首页 > 其他分享 >大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——调度器dstart的ssh启动方式不可用


时间:2023-08-25 12:11:24浏览次数:45  
标签:平台 dstart AI openssh askpass failed Host ssh key









可以知道,HPC的启动方式如果不指定--mca plm_rsh_agent方式启动,那么默认的启动方式为ssh方式启动MPI,但是实际操作后发现不可行,报错:

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   dlhpcshare-agent-37
  target node:  dlhpcshare-agent-25

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
[dlhpcshare-agent-37:2299732] 22 more processes have sent help message help-errmgr-base.txt / no-path
[dlhpcshare-agent-37:2299732] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages





这个HPC平台没有为计算节点设置ssh的免密码认证,因此各计算节点通过ssh通信时是无法认证通过的,由此报错;由此可以知道,在该HPC上是不能使用ssh的方式进行计算节点通信的,还是要使用--mca plm_rsh_agent方式来进行子节点的启动通信的。





From: https://www.cnblogs.com/devilmaycry812839668/p/17656607.html


  • 基于java的教学辅助平台
  • 基于springboot工程教育认证的计算机课程管理平台
  • Lnton羚通算法算力云平台在OpenCV-Python中如何进行图像去噪
  • 国标视频平台EasyGBS视频能力平台Linux版内核启动报错端口占用的问题解决方案
  • 工业物联网平台如何帮助提升智能制造业的生产效率
    随着科技的不断进步,智能制造已经成为制造业的重要发展方向。在这个趋势下,工业物联网平台正在发挥越来越重要的作用。 工业物联网平台是一种集成了设备、数据和应用的平台。它通过连接各种设备、传感器和系统,实现了对生产过程中海量数据的实时采集、处理和应用。它具有强大的数据处......
  • php使用traits实现代码复用、多继承
    php只能继承一个父类,php5.4后新增traits实现代码复用机制变向达到多继承1、trait和类相似,但不能被实例化,无需继承,只需要在类中使用关键词use引入即可,可引入多个traits,用','隔开2、trait会覆盖继承的方法,当前类会覆盖trait方法<?phpclassPeople{ publicfunctionwalk(){ ech......
  • OpenHarmony平台驱动案例--UART
  • 【疑难杂症】升级Mac系统后python遇到[SSL: CERTIFICATE_VERIFY_FAILED]
  • 视频智能分析平台EasyCVR视频汇聚平台关于AI分析告警列表的定制详细介绍
    安防监控视频集中存储/云存储EasyCVR视频汇聚平台基于云边端一体化架构,可支持多协议、多类型设备接入,视频监控综合管理平台具有强大的数据接入、处理及分发能力,能在复杂的网络环境中,将分散的各类视频资源进行统一汇聚、整合、存储、集中管理。 ​ 我们......
  • LangChain-Chatchat学习资料-Windows开发部署