首页 > 其他分享 >大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——调度器dstart的ssh启动方式不可用

大连人工智能计算平台——华为昇腾AI平台——高性能计算HPC——调度器dstart的ssh启动方式不可用

时间:2023-08-25 12:11:24浏览次数:45  
标签:平台 dstart AI openssh askpass failed Host ssh key

 

根据华为的官方文档:

https://support.huawei.com/enterprise/zh/doc/EDOC1100228705/d1f5a239#ZH-CN_TOPIC_0000001212004449

 

 

 

 

 

可以知道,HPC的启动方式如果不指定--mca plm_rsh_agent方式启动,那么默认的启动方式为ssh方式启动MPI,但是实际操作后发现不可行,报错:

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   dlhpcshare-agent-37
  target node:  dlhpcshare-agent-25

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[dlhpcshare-agent-37:2299732] 22 more processes have sent help message help-errmgr-base.txt / no-path
[dlhpcshare-agent-37:2299732] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

 

 

 

结论就是:

这个HPC平台没有为计算节点设置ssh的免密码认证,因此各计算节点通过ssh通信时是无法认证通过的,由此报错;由此可以知道,在该HPC上是不能使用ssh的方式进行计算节点通信的,还是要使用--mca plm_rsh_agent方式来进行子节点的启动通信的。

 

 

=========================================================

 

标签:平台,dstart,AI,openssh,askpass,failed,Host,ssh,key
From: https://www.cnblogs.com/devilmaycry812839668/p/17656607.html

相关文章

  • 基于java的教学辅助平台
    随着信息技术在管理上越来越深入而广泛的应用,管理信息系统的实施在技术上已逐步成熟。本文介绍了教学辅助平台的开发全过程。通过分析教学辅助平台管理的不足,创建了一个计算机管理教学辅助平台的方案。文章介绍了教学辅助平台的系统分析部分,包括可行性分析等,系统设计部分主要介绍了......
  • 基于springboot工程教育认证的计算机课程管理平台
    随着信息技术在管理上越来越深入而广泛的应用,管理信息系统的实施在技术上已逐步成熟。本文介绍了基于工程教育认证的计算机课程管理平台的开发全过程。通过分析基于工程教育认证的计算机课程管理平台管理的不足,创建了一个计算机管理基于工程教育认证的计算机课程管理平台的方案。文......
  • Lnton羚通算法算力云平台在OpenCV-Python中如何进行图像去噪
    图像去噪(ImageDenoising)是图像处理中的一个重要任务,旨在从带有噪声的图像中恢复出清晰的图像。噪声通常是由于图像采集、传输或存储过程中引入的不良影响而产生的。以下是一些常见的图像去噪方法:1.均值滤波器:基于邻域像素的平均值来平滑图像,可以有效减少高斯噪声等。2.中值滤波器:......
  • 国标视频平台EasyGBS视频能力平台Linux版内核启动报错端口占用的问题解决方案
    EasyGBS国标视频云服务是基于国标GB/T28181协议的视频能力平台,可实现的视频功能包括:实时监控直播、录像、检索与回看、语音对讲、云存储、告警、平台级联等功能。平台部署简单、可拓展性强,支持将接入的视频流进行全终端、全平台分发,分发的视频流包括RTSP、RTMP、FLV、HLS、WebRTC等......
  • 工业物联网平台如何帮助提升智能制造业的生产效率
    随着科技的不断进步,智能制造已经成为制造业的重要发展方向。在这个趋势下,工业物联网平台正在发挥越来越重要的作用。 工业物联网平台是一种集成了设备、数据和应用的平台。它通过连接各种设备、传感器和系统,实现了对生产过程中海量数据的实时采集、处理和应用。它具有强大的数据处......
  • php使用traits实现代码复用、多继承
    php只能继承一个父类,php5.4后新增traits实现代码复用机制变向达到多继承1、trait和类相似,但不能被实例化,无需继承,只需要在类中使用关键词use引入即可,可引入多个traits,用','隔开2、trait会覆盖继承的方法,当前类会覆盖trait方法<?phpclassPeople{ publicfunctionwalk(){ ech......
  • OpenHarmony平台驱动案例--UART
    1、程序介绍本程序是基于OpenHarmony标准系统编写的平台驱动案例:UART详细资料请参考官网:UART平台驱动开发UART应用程序开发2、基础知识2.1、UART简介UART指异步收发传输器(UniversalAsynchronousReceiver/Transmitter),是通用串行数据总线,用于异步通信。该总线双向通信,可以......
  • 【疑难杂症】升级Mac系统后python遇到[SSL: CERTIFICATE_VERIFY_FAILED]
    [本文出自天外归云的博客园]同事升级Mac电脑版本后,遇到了[SSL:CERTIFICATE_VERIFY_FAILED]报错:<urlopenerror[SSL:CERTIFICATE_VERIFY_FAILED]certificateverifyfailed:unabletogetlocalissuercertificate(_ssl.c:1131)>error:<urlopenerror[SSL:CERTIFICATE_......
  • 视频智能分析平台EasyCVR视频汇聚平台关于AI分析告警列表的定制详细介绍
    安防监控视频集中存储/云存储EasyCVR视频汇聚平台基于云边端一体化架构,可支持多协议、多类型设备接入,视频监控综合管理平台具有强大的数据接入、处理及分发能力,能在复杂的网络环境中,将分散的各类视频资源进行统一汇聚、整合、存储、集中管理。 ​ 我们......
  • LangChain-Chatchat学习资料-Windows开发部署
    在windows10下的安装部署参考资料1.LacnChain-Chatchat项目基础环境准备本人使用的是Windows10专业版22H2版本,已经安装了Python3.10,CUDA11.8版本,miniconda3。硬件采用联想R9000P,AMDR75800H,16G内存,RTX30606G。安装依赖#使用conda安装激活环境condacreate-nLangchain......