多GPU监测

多GPU监测

时间：2022-12-13 23:23:27浏览次数：45

标签：power gpu memory GPU 监测 gpu1 gpu0

相信大家在跑实验时都希望让GPU二十四小时跑，但有时候实验在半夜才结束，为了避免晚上接着跑实验需要半夜起床，同时为了不浪费计算资源，我们可以对多个GPU进行实时监测，当监测到GPU空闲时可以接着跑其他实验。

import os
import sys
import time
 
cmd0 = 'CUDA_VISIBLE_DEVICES=0 nohup bash run.sh --stage 9'  #当GPU空闲时需要跑的脚本
cmd1 = 'CUDA_VISIBLE_DEVICES=1 nohup bash run.sh --stage 9'  #当GPU空闲时需要跑的脚本 
 
def gpu_info():
    gpu_status = os.popen('nvidia-smi | grep %').read().split('|') #根据nvidia-smi命令的返回值按照'|'为分隔符建立一个列表
    '''
    结果如：
    ['', ' N/A   64C    P0    68W /  70W ', '   9959MiB / 15079MiB ', '     79%      Default ', 
    '\n', ' N/A   73C    P0   108W /  70W ', '  11055MiB / 15079MiB ', '     63%      Default ', 
    '\n', ' N/A   60C    P0    55W /  70W ', '   3243MiB / 15079MiB ', '     63%      Default ', '\n']
    '''
    gpu0_status = gpu_status[0:4]
    gpu1_status = gpu_status[4:8]
    #gpu2_status = gpu_status[8:]
    #print(gpu2_status)
    gpu0_memory = int(gpu0_status[2].split('/')[0].split('M')[0].strip())
    gpu1_memory = int(gpu1_status[2].split('/')[0].split('M')[0].strip())
    #print(gpu_memory) 
    #获取当前0号GPU功率值：提取标签为2的元素，按照'/'为分隔符后提取标签为0的元素值再按照'M'为分隔符提取标签为0的元素值，返回值为int形式 
    gpu0_power = int(gpu0_status[1].split('   ')[-1].split('/')[0].split('W')[0].strip())
    gpu1_power = int(gpu1_status[1].split('   ')[-1].split('/')[0].split('W')[0].strip())
    #print(gpu_power)
    #获取0号GPU当前显存使用量
    #gpu_util = int(gpu1_status[3].split('   ')[1].split('%')[0].strip())
    #print(gpu_util)
    #获取0号GPU显存核心利用率
    return gpu0_power, gpu0_memory, gpu1_power, gpu1_memory# gpu_util
 
 
def narrow_setup(secs=900):  #间隔15分钟检测一次
    gpu0_power, gpu0_memory, gpu1_power, gpu1_memory = gpu_info()
    i = 0
    while not ((gpu0_memory < 1000 and gpu0_power < 70) or (gpu1_memory < 1000 and gpu1_power < 70)):  # 当功率，使用量，利用率都小于特定值才去退出循环
        gpu0_power, gpu0_memory, gpu1_power, gpu1_memory = gpu_info()
        i = i % 5
        symbol = 'monitoring: ' + '>' * i + ' ' * (10 - i - 1) + '|'
        gpu0_power_str = 'NO.0 GPU power:%d W |' % gpu0_power
        gpu0_memory_str = 'NO.0 GPU memory:%d MiB |' % gpu0_memory
        gpu1_power_str = 'NO.1 GPU power:%d W |' % gpu1_power
        gpu1_memory_str = 'NO.1 GPU memory:%d MiB |' % gpu1_memory
     #   gpu_util_str = 'gpu util:%d %% |' % gpu_util
        sys.stdout.write('\r' + gpu0_memory_str + ' ' + gpu0_power_str + ' ' + symbol+'\n' + gpu1_memory_str + ' ' + gpu1_power_str + ' ' + symbol)
        #sys.stdout.write('\r' + gpu1_memory_str + ' ' + gpu1_power_str + ' ' + symbol)
        #sys.stdout.write(obj+'\n')等价于print(obj)
        sys.stdout.flush()    #刷新输出
        time.sleep(secs)  #推迟调用线程的运行，通过参数指秒数，表示进程挂起的时间。
        i += 1
    if gpu0_memory < 1000 and gpu0_power < 70:
        print('\n' + cmd0)
        os.system(cmd0) #执行脚本
    else:
        print('\n' + cmd1)
        os.system(cmd1) #执行脚本
          
 
 
if __name__ == '__main__':
    narrow_setup()

标签：power,gpu,memory,GPU,监测,gpu1,gpu0
From： https://www.cnblogs.com/Uriel-w/p/16980958.html

windows上用vs2017静态编译onnxruntime-gpu CUDA cuDNN TensorRT的坎坷之路
因为工作业务需求的关系，需编译onnxruntime引入项目中使用，主项目exe是使用的vs2017+qt5.12。onnxruntime就不用介绍是啥了撒，在优化和加速AI机器学习推理和训练这块赫赫有名......
五年经验的前端社招被问：CPU 和 GPU 到底有啥区别？
首先来看CPU和GPU的百科解释：CPU（CentralProcessingUnit，中央处理器）：功能主要是解释计算机指令以及处理计算机软件中的数据GPU（GraphicsProcessingUnit，图形处理器；......
案例解析丨造纸厂大气污染环境监测方案
近年来，国家大力推进生态文明建设与生态环境质量快速改善，并在污染防治中取得巨大成效。十多年工业主要污染排放量大幅度减少，其中最为明显的就是造纸工业。从前，造纸业是......
负氧离子监测站_景区空气负氧离子监测站
负氧离子监测站又称空气负氧离子监测站、景区负氧离子自动监测站、大气负氧离子监测站，负氧离子监测站是针对环保、气象、林业、旅游等行业中注重森林生态保护、旅游资源开发......
深度学习GPU加速配置方法
深度学习GPU加速配置方法一、英伟达官方驱动及工具安装首先检查自己的电脑驱动版本，未更新至最新建议先将驱动更新至最新，然后点击Nvidia控制面板2.在如下界面中点击系......
建筑安全实施监测预警系统，为房屋带上“安全帽”！
一、监测背景房屋在长期使用的过程中可能遭受到各种自然原因逐渐老化、人为原因的损坏导致房屋基础结构产生老化、腐蚀、折断等险情，进而引发结构倾斜、位移、开裂、扭曲等......
万恶的环境2——安装的torch版本是cpu版本如何改为GPU版本
万恶的环境2——安装的torch版本是cpu版本如何改为GPU版本目录万恶的环境2——安装的torch版本是cpu版本如何改为GPU版本感谢参考的链接1报错2检查3结果与问题——tor......
Python安装TensorFlow-GPU
选择TensorFlow版本(重要)验证TensorFlow-gpu安装成功安装遇到的问题参考TOC本文主要介绍windows下基于Miniconda下的GPU版本的TensorFlow安装过程以及安装过程中遇到的问......
Python安装Pytorch-GPU
选择Pytorch版本(重要)验证pytorch安装是否成功安装遇到的问题参考TOC本文主要介绍windows下基于Miniconda下的GPU版本的Pytorch安装过程以及安装过程中遇到的问题,本文假......
GPU CPU向量加法时间测试
GPUCPU向量加法时间测试实验设备系统：WSLUbuntu18.04实验思路分别在GPU，CPU上测试两个一维向量的加法，CPU是一个个的串行计算相加，GPU可以通过并行的方式将对应位置的元......

相关文章

赞助商

阅读排行