在 python 中打开 gnome 终端立即显示为僵尸

标签：python python-3.x multiprocessing popen psutil

作为背景，我正在编写一个脚本来训练多个 pytorch 模型。我有一个训练脚本，我希望能够在 gnome 终端中作为子进程运行。这样做的主要原因是我可以随时关注训练进度。如果我可能有多个 GPU，我想在单独的窗口中多次运行我的训练脚本。为了实现这一点，我一直在使用 popen。以下代码用于打开一个新的终端窗口并启动训练脚本

#create a list of commands
commands = []
kd_cfg = KDConfig.read(kd_cfg_path)
cmd = "python scripts/train_torch.py "
for specialist in kd_cfg.specialists:
    cmd += f"--config {kd_cfg.runtime_dict['specialists'][specialist]['config']} "
    ...

# Run each command in a new terminal and store the process object
num_gpus = len(gpus)
free_gpus = copy.deepcopy(gpus)
processes = []
worker_sema = threading.Semaphore(num_gpus)
commands_done = [False for _ in range(len(commands))]

#start the watchdog
watch = threading.Thread(target=watch_dog, args=(processes,free_gpus,commands_done,worker_sema))
watch.start()

for cmd_idx, command in enumerate(commands):

    worker_sema.acquire()

    gpu = free_gpus.pop()
    command += f" --gpu {gpu}" #allocate a free GPU from the list
    split_cmd_arr = shlex.split(command)
    proc = subprocess.Popen(['gnome-terminal', '--'] + split_cmd_arr)

    processes.append( (cmd_idx,gpu,proc) )

我遇到的部分是并发控制。为了保护 GPU 资源，我使用信号量。我的计划是监视启动 GNOME 终端的过程，并在它完成时释放信号量以开始下一个训练过程。相反，所有命令都会同时运行。当我使用两个命令进行测试并限制在一个 GPU 上时，我仍然看到两个终端打开，并且将开始两个训练。在下面的看门狗线程代码中，我看到两个进程都是僵尸进程并且没有子进程，即使我正在观察训练循环在两个终端内部执行而不会崩溃。

   # Check if processes are still running
    while not all(commands_done):
        for cmd_idx, gpu, proc in processes:
            # try:
            # Check if process is still running
            ps_proc = psutil.Process(proc.pid)

            #BC we call bash python out of the gate it executes as a child proc
            ps_proc_children = get_child_processes(proc.pid)
            proc_has_running_children = any(child.is_running for child in ps_proc_children)

            print(f"status: {ps_proc.status()}")
            print(f"children:  {ps_proc_children}")
            if proc_has_running_children:
                print(f"Process {proc.pid} on GPU {gpu} is still running", end='\r')
            else:
                print(f"Process {proc.pid} has terminated")
                free_gpus.append(gpu)
                commands_done[cmd_idx] = True
                processes.remove((cmd_idx, gpu, proc))

                ps_proc.wait()
                print(f"removed proc {ps_proc.pid}")
                worker_sema.release()

我想也许子进程基本上启动了另一个进程，然后立即返回，但是我很惊讶地发现也没有孩子。如果有人有任何见解，他们将非常感激。

如果有帮助，这是看门狗的一些示例输出。

status: zombie
children:  []
Process 4076 has terminated
removed proc 4076
status: zombie
children:  []
Process 4133 has terminated
removed proc 4133

的问题在于期望 subprocess.Popen 等待 gnome 终端窗口关闭，但事实并非如此。 Popen 会启动进程，并立即返回而不会等待进程及其子进程完成。

以下是解决此问题的几种方法：

1. 使用 wait

可以在 Popen 对象上使用 wait() 方法来阻塞，直到 gnome 终端窗口关闭（这意味着训练脚本已完成）。但这意味着将无法同时运行多个训练运行。

    proc = subprocess.Popen(['gnome-terminal', '--'] + split_cmd_arr)
    proc.wait()  # 等待进程完成
    free_gpus.append(gpu)
    commands_done[cmd_idx] = True
    ...

2. 使用 psutil 监控子进程

可以使用 psutil 来监控 gnome 终端进程及其子进程，并在所有子进程都已完成后释放信号量。这是目前正在尝试的方法，但需要更深入地查找子进程树，因为 gnome-terminal 会生成一个 shell 进程，然后该进程会运行的训练脚本。

import psutil

def get_all_child_processes(pid):
    """递归获取所有子进程"""
    try:
        parent = psutil.Process(pid)
    except psutil.NoSuchProcess:
        return []
    children = parent.children(recursive=True)
    return children

# 在的看门狗循环中：
for cmd_idx, gpu, proc in processes:
    children = get_all_child_processes(proc.pid)
    if any(child.is_running() for child in children):
        print(f"Process {proc.pid} on GPU {gpu} is still running", end='\r')
    else:
        print(f"Process {proc.pid} has terminated")
        free_gpus.append(gpu)
        commands_done[cmd_idx] = True
        processes.remove((cmd_idx, gpu, proc))
        worker_sema.release()

3. 使用更适合并发的库

对于更复杂的并发场景，请考虑使用 concurrent.futures 模块或第三方库，例如 asyncio 或 gevent ，这些库提供了用于管理并发进程的更高级工具。

请记住，在实现上述任何解决方案时，需要相应地调整代码的其他部分，例如信号量管理和进程列表更新。

标签：python,python-3.x,multiprocessing,popen,psutil
From： 78780790

在 python 中打开 gnome 终端立即显示为僵尸

相关文章

赞助商

阅读排行