Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程

标签：ResourceManager 100% Hadoop private hadoop RenewerPoolTracker future new public

1. top命令查看进程占用 CPU 情况

top

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java

2. 查找该进程号（22054）对应的服务

方式一：top 命令输入后，再按键盘中的 “c” 字母即可查看

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_02

方式二：使用: ps -ef | grep PID 命令查看

ps -ef | grep 22054

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_hadoop_03

3.找到该进程中占用 CPU 最高的线程

top -Hp 22054

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_04

4. 将线程的ID转换为16进制（用于排查、匹配进程的堆栈信息）

printf "%x\n" 22545

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_05

5. 使用 jstack 命令（java 自带命令）查看对应进程的堆栈信息，找出问题代码

使用 jstack 22054 | grep "5811" -A 30 查看具体的进程信息。

jstack：Prints Java thread stack traces for a Java process, core file, or remote debug server。

其中 grep -A 30 则是显示(上下文，也就是上下行)下文30条相关的语句

使用进程的启动用户（yarn）执行该命令

sudo su - yarn
jstack 22054 | grep "5811" -A 30

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_hadoop_06

截图中指向 DelegationTokenRenewer类中的DelegationTokenRenewerPoolTracker类中的 run方法，有问题的代码在 990行！

6.查看代码，分析问题

因为这是 hadoop 的源码，可以在 github上直接在线查询对应版本的代码或者下载对应版本代码在本地 idea中打开查看，我们使用的hadoop 版本是：3.3.4，找的对应代码，如下：

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_07

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_08

看似没问题，其实有问题，因为这个类是给集群中开启了 Kerberos 认证的用户更新用户有效凭证的，并且会在任务提交后立即执行，同时会一直等待着主体过期后再次更新主体的有效时间，避免任务在执行过程中因为 token 失效而导致任务执行失败。

但是这个 while(true) 方法就是导致 CPU 一直被占用的关键，当 map集合中没有数据的时候，一直跑的 while(true)代码,让 cpu 空转，拿着一个 CPU 不放。

7. 解决问题

因为这是 hadoop 版本自带的问题。一种方式是等待 hadoop 版本升级后优化掉这个问题后再部署新版本，另一种方式是自己修改源码、再编译打包部署，这种方式稳定性不保证。

github 上 hadoop 的代码已有人遇到同样的问题并自己做了修改优化，但他的代码目前未能合并到 hadoop的最新版本分支中。

同时在 hadoop的 trunk 分支中已经做了另一种更优雅的方式来优化和修改，可等 hadoop 新版本发布。

参考资料：

https://github.com/apache/hadoop/pull/4435

https://issues.apache.org/jira/browse/YARN-11178?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/DelegationTokenRenewer.java

社区开发者的解决方式：

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_09

hadoop 官方的解决方式：

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_hadoop_10

Hadoop 的 ResourceManager进程占用 CPU 100% 问题排查过程_java_11

8.手动复现此问题

8.1. hadoop 旧版本问题对应的简化代码示例

package com.example.demo.futuretracker;

import java.util.Map;
import java.util.concurrent.*;

/**
 * 问题点：run方法中的 while(true)当 map 集合中无数据的时候会一直运行，占用着 CPU 不放。
 */
public class FutureTrackerExample {
    private final Map<String, Future<?>> futures = new ConcurrentHashMap<>();
    private final long tokenRenewerThreadTimeout = 60;

    private final class RenewerPoolTracker implements Runnable {

        @Override
        public void run() {
            while (true) {
                for (Map.Entry<String, Future<?>> entry : futures.entrySet()) {
                    String key = entry.getKey();
                    Future<?> future = entry.getValue();
                    try {
                        future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
                    } catch (TimeoutException e) {
                        // Handle TimeoutException, retry logic can be added here
                        System.out.println("Timeout occurred for key: " + key);
                    } catch (InterruptedException | ExecutionException ex) {
                        ex.printStackTrace();
                        // Handle other exceptions if needed
                    }
                }
            }
        }
    }

    public static void main(String[] args) {
        FutureTrackerExample example = new FutureTrackerExample();
        RenewerPoolTracker renewerPoolTracker = example.new RenewerPoolTracker();
        Thread trackerThread = new Thread(renewerPoolTracker);
        trackerThread.start();
    }
}

8.2. 社区开发者提供的解决方法代码示例

package com.example.demo.futuretracker;

import java.util.Map;
import java.util.concurrent.*;

/**
 * 优化点：在run方法中的 while(true)动作之前判断 map 集合中是否为空，
 *        为空则手动调用 wait 方法等待一段时间，等待后再次判断map 集合中是否为空，
 *        为空则继续判断、等待，重复此步骤。
 * 解决思路：在集合为空的情况下，手动调用 wait方法，让出 cpu 一段时间，此期间不持有锁，
 *         并重复此步骤。直到集合中有任务则正常执行。
 *
 */
public class FutureTrackerExample1 {
    private final Map<String, Future<?>> futures = new ConcurrentHashMap<>();
    private final long tokenRenewerThreadTimeout = 60;

    private final class RenewerPoolTracker implements Runnable {

        @Override
        public void run() {
            while (true) {
                if (futures.isEmpty()) {
                    synchronized (this) {
                        try {
                            long waitingTimeMs = Math.min(10000, Math.max(500, tokenRenewerThreadTimeout));
                            this.wait(waitingTimeMs);
                        } catch (InterruptedException e) {
                            throw new RuntimeException(e);
                        }
                    }
                    if (futures.isEmpty()) {
                        continue;
                    }
                }

                for (Map.Entry<String, Future<?>> entry : futures.entrySet()) {
                    String key = entry.getKey();
                    Future<?> future = entry.getValue();
                    try {
                        future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
                    } catch (TimeoutException e) {
                        // Handle TimeoutException, retry logic can be added here
                        System.out.println("Timeout occurred for key: " + key);
                    } catch (InterruptedException | ExecutionException ex) {
                        ex.printStackTrace();
                        // Handle other exceptions if needed
                    }
                }
            }
        }
    }

    public static void main(String[] args) {
        FutureTrackerExample1 example = new FutureTrackerExample1();
        RenewerPoolTracker renewerPoolTracker = example.new RenewerPoolTracker();
        Thread trackerThread = new Thread(renewerPoolTracker);
        trackerThread.start();
    }
}

8.3 hadoop官方提供的解决方法代码示例（trunk 分支）

package com.example.demo.futuretracker;

import java.util.concurrent.*;
/**
 * 优化点：修改数据结构，不使用 map 集合存储任务，改用 LinkedBlockingDeque 阻塞队列存储任务。
 *        在 while(true)方法中使用队列的 take 方法获取任务，take方法是阻塞式的（内部调用 await 方法），
 *        即在队列为空的时候会一直阻塞式的等待，但此过程不持有锁，不占用 CPU 的使用权(核心思想)
 *        当队列中有任务进入（有 put 操作），put 操作内部会调用 signal 方法发送信号给处于 await 状态的线程，
 *        使其被唤醒后再获取锁并正常执行。
 * 解决思路：使用了锁的等待和唤醒机制，更为灵活优雅。
 *
 */
public class FutureTrackerExample2 {
    private final LinkedBlockingDeque<DelegationTokenRenewerFuture> futures = new LinkedBlockingDeque();
    private final long tokenRenewerThreadTimeout = 60;

    private final class RenewerPoolTracker implements Runnable {
        @Override
        public void run() {
            while (true) {
                DelegationTokenRenewerFuture dtrf;
                try {
                    //队列为空时阻塞，释放锁，等待被唤醒
                    dtrf = futures.take();
                    Future future = dtrf.getFuture();
                    future.get(tokenRenewerThreadTimeout, TimeUnit.MILLISECONDS);
                } catch (TimeoutException e) {
                    // Handle TimeoutException, retry logic can be added here
                    System.out.println("Timeout occurred ");
                } catch (InterruptedException | ExecutionException ex) {
                    ex.printStackTrace();
                    // Handle other exceptions if needed
                }
            }
        }
    }

    public static class DelegationTokenRenewerFuture {
        private Future future;

        public DelegationTokenRenewerFuture() {
        }

        public DelegationTokenRenewerFuture(Future future) {
            this.future = future;
        }

        public Future getFuture() {
            return future;
        }

        public void setFuture(Future future) {
            this.future = future;
        }
    }

    public static void main(String[] args) {
        FutureTrackerExample2 example = new FutureTrackerExample2();
        RenewerPoolTracker renewerPoolTracker = example.new RenewerPoolTracker();
        Thread trackerThread = new Thread(renewerPoolTracker);
        trackerThread.start();
    }
}

标签：ResourceManager,100%,Hadoop,private,hadoop,RenewerPoolTracker,future,new,public
From： https://blog.51cto.com/simplelife/9161410