首页 > 编程语言 >Flink TaskManager OutOfMemoryError: Metaspace 处理记录

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

时间:2022-11-23 20:08:13浏览次数:43  
标签:flink code Flink TaskManager classes user OutOfMemoryError

一个很有意思的Flink任务异常处理记录

一、环境信息

Flink1.12 Standalone 模式,单台机器,由于客户环境基本很长时间会看不到运行状态

二、问题现象

现场同事反馈设备在客户现场运行了一段时间后Flink Job全挂,在Flink DashBoard上所有的Job都看不到了,TaskManager已经挂掉了,但TaskManager进程还在, 手动重启taskmanager服务后恢复正常

三、问题排查

系统监控上,磁盘IO,CPU,内存使用都正常,/var/log/message中也没看到异常信息

发现Flink taskmanager的日志有OutOfMemoryError: Metaspace情况

2022-11-03 16:20:59,122 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down... java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...

四、原因分析

因为flink taskmanager中oom日志上有提示 If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... 所以怀疑可能由于任务多次重启导致

五、第一次内部复现

写一个Flink Job,让他提交到TaskManager上之后就抛错(简单点open方法里头抛个错,让Job挂掉) 并且大量快速重复的提交错误任务,此时会导致发现flink-dashboard上TaskManager的JVM Metaspace在每次任务重启后会新增20mb左右

image.png

一直增至1G,导致TaskManager假死,进程还存在,但是任务已经完全崩溃,task manager界面看不到task manager的实例,dolphinscheduler会重新提交任务,但是不会成功,此时反复重启

任务重启情况

image.png

TaskManager已经挂掉,但是进程还在(处于假死状态)

image.png

image.png

此时Flink调度系统在一直重试拉起任务但是一直失败,Taskmanager的日志没有出现OutOfMemoryError的情况

六、第二次复现-成功复现 MetaSpace OutOfMemoryError

单次重复慢速(每60s提交一次)提交瞬间错误的任务,长时间失败提交后,在Taskmanager中出现MetaSpace OutOfMemoryError的信息

复现的TaskManager out-of-memory错误日志

image.png

并且能在systemd中查看到flink-taskmanager重启的信息

image.png

但是此时查看flink dashboard界面,能看到所有任务提示Running正常,无问题(但其实不能工作)

image.png

切换到单个任务内部,在overview上也能看到是正常的标识

image.png

但是在exception中,则能看到提示没有slot的信息,无法提交任务

image.png

七、第三次内部复现- 说明有一定概率自动恢复

操作步骤和第二步一样,但是Taskmanager会过段时间挂掉自动恢复(通过systemd),调度系统重试提交任务几次后,成功提交,任务恢复正常

八、问题解决

当然还是要优先处理掉程序中的bug,解决任务重启问题, 但为防止后续有其他问题,看到网上有以下几个方案: 1,JAR包依赖分离, 把用到的第三方包放到flink/lib目录下 2,更改运行方式为Local或者Yarn模式(这种内存泄漏在Standalone-cluster模式才会出现)

附上官网关于这个问题的说明

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/#unloading-of-dynamically-loaded-classes-in-user-code

官网原版可能看的更清楚些

Unloading of Dynamically Loaded Classes in User Code # All scenarios that involve dynamic user code classloading (sessions) rely on classes being unloaded again. Class unloading means that the Garbage Collector finds that no objects from a class exist and more, and thus removes the class (the code, static variable, metadata, etc).

Whenever a TaskManager starts (or restarts) a task, it will load that specific task’s code. Unless classes can be unloaded, this will become a memory leak, as new versions of classes are loaded and the total number of loaded classes accumulates over time. This typically manifests itself though a OutOfMemoryError: Metaspace.

Common causes for class leaks and suggested fixes:

Lingering Threads: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes.

Interners: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava’s interners, or Avro’s class/object caches in the serializers.

JDBC: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should add the driver jars to Flink’s lib/ folder instead of bundling them in the user-jar. If you can’t guarantee that none of your user-jars bundle the driver, you have to additionally add the driver classes to the list of parent-first loaded classes via classloader.parent-first-patterns-additional.

A helpful tool for unloading dynamically loaded classes are the user code class loader release hooks. These are hooks which are executed prior to the unloading of a classloader. It is generally recommended to shutdown and unload resources as part of the regular function lifecycle (typically the close() methods). But in some cases (for example for static fields), it is better to unload once a classloader is certainly not needed anymore.

Class loader release hooks can be registered via the RuntimeContext.registerUserCodeClassLoaderReleaseHookIfAbsent() method.

官网翻译

Flink的组件(JobManager, TaskManager, Client, ApplicationMaster等)在启动时会在日志开头的环境信息部分记录classpath的设定。 当JobManager和TaskManager的运行模式为指定一个job时,可以通过将用户代码的JAR文件放置在/lib目录下,从而包含在classpath路径中,以保证它们不会被动态加载。 通常情况下将job的JAR文件放置在/lib目录下可以正常运行。JAR文件会同时作为classpath(AppClassLoader)和动态类加载器(FlinkUserCodeClassLoader)的一部分。 由于AppClassLoader是FlinkUserCodeClassLoader的父类(Java默认情况下以parent-first方式加载),这样类只会加载一次。 当job相关的JAR文件不能全部放在/lib目录下(如多个job共用的一个session)时,可以通过将相对公共的类库放在/lib目录下,从而避免这些类的动态加载。

标签:flink,code,Flink,TaskManager,classes,user,OutOfMemoryError
From: https://blog.51cto.com/mapengfei/5881698

相关文章

  • 解决问题 1474 个,Flink 1.11 究竟有哪些易用性上的改善?
    7月7日,​​Flink1.11.0正式发布了​​,作为这个版本的releasemanager之一,我想跟大家分享一下其中的经历感受以及一些代表性feature的解读。在进入深度解读前,我们先简......
  • 社区活动 | Apache Flink Meetup·深圳站,锁定 Flink 最佳实践
    Apache Flink Meetup 2020·深圳站正!式!上!线!如何基于Flink+Iceberg构建企业级数据湖?HudionFlink有哪些生产环境应用实践?基于Flink搭建的监控体系如何更立体?......
  • 开源共建 | 中国移动冯江涛:ChunJun(原FlinkX)在数据入湖中的应用
    ChunJun(原FlinkX)是一个基于Flink提供易用、稳定、高效的批流统一的数据集成工具。2018年4月,秉承着开源共享的理念,数栈技术团队在github上开源了FlinkX,承蒙各位开发者的合......
  • hudi with flink1.14验证
    hudiwithflinkdatastreambyjava版本flink1.14hudi0.12.1pom依赖<dependencies><dependency><groupId>org.apache.hudi</groupId>......
  • Docker部署flink备忘
    欢迎访问我的GitHub这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos本文目的是给自己备忘的,在后面的工作和学习中若想快速搭建Flink......
  • Flink
     ApacheFlink是一个框架和分布式处理引擎,用于在无边界和有边界数据流上进行有状态的计算。Flink能在所有常见集群环境中运行,并能以内存速度和任意规模进行计算。Apa......
  • Flink-简单介绍
    1,Flink是一个框架和分布式处理引擎,用于对无界和有界数据流进行有状态计算。并且Flink提供了数据分布、容错机制以及资源管理等核心功能。Flink提供了诸多高抽象层的AP......
  • Flink-Exactly once(精确一次)
    目的:保证数据在生产,消费,sink端都只被精确一次。保证结果一致性。为了达到这目的,采用的措施有:1,生产端:往Kafka生产数据时有幂等,ack,事务,三个措施。ps:幂等:无论数据输入多少......
  • 报错信息java.lang.OutOfmemoryError: PermGen Space
    问题背景PermGenSpace的全称是PermanentGenerationSpace,是指内存的永久保存区域。这一部分用于存放class和meta的信息,class在加载的时候被放入PermGenSpace区域。它和......
  • Flink 配置HADOOP_CLASSPATH 影响Hive的日志打印问题
    由于FlinkonYarn的部署需要hadoop的依赖:  比较常见的解决方式会将hadoopcalsspath放在Linux的系统环境变量下,但是这里会影响Hive的日志级别,导致Hive打印过多的INFO......