一、现象
起因:应用经常重启,JVM 内存是比较稳定,查看错误日志发现是堆外内存泄漏,而且观看相关的 trace,只有这个用户会导致OutOfDirectMemoryError
继续排查现象
- 查看 redis 出口流量远没有这么大
- 查看对应堆栈的数据的 key,也木有这么大。900KB 左右(这个很大,也是促成现象的原因之一)
- 该用户的处理很频繁
我们的spring-boot-starter-data-redis使用的5.1.4.RELEASE版本的lettuce-core连接redis。luttuce 使用 netty,而netty中的
io.netty.util.internal.PlatformDependent#DIRECT_MEMORY_COUNTER
记录着当前使用的堆外内存大小。然后我们使用 arthas 查看重启之后一段时间的内存大小
getstatic io.netty.util.internal.PlatformDependent DIRECT_MEMORY_COUNTER -x 1
然后再隔一段时间,发现它不会回收。感觉问题是出在这里了。一直增长导致OutOfDirectMemoryError
二、原因
问:为什么会重启?
答:堆外内存过大导致超过限制重启
问:为什么堆外内存会泄漏?
答:因为一直增长不回收,最终导致 OutOfDirectMemoryError
问:为什么只有这个店铺会?这个答案在后面揭晓
三、如何快速处理
本着减少影响面,第一时间把这个用户从线上环境剥离到其他环境,这样其他用户就不会受到影响,而且排查起来也方便。
四、如何复现
那么我们来梳理一下复现的条件
- 某用户的特殊大key配置
- 较高的频率获取缓存
- 5.1.4.RELEASE版本的lettuce-core
那么看起来没有什么特殊的,可以放本地redis然后起项目复现一波。
@Test
public void testOOD() {
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
executorService.execute(() -> {
System.out.println(cacheTest.selectxxxxList(param).size());
});
}
System.in.read();
}
@Cacheable(cacheNames = "cache:settings", key = "#param")
public List<SettingsDTO> selectSettingsList(Long param) {
// 已经在redis内提前插入数据了,所以这里是空
return Lists.newArrayList();
}
果然,稳定复现
本地测试 -XX:MaxDirectMemorySize=5m
2022-03-05 18:12:50.272 WARN 51615 --- [ioEventLoop-4-1] io.lettuce.core.protocol.CommandHandler : [null, io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880), io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880)] Unexpected exception during request: [null, io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880), io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880)]
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
at com.berserk.CacheTest$$EnhancerBySpringCGLIB$$65af824b.selectSettingsList(<generated>)
at com.berserk.ChatListenerTest.lambda$test$0(ChatListenerTest.java:33)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.lettuce.core.RedisException: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:129)
at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:69)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy82.get(Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:66)
... 19 more
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2097152 byte(s) of direct memory (used: 3155342, max: 5242880)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:655)
at io.netty.util.internal.PlatformDependent.reallocateDirectNoCleaner(PlatformDependent.java:628)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.reallocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:34)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.capacity(UnpooledUnsafeNoCleanerDirectByteBuf.java:51)
at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:299)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:278)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1096)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1087)
at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:554)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
五、问题根源
核心流程如下:
产生问题的代码在io.lettuce.core.protocol.CommandHandler#decode(io.netty.channel.ChannelHandlerContext, io.netty.buffer.ByteBuf)
核心代码如下:
protected void decode(ChannelHandlerContext ctx, ByteBuf buffer) throws InterruptedException {
......
//判断是否可解码(deque不为空 && buffer可读---buffer的writerIndex > readerIndex)
while (canDecode(buffer)) {
// 查询最先入栈的command
RedisCommand<?, ?, ?> command = stack.peek();
if (debugEnabled) {
logger.debug("{} Stack contains: {} commands", logPrefix(), stack.size());
}
pristine = false;
try {
// 解码
if (!decode(ctx, buffer, command)) {
return;
}
} catch (Exception e) {
ctx.close();
throw e;
}
...
if (canComplete(command)) {
// 解析成功则把最开始的cmd给出栈
stack.poll();
try {
// 通知并返回业务线程
complete(command);
} catch (Exception e) {
logger.warn("{} Unexpected exception during request: {}", logPrefix, e.toString(), e);
}
}
afterDecode(ctx, command);
}
//重置读写下标,writerIndex = readerIndex = 0,
//或者writerIndex -= readerIndex,readerIndex= 0
if (buffer.refCnt() != 0) {
buffer.discardReadBytes();
}
}
/**
* 判断是否可以解码
*
* @param buffer
* @return
*/
protected boolean canDecode(ByteBuf buffer) {
return !stack.isEmpty() && buffer.isReadable();
}
由图上可以看出,使用的是同一个channel,还有同一个ByteBuf,这里问题是出现在步骤5中,如上decode的代码中得出一个极端场景
当canDecode(buffer)一直为true(deque不为空 && buffer的writerIndex > readerIndex)时,此时就不会重置读写下标,导致buffer会一直扩容。再结合上我们这个用户并发还挺高,所以一直居高不下,回收不了。
六、如何解决
目前是有两个方案:
- 改为jedis
- 升级lettuce到5.2.0.RELEASE及之后的版本,至于为什么5.2.0.RELEASE不会出现这种消费不过来就OutOfDirectMemoryError嘛。那是因为它在decode的时候,如果发现不能decode的话,就会强制执行discardReadBytes重置读写下标,也就是在,就不会去叠加开辟堆外内存。
<dependency>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
<version>5.2.0.RELEASE</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
<exclusions>
<exclusion>
<groupId>io.lettuce</groupId>
<artifactId>lettuce-core</artifactId>
</exclusion>
</exclusions>
</dependency>
https://github.com/redis/lettuce/issues/906
这是在github上面的相似issues,事实上也是这位老哥后面pr修复了。
七、后续优化
- 减少无效字段
- 多重缓存