记一次ElasticSearch重启之后shard未分配问题的解决 allocation_status": "no_attempt

时间：2024-07-29 16:19:31浏览次数：19

标签：status node attempt STARTED shard shards allocation gb

记一次ElasticSearch重启之后shard未分配问题的解决

环境

ElasticSearch6.3.2，三节点集群
Ubuntu16.04
一个名为user的索引，索引配置为：3 primary shard，每个primary shard 2个replica

正常情况下，各个分片的分布如下：

可见，user 索引的三个分片平均分布在各台机器上，可以完全容忍一台机器宕机，而不丢失任何数据。

由于一次故障（修改了一个分词插件，但是这个插件未能正确加载），导致 node-151 节点宕机了。修复问题后，执行./bin/elasticsearch -d正常启动，但是发现集群中存在三个未分配的shards。本以为这些未分配的shards在node-151正常启动后能够自动分配，但是却发现它一直没有自动分配。

解决方法

首先：GET user/_recovery?active_only=true 发现集群并没有进行副本恢复。

执行GET _cluster/allocation/explain?pretty发现：

"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-09-29T08:02:03.794Z], failed_attempts[5], delayed=false, details[failed shard on node [mKkj4112T7aLeC2oNouOrg]: failed to update mapping for index, failure MapperParsingException[Failed to parse mapping [profile]: analyzer [hanlp_standard] not found for field [details]]; nested: MapperParsingException[analyzer [hanlp_standard] not found for field [details]]; ]

原来是分词插件错误导致。再仔细看日志，有一行：

allocation_status: "no_attempt"

原因是：shard 自动分配已经达到最大重试次数5次，仍然失败了，所以导致"shard的分配状态已经是：no_attempt"。这时在Kibana Dev Tools，执行命令：POST /_cluster/reroute?retry_failed=true即可。由index.allocation.max_retries参数来控制最大重试次数。

The cluster will attempt to allocate a shard a maximum of index.allocation.max_retries times in a row (defaults to 5), before giving up and leaving the shard unallocated.

当执行reroute命令对分片重新路由后，ElasticSearch会自动进行负载均衡，负载均衡参数cluster.routing.rebalance.enable默认为true。

It is important to note that after processing any reroute commands Elasticsearch will perform rebalancing as normal (respecting the values of settings such as cluster.routing.rebalance.enable) in order to remain in a balanced state.

过一段时间后：执行 GET /_cat/shards?index=user 可查看 user 索引中所有的分片分配情况已经正常了。

user 1 p STARTED 13610428 2.6gb node-248
user 1 r STARTED 13610428 2.5gb node-151
user 1 r STARTED 13610428 2.8gb node-140
user 2 p STARTED 13606674 2.8gb node-248
user 2 r STARTED 13606674 2.7gb node-151
user 2 r STARTED 13606684 3.8gb node-140
user 0 p STARTED 13603429 2.6gb node-248
user 0 r STARTED 13603429 2.6gb node-151
user 0 r STARTED 13603429 2.7gb node-140

第一列：索引名称；第二列标识 shard 是primary(p) 还是 replica(r)；第三列 shard的状态；第四列：该shard上的文档数量；最后一列节点名称。

总结

一般来说，ElasticSearch会自动分配那些 unassigned shards，当发现某些shards长期未分配时，首先看下是否是因为：为索引指定了过多的primary shard 和 replica 数量，然后集群中机器数量又不够。另一个原因就是本文中提到的：由于故障，shard自动分配达到了最大重试次数了，这时执行 reroute 就可以了。

参考资料

/_cat/shards 命令：https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html

2018.9.30
原文：https://www.cnblogs.com/hapjin/p/9726469.html

标签：status,node,attempt,STARTED,shard,shards,allocation,gb
From： https://www.cnblogs.com/gaoyuechen/p/18330350

Centos中修改Docker镜像源:解决error pulling image configuration:download failed a
场景在进行拉取镜像时提示：errorpullingimageconfiguration:downloadfailedafterattempts=6:dialing... 这是因为镜像源无法连接和使用了。但是之前已经配置过国内docker的镜像源了。是因为自2024年6月份左右国内镜像源大部分失效，原因自行探索。所以记录下如何修......
git status 路径里的汉字不显示
症状>gitstatusOnbranchmainYourbranchisuptodatewith'origin/main'.Changestobecommitted:(use"gitrestore--staged<file>..."tounstage)modified:"\346\225\260\350\256\272.md"......
Python win32serviceutil QueryServiceStatus：返回值是什么意思？
我正在学习使用pywin32，并尝试在64位Python3.6.4上使用win32serviceutil模块以下代码：importwin32serviceutilasserviceserviceStatus=service.QueryServiceStatus("WinDefend")print(serviceStatus)返回以下元组：(16,4,197,0,0,0,0)我对wind......
User Allocation In MEC: A DRL Approach 论文笔记
论文：ICWS2021移动边缘计算中的用户分配：一种深度强化学习方法代码地址：使用强化学习在移动边缘计算环境中进行用户分配目录Ⅰ.IntroductionII.MOTIVATION-A.验证假设的观察结果 II.MOTIVATION-AMotivatingExample数据驱动方法的基本思想III.强化学习分配RL框架 ......
PHP curl 模拟GET请求接口报错HTTP Status 400 – Bad Request 问题
网上查的解决方案：https://blog.csdn.net/sunsijia21983/article/details/123204143问题：PHP用curl模拟GET请求接口报错HTTPStatus400–BadRequesthttp://xxx/api/getZList?page=1&limit=20&zName=测试参数zName是英文、数字的时候都不会报错，输入汉字就报错400；解决方案：h......
论文《AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning》
在大模型微调的理论中，AdaLoRA方法是一个绕不开的部分。这篇论文主要提出了一种新的自适应预算分配方法AdaLoRA，用于提高参数高效的微调性能。AdaLoRA方法有效地解决了现有参数高效微调方法在预算分配上的不足，提高了在资源有限情况下的模型性能，为NLP领域的实际应用提供了新的......
Android 11 NavigationBar && Status Bar 如果改变背景颜色
SystemUI的导航栏和状态栏的背景是大部分是根据当前应用的主题显示的,状态有黑,白,透明,半透明等.需求：要求背景不跟随栈顶应用主题变化,始终固定成一个颜色！/SystemUI/src/com/android/systemui/statusbar/phone/NavigationBarView.java//NavigationBarView初始化pub......
[GIT] 解决：git status时有Untracked files(未跟踪的文件)
1问题描述gitpull时失败，报Pleasemoveorremovethembeforeyoumerge。结果gitstatus显示有一堆不太想提交的Untrackedfiles(未跟踪的文件)。那么，Untrackedfiles文件状态的文件，是什么？一般又如何处理呢？2原因分析我们要真正弄明白问题的原因，我们就要先知道文件的......
如何完美解决 “error pulling image configuration: download failed after attempts
如何完美解决"errorpullingimageconfiguration:downloadfailedafterattempts=6:dialtcp59.188.250.54"......
FAILED: cpu_adam.so /usr/bin/ld: cannot find -lcurand collect2: error: ld retur
FAILED:cpu_adam.so c++cpu_adam.ocpu_adam_impl.o-shared-lcurand-L/home/deeplp/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib-lc10-ltorch_cpu-ltorch-ltorch_python-ocpu_adam.so/usr/bin/ld:cannotfind-lcurandcollect2:error:ld......

记一次ElasticSearch重启之后shard未分配问题的解决 allocation_status": "no_attempt