首页 > 其他分享 >K12531: Troubleshooting health monitors

K12531: Troubleshooting health monitors

时间:2023-10-10 10:15:35浏览次数:44  
标签:node member monitor K12531 down health Troubleshooting IP pool

Issue

A monitor is a BIG-IP feature that verifies connections to pool members or nodes. A health monitor is designed to report the status of a pool, pool member, or node on an ongoing basis, at a set interval. When a health monitor marks a pool, pool member, or node as down, the BIG-IP system stops sending traffic to the device.

A failing or misconfigured health monitor may cause traffic management issues similar to, but not limited to, the following:

  • Connections to the virtual server are interrupted or fail.
  • Web pages or applications fail to load or run.
  • Certain pool members or nodes receive more connections than others.

Any of these symptoms may indicate that a health monitor is marking a pool, pool member, or node as indefinitely down or that a monitor is repeatedly marking a pool member or node as down and then as back up (often called "bouncing"). For example, if a misconfigured health monitor repeatedly marks pool members as down and then as back up, connections to the virtual server may be interrupted or fail altogether. If this occurs, you need to determine whether the monitor is misconfigured, the device or application is failing, or some other factor, such as a network-related issue, is causing the monitor to fail. The troubleshooting steps you take depend on the monitor type and the symptoms you observe.

You can use the following procedures to troubleshoot health monitor issues:

Identifying a failing health monitor

You can use the Configuration utility, command line utilities, logs, or SNMP to help identify when a health monitor marks a pool, pool member, or node as down.

Configuration utility

The following table lists Configuration utility pages where you can check the status of pools, pool members, and nodes.

Configuration utility page Description Location
Network map Summary of pools, pool members, and nodes Local Traffic > Network Map
Pools Current status of pool/members Local Traffic > Pools > Statistics
Pool members Current status of pool/members Local Traffic > Pools > Statistics
Nodes Current status of nodes Local Traffic > Nodes > Statistics

Command line utilities

The following table lists command line utilities that you can use to monitor the status of pools, pool members, and nodes.

Command line utility Description Example commands
TMOS Shell (tmsh) (BIG-IP 10.x and later) Statistical information about pools, pool members, and nodes tmsh show /ltm pool <pool_name>
tmsh show /ltm node <node_IP>
bigtop Live statistics for pool members and nodes bigtop -n
bigpipe (BIG-IP 10.x) Statistical information about pools, pool members, and nodes bigpipe pool show, bigpipe node show

Logs

The BIG-IP system logs messages related to health monitors to the /var/log/ltm file. You can review log files to determine the frequency with which the system marks pool members and nodes as down.

  • Pools

    When a health monitor marks all members of a pool as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    tmm err tmm[4779]: 01010028:3: No members available for pool <Pool_name>
    tmm err tmm[4779]: 01010221:3: Pool <Pool_name> now has available members

  • Pool members

    When a health monitor marks pool members as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[2964]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status down [ <MonitorA_name>: down, <MonitorB_name>: down ] [ was up for <#>hrs:<#>mins:<#>sec ]
    notice mcpd[2964]: 01070727:5: Pool <Pool_name> member <ServerIP_port> monitor status up. [ <MonitorA_name>: down, <MonitorB_name>: up ] [ was down for <#>hrs:<#>mins:<#>sec ]


    When a pool member is forced offline by the administrator, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[5897]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status forced down. [ <MonitorA_name>: down, <MonitorB_name>: up ] [ was up for <#>hrs:<#>mins:<#>sec ]

  • Nodes

    When a health monitor marks a node as down or up, the BIG-IP system logs messages to the /var/log/ltm file which appear similar to the following example:

    notice mcpd[2964]: 01070640:5: Node <ServerIP> monitor status down.
    notice mcpd[2964]: 01070728:5: Node <ServerIP> monitor status up.

Monitor logging

In BIG-IP 11.5.0 and later, the Monitor Logging option is available to allow the system to log more verbose messages for each pool member and node level. The BIG-IP system stores the log of the respective pool member or node in the /var/log/monitors/ directory. The system does not save the Monitor Logging option setting into the system configuration but rather disables the option when the configuration loads. Additionally, the BIG-IP system does not include the Monitor Logging option in syncing operations.

The log file has the following file naming format:

<MonitorPartition>_<MonitorName>-<NodePartition>_<NodeName>-<port>.log

For example, if the Gateway_ICMP monitor is set to monitor pool member 10.10.12.200 and the Monitor Logging option is set to Enabled, the BIG-IP system generates the following log file for the pool member:

/var/log/monitors/Common_gateway_icmp-Common_10.10.12.200-0.log

Enabling monitor logging for a pool member

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a long period of time. Be sure to disable monitor logging after troubleshooting.

  1. Log in to the Configuration utility.
  2. Go to Local Traffic > Pools > Pool List.
  3. Select the name of the pool that contains the pool member for which you want to enable monitor logging.
  4. Select the Members tab.
  5. In the Current Members, select the Member name of the pool member for which you want to enable monitor logging.
  6. For Monitor Logging, select the Enable check box.
  7. Select Update.

Enabling monitor logging for a node

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a long period of time. Be sure to disable monitor logging after troubleshooting.

  1. Log in to the Configuration utility.
  2. Go to Local Traffic > Nodes > Node List.
  3. Select the name of the node for which you want to enable monitor logging.
  4. For Monitor Logging, select the Enable check box.
  5. Select Update.

SNMP

When you configure the BIG-IP system to send SNMP traps and a health monitor marks a pool member or node as down or up, the system sends the following traps:

  • Pool members

    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.10"
    }
    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.11"
    }

  • Nodes

    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.12"
    }
    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.13"
    }

Verifying monitor settings

You must verify that monitor settings are properly defined for your environment. F5 recommends that in most cases the timeout value should be equal to three times the interval value, plus one. For example, the default timeout/interval ratio is 5/16 (three times 5 plus one equals 16). This setting prevents the monitor from marking the node as down before sending the last check.

Simple monitors

You can use a simple monitor to verify the status of a destination node (or the path to the node through a transparent device). Simple monitors only monitor the node address itself, not individual protocols, services, or applications on a node. The BIG-IP system provides the following pre-configured simple monitor types: gateway_icmp, icmp, tcp_echo, tcp_half_open. If you determine that a simple monitor is marking a node as down, you can verify the following settings:

Note: There are other monitor settings that can be defined for simple monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5 product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

  • Interval/timeout ratio

    You must configure an appropriate interval/timeout ratio for simple monitors. In most cases, the timeout value should be equal to three times the interval value, plus one. For example, the default ratio is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined.

  • Transparent

    A transparent monitor uses a path through the associated node to monitor the aliased destination. Verify that the destination target device is reachable and configured properly for the monitor.

Extended content verification (ECV) monitors

ECV monitors use Send and Receive string settings to retrieve content from pool members or nodes. The BIG-IP system provides the following pre-configured monitor types: tcp, http, https, and https_443. If you determine that a simple monitor is marking a node as down, you can verify the following settings:

Note: There are other monitor settings that can be defined for ECV monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5 product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

Note: HTTPS monitors use OpenSSL for cipher negotiations.

  • Interval/timeout ratio

    As with simple monitors, you need to properly set the interval/timeout ratio for ECV monitors. In most cases, the timeout value should be equal to three times the interval value, plus one. For example, the default ratio is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined

  • Send string

    The Send string is a text string that the monitor sends to the pool member. The default setting is GET /, which retrieves a default HTML file for a website. If the Send string is not properly constructed, the server may send an unexpected response and be subsequently marked as down by the monitor. For example, if the server requires the monitor request to be HTTP/1.1 compliant, you must adjust the monitor’s Send string.

    Note: For information about modifying HTTP requests for use with HTTP or HTTPS application health monitors, refer to the following articles:

  • Receive string

    The Receive string is the regular expression representing the text string that the monitor looks for in the returned resource. ECV monitor requests may fail and mark the pool member as down if the Receive string is not configured properly. For example, if the Receive string appears too late in the server response, or the server responds with a redirect, the monitor marks the pool member as down.

    Note: For information about modifying the monitor to issue a request to a redirection target, refer to K3224: HTTP health checks may fail even though the node is responding correctly.

  • User name and password

    ECV monitors have User Name and Password fields, which can be used for resources that require authentication. Verify whether the pool member requires authentication and ensure that these fields contain valid credentials.

Troubleshooting monitor types

Simple monitors

If you determine that a simple monitor is marking a node as down (or if the node is bouncing), you can use the following steps to troubleshoot:

  1. Determine the IP address of the nodes being marked as down.

    You can determine the IP address or the nodes that the monitor is marking as down by using the Configuration utility, command line utilities, or log files. You can quickly search the /var/log/ltm file for node status messages by typing the following command:

    # grep 'Node' /var/log/ltm |grep 'status'

    Output will appear similar to the following example:

    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.1 monitor status down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 172.24.64.4 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.200 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.122 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.100 monitor status unchecked.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 11.1.1.1 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.3 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.229 monitor status down.

    Note: If a large number of nodes are being marked as down (or bouncing), you can sort the results by IP addresses by typing the following command.

    grep 'Node' /var/log/ltm |grep 'status' | sort -t . -k 3,3n -k 4,4n

  2. Check connectivity to the node.

    If there are occurrences of node addresses being marked as down and not back up, or of nodes bouncing, use commands such as ping and traceroute (BIG-IP 10.x and 11.x) to check the connectivity to the nodes from the BIG-IP system. For example, if you determine that a simple monitor is marking the node address 10.10.65.1 as down, you can attempt to ping the resource from the BIG-IP system, as shown in the following example:

    # ping -c 4 10.10.65.1
    PING 10.10.65.1 (10.10.65.1) 56(84) bytes of data.
    64 bytes from 10.10.65.1: icmp_seq=1 ttl=64 time=11.32 ms
    64 bytes from 10.10.65.1: icmp_seq=2 ttl=64 time=8.989 ms
    64 bytes from 10.10.65.1: icmp_seq=3 ttl=64 time=10.981 ms
    64 bytes from 10.10.65.1: icmp_seq=4 ttl=64 time=9.985 ms

    Note: The ping output in the previous example shows high round-trip times, which may indicate a network issue or a slowly responding node.

    In addition, make sure that the node is configured to respond to the simple monitor. For example, tcp_echo is a simple monitor type that requires that you enable TCP echo service on the monitored nodes. The BIG-IP system sends a SYN segment with information that the receiving device echoes.

  3. Check the monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval/timeout ratio) are appropriate for the node.

    Type the following tmsh command to list the configuration for the icmp_new monitor:

    tmsh list /ltm monitor icmp_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

ECV monitors

If you determine that an ECV monitor is marking a pool member as down (or if the pool member is bouncing), you can use the following steps to troubleshoot the issue:

  1. Determine the IP address of the pool members that the monitor is marking as down by using the Configuration utility, command line utilities, or log files.

    For example, you can search the /var/log/ltm file for pool member status messages by typing the following command:

    # grep -i 'pool member' /var/log/ltm | grep 'status'

    Output appears similar to the following example:

    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:21 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status node down.
    Jan 21 15:05:05 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status unchecked.

  2. Check connectivity to the pool member.

    Check the connectivity to the pool members from the BIG-IP system using the ping or traceroute commands.

  3. Check the ECV monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval/timeout ratio) are appropriate for the pool members.

    The following tmsh command lists the configuration for the http_new monitor:

    tmsh list /ltm monitor http_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

  5. Test the response from the application.

    Use a command line utility on the BIG-IP system to test the response from the web application. For example, the following curl and time command attempts to transfer data from the web server while timing the response:

    # time curl http://10.10.65.1

    Output syntax appears similar to the following example:

    <html>
    <head>
    ---
    </body>
    </html>
    real 0m18.032s
    user 0m0.030s
    sys 0m0.060s

    Note: If you want to test a specific HTTP request, including HTTP headers, you can use the telnet command to connect to the pool member.

    For example:

    telnet <serverIP> <serverPort>

    At the prompt, enter an appropriate HTTP request line and HTTP headers, pressing Enter once after each line.

    For example:

    GET / HTTP/1.1 <enter>
    Host: www.yoursite.com <enter>
    Connection: close <enter>
    <enter>

Troubleshooting daemons related to health monitoring

The bigd process manages health checking for pool members, nodes, and services on the BIG-IP LTM system. The bigd process collects health checking status and communicates the status information to the mcpd process, which stores the data in shared memory so that the Traffic Management Microkernel (TMM) can read it. If you are having monitoring issues, you can check the memory utilization of the bigd process. If the %MEM is unusually high, or continually increases, the process may be leaking memory.

For example, to check the current memory utilization of bigd, type the ps command:

# ps aux |grep bigd

Output appears similar to the following example:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3020 0.0 0.6 28208 10488 ? S 2010 5:08 /usr/bin/bigd

Note: If the bigd process fails, the health check status of pool members, nodes, and services remain in their current state until the bigd process restarts. For more information, refer to K6967: When the BIG-IP LTM bigd daemon fails, the health check status of pool members, nodes, and services remain unchanged until the bigd daemon restarts.

Additionally, you can run the bigd process in debug mode. Debug logging for the bigd process is extremely verbose as it logs multiple messages for every monitor attempt. For information about running bigd in debug mode, contact F5 Technical Support.

Using tcpdump to capture the monitor traffic

If you are unable to determine the cause of a failing health monitor, you may need to perform packet captures on the BIG-IP system. To use the tcpdump command to capture monitor traffic, perform the following steps:

Impact of procedure: You should only run tcpdump packet captures during active troubleshooting sessions.

  1. Log into the BIG-IP command line.
  2. Use the following command syntax to determine the self IP address that the BIG-IP system uses for health monitoring:

    ip route get <server ip address>

    Note: Replace <server ip address> with the IP address of the destination server.

    Output appears similar to the following example, which uses the destination server address 10.20.4.100:

    ip route get 10.20.4.100
    10.20.4.100 dev internal_vlan  src 10.20.4.3
    cache

    Note: In the example, the server 10.20.4.100 is associated with VLAN internal_vlan and the self IP address for health monitoring is 10.20.4.3.

  3. Use the following tcpdump syntax to capture monitor traffic.

    tcpdump -nnvi <internal_vlan_name>:nnn -s0 -w /var/tmp/<filename>.pcap host <self-ip address>

    For example,

    tcpdump -nnvi internal_vlan:nnn -s0 -w /var/tmp/monitortraffic.pcap host 10.20.4.3

  4. When you have captured the appropriate amount of monitor traffic, press Ctrl+C to terminate the tcpdump capture.

Note: For more information about running tcpdump, refer to K411: Overview of packet tracing with the tcpdump utility.

Verify the connectivity between F5 and pool members

  1. Send a ping from F5 to a pool member.
  2. Identify the intermediate device between F5 and pool member and ping to that device IP from F5.
  3. If the intermediate device is a switch, check for ARP entry in F5 ARP table using the arp -a command. 
  4. Verify VLAN and VLAN tagging configuration on F5 and the connected switch/L3 switch.  
  5. If the ping is blocked, perform a telnet test.

翻译

搜索

复制

标签:node,member,monitor,K12531,down,health,Troubleshooting,IP,pool
From: https://www.cnblogs.com/BigGod/p/17753888.html

相关文章

  • Kubernetes 集群 troubleshooting
    Kubernetes集群troubleshooting–陈少文的网站(chenshaowen.com)1、FailedCreatePodSandBox错误Errorresponsefromdaemon:OCIruntimecreatefailed:container_linux.go:380:startingcontainerprocesscaused:process_linux.go:402:gettingthefinalchild'......
  • Troubleshooting:Helm postgres cannot create director
    helminstallpostgres报错如下:postgresql12:12:18.62postgresql12:12:18.62WelcometotheBitnamipostgresqlcontainerpostgresql12:12:18.62Subscribetoprojectupdatesbywatchinghttps://github.com/bitnami/bitnami-docker-postgresqlpostgres......
  • 心理健康数据集:mental_health_chatbot_dataset
    一.数据集描述1.数据集摘要  该数据集包含与心理健康相关的问题和答案的对话对,以单一文本形式呈现。数据集是从流行的医疗博客(如WebMD、MayoClinic和HealthLine)、在线常见问题等来源精选而来的。所有问题和答案都经过匿名化处理,以删除任何个人身份信息(PII),并经过预处理以删除......
  • DL380 Gen9服务器-ILO health故障报警
    告警信息Controllerfirmwarerevision2.10.00NANDreadfailure:MediaisinaWRITE-PROTECTEDstate。处理文档https://support.hpe.com/hpesc/public/docDisplay?docId=a00048622zh_tw影响范围:任何设定iLO4的ProLiantGen8系列或HPEProLiantGen9系列伺服器。解决......
  • VCSA7.0访问提示no healthy upstream
    问题:打开VCSA7.0登录页面提示“nohealthyupstream”解决方法:一、登录https://172.22.1.250:5480设备管理后台,发现整体运行状况有警示,按照提示先解决log盘空间问题, 1、登录esxi主机查询vcenter在那个ESXI主机上,然后登录vcenter主机,或者通过ssh连接vcenter主机。 2、编辑......
  • Health Kit基于数据提供专业方案,改善用户睡眠质量
    什么是CBT-I?中国社科院等机构今年发布的《中国睡眠研究报告2023》内容显示,2022年,受访者的每晚平均睡眠时长为7.40小时,近半数受访者的每晚平均睡眠时长不足8小时(47.55%),16.79%的受访者的每晚平均睡眠时长不足7小时。这些数据反映出民众睡眠情况有待改善。CBT-I即针对失眠的认知......
  • 【HMS Core】Health Kit 血压、血糖等数据返回数据包含max,min,avg,last 数据,这些数据
    ​【问题描述】1. 血压、血糖等数据返回数据包含max,min,avg,last数据,这些数据的含义是什么意思?​2. 如何获取用户上传健康数据的腕表的型号 【解决方案】1、血压原子采样统计数据类型开放的是多日统计查询接口,统计的维度是按照自然日进行统计的。​最大最小以及平均......
  • 【HMS Core】Health Kit 血压、血糖等数据返回数据包含max,min,avg,last 数据,这些数据
    【问题描述】1. 血压、血糖等数据返回数据包含max,min,avg,last数据,这些数据的含义是什么意思?2. 如何获取用户上传健康数据的腕表的型号【解决方案】1、血压原子采样统计数据类型开放的是多日统计查询接口,统计的维度是按照自然日进行统计的。最大最小以及平均值是指这一天的最大......
  • rke up etcd报错: etcd cluster is unhealthy
    问题添加node,rkeup报错:WARN[0197][etcd]host[10.7.0.51]failedtochecketcdhealth:failedtoget/healthforhost[10.7.0.51]:Get"https://10.7.0.51:2379/health":remoteerror:tls:badcertificateWARN[0290][etcd]host[10.7.0.52]failedtoch......
  • Health Kit文档大变样,一起尝鲜!
    HealthKit文档全新升级,开发场景更清晰,聚焦你关心的问题,快来一起尝鲜!文档入口请戳:文档入口~如果你是运动健康的老朋友,可以从旧文档页面上方的提示信息中进入:最新版本哦。一、架构调整更易读——端/云开发一目了然HealthKit新架构文档从开发者视角出发,导航目录设计从端侧、云......