目标
客户用产品可能是功能设计好,也可能是因为响应快稳定可靠,例如滴滴用不了用高德,券商app故障受罚,微信凌晨服务崩溃,所以稳定性建设工作价值是保障客户体验,避免资损,社会负面舆论。
故障生命周期处理
围绕故障生命周期,在整个故障定位体系,可分为故障开始前,预案阶段,做量化分析找到潜在隐患;故障开始后,尽快发现定位故障直接原因,直接原因定位是为了止损,根因可以后续排查;故障恢复后就是复盘,行程TODO list,针对性改进。
预案阶段
1.可观测性体系
基础设施和软件架构都比较完善情况下并不能万事大吉,线上问题防不胜防,建设可观测体系是必需的。预防阶段两件事,埋点数据采集,数据组织,便于后续排障。
可观测性数据通常分四类,指标,链路,日志,事件。
指标,存储成本小应用最广泛,其中很多熟知的产品包括,成为监控系统业界标准的Prometheus,时间序列数据库VictoriaMetrics,OpenTSDB,采集器Telegraf、Categraf、Grafana-agent、Datadog-agent。
链路,服务数量众多,关系复杂,导致服务故障很难排查的情况下需要引入链路追踪系统,业界推出了观测度量框架OpenTelemetry,可以基本解决链路监控需求,推动落地需要所有模块都接入才有价值。
日志,是最重要的问题排查手段,存储成本高,所以管理日志需要精细化,较久远数据几乎没有查询需求,近期数据存ES用于排查问题,选取日志中数值类指标存时序库,保存更久一点。
排查问题,通常是指标先提示异常,然后查看相关时间段日志,日志里可能有traceid,再去查询链路数据,从而更快找到故障直接原因。
事件,通常包含告警事件,变更事件,故障定位是事件也需要统一收集,从时间维度做关联分析
2.风险量化体系
主要是分析评价可观测性体系成果,是否完备,还能量化变更体系,确认各团队变更操作是否值得依赖,例如不做灰度直接全量,经常高峰期上线,经常回滚,量化团队健康分,督促差的业务线去改进。
以上内容摘选于大佬(秦晓辉@快猫星云)的文章<稳定性体系建设白皮书>
夜莺方案
夜莺(Nightingale)是一款可视化监控工具产品,夜莺开源版源于滴滴运维团队,是国内最活跃的企业级云原生监控方案,被很多团队选用部署落地,经过生产实践。通过Categraf、VictoriaMetrics、Nightingale可以方便我们快速搭建可观测性体系。夜莺是一个服务端组件,类似Grafana,可以接入不同的数据源,夜莺就可以对数据源的数据进行分析、告警、可视化,以及后续的事件处理、告警自愈(和Grafana一样提供可视化,对告警规则管理不同于Prometheus通过配置文件来实现,夜莺通过WebUI来统一协同) 夜莺商业版产品提供了更多企业级功能,用于统一监控和故障定位场景。
夜莺V6入门
操作系统 云耀云服务器(Hyper Elastic Cloud Server)
cat /etc/centos-release
AlmaLinux release 8.4 (Electric Cheetah)
lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: AlmaLinux
Description: AlmaLinux release 8.4 (Electric Cheetah)
Release: 8.4
Codename: ElectricCheetah
uname -a
Linux hecs-34116 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Tue Sep 13 06:07:14 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
cpu信息
lscpu|grep CPU
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 4
On-line CPU(s) list: 0-3
CPU family: 6
Model name: Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz
CPU MHz: 2600.000
NUMA node0 CPU(s): 0-3
x86_64,x64,AMD64基本上是同一个东西,我们现在用的intel/amd的桌面级CPU基本上都是x86_64
mariadb安装
默认源
yum info mariadb
Last metadata expiration check: 3:17:56 ago on Thu 13 Apr 2023 07:04:57 AM CST.
Available Packages
Name : mariadb
Epoch : 3
Version : 10.3.35
Release : 1.module_el8.6.0+3265+230ed96b
Architecture : x86_64
Size : 6.0 M
Source : mariadb-10.3.35-1.module_el8.6.0+3265+230ed96b.src.rpm
Repository : appstream
Summary : A very fast and robust SQL database server
URL : http://mariadb.org
License : GPLv2 with exceptions and LGPLv2 and BSD
Description : MariaDB is a community developed branch of MySQL - a multi-user, multi-threaded
: SQL database server. It is a client/server implementation consisting of
: a server daemon (mysqld) and many different client programs and libraries.
: The base package contains the standard MariaDB/MySQL client programs and
: generic MySQL files.
添加MariaDB yum 源, 官网按需要选择源
vim /etc/yum.repos.d/MariaDB.repo
# MariaDB 11.0 [RC] CentOS repository list - created 2023-04-13 02:37 UTC
# https://mariadb.org/download/
[mariadb]
name = MariaDB
# rpm.mariadb.org is a dynamic mirror if your preferred mirror goes offline. See https://mariadb.org/mirrorbits/ for details.
# baseurl = https://rpm.mariadb.org/11.0/centos/$releasever/$basearch
baseurl = https://mirrors.neusoft.edu.cn/mariadb/yum/11.0/centos/$releasever/$basearch
module_hotfixes = 1
# gpgkey = https://rpm.mariadb.org/RPM-GPG-KEY-MariaDB
gpgkey = https://mirrors.neusoft.edu.cn/mariadb/yum/RPM-GPG-KEY-MariaDB
gpgcheck = 1
重新构建缓存。
yum clean all
yum makecache
卸载旧版
yum remove mariadb-server mariadb mariadb-libs
yum clean all
找出并删除残留目录
find / -name mariadb
find / -name mysql
安装新版及启动数据库
yum install MariaDB-server
一路y下去
查看状态:
systemctl status mariadb
● mariadb.service - MariaDB 11.0.1 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/mariadb.service.d
└─migrated-from-my.cnf-settings.conf
Active: inactive (dead)
Docs: man:mariadbd(8)
https://mariadb.com/kb/en/library/systemd/
启动
systemctl start mariadb
此时意外发现报错了
Job for mariadb.service failed because the control process exited with error code.
See "systemctl status mariadb.service" and "journalctl -xe" for details.
按照提示去查看错误,端口被占用
...
Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [Note] Server socket created on IP: '0.0.0.0'.
Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Can't start server: Bind on TCP/IP port. Got error: 98: Address already in use
Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Do you already have another server running on port: 3306 ?
Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Aborting
...
netstat -ntuap
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 232910/nginx: maste
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 232910/nginx: maste
换端口
vi /etc/my.cnf.d/server.cnf
搜索行统计以[mysqld]开始,并在[mysqld]语句下放置以下端口指令,如以下文件摘录所示。 相应地更换端口变量。
[mysqld]
port = 12345
再次启动,并查看状态
systemctl start mariadb
systemctl status mariadb
● mariadb.service - MariaDB 11.0.1 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/mariadb.service.d
└─migrated-from-my.cnf-settings.conf
Active: active (running) since Thu 2023-04-13 11:21:07 CST; 2min 44s ago
Docs: man:mariadbd(8)
https://mariadb.com/kb/en/library/systemd/
Process: 2579593 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Process: 2579562 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment>
Process: 2579560 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Main PID: 2579576 (mariadbd)
Status: "Taking your SQL requests now..."
Tasks: 9 (limit: 49448)
Memory: 169.3M
CGroup: /system.slice/mariadb.service
└─2579576 /usr/sbin/mariadbd
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: log sequence number 47295; transaction id 14
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Plugin 'FEEDBACK' is disabled.
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Plugin 'wsrep-provider' is disabled.
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Server socket created on IP: '0.0.0.0'.
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Server socket created on IP: '::'.
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: Buffer pool(s) load completed at 230413 11:21:07
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] /usr/sbin/mariadbd: ready for connections.
Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: Version: '11.0.1-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 12345 MariaDB Server
Apr 13 11:21:07 hecs-34116 systemd[1]: Started MariaDB 11.0.1 database server.
成功了接下来,给MariaDB设置用户及密码
连接数据库
mysql
select user, host, plugin from mysql.user;
+-------------+------------+-----------------------+
| User | Host | plugin |
+-------------+------------+-----------------------+
| mariadb.sys | localhost | mysql_native_password |
| root | localhost | mysql_native_password |
| mysql | localhost | mysql_native_password |
| PUBLIC | | |
| | localhost | |
| | hecs-34116 | |
+-------------+------------+-----------------------+
设置权限和密码
GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost' IDENTIFIED BY '123456';
退出后再登录mysql
mysql -uroot -p123456 -e "show databases"
mysql: Deprecated program name. It will be removed in a future release, use '/usr/bin/mariadb' instead
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| sys |
| test |
+--------------------+
连接test库
mysql -uroot -p -D test
安装redis
查看可用源
dnf list|grep redis
hiredis.x86_64 0.13.3-13.el8 epel
hiredis-devel.x86_64 0.13.3-13.el8 epel
pcp-pmda-redis.x86_64 5.3.7-7.el8 appstream
perl-RDF-Trine-redis.noarch 1.019-8.el8 epel
python3-redis.noarch 3.5.3-1.el8 epel
redis.x86_64 5.0.3-5.module_el8.4.0+2583+b9845322 appstream
redis-devel.x86_64 5.0.3-5.module_el8.4.0+2583+b9845322 appstream
redis-doc.noarch 5.0.3-5.module_el8.4.0+2583+b9845322 appstream
syslog-ng-redis.x86_64 3.23.1-3.el8 epel
uwsgi-logger-redis.x86_64 2.0.21-2.el8 epel
uwsgi-router-redis.x86_64 2.0.21-2.el8 epel
官网下载新版
cd /home/tarball/
wget https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.10 -O redis-7.0.10.tar.gz
tar zxvf redis-7.0.10.tar.gz
cd redis-7.0.10
make
成功后输出
Hint: It's a good idea to run 'make test' ;)
make[1]: Leaving directory '/home/tarball/redis-7.0.10/src'
make PREFIX=/home/tarball/redis-7.0.10 install
成功后输出
cd src && make install
make[1]: Entering directory '/home/tarball/redis-7.0.10/src'
CC Makefile.dep
Hint: It's a good idea to run 'make test' ;)
INSTALL redis-server
INSTALL redis-benchmark
INSTALL redis-cli
make[1]: Leaving directory '/home/tarball/redis-7.0.10/src'
启动redis
./bin/redis-server& ./redis.conf
查看redis状态
ps -aux | grep redis
root 2584263 0.0 0.1 62644 10196 pts/0 Sl 12:51 0:00 ./bin/redis-server *:6379
root 2584270 0.0 0.0 12140 1116 pts/1 S+ 12:51 0:00 grep --color=auto redis
netstat -ntuap
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2487/sshd
tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 2579576/mariadbd
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 2584263/./bin/redis
安装 TSDB
按官方文档描述‘小规模使用,比如 1000 台机器以下,用 Prometheus 做存储即可,超过 1000 台机器,选择 VictoriaMetrics 可能更合适
VictoriaMetrics。 提供单机版和集群版。如果您的每秒写入数据点数小于100万(这个数量是个什么概念呢,如果只是做机器设备的监控,每个机器差不多采集200个指标,采集频率是10秒的话每台机器每秒采集20个指标左右,100万/20=5万台机器),VictoriaMetrics 官方默认推荐您使用单机版,单机版可以通过增加服务器的CPU核心数,增加内存,增加IOPS来获得线性的性能提升。且单机版易于配置和运维。’
另外听大佬说,大规模使用时候夜莺的主要瓶颈在TSDB上,所以这次选用单机版VictoriaMetrics
下载VictoriaMetrics
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.90.0/victoria-metrics-linux-amd64-v1.90.0.tar.gz
下载还是有点慢的
mkdir victoria-metrics
tar xf victoria-metrics-linux-amd64-v1.90.0.tar.gz -C victoria-metrics
cd victoria-metrics
启动
nohup ./victoria-metrics-prod &>victoria.log &
查看默认端口8428
ss -ntpl
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=2487,fd=5))
LISTEN 0 80 0.0.0.0:12345 0.0.0.0:* users:(("mariadbd",pid=2579576,fd=21))
LISTEN 0 511 0.0.0.0:6379 0.0.0.0:* users:(("redis-server",pid=2584263,fd=6))
LISTEN 0 1024 0.0.0.0:8428 0.0.0.0:* users:(("victoria-metric",pid=2584341,fd=10))
安装夜莺
官网下载
wget https://download.flashcat.cloud/n9e-v6.0.0-ga.3-linux-amd64.tar.gz
mkdir n9e
tar zxvf n9e-v6.0.0-ga.3-linux-amd64.tar.gz -C n9e
导入数据库
mysql -uroot -p <n9e.sql
修改 N9e 的配置文件 (需要注意上线前修改密钥Auth相关字段)
vim etc/config.toml
[DB]
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
DSN="root:123456@tcp(127.0.0.1:12345)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true" #更换db密码,端口
[[Pushgw.Writers]]
# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"
Url = "http://127.0.0.1:8428/api/v1/write" #更换为vm的端口
启动n9e服务
$ nohup ./n9e &>n9e.log &
ss -ntlp | grep 17000
LISTEN 0 1024 *:17000 *:* users:(("n9e",pid=2584479,fd=9))
配置nginx
server {
listen 80;
server_name xxxx.xxxx.com;
location / {
proxy_pass http://localhost:17000;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
浏览器访问,然后输入用户名root,密码root.2020即可登录系统。
下载catgraf
wget https://download.flashcat.cloud/categraf-v0.2.38-linux-amd64.tar.gz
tar xf categraf-v0.2.38-linux-amd64.tar.gz
cd categraf-v0.2.38-linux-amd64/
vim conf/config.toml
修改这两项
[[writers]]
url = "http://127.0.0.1:17000/prometheus/v1/write"
[heartbeat]
enable = true
启动catgraf
nohup ./categraf &>categraf.log &
这样完成了夜莺精简的中心汇聚式部署方案搭建,再通过WebUI操作系统配置-数据源来添加数据源,告警规则来告警判断等等操作
夜莺的V6版本架构图和部署方式可以通过官方博客了解到,简单来说就是n9e利用mysql存储数据 告警信息,配置信息, redis存储验证信息,元数据,心跳信息,TSDB时序数据库存储告警指标,categraf进行数据采集,
n9e可以做集群,多个n9e分担告警规则的处理和压力(此时n9e有状态服务)。
最后感谢看完,由于作者水平有限,使用很多工具并不熟悉,如有错误和遗漏欢迎指出,感谢谅解。
参考资料:
https://flashcat.cloud/blog/sre-practice-white-paper/
https://mp.weixin.qq.com/s/5Ik-Kk1_B7jjgXLxHH1Oug
https://blog.csdn.net/m0_61323675/article/details/130114281
https://flashcat.cloud/blog/nightingale-v6-arch/
https://blog.csdn.net/wf19930209/article/details/79536506
https://www.cnblogs.com/xunzhiyou/p/16365158.html
https://blog.csdn.net/sqlquan/article/details/122093702
https://blog.csdn.net/w892824196/article/details/107062729
https://www.cnblogs.com/pxyblog/p/mysql.html
https://www.cnblogs.com/hunanzp/p/12304622.html
https://developer.aliyun.com/article/789869
https://flashcat.cloud/docs/content/flashcat-monitor/nightingale/install/victoriametrics/
https://blog.csdn.net/qihoo_tech/article/details/120558834