首页 > 数据库 >PostgreSQL 性能优化之 - 大锁与long sql/xact的蝴蝶效应

PostgreSQL 性能优化之 - 大锁与long sql/xact的蝴蝶效应

时间:2023-04-24 14:33:57浏览次数:46  
标签:PostgreSQL ms 16 long tps stddev 大锁 lat progress

PostgreSQL 性能优化之 - 大锁与long sql/xact的蝴蝶效应

简介:

在使用数据库时,某些特定的因素结合在一起,就可能引发蝴蝶效应。
导致数据库性能的急剧下降。
本文要讲的是和以下几个因素有关的:

因素1
PG的锁排队机制,即使没有获得锁,只要在锁队列中就会造成锁竞争。

session A lock1 get  
session B lock2 wait lock1 release  
session C lock3 可能和lock1, lock2 冲突  
session D lock4 可能和lock1,2,3 冲突  

因素2
SQL响应变慢后,服务端程序通常会增加到数据库的连接来处理拥塞的请求

因素3
数据库的性能会随着连接数增加到一个值(通常是核数的3倍)后,性能开始下降

因素4
对象锁在事务结束时释放
例如在事务中查询表时,表的共享锁需要等到事务结束时释放。

以上4个因素加在一起,就可能引发一次应用级别的故障。

模拟
开启lock跟踪:

log_lock_waits = on   
deadlock_timeout = 1s  

创建测试表

postgres=# create table test(id int primary key, info text, crt_time timestamp);  
CREATE TABLE  
postgres=# insert into test select generate_series(1,10000000),md5(random()::text),clock_timestamp();  
INSERT 0 10000000  

测试脚本

vi test1.sql  
\setrandom id 1 10000000  
update test set info=info where id=:id;  

.1. pgbench1 模拟数据更新A。使用10个链接(假设正常只需要10个)
正常的性能

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 10 -j 10 -T 10  
progress: 2.0 s, 65994.3 tps, lat 0.149 ms stddev 0.038  
progress: 3.0 s, 67706.5 tps, lat 0.145 ms stddev 0.051  
progress: 4.0 s, 72865.0 tps, lat 0.135 ms stddev 0.048  
progress: 5.0 s, 77664.2 tps, lat 0.126 ms stddev 0.032  
progress: 6.0 s, 77138.9 tps, lat 0.127 ms stddev 0.037  
progress: 7.0 s, 75941.3 tps, lat 0.129 ms stddev 0.061  
progress: 8.0 s, 77328.8 tps, lat 0.127 ms stddev 0.036  

开启长时间更新请求

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 10 -j 10 -T 10000  

.2. 模拟一个查询长事务,查询表A
也可能是后台的whole vacuum prevent wrapper

postgres=# begin;  
BEGIN  
postgres=# select * from test limit 1;  
 id |               info               |          crt_time            
----+----------------------------------+----------------------------  
  1 | e86e219d51c39d16f78d77cf697395ca | 2016-03-16 16:07:49.814487  
(1 row)  

暂不结束事务, 持有test表的shared lock.

.3. 模拟一个DDL请求A

postgres=# alter table test add column c1 int;  

等待test shared lock锁释放

马上会堵塞正常的业务请求,tps降到0

progress: 53.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 54.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 56.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 57.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 58.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 59.0 s, 0.0 tps, lat -nan ms stddev -nan  

.4. 这个时候,业务并不知道数据库堵塞了,会增加更多的连接来处理用户的请求。甚至可能把连接塞满。
pgbench2 模拟拥塞更新A,新建500链接

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 500 -j 500 -T 10000  

新增的连接会全部处于这样的状态:

digoal  25434  22068  0 16:21 ?        00:00:00 postgres: postgres postgres [local] PARSE waiting  
digoal  25437  22068  0 16:21 ?        00:00:00 postgres: postgres postgres [local] PARSE waiting  

.5. 结束长事务或结束DDL请求后,锁释放。
锁释放,大量并发的连接开始处理拥塞的请求
此时性能下降了一半
pgbench2

progress: 10.3 s, 270.5 tps, lat 1396.862 ms stddev 3498.526  
progress: 11.0 s, 34443.5 tps, lat 64.132 ms stddev 709.718  
progress: 12.0 s, 34986.1 tps, lat 14.229 ms stddev 18.469  
progress: 13.0 s, 36645.0 tps, lat 13.661 ms stddev 17.686  
progress: 14.0 s, 34570.1 tps, lat 14.463 ms stddev 18.716  
progress: 15.0 s, 36435.8 tps, lat 13.752 ms stddev 17.621  
progress: 16.0 s, 35513.3 tps, lat 14.052 ms stddev 18.087  
progress: 17.0 s, 35560.0 tps, lat 14.013 ms stddev 18.159  

pgbench1

progress: 59.0 s, 688.7 tps, lat 340.857 ms stddev 2734.371  
progress: 60.0 s, 733.0 tps, lat 13.659 ms stddev 18.501  
progress: 61.0 s, 816.0 tps, lat 12.237 ms stddev 16.941  
progress: 62.0 s, 811.0 tps, lat 12.328 ms stddev 16.715  
progress: 63.0 s, 809.9 tps, lat 12.370 ms stddev 17.370  
progress: 64.0 s, 750.1 tps, lat 13.338 ms stddev 17.745  

将后建立的500个连接释放后,恢复正常的性能

progress: 66.0 s, 1937.8 tps, lat 5.044 ms stddev 12.975  
progress: 67.0 s, 64995.8 tps, lat 0.157 ms stddev 0.757  
progress: 68.0 s, 73996.3 tps, lat 0.133 ms stddev 0.042  
progress: 69.0 s, 78099.4 tps, lat 0.125 ms stddev 0.038  

日志
可以追踪到锁等待的源头,但是不能追踪到大锁。

2016-03-16 16:25:57.531 CST,"postgres","postgres",48877,"[local]",56e91894.beed,3,"ALTER TABLE waiting",2016-03-16 16:25:56 CST,13/28,580426398,LOG,00000,"process 48877 still waiting for AccessExclusiveLock on relation 61245 of database 13241 after 1000.048 ms","Process holding the lock: 48557. Wait queue: 48877, 46333, 46331, 46338, 46334, 46339, 46335, 46340, 46337, 46328, 46336.",,,,,"alter table test add column c1 int;",,"ProcSleep, proc.c:1323","psql"  
2016-03-16 16:25:57.531 CST,"postgres","postgres",46333,"[local]",56e91871.b4fd,3,"BIND waiting",2016-03-16 16:25:21 CST,4/263058,0,LOG,00000,"process 46333 still waiting for RowExclusiveLock on relation 61245 of database 13241 after 1000.036 ms","Process holding the lock: 48557. Wait queue: 48877, 46333, 46331, 46338, 46334, 46339, 46335, 46340, 46337, 46328, 46336.",,,,,"update test set info=info where id=$1;",,"ProcSleep, proc.c:1323","pgbench"  
  
2016-03-16 16:26:10.191 CST,"postgres","postgres",49812,"[local]",56e918a1.c294,3,"PARSE waiting",2016-03-16 16:26:09 CST,14/29,0,LOG,00000,"process 49812 still waiting for RowExclusiveLock on relation 61245 of database 13241 after 1000.207 ms","Process holding the lock: 48557. Wait queue: 48877, 46333, 此处省略500+ PIDs, 50816, 50817.",,,,,"update test set info=info where id=$1;",8,"ProcSleep, proc.c:1323","pgbench"  
  
2016-03-16 16:26:19.367 CST,"postgres","postgres",48877,"[local]",56e91894.beed,4,"ALTER TABLE waiting",2016-03-16 16:25:56 CST,13/28,580426398,LOG,00000,"process 48877 acquired AccessExclusiveLock on relation 61245 of database 13241 after 22836.312 ms",,,,,,"alter table test add column c1 int;",,"ProcSleep, proc.c:1327","psql"  
2016-03-16 16:26:19.368 CST,"postgres","postgres",48877,"[local]",56e91894.beed,5,"ALTER TABLE",2016-03-16 16:25:56 CST,13/28,580426398,ERROR,42701,"column ""c1"" of relation ""test"" already exists",,,,,,"alter table test add column c1 int;",,"check_for_column_name_collision, tablecmds.c:5069","psql"  
  
2016-03-16 16:26:19.379 CST,"postgres","postgres",49814,"[local]",56e918a1.c296,4,"PARSE waiting",2016-03-16 16:26:09 CST,15/2,0,LOG,00000,"process 49814 acquired RowExclusiveLock on relation 61245 of database 13241 after 10177.162 ms",,,,,,"update test set info=info where id=$1;",8,"ProcSleep, proc.c:1327","pgbench"  

要追踪大锁,
可以使用以下SQL

with t_wait as                       
(select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,  
a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a,  
transactionid,b.query,b.xact_start,b.query_start,b.usename,b.datname   
  from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted),  
t_run as   
(select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,  
a.classid,a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,  
a,transactionid,b.query,b.xact_start,b.query_start,  
b.usename,b.datname from pg_locks a,pg_stat_activity b where   
a.pid=b.pid and a.granted)   
select r.locktype,r.mode r_mode,r.usename r_user,r.datname r_db,  
r.relation::regclass,r.pid r_pid,  
r.page r_page,r.tuple r_tuple,r.xact_start r_xact_start,  
r.query_start r_query_start,  
now()-r.query_start r_locktime,r.query r_query,w.mode w_mode,  
w.pid w_pid,w.page w_page,  
w.tuple w_tuple,w.xact_start w_xact_start,w.query_start w_query_start,  
now()-w.query_start w_locktime,w.query w_query    
from t_wait w,t_run r where  
  r.locktype is not distinct from w.locktype and  
  r.database is not distinct from w.database and  
  r.relation is not distinct from w.relation and  
  r.page is not distinct from w.page and  
  r.tuple is not distinct from w.tuple and  
  r.classid is not distinct from w.classid and  
  r.objid is not distinct from w.objid and  
  r.objsubid is not distinct from w.objsubid and  
  r.transactionid is not distinct from w.transactionid and  
  r.pid <> w.pid  
  order by   
  ((  case w.mode  
    when 'INVALID' then 0  
    when 'AccessShareLock' then 1  
    when 'RowShareLock' then 2  
    when 'RowExclusiveLock' then 3  
    when 'ShareUpdateExclusiveLock' then 4  
    when 'ShareLock' then 5  
    when 'ShareRowExclusiveLock' then 6  
    when 'ExclusiveLock' then 7  
    when 'AccessExclusiveLock' then 8  
    else 0  
  end  ) +   
  (  case r.mode  
    when 'INVALID' then 0  
    when 'AccessShareLock' then 1  
    when 'RowShareLock' then 2  
    when 'RowExclusiveLock' then 3  
    when 'ShareUpdateExclusiveLock' then 4  
    when 'ShareLock' then 5  
    when 'ShareRowExclusiveLock' then 6  
    when 'ExclusiveLock' then 7  
    when 'AccessExclusiveLock' then 8  
    else 0  
  end  )) desc,r.xact_start;  

优化措施或处理措施
.1. 养成大锁处理习惯, 配置锁等待超时
.2. 应用程序或中间件应该有自动释放空闲连接的功能
.3. auto_explain也不会记录锁等待的时间,所以不利于分析原因。只有从日志中才能分析。

标签:PostgreSQL,ms,16,long,tps,stddev,大锁,lat,progress
From: https://www.cnblogs.com/yaoyangding/p/17349415.html

相关文章

  • 批量更新Postgresql的序列
    序列(sequence)是PostgreSQL中的一种对象,用于生成自动递增的唯一标识符。通常,序列会与表的自增主键一起使用,以确保每个新插入的行都有一个唯一的标识符。在某些情况下,可能需要更新序列的值:从另一个数据库中导入数据,自增列的值也从原来的数据中导入。导入的过程中,目标数据库的序列......
  • 7 Best Practice Tips for PostgreSQL Bulk Data Loading
    7BestPracticeTipsforPostgreSQLBulkDataLoading  February19,2023Sometimes,PostgreSQLdatabasesneedtoimportlargequantitiesofdatainasingleoraminimalnumberofsteps.Thisiscommonlyknownasbulkdataimportwherethedatas......
  • Java Long对象对比,用equals函数
    文章目录一、前文二、调试三、后记一、前文同事问我一个问题,我咋一看也愣住了代码:if(ObjectUtil.isNotNull(temp.getUserId())&&temp.getUserId()!=SecurityUtils.getUserId()){logger.error(temp.toString());logger.error("SecurityUtils.getUserId()="+Secur......
  • 后端long类型传到前端精度丢失
    问题:c#的long类型精度比前端JS的number精度高,前端JS的long类型只有17位,当后端传给前端的long类型超过17位时,就会造成数据精度丢失。解决方案: 1.我们可以自定义一个json序列化的转换器,将long类型转换为string类型。代码如下:publicclassLongToStringConverter:Json......
  • PostgreSQL中的复制延迟
    PostgreSQL是一个流行的开源关系数据库管理系统,PostgreSQL中可能遇到的一个常见问题是复制延迟。在这篇博客中,我们将讨论什么是复制延迟,它为什么会发生,以及如何在PostgreSQL中减轻它。什么是复制延迟?复制延迟是指数据写入主数据库和复制到备用数据库之间的时间延迟。在PostgreSQ......
  • ubuntu 22.04安装postgresql
    安装sudoaptinstallpostgresql修改/etc/postgresql/14/main/postgresql.conf把listen_addresses='127.0.0.1'修改为listen_addresses='*'/etc/postgresql/14/main/pg_hba.conf添加hostallall0.0.0.0/0m......
  • 9.x - 13.0 postgresql 分区表新特性及简单用法
    一、分区表定义与意义1.分区表的定义把一个大的物理表分成若干个小物理表,并使得这些小物理表在逻辑上可以被当成一张表来使用。主表/父表/MasterTable主表是创建子表的模板,是一个正常的普通表,一般主表并不存任何数据。子表/分区表/ChlidTable/PartitionTable子表继承并属于一......
  • 解决 Error running ‘Application‘: Command line is too long.
    一、项目场景:运行刚拉取下来的项目代码,出现下面问题描述的错误提示。二、问题描述Errorrunning'Application':Commandlineistoolong.ShortencommandlineforApplicationoralsoforSpringBootdefaultconfiguration?翻译翻译:运行“Application”时出错:命令行太长......
  • postgresql -- 执行计划
    一、显示执行计划pg中explain命令格式如下explain[options]sql语句例如explainselect*fromtest;explain(formatxml)select*fromtest;explain(analyzetrue,bufferstrue)select*fromtest;options可选项如下:ANALYZE(默认FALSE):实际执行sql,显示真实的执行计划及......
  • postgresql 函数错误捕捉
    CREATEORREPLACEFUNCTION"public"."proc_net_agent_diamond_loss"("dwuserid"int4,"strdate"date)RETURNS"public"."my_returninfo"AS$BODY$DECLAREresultMy_ReturnInfo;v_start_timeTIMESTAMP(......