PostgreSQL DBA之数据库指标监控

news/2024/7/9 22:03:51 标签: 数据库, postgresql, dba

常规监控

  • 连接数使用超过95%告警,最大连接数和当前连接数

  select (select count(*) from pg_stat_activity) /(select setting from pg_settings where name = 'max_connections') ::numeric >0.95;
  • 存在空闲连接idle in transaction大于5秒的连,如果长期存在这样的连接说明应用有问题,并且会造成数据库出现异常

 select count(*) from pg_stat_activity where state = 'idle in transaction' and now()-query_start>'5s'::interval);
  • 长事务,超过30分钟的慢sql

 select count(*) from pg_stat_activity where state <> 'idle' and (backend_xid is not null or backend_xmin is not null) and now()-xact_start > interval '1800 sec'::interval;
 select pg_size_pretty(pg_database_size('postgres'));
 select datname, pg_size_pretty(pg_database_size(datname)) from pg_database;
  • 数据库中存在长时间阻塞的锁,阻塞超过3分钟

 select
  kl.pid as 阻塞pid,
 a.usename as blocked_user,
 ka.query as 阻塞sql,
 ka.state as 阻塞sql状态,
 now()- ka.query_start as 阻塞语句执行时长,--阻塞的语句执行时长
  bl.pid as 被阻塞pid,
 ka.usename as 被阻塞用户,
 a.query as 被阻塞sql,
 now()- a.query_start as 被阻塞语句执行时长--被阻塞语句执行时长
 from
 pg_locks bl
 join pg_stat_activity a on
 a.pid = bl.pid
 join pg_locks kl on
 kl.transactionid = bl.transactionid
 and kl.pid != bl.pid
 join pg_stat_activity ka on
 ka.pid = kl.pid
 where
 not bl.granted
  and now()- ka.query_start >'60s'::interval;--阻塞时长超过1分钟
 
 
  --阻塞pid对应的是数据库中pid,可以使用select pg_terminate_backend(pid);进行kill
  • 坏元组超过10000行,且占比超过20%的表

 select current_database(),
  schemaname||'.'||relname,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup*100/ (n_live_tup )::numeric,2)||'%' AS dead_tup_ratio
  FROM
  pg_stat_all_tables
  WHERE
  n_dead_tup >= 10000 and n_live_tup>0 and (n_dead_tup)/(n_live_tup)::numeric>0.25 and schemaname not in ('pg_toast','pg_catalog','information_schema')
  ORDER BY dead_tup_ratio DESC
  LIMIT 10;
 
  • wal日志生成量

 --'AD/FAFFF2A8'是5分钟前select pg_current_wal_lsn()获取的值
 select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),'AD/FAFFF2A8'));
  • 表缓存命中率和索引缓存命中率,命中率低于80%说明缓存不足

  • 注意,不常用的库缓存可能为0

 SELECT'index hit rate' AS name,
 (sum(idx_blks_hit)) / nullif(sum(idx_blks_hit + idx_blks_read),0) AS ratio
 FROM pg_statio_user_indexes
 UNION ALL
 SELECT
 'table hit rate' AS name,
 sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read),0) AS ratio
 FROM pg_statio_user_tables;
  • 数据库年龄监控

  • 当age达到2亿会触发自动清理,如果大于5亿则告警

 select datname,age(datfrozenxid),pg_size_pretty(pg_database_size(oid)) from pg_database where age(datfrozenxid) > 500000000 order by age(datfrozenxid) desc limit 10 ;
  • 表年龄(用于查看详情,默认数据库年龄会存储表年龄最大的一条)

 select sp.nspname,relname,age(relfrozenxid), pg_size_pretty(pg_table_size(cl.oid)) from pg_class cl join pg_namespace sp on cl.relnamespace = sp.oid where relkind in ('t','r') order by age(relfrozenxid) desc limit 10;
 
 -- 说明:当age到达2亿(默认)时触发自动清理,期间会大量占用系统资源。提前做好监控避免在业务高峰时发生。可在库级别操作,也可在表基本操作。
  • 事务提交,事务回滚,全表扫描,索引扫描,插入记录,更新记录

 select sum(xact_commit),sum(xact_rollback),sum(tup_returned),sum(tup_fetched),sum(tup_inserted),sum(tup_updated),sum(tup_deleted),sum(conflicts),sum(deadlocks) from pg_stat_database;
 -- pg_stat_get_db_xact_commit 为stable函数,一个事务中两次调用之间只执行一次,所以需要外部多次执行。

回滚率计算:select xact_rollback/(xact_commit+xact_rollback),如果超过10%则代表回滚率较高

  • 过去5分钟内生成wal个数,可以判断当前数据库的dml频繁程度

 select count(1) from pg_catalog.pg_ls_waldir() where modification > CURRENT_TIMESTAMP - '5 minutes' :: INTERVAL ;
  • wal写入速率

  • lsn是一个递增的值

 SELECT CASE WHEN pg_is_in_recovery() THEN pg_last_wal_replay_lsn() ELSE pg_current_wal_lsn() END - '0/0' as wal_lsn;

高可用,逻辑复制

  • 现场使用arrs或者高可用,出现异常后并不知情,可监控复制状态,以及日志堆积大小判断当前复制情况

pg10+

  • 复制状态,true表示正常,false表示异常

  select active from pg_replication_slots;
  • 备库applay延迟,以及堆积情况

 select
 application_name,client_addr,client_hostname,client_port,state,sync_priority,sync_state,pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn)),
 pg_wal_lsn_diff(pg_current_wal_lsn(),write_lsn) write_delay,
 pg_wal_lsn_diff(pg_current_wal_lsn(),flush_lsn) flush_delay,
 pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) replay_dely
  from pg_stat_replication ;
  • arrs复制延迟,以及堆积情况

  select slot_name, plugin, slot_type,
 temporary, active, active_pid,
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) from
 pg_replication_slots ;

pg10以下

  • 备库applay延迟,复制堆积超过1GB

 select
 application_name,client_addr,client_hostname,client_port,state,sync_priority,
 sync_state,pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(),replay_location)),
 pg_xlog_location_diff(pg_current_xlog_location(),write_location) write_delay,
 pg_xlog_location_diff(pg_current_xlog_location(),flush_location) flush_delay,
 pg_xlog_location_diff(pg_current_xlog_location(),replay_location) replay_dely
  from pg_stat_replication;
  select slot_name, plugin, slot_type,
  active, active_pid,
 pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn)) from
 pg_replication_slots;
  • 主从流复制延时时间 (从库执行)

 SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END
 10 版本及以后
 SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END;
 -- 主从复制延迟字节 (主库执行)
 select greatest(0,pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) from pg_stat_replication ; where client_addr = '从库

慢SQL监控

  • 数据库需要安装插件pg_stat_statements

  • 数据库中当前频繁执行的SQL TOP5

 SELECT C.rolname,
  b.datname,
  A.total_time / A.calls per_call_time,
  A.*
 FROM
  pg_stat_statements A,
  pg_database b,
  pg_authid C
 WHERE
  A.userid = C.oid
  AND A.dbid = b.oid
 ORDER BY
  A.total_time DESC
  LIMIT 5
  • 最耗时 SQL,单次调用最耗时 SQL TOP5

 select userid::regrole, dbid, query from pg_stat_statements order by mean_time desc limit 5;
  • 最耗共享内存 SQL

 select userid::regrole, dbid, query from pg_stat_statements order by (shared_blks_hit+shared_blks_dirtied) desc limit 5;
  • 总最耗IO SQL TOP5

 select userid::regrole, dbid, query from pg_stat_statements order by (blk_read_time+blk_write_time) desc limit 5;
  • 最耗IO SQL,单次调用最耗IO SQL TOP5

 select userid::regrole, dbid, query from pg_stat_statements order by (blk_read_time+blk_write_time)/calls desc limit 5;
  • QPS

  with
  a as(select sum(calls)s from pg_stat_statements),
  b as(select sum(calls)s from pg_stat_statements,pg_sleep(1))
  select b.s-a.s --QPS
  from a,b;
  • 长期(30天)未更新统计信息的表,坏元组比例超过10%就应该触发更新统计信息(警告)

 
 SELECT
  to_char((now() - last_analyze), 'DD'), to_char((now() - last_autoanalyze),'DD' ) ,
 current_database(),
  schemaname||'.'||relname,n_live_tup,n_dead_tup,
  round(n_dead_tup*100/n_live_tup::numeric,2)||'%' as dead_tup_ratio
 FROM
  pg_stat_all_tables
 WHERE
  schemaname not in( 'pg_toast','pg_catalog','information_schema' )
 AND (last_analyze IS null OR to_char((now() - last_analyze), 'DD') :: NUMERIC > 30)
 AND (last_autovacuum IS null OR to_char((now() - last_autovacuum),'DD') :: NUMERIC > 30)
 AND (last_autoanalyze IS null OR to_char((now() - last_autoanalyze),'DD' ) :: NUMERIC > 30)
 and n_dead_tup >= 10000 and n_live_tup>0 and (n_dead_tup)/(n_live_tup)::numeric > 0.1;
  • 未使用的索引

 SELECT
 schemaname || '.' || relname AS table,
 indexrelname AS index,
 pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
 idx_scan as index_scans
 FROM pg_stat_user_indexes ui
 JOIN pg_index i ON ui.indexrelid = i.indexrelid
 WHERE NOT indisunique
 AND idx_scan < 5
 AND pg_relation_size(relid) > 5 * 8192
 ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,
 pg_relation_size(i.indexrelid) DESC;
 
--修改后:

 SELECT
 schemaname || '.' || relname AS table,
 indexrelname AS index,
 pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,pg_size_pretty(pg_relation_size(relid)),
 idx_scan as index_scans
 FROM pg_stat_user_indexes ui
 JOIN pg_index i ON ui.indexrelid = i.indexrelid
 WHERE NOT indisunique
 AND idx_scan < 5
 AND pg_relation_size(relid) > 5 * 8192*1000
 ORDER BY pg_relation_size(i.indexrelid) desc limit 5;
  • 表未创建主键

 SELECT
 current_database (),ns.nspname,obj.relname
 FROM
 pg_class obj,pg_namespace ns
 WHERE
 obj.relnamespace = ns.oid
 AND obj.relkind = 'r'
 AND ns.nspname NOT IN ('pg_toast','pg_catalog','information_schema')
 AND obj.oid NOT IN (SELECTobj.oid FROMpg_class obj,pg_index idxWHEREobj.oid = idx.indrelidAND idx.indisprimary = 'true')
 limit 10;
 SELECT
 current_database (),
 ns.nspname,
 rel.relname,
 tri.tgname
 FROM
 pg_class rel,
 pg_namespace ns,
 pg_trigger tri
 WHERE
 ns.nspname NOT IN ('pg_toast','pg_catalog','information_schema')
 AND rel.oid = tri.tgrelid
 AND rel.relnamespace = ns.oid
 AND rel.relhastriggers = true and tgisinternal= = false;
  • 上班期间存在备份进程

 select st.application_name,st.datname,st.query_start,st.client_addr,client_hostname,query from pg_stat_activity st where state='active' and application_name='pg_dump';
  • 上班期间存在创建索引进程

 select st.application_name,st.datname,st.query_start,st.client_addr,client_hostname
 from pg_stat_activity st where state='active' and (st.query ilike 'create index%' or st.query ilike 'create uniuqe index%');
  • 上班期间重建索引

 select st.application_name,st.datname,st.query_start,st.client_addr,client_hostname from pg_stat_activity st where state='active' and query ilike 'reindex%';
 
  • vacuum full进程

 select st.application_name,st.datname,st.query_start,st.client_addr,client_hostname,query from pg_stat_activity st where state='active' and query ilike 'vacuum full%';
  • 重复索引

 select allindx.schemaname,allindx.relname,allindx.indexrelname,ind.indexdef FROM pg_stat_all_indexes allindx,pg_indexes ind where allindx.relname=ind.tablename and allindx.indexrelname=ind.indexname and allindx.indexrelid :: VARCHAR
 IN (SELECT regexp_split_to_table(indexs, $$,$$)
 FROM(SELECT string_agg (idx :: VARCHAR, $$,$$) AS indexs
 FROM(
 SELECT indexrelid :: oid AS idx,
 (indrelid :: TEXT || $$BB$$ || indclass :: TEXT || $$BB$$ || indkey :: TEXT || $$BB$$ || COALESCE (indexprs :: TEXT, $$BB$$) || $$BB$$ || COALESCE (indpred :: TEXT, $$BB$$)) AS val FROM pg_index ) sub
 GROUP BYval HAVINGCOUNT (*) > 1
 ORDER BY SUM (pg_relation_size(idx)) desc
 ) arr) and allindx.schemaname not in ('pg_toast','pg_catalog');

http://www.niftyadmin.cn/n/5260945.html

相关文章

selenium+python自动化测试 —— 解决无法启动IE浏览器及报错问题!

前言&#xff1a;记录启动IE浏览器的报错及解决方法。 错误1&#xff1a; selenium.common.exceptions.WebDriverException: Message: IEDriverServer.exe executable needs to be in PATH. Please download from http://selenium-release.storage.googleapis.com/index.html…

案例:xpath实例+功能

素材&#xff1a;test.html 代码如下&#xff1a; <!DOCTYPE html> <html lang"en"><head><meta charset"UTF-8" /><title>Title</title></head><body><span>我爱你</span><ul><li…

华为OD机试真题-精准核酸检测-2023年OD统一考试(C卷)

题目描述: 为了达到新冠疫情精准防控的需要,为了避免全员核酸检测带来的浪费,需要精准圈定可能被感染的人群。 现在根据传染病流调以及大数据分析,得到了每个人之间在时间、空间上是否存在轨迹的交叉。 现在给定一组确诊人员编号(X1, X2, X3, .... Xn),在所有人当中,找出…

R语言对医学中的自然语言(NLP)进行机器学习处理(1)

什么是自然语言(NLP)&#xff0c;就是网络中的一些书面文本。对于医疗方面&#xff0c;例如医疗记录、病人反馈、医生业绩评估和社交媒体评论,可以成为帮助临床决策和提高质量的丰富数据来源。如互联网上有基于文本的数据(例如,对医疗保健提供者的社交媒体评论),这些数据我们可…

JavaScript询问gpt:如何合并数组?

如何合并数组&#xff1f; 在 JavaScript 中&#xff0c;你可以使用多种方法来合并数组。以下是一些常见的方法&#xff1a; 1. 使用 concat() 方法&#xff1a; javascript const array1 [1, 2, 3]; const array2 [4, 5, 6]; const mergedArray array1.concat(…

hive聚合函数之JOIN原理及案例

1.数据准备 原始数据 创建dept.txt文件&#xff0c;并赋值如下内容&#xff0c;上传HDFS。 部门编号 部门名称 部门位置id 10 行政部 1700 20 财务部 1800 30 教学部 1900 40 销售部 1700创建emp.txt文件&#xff0c;并赋值如下内容&#xff0c;上传HDFS。 员工编号 姓名 岗…

centos8stream 升级 sqlite3 ,解决 SQLite 3.27 or later is required (found 3.26.0).

服务器环境是centos8stream, 默认的sqlite是 3.26 &#xff0c;因此&#xff0c;需要升级。 sqlite官网&#xff1a;SQLite Download Page 1.从官网下载最新源码包 cd /opt/ wget https://www.sqlite.org/2023/sqlite-autoconf-3440200.tar.gz tar xvf sqlite-autoconf-344020…

【JVM从入门到实战】(六)类加载器的双亲委派机制

一、双亲委派机制 在Java中如何使用代码的方式去主动加载一个类呢&#xff1f; 方式1&#xff1a;使用Class.forName方法&#xff0c;使用当前类的类加载器去加载指定的类。 方式2&#xff1a;获取到类加载器&#xff0c;通过类加载器的loadClass方法指定某个类加载器加载。 …