
关键词:深入理解MySQL主从架构中的Seconds_Behind_Master指标、复制延迟、高可用、故障切换、监控
在互联网金融、实时计费、秒杀等高并发场景下,主从延迟直接决定用户体验与资金安全。Seconds_Behind_Master(SBM)作为最直观的复制延迟指标,却常被误读。本文从源码级拆解其计算逻辑,给出可落地的监控+报警+切换一体化方案,并附完整 Python 代码案例,帮助你在 5 分钟内发现延迟根因,10 分钟内完成故障切换。
官方定义:SHOW SLAVE STATUS\G 中的 Seconds_Behind_Master 表示“从库 SQL 线程当前执行到的事务与 I/O 线程接收到的主库 binlog 时间戳差值”。
计算源码(MySQL 8.0.33,sql/rpl_slave.cc):
long seconds= (long)(time(0) – mi->rli->last_master_timestamp);
if (seconds < 0) seconds= 0;
关键点:
last_master_timestamp 取自 relay log 中事务的原始 original_commit_timestamp;time(0) 持续增大,SBM 随之飙升;| 场景 | 允许最大 SBM | 触发动作 |
|---|---|---|
| 金融支付 | 1 s | 关闭读流量,主库直写 |
| 电商秒杀 | 3 s | 限流降级,队列缓冲 |
| 报表从库 | 300 s | 只读查询路由切换 |
使用 mysqld_exporter 采集 mysql_slave_lag_seconds{instance="slave1:3306"},PromQL:
mysql_slave_lag_seconds > 3
告警模板:
summary = "MySQL 主从延迟 {{ $value }}s",
description = "主从延迟持续 3 个采集周期超过阈值,请检查 IO/SQL 线程状态",
下面给出完整可运行的 Python3 脚本,依赖包:
pip install pymysql prometheus_client dnspython
import pymysql, time, os, socket
from prometheus_client import Gauge, start_http_server
GAUGE = Gauge('mysql_slave_lag_seconds', 'Seconds Behind Master', ['host'])
def get_slave_lag(host, port, user, pwd):
"""
获取 Seconds_Behind_Master 并处理 NULL 场景
返回值: int 型延迟秒数,异常返回 -1
"""
try:
conn = pymysql.connect(host=host, port=port, user=user,
password=pwd, charset='utf8mb4',
connect_timeout=2, read_timeout=3)
with conn.cursor() as cur:
cur.execute("SHOW SLAVE STATUS")
row = cur.fetchone()
if row is None:
return -1 # 非从库
cols = [d[0] for d in cur.description]
dat = dict(zip(cols, row))
io_thread = dat['Slave_IO_Running'] == 'Yes'
sql_thread = dat['Slave_SQL_Running'] == 'Yes'
sbm = dat['Seconds_Behind_Master']
if not io_thread or not sql_thread:
# 任一线程停止,认为延迟无限大
return 99999
if sbm is None:
# 常见于刚启动复制或 SQL 线程空闲
return 0
return int(sbm)
except Exception as e:
print("ERROR", e)
return -1
finally:
conn.close()
def main():
start_http_server(9104)
slave_host = os.getenv("SLAVE_HOST", "172.16.0.31")
while True:
lag = get_slave_lag(slave_host, 3306, "exporter", "xxx")
GAUGE.labels(host=slave_host).set(lag)
time.sleep(1)
if __name__ == '__main__':
main()
代码解读(≥500 字):
connect_timeout+read_timeout 防止网络 hang 住主线程;-1,与正常延迟 0 区分,方便 Prometheus 做 up{job="mysql-slag"} 存活检查。Slave_IO_Running 与 Slave_SQL_Running 都为 Yes,才真正认为复制正常;否则返回 99999,触发熔断。CHANGE MASTER TO 启动复制,尚未收到任何事件时,Seconds_Behind_Master 为 NULL。脚本将其映射为 0,符合“尚未落后”语义。prometheus_client 库,在 9104 端口暴露 /metrics,可直接被 Prometheus 拉取,无需额外 exporter。指标名 mysql_slave_lag_seconds 与 mysqld_exporter 保持一致,方便统一仪表盘。SLAVE_HOST 注入目标从库地址,镜像仅 35 MB,可在 K8s 中作为 Sidecar 与从库 Pod 同生命周期部署,实现“一对一”精准监控,避免传统中心 exporter“一挂全挂”风险。
import dns.resolver, pymysql, os, time, signal, subprocess
DOMAIN = "read-only.example.com"
THRESHOLD = 3 # 秒
def get_a_record(domain):
ans = dns.resolver.resolve(domain, 'A')
return [str(r) for r in ans][0]
def set_a_record(domain, new_ip, ttl=30):
"""
调用阿里云 CLI 修改 DNS,可自行替换为 Route53、Cloudflare API
"""
subprocess.run([
"aliyun", "alidns", "UpdateDomainRecord",
"--RecordId", os.getenv("RECORD_ID"),
"--RR", "read-only",
"--Type", "A",
"--Value", new_ip,
"--TTL", str(ttl)
], check=True)
def find_healthiest_slave():
candidates = ["172.16.0.31", "172.16.0.32", "172.16.0.33"]
lag_list = []
for ip in candidates:
lag = get_slave_lag(ip, 3306, "monitor", "xxx")
if lag != -1 and lag <= THRESHOLD:
lag_list.append((ip, lag))
if not lag_list:
return None
lag_list.sort(key=lambda x: x[1])
return lag_list[0][0]
def main():
current = get_a_record(DOMAIN)
while True:
best = find_healthiest_slave()
if best and best != current:
set_a_record(DOMAIN, best)
current = best
print(f"Failover DNS to {best}")
time.sleep(2)
if __name__ == '__main__':
main()
脚本通过 DNS 轮询实现只读流量“无损”切换,全程无需人工介入。
transaction_consistency 取代;