Alertmanager告警配置

Prometheus和Alertmanager配置教程

使用Docker Compose搭建Prometheus监控系统,并配置Alertmanager进行告警管理。

目录

  1. Docker Compose配置
  2. Prometheus配置
  3. Alertmanager配置
  4. 告警规则配置

Docker Compose配置

首先,使用Docker Compose来部署Prometheus、Grafana、Pushgateway和Alertmanager。

创建一个docker-compose.yml文件,内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
version: "3"
services:
prometheus:
image: prom/prometheus
container_name: prometheus
user: root
ports:
- "9090:9090"
volumes:
- ./conf/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./conf/rules:/etc/prometheus/rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
networks:
- net-prometheus

grafana:
image: grafana/grafana
container_name: grafana
user: root
ports:
- "3000:3000"
volumes:
- ./data/prometheus/grafana_data:/var/lib/grafana
depends_on:
- prometheus
networks:
- net-prometheus

pushgateway:
image: prom/pushgateway
container_name: pushgateway
user: root
ports:
- "9091:9091"
volumes:
- ./data/prometheus/pushgateway_data:/var/lib/pushgateway
networks:
- net-prometheus

alertmanager:
image: prom/alertmanager
container_name: alertmanager
user: root
ports:
- "9093:9093"
volumes:
- ./conf/alertmanager:/etc/alertmanager
- ./data/prometheus/alertmanager_data:/var/lib/alertmanager
networks:
- net-prometheus

networks:
net-prometheus:

这个配置文件定义了四个服务:Prometheus、Grafana、Pushgateway和Alertmanager。建议使用固定版本镜像,配置相应的端口映射和卷挂载。

Prometheus配置

接下来,我们需要配置Prometheus。在对应映射目录创建prometheus.yml文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
global:
scrape_interval: 5s
evaluation_interval: 5s

external_labels:
monitor: 'dashboard'

alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager:9093"
timeout: 30s

rule_files:
- /etc/prometheus/rules/*.rules

scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['prometheus:9090']

- job_name: "Windows"
static_configs:
- targets: ["your_windows_host:9182"]

这个配置文件定义了全局设置、告警管理器、规则文件位置和抓取配置。your_windows_host`你实际的Windows主机地址。

Alertmanager配置

在对应映射目录创建alertmanager.yml文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your_smtp_password'
smtp_require_tls: false

route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
send_resolved: true

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

请注意替换以下内容:

这个配置文件设置了全局SMTP配置、路由规则、接收器和抑制规则。

告警规则配置

最后,配置告警规则。创建windows_alerts.rules文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
groups:
- name: windows_alerts
rules:
- alert: WindowsHighMemoryUsage
expr: 100 - (windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100 > 70
for: 1m
labels:
severity: warning
annotations:
summary: "Windows主机内存使用率高 (实例 {{ $labels.instance }})"
description: "Windows主机 {{ $labels.instance }} 的内存使用率超过70%,当前值: {{ $value | printf \"%.2f\" }}%"

- alert: WindowsHighCPUUsage
expr: 100 - (avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Windows主机CPU使用率高 (实例 {{ $labels.instance }})"
description: "Windows主机 {{ $labels.instance }} 的CPU使用率持续5分钟超过90%,当前值: {{ $value | printf \"%.2f\" }}%"

- alert: WindowsServerDown
expr: up{job="windows"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Windows服务器宕机 (实例 {{ $labels.instance }})"
description: "Windows服务器 {{ $labels.instance }} 已经宕机超过5分钟"

这个文件定义了三个告警规则:

  1. 内存使用率高于70%
  2. CPU使用率持续5分钟高于90%
  3. Windows服务器宕机超过5分钟

总结

通过以上配置,就成功搭建一个基于Prometheus和Alertmanager的监控告警系统。系统可以监控Windows主机的状态,并在出现问题时发送邮件通知。

要启动整个系统,只需在包含docker-compose.yml文件的目录中运行:

1
docker-compose up -d

之后,您可以通过以下地址访问各个服务:


Alertmanager告警配置
http://example.com/2024/07/31/Alertmanager告警配置/
作者
Sanli Ma
发布于
2024年7月31日
许可协议