Prometheus与Grafana监控体系搭建实战

最新推荐文章于 2025-12-16 14:07:13 发布

原创最新推荐文章于 2025-12-16 14:07:13 发布 · 758 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#prometheus #grafana

本文详解如何搭建Prometheus + Grafana监控体系，实现服务器、应用、数据库的全方位监控。

前言

生产环境必须要有监控：

及时发现问题
追溯历史数据
容量规划依据
告警通知

Prometheus + Grafana 是目前最流行的开源监控方案：

Prometheus：采集和存储指标
Grafana：可视化展示
丰富的生态：各种Exporter

今天来搭建一套完整的监控体系。

一、架构设计

1.1 整体架构


┌─────────────────────────────────────────────────────┐
│                    Grafana                          │
│                  (可视化展示)                        │
└─────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────┐
│                  Prometheus                          │
│               (采集+存储+查询)                        │
└─────────────────────────────────────────────────────┘
↑               ↑               ↑
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Node Exporter│ │MySQL Exporter│ │Redis Exporter│
│  (主机监控)   │ │ (MySQL监控)  │ │ (Redis监控)  │
└──────────────┘ └──────────────┘ └──────────────┘

1.2 数据流


1. Exporter采集指标 → 暴露HTTP接口（:9100等）
2. Prometheus定时拉取 → 存储时序数据
3. Grafana查询Prometheus → 展示图表
4. Alertmanager → 发送告警

二、Prometheus部署

2.1 Docker Compose部署

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

2.2 Prometheus配置

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 主机监控
  - job_name: 'node'
    static_configs:
      - targets: 
        - '192.168.1.101:9100'
        - '192.168.1.102:9100'
        - '192.168.1.103:9100'

  # MySQL监控
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.1.101:9104']

  # Redis监控
  - job_name: 'redis'
    static_configs:
      - targets: ['192.168.1.101:9121']

2.3 启动服务

# 创建目录
mkdir -p prometheus/rules alertmanager

# 启动
docker compose up -d

# 访问
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin123)

三、Node Exporter（主机监控）

3.1 安装部署

# 方式1：Docker
docker run -d --name node_exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

# 方式2：二进制安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*/
./node_exporter &

3.2 验证

curl http://localhost:9100/metrics

# 输出示例
# node_cpu_seconds_total{cpu="0",mode="idle"} 12345.67
# node_memory_MemTotal_bytes 8.3e+09
# node_filesystem_size_bytes{device="/dev/sda1"} 1.0e+11

3.3 常用指标

指标	说明
node_cpu_seconds_total	CPU使用时间
node_memory_MemTotal_bytes	总内存
node_memory_MemAvailable_bytes	可用内存
node_filesystem_size_bytes	磁盘大小
node_filesystem_avail_bytes	磁盘可用
node_network_receive_bytes_total	网络接收
node_network_transmit_bytes_total	网络发送
node_load1/5/15	系统负载

四、应用监控

4.1 MySQL Exporter

# 部署
docker run -d --name mysql_exporter \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="exporter:password@(mysql:3306)/" \
  prom/mysqld-exporter

# 创建监控用户
CREATE USER 'exporter'@'%' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;

常用指标：

mysql_up：MySQL是否存活
mysql_global_status_connections：连接数
mysql_global_status_slow_queries：慢查询数
mysql_global_status_questions：查询总数

4.2 Redis Exporter

docker run -d --name redis_exporter \
  -p 9121:9121 \
  -e REDIS_ADDR=redis://192.168.1.101:6379 \
  oliver006/redis_exporter

常用指标：

redis_up：Redis是否存活
redis_connected_clients：客户端连接数
redis_used_memory：内存使用
redis_commands_processed_total：命令处理数

4.3 Nginx Exporter

# 需要先启用Nginx状态模块
# nginx.conf添加：
# location /nginx_status {
#     stub_status on;
# }

docker run -d --name nginx_exporter \
  -p 9113:9113 \
  nginx/nginx-prometheus-exporter \
  -nginx.scrape-uri=http://192.168.1.101/nginx_status

4.4 Java应用（Micrometer）

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health
  metrics:
    export:
      prometheus:
        enabled: true

访问 http://localhost:8080/actuator/prometheus 获取指标。

五、Grafana配置

5.1 添加数据源

1. Configuration → Data Sources → Add data source
2. 选择Prometheus
3. URL: http://prometheus:9090（Docker网络）
   或 http://192.168.1.100:9090（外部）
4. Save & Test

5.2 导入Dashboard

推荐Dashboard（Grafana官网ID）：

ID	名称	用途
1860	Node Exporter Full	主机监控
7362	MySQL Overview	MySQL监控
763	Redis Dashboard	Redis监控
12708	Nginx Exporter	Nginx监控

导入方式：
1. Dashboards → Import
2. 输入ID：1860
3. Load → 选择数据源 → Import

5.3 自定义面板

# CPU使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 磁盘使用率
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

六、告警配置

6.1 告警规则

# prometheus/rules/alert.yml
groups:
  - name: 主机告警
    rules:
      - alert: 主机宕机
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "主机 {{ $labels.instance }} 宕机"
          description: "主机已超过1分钟无法访问"

      - alert: CPU使用率过高
        expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主机 {{ $labels.instance }} CPU使用率过高"
          description: "CPU使用率超过80%，当前值: {{ $value }}%"

      - alert: 内存使用率过高
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主机 {{ $labels.instance }} 内存使用率过高"
          description: "内存使用率超过80%，当前值: {{ $value }}%"

      - alert: 磁盘空间不足
        expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "主机 {{ $labels.instance }} 磁盘空间不足"
          description: "磁盘使用率超过85%，当前值: {{ $value }}%"

6.2 Alertmanager配置

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'webhook'

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://your-webhook-url/alert'
        send_resolved: true

  # 或使用邮件
  # - name: 'email'
  #   email_configs:
  #     - to: 'admin@example.com'
  #       from: 'alert@example.com'
  #       smarthost: 'smtp.example.com:587'
  #       auth_username: 'alert@example.com'
  #       auth_password: 'password'

6.3 告警测试

# 查看告警状态
curl http://localhost:9090/api/v1/alerts

# 查看规则状态
curl http://localhost:9090/api/v1/rules

七、多站点监控

7.1 场景

监控需求：
- 总部机房10台服务器
- 分部A机房5台服务器
- 分部B机房3台服务器
- 云上2台服务器

挑战：各站点网络不通

7.2 传统方案

方案1：每个站点部署Prometheus

优点：独立运行
缺点：无法统一查看，告警分散

方案2：公网暴露Exporter

优点：中心化采集
缺点：安全风险高

7.3 组网方案（推荐）

使用组网软件（如星空组网）打通所有节点：

组网后的架构：
                    ┌──────────────────────┐
                    │   中心Prometheus     │
                    │      10.10.0.1       │
                    └──────────────────────┘
                              ↑
        ┌─────────────────────┼─────────────────────┐
        ↑                     ↑                     ↑
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│    总部       │      │    分部A     │      │    分部B     │
│  10.10.0.2   │      │  10.10.0.3   │      │  10.10.0.4   │
│  Node Export │      │  Node Export │      │  Node Export │
│    :9100     │      │    :9100     │      │    :9100     │
└──────────────┘      └──────────────┘      └──────────────┘

Prometheus配置：

scrape_configs:
  # 总部服务器（组网IP）
  - job_name: 'node-headquarters'
    static_configs:
      - targets: 
        - '10.10.0.10:9100'
        - '10.10.0.11:9100'
        - '10.10.0.12:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '总部'

  # 分部A服务器（组网IP）
  - job_name: 'node-branch-a'
    static_configs:
      - targets: 
        - '10.10.0.20:9100'
        - '10.10.0.21:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '分部A'

  # 分部B服务器（组网IP）
  - job_name: 'node-branch-b'
    static_configs:
      - targets: 
        - '10.10.0.30:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: location
        replacement: '分部B'

优势：

统一监控入口
所有数据集中展示
告警统一管理
无需公网暴露
配置简单

八、高可用部署

8.1 Prometheus联邦

# 中心Prometheus配置
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}'
    static_configs:
      - targets:
        - '10.10.0.2:9090'  # 总部Prometheus
        - '10.10.0.3:9090'  # 分部Prometheus

8.2 Grafana高可用

# 使用外部MySQL存储
services:
  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_DATABASE_TYPE=mysql
      - GF_DATABASE_HOST=mysql:3306
      - GF_DATABASE_NAME=grafana
      - GF_DATABASE_USER=grafana
      - GF_DATABASE_PASSWORD=password

九、常见问题

9.1 Prometheus内存占用高

# 减少数据保留时间
--storage.tsdb.retention.time=15d

# 减少采集频率
global:
  scrape_interval: 30s

9.2 查询慢

# 使用Recording Rules预计算
groups:
  - name: recording
    rules:
      - record: job:node_cpu_usage:avg
        expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (job)

9.3 热重载配置

curl -X POST http://localhost:9090/-/reload

十、总结

监控体系搭建要点：

基础架构：Prometheus + Grafana + Alertmanager
主机监控：Node Exporter必装
应用监控：根据技术栈选Exporter
Dashboard：导入现成的，再自定义
告警规则：按优先级设置
多站点：组网打通后统一监控
高可用：联邦 + 外部存储

我的监控清单：

必监控项：
- CPU/内存/磁盘/网络
- 服务存活状态
- 数据库连接数和慢查询
- 应用响应时间和错误率

监控是运维的眼睛，没有监控的系统就是在裸奔。

参考资料

Prometheus官方文档：https://prometheus.io/docs/
Grafana官方文档：https://grafana.com/docs/
Awesome Prometheus Alerts：https://awesome-prometheus-alerts.grep.to/

💡 建议：先监控核心指标，逐步完善。告警不要太多，否则容易麻木。