用 Prometheus 监控 Docker CouchDB 实例

最近在项目中使用了 CouchDB 作为文档数据库，考虑到生产环境的需求，研究了一套基于 Prometheus 的监控方案。这套方案全部跑在 Docker 环境下，现在整理成文，供有需要的同学参考。

为什么需要监控 CouchDB 链接到标题

CouchDB 是一个面向文档的 NoSQL 数据库，默认提供 HTTP API。在生产环境中，我们需要关注：

服务可用性：数据库是否正常运行
请求性能：QPS、延迟、错误率
资源消耗：内存、磁盘、文件描述符
复制状态：多节点集群的同步健康度

CouchDB 自身提供了 /_stats 和 /_active_tasks 等 HTTP 端点，而 couchdb-prometheus-exporter 可以把这些数据转换为 Prometheus 可抓取的格式。

Docker Compose 部署链接到标题

整个监控架构分为两层：CouchDB 数据库 + CouchDB Exporter。

services:
  couchdb:
    image: docker.m.daocloud.io/library/couchdb:3.5.1
    container_name: couchdb
    user: 1000:1000
    environment:
      - COUCHDB_USER=${COUCHDB_USER}
      - COUCHDB_PASSWORD=${COUCHDB_PASSWORD}
    volumes:
      - couchdb-data:/opt/couchdb/data
    restart: unless-stopped
    ports:
      - 5984:5984
    networks:
      - couchdb-net

  couchdb-exporter:
    image: gesellix/couchdb-prometheus-exporter:v30.17.0
    container_name: couchdb-exporter
    ports:
      - 9984:9984
    environment:
      COUCHDB_URI: http://${COUCHDB_USER}:${COUCHDB_PASSWORD}@couchdb:5984
      COUCHDB_USERNAME: ${COUCHDB_USER}
      COUCHDB_PASSWORD: ${COUCHDB_PASSWORD}
    depends_on:
      - couchdb
    networks:
      - couchdb-net
    restart: unless-stopped

volumes:
  couchdb-data:
    name: couchdb-service_couchdb-data

networks:
  couchdb-net:
    name: couchdb-net
    driver: bridge

关键配置说明：

配置项	说明
`couchdb-exporter` 镜像	使用 `gesellix/couchdb-prometheus-exporter`，当前稳定版本 v30.17.0
Exporter 端口	9984，Prometheus 通过此端口抓取 metrics
认证方式	通过环境变量传入 CouchDB 的用户名密码
网络模式	CouchDB 和 Exporter 在同一 bridge 网络中通信

Prometheus 配置链接到标题

添加抓取目标链接到标题

在 prometheus.yml 中添加新的 job：

scrape_configs:
  # ... 其他 job ...

  - job_name: couchdb
    static_configs:
      - targets:
          - <COUCHDB_HOST>:9100      # node-exporter
          - <COUCHDB_HOST>:9984      # couchdb-exporter
        labels:
          service: couchdb

其中 9100 端口是主机层面的 node-exporter（用于监控宿主机的 CPU、内存、磁盘），9984 端口是 CouchDB 专用的 exporter。

验证抓取链接到标题

curl http://localhost:9984/metrics | grep couchdb_httpd_up

正常情况下应该看到 couchdb_httpd_up 1，表示 CouchDB 连接健康。

CouchDB 监控指标详解链接到标题

couchdb-prometheus-exporter 暴露的指标非常全面，涵盖 CouchDB 运行的方方面面。下面按类别介绍。

服务健康指标链接到标题

指标名	说明
`couchdb_httpd_up`	CouchDB 连接健康检查（1=正常，0=异常）
`couchdb_up`	Exporter 自身健康状态

这是最基础的指标，用于判断服务是否存活。

请求与性能指标链接到标题

指标名	说明
`couchdb_httpd_requests`	HTTP 请求总数
`couchdb_httpd_request_methods{method="GET/POST/PUT/DELETE"}`	按 HTTP Method 统计的请求数
`couchdb_httpd_request_time`	请求处理时长（不含 MochiWeb 层）
`couchdb_httpd_database_reads`	文档读取次数
`couchdb_httpd_database_writes`	数据库写入次数
`couchdb_httpd_bulk_requests`	批量操作请求数

HTTP 响应状态码链接到标题

指标名	说明
`couchdb_httpd_status_codes{code="2xx/4xx/5xx"}`	按状态码分类的响应统计

通过 rate(couchdb_httpd_status_codes{code=~"5.."}[5m]) 可以计算 5xx 错误率。

认证与缓存链接到标题

指标名	说明
`couchdb_httpd_auth_cache_hits`	认证缓存命中数
`couchdb_httpd_auth_cache_misses`	认证缓存未命中数

缓存命中率高说明认证效率好。如果未命中率高，可能需要调整 CouchDB 的认证缓存配置。

数据库层面指标链接到标题

指标名	说明
`couchdb_httpd_databases_total`	集群数据库总数
`couchdb_httpd_open_databases`	当前打开的数据库数
`couchdb_httpd_open_os_files`	打开的文件描述符数量
`couchdb_database_doc_count{db_name="xxx"}`	各数据库文档数量
`couchdb_database_data_size{db_name="xxx"}`	各数据库数据大小（字节）
`couchdb_database_disk_size{db_name="xxx"}`	各数据库磁盘占用（字节）
`couchdb_database_overhead{db_name="xxx"}`	磁盘开销（overhead）

couchdb_database_overhead 是一个重要指标，当这个值超过磁盘大小的 50% 时，意味着数据库碎片化严重，建议执行 compact 操作来回收空间。

复制器指标链接到标题

CouchDB 内置了双向复制功能，复制器指标对于集群运维至关重要：

指标名	说明
`couchdb_replicator_jobs{metric="running/pending/crashed"}`	复制任务状态分布
`couchdb_replicator_cluster_is_stable`	集群是否稳定（1=稳定，0=不稳定）
`couchdb_replicator_failed_starts`	复制启动失败次数
`couchdb_replicator_checkpoints{metric="failure/success"}`	复制检查点状态

Fabric 分布式指标链接到标题

在 CouchDB 集群模式下，一些内部协调操作会暴露为 fabric 指标：

指标名	说明
`couchdb_fabric_doc_update{metric="errors"}`	文档更新错误数
`couchdb_fabric_read_repairs{metric="failure/success"}`	读修复操作的成败统计
`couchdb_fabric_open_shard{metric="timeout"}`	分片打开超时计数

Erlang 虚拟机指标链接到标题

CouchDB 底层使用 Erlang/OTP，运行时环境的一些关键指标：

指标名	说明
`couchdb_erlang_memory_atom`	Atom 表内存使用
`couchdb_erlang_memory_processes`	进程内存
`couchdb_erlang_memory_binary`	二进制数据内存
`couchdb_erlang_memory_code`	代码加载内存

日志统计链接到标题

指标名	说明
`couchdb_server_couch_log{level="error/warning/info"}`	按级别统计的日志消息数量

通过监控 error 级别日志的数量，可以及时发现 CouchDB 的异常情况。

Prometheus 告警规则链接到标题

有了指标数据，还需要配置告警规则来及时发现问题。下面是一组实用的告警规则：

groups:
  - name: couchdb_alerts
    interval: 30s
    rules:
      - alert: CouchDBDown
        expr: couchdb_httpd_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CouchDB 已宕机"
          description: "CouchDB 已停止运行超过 1 分钟"

      - alert: CouchDBExporterDown
        expr: up{job="couchdb-exporter"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CouchDB Exporter 已宕机"
          description: "CouchDB 指标采集器已停止运行"

      - alert: CouchDBHighErrorRate
        expr: |
          rate(couchdb_httpd_status_codes{code=~"5.."}[5m])
          / ignoring(code) rate(couchdb_httpd_requests[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CouchDB 错误率过高"
          description: "5xx 错误率超过 5%"

      - alert: CouchDBReplicatorUnstable
        expr: couchdb_replicator_cluster_is_stable == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "CouchDB 复制器集群不稳定"
          description: "复制器集群状态不稳定，可能影响数据同步"

      - alert: CouchDBDiskOverheadHigh
        expr: couchdb_database_overhead / couchdb_database_disk_size > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CouchDB 数据库磁盘开销过高"
          description: "数据库 {{ $labels.db_name }} 磁盘开销超过实际数据的 50%，建议执行 compact"

告警通知链路链接到标题

完整的告警流程是：

CouchDB Metrics → couchdb-exporter (:9984) → Prometheus (:9090)
                                              ↓
                                        Alertmanager (:9093)
                                              ↓
                                      alert-transformer (:9091)
                                              ↓
                                        OpenClaw (:18789)
                                              ↓
                                          飞书通知

Alertmanager 接收 Prometheus 的告警，去重后转发给 alert-transformer 进行格式转换，最终通过 OpenClaw Agent 发送到飞书群。

总结链接到标题

这套方案的优势：

零侵入：不需要修改 CouchDB 本身，只需部署额外的 exporter 容器
全面覆盖：从服务健康到数据库内部状态，都有对应的指标
自动化告警：基于 Prometheus 的规则引擎，可以灵活配置告警条件
可扩展：如果要监控多个 CouchDB 实例，只需要在 Prometheus 中添加新的 target 即可

如果你的项目正在使用 CouchDB，不妨把这套监控方案用起来。数据的稳定性是应用稳定性的基础，而监控是保障稳定性的第一道防线。

为什么需要监控 CouchDB 链接到标题

Docker Compose 部署 链接到标题

Prometheus 配置 链接到标题

添加抓取目标 链接到标题

验证抓取 链接到标题

CouchDB 监控指标详解 链接到标题

服务健康指标 链接到标题

请求与性能指标 链接到标题

HTTP 响应状态码 链接到标题

认证与缓存 链接到标题

数据库层面指标 链接到标题

复制器指标 链接到标题

Fabric 分布式指标 链接到标题

Erlang 虚拟机指标 链接到标题

日志统计 链接到标题

Prometheus 告警规则 链接到标题

告警通知链路 链接到标题

总结 链接到标题