前一篇 介绍了用 Alloy + Loki 给 CouchDB 做日志告警。这篇文章是同一套路在 Windmill 上的应用。
Windmill 是一个工作流调度平台,由多个 Docker 容器组成:server(API 服务)、worker(任务执行)、worker_gpu(GPU 任务)。这些容器的日志里包含各种运行错误——
- API Token 权限不足
- S3 存储配置丢失
- Worker 执行异常
它们不会体现在 Prometheus 指标上,只有查看容器日志才能发现。需要一套日志级的告警来及时发现。
日志格式分析 链接到标题
Windmill 的日志是结构化 JSON 格式,每条日志包含 level、msg、timestamp 等字段:
{"level":"ERROR","msg":"Permission denied. Required scope: jobs:run:flows","timestamp":"..."}
{"level":"ERROR","msg":"Storage _default_ not found at the workspace level","timestamp":"..."}
JSON 格式的优势在于:Loki 可以直接通过 | json 解析出字段,然后用字段值做精确过滤,比纯文本正则匹配更稳定可靠。
常见的 Windmill ERROR 类型:
| 错误类型 | 含义 | 严重度 |
|---|---|---|
Permission denied |
API Token scope 不足 | warning |
Storage not found |
S3 存储配置丢失 | critical |
unshare isolation |
容器隔离配置缺失(无害) | 可忽略 |
Worker ERROR |
任务执行异常 | warning |
整体架构 链接到标题
server / worker
worker_gpu"] --> B["Alloy
loki.source.docker"] B --> C["Loki
日志存储 + Ruler"] C --> D["Alertmanager
告警去重/路由"] D --> E["alert-transformer
格式化"] E --> F["OpenClaw"] F --> G["飞书通知"] H["Prometheus
指标告警"] --> D
日志告警和已有的指标告警在 Alertmanager 汇合,走同一套通知链路到飞书。
Alloy 部署 链接到标题
对比 前一篇 的 CouchDB 部署,Alloy 的容器部署完全一致,只是 config.alloy 中的 relabel 规则和日志处理不同。
docker-compose.yaml 链接到标题
services:
alloy:
image: m.daocloud.io/docker.io/grafana/alloy:v1.14.1
container_name: alloy
restart: unless-stopped
ports:
- 12345:12345
volumes:
- ./config.alloy:/etc/alloy/config.alloy:ro
- /var/run/docker.sock:/var/run/docker.sock
command:
- run
- --server.http.listen-addr=0.0.0.0:12345
- --storage.path=/var/lib/alloy/data
- /etc/alloy/config.alloy
config.alloy 链接到标题
// 发现 Docker 容器
discovery.docker "local" {
host = "unix:///var/run/docker.sock"
refresh_interval = "5s"
}
// 给各容器打标签 + 跳过无关容器
discovery.relabel "local" {
targets = discovery.docker.local.targets
// server
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/windmill-windmill_server-1"
target_label = "container"
replacement = "windmill_server-1"
}
// worker(两个副本)
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/windmill-windmill_worker-1"
target_label = "container"
replacement = "windmill_worker-1"
}
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/windmill-windmill_worker-2"
target_label = "container"
replacement = "windmill_worker-2"
}
// worker_gpu
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/windmill-windmill_worker_gpu-1"
target_label = "container"
replacement = "windmill_worker_gpu-1"
}
// 跳过 caddy(反向代理)和 windmill_extra(LSP 调试器)
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/windmill-(caddy|windmill_extra)"
target_label = "__meta_alloy_skip"
replacement = "true"
}
}
loki.source.docker "local" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.local.output
forward_to = [loki.process.local.receiver]
refresh_interval = "5s"
}
// 解析 JSON 日志 + 提取 level 标签
loki.process "local" {
stage.json {
expressions = {
level = "level",
msg = "msg",
}
}
stage.labels {
values = {
level = "",
container = "",
}
}
forward_to = [loki.write.remote.receiver]
}
// 推送到 Loki
loki.write "remote" {
endpoint {
url = "http://loki.example.com:3100/loki/api/v1/push"
tenant_id = "my-tenant"
}
}
配置要点 链接到标题
-
挂载
/var/run/docker.sock——让 Alloy 能发现和读取 Docker 容器日志 -
stage.json+stage.labels——与 CouchDB 篇不同,Windmill 的 JSON 日志可以提取level和msg字段。stage.json提取字段,stage.labels将level提升为标签,这样在 LogQL 中可以直接按level = "ERROR"过滤,不需要在每条规则里重新解析 -
跳过无关容器——caddy(反向代理)和 windmill_extra(LSP 调试器)的日志与运行错误无关,跳过以减少存储和告警噪声
容器名的坑 链接到标题
Docker Compose 部署的服务名会在容器名追加编号:
| Docker Compose 服务名 | 实际容器名 |
|---|---|
windmill_server |
windmill_server-1 |
windmill_worker(replicas: 2) |
windmill_worker-1、windmill_worker-2 |
windmill_worker_gpu |
windmill_worker_gpu-1 |
在 Loki 中查询 label values 确认实际的容器名:
curl -s -H "X-Scope-OrgID: my-tenant" \
"http://loki.example.com:3100/loki/api/v1/label/container/values"
Loki Ruler 告警规则 链接到标题
启用 Ruler 链接到标题
如 前一篇 所述,在 loki-config.yaml 中启用 Ruler:
ruler:
enable_api: true
enable_alertmanager_v2: true
alertmanager_url: http://alertmanager:9093
poll_interval: 30s
storage:
type: local
local:
directory: /loki/rules
规则文件 链接到标题
新建 <rules_dir>/<tenant_id>/windmill.yaml:
groups:
- name: windmill_errors
interval: 30s
rules:
# server 错误过多(聚合告警,防抖动)
- alert: WindmillServerError
expr: |
sum(count_over_time({container="windmill_server-1"}
| json
| level = "ERROR" [1m])) > 3
for: 2m
labels:
severity: warning
annotations:
summary: "Windmill server 错误过多"
description: "最近 2 分钟产生 {{ $value }} 条 ERROR"
# API 权限问题(精确匹配 msg 字段)
- alert: WindmillPermissionDenied
expr: |
count_over_time({container="windmill_server-1"}
| json
| level = "ERROR"
| msg =~ "Permission denied" [5m]) > 0
labels:
severity: warning
annotations:
summary: "Windmill API 权限不足"
description: "API token 可能已过期或 scope 不足"
# S3 存储配置问题
- alert: WindmillS3Error
expr: |
count_over_time({container="windmill_server-1"}
| json
| level = "ERROR"
| msg =~ "Storage.*not found" [5m]) > 0
labels:
severity: critical
annotations:
summary: "Windmill S3 存储配置错误"
description: "S3 存储不可用,影响文件访问"
# Worker 执行异常(注意:windmill_worker-\d+$ 精确匹配两个 worker)
- alert: WindmillWorkerError
expr: |
sum(count_over_time({container=~"windmill_worker-\\d+$"}
| json
| level = "ERROR" [5m])) > 0
labels:
severity: warning
annotations:
summary: "Windmill worker 运行错误"
description: "Worker 容器出现 ERROR 日志"
LogQL 语法说明 链接到标题
| 语法 | 含义 | 例子 |
|---|---|---|
{container="windmill_server-1"} |
按标签选择日志流 | 精确匹配 |
{container=~"windmill_worker-\\d+$"} |
正则匹配日志流 | 匹配 worker-1、worker-2 |
| json |
解析 JSON 内容 | 提取 level、msg 等字段 |
level = "ERROR" |
过滤字段值 | 只保留 ERROR 级别 |
msg =~ "Permission denied" |
字段值正则匹配 | 模糊匹配错误信息 |
count_over_time(... [1m]) |
统计 1 分钟内匹配行数 | 用于计数 |
for: 2m |
持续 2 分钟才触发 | 防止抖动 |
sum() 的作用:当多个容器匹配同一个规则时(如多个 worker),将它们的计数汇总。
Alertmanager 配置 链接到标题
同样复用现有配置,不需要新增路由:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'openclaw'
receivers:
- name: 'openclaw'
webhook_configs:
- url: 'http://alert-transformer:9091/alertmanager'
send_resolved: true
验证方法 链接到标题
# 查看所有容器的 label values
curl -s -H "X-Scope-OrgID: my-tenant" \
"http://loki.example.com:3100/loki/api/v1/label/container/values"
# 确认告警规则加载
curl -s -H "X-Scope-OrgID: my-tenant" \
"http://loki.example.com:3100/loki/api/v1/rules"
# 查询 windmill_server ERROR 日志
curl -s -H "X-Scope-OrgID: my-tenant" \
"http://loki.example.com:3100/loki/api/v1/query_range?\
query=%7Bcontainer%3D%22windmill_server-1%22%7D%20%7C%20json%20%7C%20level%20%3D%20%22ERROR%22&limit=3"
# 确认 Alertmanager 告警
curl -s "http://alertmanager:9093/api/v2/alerts"
总结 链接到标题
与前一篇 CouchDB 日志监控对比:
| 对比项 | CouchDB(文本日志) | Windmill(JSON 日志) |
|---|---|---|
| 日志格式 | Erlang [error] |
{"level":"ERROR","msg":"..."} |
| Alloy 解析 | 不需要,原样透传 | stage.json 提取字段 |
| LogQL 写法 | |~ "\\\\[error\\\\]" |
| json | level = "ERROR" |
| 字段过滤 | 不支持 | 支持精确匹配 msg |
对于 JSON 格式的日志,| json 比文本正则更稳定、表达能力更强。
如果项目中还有其他的 JSON 格式服务日志,用完全相同的模式接入即可——Alloy 配置只需要改 relabel 规则,Loki 只需要新增一个 YAML 规则文件。