Alertron
An anomaly detection system using Kafka, Prometheus, Grafana and Slack.
GitHub Repository: View on GitHub
Introduction
I built a real-time anomaly-detection pipeline that ingests streaming IoT-style data, scores each event with an ML model, exports rich Prometheus metrics, raises automated Slack alerts when the fleet goes off‑nominal, and visualizes the system in Grafana. It’s a compact, hands‑on MLOps project that shows how to take a model from a notebook to a production‑like, observable service.
Problem
Detect anomalous device behavior across a fleet as it happens, not hours later. The system needed to:
- handle a continuous stream of readings,
- score each event with low latency,
- expose operational metrics (throughput, latency, errors),
- raise a reliable alert when anomaly volume spikes,
- be easy to run locally and share.
Approach
Pipeline
IoT generator → Kafka ("readings")
↓
FastAPI inference service (aiokafka)
• Isolation Forest scoring
• Severity via score quantiles (low/medium/high)
• Prometheus metrics (/metrics)
↓
Prometheus → Alert rules → Alertmanager → Slack
↓
Grafana dashboards
Key tech
- Kafka (Redpanda) for the stream (
readings
, optionalanomalies
topic). - FastAPI service scoring with Isolation Forest (scikit‑learn, joblib).
- Prometheus metrics exposed by the service:
anomaly_count_total{device_id,severity}
,last_anomaly_score{device_id}
,
inference_latency_seconds_bucket/_sum/_count
,predictions_total
,
kafka_messages_consumed_total{topic}
,kafka_errors_total
. - Alerting: Prometheus rule (e.g.,
sum(increase(anomaly_count_total[5m])) > threshold
) → Alertmanager → Slack (app webhook). - Grafana dashboard for anomalies, throughput, p95 latency, Kafka error %, top devices, and last scores.
- Docker Compose for one‑command up; a
topic-init
job ensures topics exist before consumers start.
Results
- Throughput: ~9–10 predictions/sec on my local run.
- Latency: ~24–25 ms p95 inference latency.
- Alerting:
HighAnomalyRate
fires under sustained spikes and delivers to Slack. - Reliability: Kafka error % at 0.00% during steady state.
- Observability: Live panels for anomaly volume (5m), by‑severity trend, top noisy devices, and recent per‑device scores.
Key Findings
- Observability from day 0 changes how you build ML services: latency histograms + rates make bottlenecks obvious.
- Severity needs calibration: quantile‑based thresholds (e.g., p80/p95 of scores) give meaningful low/medium/high bands.
- Alert tuning matters: use
increase()
over a window and sum across devices; start conservative to avoid Slack spam. - Streaming > polling for scale: Kafka decouples generation from scoring and handles backpressure cleanly.
- Infra gotchas: correct Kafka advertised addresses and YAML indentation for Prometheus rules are common pitfalls—document them.
What I Solved
- Turned a notebook model into a real‑time, observable microservice.
- Built an end‑to‑end alerting loop (metrics → rule → Slack) operators can trust.
- Packaged everything with Docker Compose so anyone can reproduce the system in minutes.
- Shipped a Grafana dashboard and PromQL library that make the service explainable to SREs and PMs, not just ML engineers.
Conclusion
This project demonstrates practical MLOps readiness: streaming ingestion, low‑latency inference, first‑class metrics, actionable alerts, and clear dashboards. It’s small enough to run locally yet architected like a production system—useful as a template for real deployments.
Future Improvements
- Modeling: windowed features, autoencoder baseline, concept‑drift detection, periodic retraining.
- Data/Contracts: schema registry (Avro/Protobuf), input validation, dead‑letter topic.
- Ops: CI/CD with unit + integration tests, canary deploys, blue/green for the service.
- Platform: Kubernetes (Prometheus Operator, Alertmanager, Grafana provisioning), Helm charts.
- Observability+: OpenTelemetry traces, SLOs + burn‑rate alerts, per‑device alert routing, on‑call runbooks.
- Security: TLS between components, secret management, RBAC on dashboards.