Alertron

An anomaly detection system using Kafka, Prometheus, Grafana and Slack.

GitHub Repository: View on GitHub


Introduction

I built a real-time anomaly-detection pipeline that ingests streaming IoT-style data, scores each event with an ML model, exports rich Prometheus metrics, raises automated Slack alerts when the fleet goes off‑nominal, and visualizes the system in Grafana. It’s a compact, hands‑on MLOps project that shows how to take a model from a notebook to a production‑like, observable service.

Problem

Detect anomalous device behavior across a fleet as it happens, not hours later. The system needed to:

  • handle a continuous stream of readings,
  • score each event with low latency,
  • expose operational metrics (throughput, latency, errors),
  • raise a reliable alert when anomaly volume spikes,
  • be easy to run locally and share.

Approach

Pipeline

IoT generator → Kafka ("readings")
                   ↓
        FastAPI inference service (aiokafka)
           • Isolation Forest scoring
           • Severity via score quantiles (low/medium/high)
           • Prometheus metrics (/metrics)
                   ↓
      Prometheus → Alert rules → Alertmanager → Slack
                   ↓
                        Grafana dashboards

Key tech

  • Kafka (Redpanda) for the stream (readings, optional anomalies topic).
  • FastAPI service scoring with Isolation Forest (scikit‑learn, joblib).
  • Prometheus metrics exposed by the service:
    anomaly_count_total{device_id,severity}, last_anomaly_score{device_id},
    inference_latency_seconds_bucket/_sum/_count, predictions_total,
    kafka_messages_consumed_total{topic}, kafka_errors_total.
  • Alerting: Prometheus rule (e.g., sum(increase(anomaly_count_total[5m])) > threshold) → AlertmanagerSlack (app webhook).
  • Grafana dashboard for anomalies, throughput, p95 latency, Kafka error %, top devices, and last scores.
  • Docker Compose for one‑command up; a topic-init job ensures topics exist before consumers start.

Results

  • Throughput: ~9–10 predictions/sec on my local run.
  • Latency: ~24–25 ms p95 inference latency.
  • Alerting: HighAnomalyRate fires under sustained spikes and delivers to Slack.
  • Reliability: Kafka error % at 0.00% during steady state.
  • Observability: Live panels for anomaly volume (5m), by‑severity trend, top noisy devices, and recent per‑device scores.

Key Findings

  • Observability from day 0 changes how you build ML services: latency histograms + rates make bottlenecks obvious.
  • Severity needs calibration: quantile‑based thresholds (e.g., p80/p95 of scores) give meaningful low/medium/high bands.
  • Alert tuning matters: use increase() over a window and sum across devices; start conservative to avoid Slack spam.
  • Streaming > polling for scale: Kafka decouples generation from scoring and handles backpressure cleanly.
  • Infra gotchas: correct Kafka advertised addresses and YAML indentation for Prometheus rules are common pitfalls—document them.

What I Solved

  • Turned a notebook model into a real‑time, observable microservice.
  • Built an end‑to‑end alerting loop (metrics → rule → Slack) operators can trust.
  • Packaged everything with Docker Compose so anyone can reproduce the system in minutes.
  • Shipped a Grafana dashboard and PromQL library that make the service explainable to SREs and PMs, not just ML engineers.

Conclusion

This project demonstrates practical MLOps readiness: streaming ingestion, low‑latency inference, first‑class metrics, actionable alerts, and clear dashboards. It’s small enough to run locally yet architected like a production system—useful as a template for real deployments.

Future Improvements

  • Modeling: windowed features, autoencoder baseline, concept‑drift detection, periodic retraining.
  • Data/Contracts: schema registry (Avro/Protobuf), input validation, dead‑letter topic.
  • Ops: CI/CD with unit + integration tests, canary deploys, blue/green for the service.
  • Platform: Kubernetes (Prometheus Operator, Alertmanager, Grafana provisioning), Helm charts.
  • Observability+: OpenTelemetry traces, SLOs + burn‑rate alerts, per‑device alert routing, on‑call runbooks.
  • Security: TLS between components, secret management, RBAC on dashboards.