Operational Recovery Through WhatsApp

Your infrastructure should
message you directly.

AlertEngine replaces dashboards with operational conversations. Detect degradation. Get a WhatsApp message. Tap to approve the fix. Everything audited.

📉

Detect

API latency spikes

→

🧠

Diagnose

AI finds root cause

→

💬

Alert

WhatsApp message sent

→

✅

Approve

Engineer taps approve

→

📋

Audit

Every action logged

$ pip install fastapi-alertengine

View on GitHub Request Managed Pilot

No dashboards No alert fatigue No chart interpretation No complex setup

How it works

Five stages. No dashboards required.

From anomaly detection to authorized recovery — entirely through your phone.

Detect

Add instrument(app). P95 latency, error rate, and health scoring start immediately. The orchestrator polls every 5 seconds.

Diagnose

Claude AI analyses the incident context and produces a plain English root-cause assessment with a recovery recommendation.

Alert

WhatsApp or Telegram message arrives. Plain English description of what broke, why, and what to do. Recovery link included.

Approve

Preview the recovery action. Tap to authorize. Nothing executes without your explicit approval. The system waits for you.

Audit

Every stage transition, delivery attempt, and authorization is logged immutably. Full incident history per tenant.

Free SDK

Free Forever · MIT Licensed

Local Incident Sensing — Free Forever

Everything you need to understand what your API is doing right now. No account. No cloud. No catch.

✓
P95 Latency Tracking — rolling window, no dependencies
✓
Error Rate Detection — 4xx and 5xx tracked separately
✓
Health Score 0–100 — adaptive, composite metric
✓
/health/alerts Endpoint — machine-readable incident feed
✓
Memory Fallback — Redis optional; never crashes your app
✓
Zero Breaking Changes — add it to any running FastAPI service
✓
MIT Licensed — use it however you like

# Install
pip install fastapi-alertengine

# In your FastAPI app
from fastapi import FastAPI
from fastapi_alertengine import instrument

app = FastAPI()
instrument(app)  # that's it

# Now visit /health/alerts
# {
#   "health_score": 94,
#   "p95_latency_ms": 42,
#   "error_rate": 0.002,
#   "alerts": []
# }
        

Architecture

Pipeline and response format.

Deterministic data flow from request to alert. Every stage is observable, append-only, and recoverable.

# Pipeline
FastAPI Request
  → RequestMetricsMiddleware (latency + status)
  → Redis Streams (append-only event log)
  → Alert Engine (P95 + error rate + anomaly scoring)
  → /health/alerts (status: ok | warning | critical)

# JSON response from /health/alerts
{
  "status": "critical",
  "health_score": {"score": 23, "status": "critical", "trend": "degrading"},
  "metrics": {
    "overall_p95_ms": 2847.3,
    "error_rate": 0.19,
    "anomaly_score": 1.4,
    "sample_size": 187
  },
  "alerts": [
    {
      "type": "latency_spike",
      "severity": "critical",
      "reason_for_trigger": "P95 latency 2847ms exceeds threshold",
      "triggered_by": "absolute_threshold"
    }
  ]
}
    

Channels

Alerts where engineers actually are.

No new apps to install. No dashboards to check. Just the channel your team already uses.

💬

Via Twilio or Sent.dm. The most reliable mobile interrupt channel globally. Recovery approvals arrive as tappable links.

Developer plan+

✈️

Via Telegram Bot API. Available on all plans including Hobby. No per-message cost — flat rate. Instant delivery globally.

All plans

Slack

Webhook-based Slack integration for team notifications. Incidents posted to your channel with recovery link.

Startup plan+

🔗

Webhook

Generic HTTP webhook fallback. Fires when primary channel fails. Integrates with any endpoint.

All plans

📞

Voice

Automated voice call escalation via Twilio. Fires after configurable timeout if approval is not received.

Scale plan+

📋

Audit Trail

Every delivery attempt logged immutably. Success, failure, provider, timestamp. Full ledger per incident.

All plans

Channel	Provider	Plan	Best For
WhatsApp	Sent.dm	Solo (default)	Zero-friction, instant setup
WhatsApp	Twilio	All	Enterprise existing accounts
Telegram	Telegram Bot API	All	Developers, North America
Slack	Incoming Webhooks	Startup+	Team transparency
Webhook	HTTP POST	All	Slack/Teams/PagerDuty fallback

API

Orchestrator Endpoints

RESTful endpoints for tenant management, incident audit, and human-authorized recovery.

Method	Path	Description
GET	/health	Service health + Redis status
GET	/status	Active tenants, degraded mode, DLQ, stage gates
POST	/onboard	Register a new tenant
POST	/verify	Verify WhatsApp number
GET	/tenant/{id}	Get tenant status
GET	/tenant/{id}/contacts	Get contact verification status
POST	/tenant/{id}/test	Trigger test incident
GET	/audit/{incident_id}	Incident audit log ?tenant_id=
GET	/delivery/{incident_id}	Delivery log ?tenant_id=
GET	/dlq	Dead letter queue Startup+
GET	/action/recover	Preview recovery action (safe, no side effects)
POST	/action/recover/confirm	Execute recovery (irreversible, requires JWT)

Configuration

Environment Variables

Required and optional configuration for the managed orchestrator.

Variable	Required	Description
REDIS_URL	Yes	Redis connection URL
ALERTENGINE_BASE_URL	Yes	Public URL of this orchestrator
ANTHROPIC_API_KEY	Yes	Claude AI API key for diagnosis
ALERT_SECRET	Yes	JWT signing secret (min 32 chars)
TWILIO_ACCOUNT_SID	Twilio only	Twilio account SID
TWILIO_AUTH_TOKEN	Twilio only	Twilio auth token
TWILIO_WHATSAPP_FROM	Twilio only	Sender WhatsApp number
SENT_API_KEY	Sent.dm only	Sent.dm API key
SENT_PHONE_ID	Sent.dm only	Sent.dm phone ID
LOOP_INTERVAL_S	Optional	Polling interval seconds (default: 5)
POLICY_MIN_SCORE_TO_ALERT	Optional	Min score to open incident (default: 70)

Pricing

Simple, honest pricing.

The SDK is free forever. Pay only for the managed orchestration layer.

Why not just build it yourself?

Hobby — $19/mo

Free uptime tools tell you when your server is dead. They don't tell you when your payment rail is dropping 40% of traffic due to API latency. Building a sliding-window P95 tracker that catches business degradation takes days of engineering. At $19/mo you're buying back a week of development time for the price of two pizzas.

Developer — $99/mo

Datadog APM for 3 microservices costs $93/mo and gives you a graph. AlertEngine gives you the graph, the WhatsApp message, the AI diagnosis, and the recovery authorization link. We pay for WhatsApp routing and Claude inference on your behalf.

Solo+ — $299/mo

SOC 2 auditors ask: "Prove an automated script didn't modify production without oversight." AlertEngine shows the exact timestamped human authorization for every recovery action. If this shaves two days off your audit, it has paid for itself for the entire year.

Hobby

$19/mo

1 service. 5 incidents. Telegram only. Ideal for evaluating AlertEngine on a small project.

Get started

Developer

$99/mo

Single app. WhatsApp or Telegram. AI diagnosis. Ideal for side projects going live.

Get started

Popular

Solo

$299/mo

Up to 3 apps. Priority diagnosis. Full audit trail. Best for indie developers running real products.

Get started

Startup

$799/mo

Up to 10 apps. Team approvals. Multi-channel delivery. SLA included.

Get started

Scale

$1,500/mo

Unlimited apps. Custom thresholds. Dedicated support. Compliance exports.

Get started

Enterprise

Custom

On-premise. SSO. Custom SLA. Dedicated account manager.

Safety

Human-Authorized. Always.

No automated remediation. No background execution. Every recovery action requires explicit human authorization.

🔑

JWT Recovery Tokens

Every recovery action is gated by a tenant-scoped JWT with a 5-minute TTL. Tokens are single-use and validated atomically in Redis — no replay possible.

👁

Preview Before Authorization

GET the recovery link to see exactly what will happen. POST to execute. The preview is read-only and irreversible actions are always a separate, explicit step.

🔒

Cross-Tenant Isolation

All endpoints enforce tenant ownership. An adversarial audit confirmed: attempting to access another tenant's incidents returns 403 — always.

📋

Immutable Audit Trail

Every alert, diagnosis, delivery attempt, and recovery authorization is written to an append-only log with 7-day retention. Nothing happens silently.

Security

Survived a full adversarial audit.

An autonomous AI agent acted as a hostile tenant and attempted to break isolation, replay tokens, and flood the system. 10/10 passed.

Check	Result	Detail
Cross-tenant audit access	✓ Blocked	403 returned
Cross-tenant delivery access	✓ Blocked	403 returned
Recovery token replay (20 concurrent)	✓ Protected	1 succeeded, 19 rejected
Duplicate incident creation (race)	✓ Protected	Exactly 1 created
Concurrent token flood	✓ Handled	Atomic Redis SET NX
Natural incident detection	✓ Confirmed	End-to-end verified
WhatsApp delivery	✓ Confirmed	Live production delivery
Recovery authorization audit trail	✓ Written	Immutable append-only log
Degraded mode handling	✓ Confirmed	NORMAL/DEGRADED/EMERGENCY
Lease renewal under load	✓ Atomic	Lua compare-and-delete

✓

Cross-Tenant Isolation

Attempted unauthorized access to another tenant's incidents. System returned 403 on every request.

✓

Replay Attack (20 Concurrent)

Flooded recovery endpoint with 20 identical JWT tokens. Exactly 1 succeeded; 19 were atomically rejected via Redis SET NX.

✓

Natural Incident Detection

Simulated latency spike and error-rate surge. Detection triggered within 5 seconds with correct severity classification.

✓

Recovery Authorization Audit Trail

Every preview GET and confirm POST was logged with timestamp, tenant ID, and JWT fingerprint. Trail was immutable.

✓

DLQ Plan Enforcement

Attempted to access dead-letter queue on Hobby plan. Endpoint correctly returned plan-gated denial.

✓

Concurrent Token Flood

Race-condition test: 50 concurrent authorization attempts for the same incident. No double-execution occurred.

✓

Circuit Breaker Resilience

Redis outage simulation. SDK entered memory-fallback mode without crashing the host FastAPI application.

✓

Monthly Counter Reset

Simulated 30-day boundary crossing. Incident counters reset atomically; no plan-limit bypass possible.

✓

Test Incident Plan Enforcement

Hobby tenant attempted to trigger AI diagnosis (not available on Hobby). System blocked with clear plan-upgrade messaging.

✓

WhatsApp Delivery Confirmed

Live end-to-end delivery to verified number. Recovery link delivered. Human authorization confirmed working.

Open Source

Repository Structure

Clean separation between the free SDK and the paid orchestrator. MIT-licensed core; commercial managed layer.

fastapi_alertengine/ ← Free PyPI package

middleware.py ← RequestMetricsMiddleware

engine.py ← Core alert engine

intelligence.py ← Adaptive thresholds, health scoring

storage.py ← Redis Streams persistence

orchestrator/ ← Paid managed service

loop.py ← Multi-tenant polling

pipeline.py ← Incident state machine

claude_engine.py ← AI diagnosis

notifications.py ← Multi-channel dispatch

providers/ ← WhatsApp, Telegram, Slack, Webhook

audit.py ← Immutable forensic log

plans.py ← Billing tiers and feature gates

tests/ ← 232 tests, Python 3.10/3.11/3.12

docs/ ← This landing page

Compliance

Built for compliance and auditability.

AlertEngine is designed for teams where operational decisions must be documented and defensible.

✓
Every recovery action requires explicit human authorization — never autonomous
✓
Immutable audit trail on every incident, stage transition, and recovery event
✓
JWT-scoped single-use tokens with replay protection
✓
Cross-tenant data isolation enforced at every endpoint
✓
Delivery ledger records every notification attempt

Suitable for fintech, regulated industries, and teams preparing for SOC 2 or ISO 27001.

🇿🇼

Built in Zimbabwe for mobile-first operational reality.

A latency spike in Zimbabwe means a customer walks away mid-transaction. We built FastAPI AlertEngine while running a WhatsApp-native commerce platform — where every failure was visible first on mobile. Mobile-first isn't a design choice here. It's the infrastructure constraint.

10/10

Adversarial checks passed including replay attacks and cross-tenant isolation

Live

Live fintech platform monitored in production — real workloads, real tenants

232

Tests passing across the full SDK and orchestration suite

End-to-end detection latency from spike to WhatsApp alert

Your infrastructure shouldmessage you directly.

Five stages. No dashboards required.

Detect

Diagnose

Alert

Approve

Audit

Local Incident Sensing — Free Forever

Pipeline and response format.

Alerts where engineers actually are.

WhatsApp

Telegram

Slack

Webhook

Voice

Audit Trail

Orchestrator Endpoints

Environment Variables

Simple, honest pricing.

Why not just build it yourself?

Human-Authorized. Always.

JWT Recovery Tokens

Preview Before Authorization

Cross-Tenant Isolation

Immutable Audit Trail

Survived a full adversarial audit.

Cross-Tenant Isolation

Replay Attack (20 Concurrent)

Natural Incident Detection

Recovery Authorization Audit Trail

DLQ Plan Enforcement

Concurrent Token Flood

Circuit Breaker Resilience

Monthly Counter Reset

Test Incident Plan Enforcement

WhatsApp Delivery Confirmed

Repository Structure

Built for compliance and auditability.

Your infrastructure should message you directly.

Your infrastructure should
message you directly.