Operational Recovery Through WhatsApp

Your infrastructure should
message you directly.

AlertEngine replaces dashboards with operational conversations. Detect degradation. Get a WhatsApp message. Tap to approve the fix. Everything audited.

πŸ“‰
Detect
API latency spikes
β†’
🧠
Diagnose
AI finds root cause
β†’
πŸ’¬
Alert
WhatsApp message sent
β†’
βœ…
Approve
Engineer taps approve
β†’
πŸ“‹
Audit
Every action logged
$ pip install fastapi-alertengine
No dashboards No alert fatigue No chart interpretation No complex setup
232
Tests Passing
Full pytest suite
10/10
Adversarial Audit
All checks passed
5s
Detection
Alert polling interval
Live
Production Tenant
Fintech, Zimbabwe

Five stages. No dashboards required.

From anomaly detection to authorized recovery β€” entirely through your phone.

1

Detect

Add instrument(app). P95 latency, error rate, and health scoring start immediately. The orchestrator polls every 5 seconds.

2

Diagnose

Claude AI analyses the incident context and produces a plain English root-cause assessment with a recovery recommendation.

3

Alert

WhatsApp or Telegram message arrives. Plain English description of what broke, why, and what to do. Recovery link included.

4

Approve

Preview the recovery action. Tap to authorize. Nothing executes without your explicit approval. The system waits for you.

5

Audit

Every stage transition, delivery attempt, and authorization is logged immutably. Full incident history per tenant.

Free Forever Β· MIT Licensed

Local Incident Sensing β€” Free Forever

Everything you need to understand what your API is doing right now. No account. No cloud. No catch.

  • βœ“
    P95 Latency Tracking β€” rolling window, no dependencies
  • βœ“
    Error Rate Detection β€” 4xx and 5xx tracked separately
  • βœ“
    Health Score 0–100 β€” adaptive, composite metric
  • βœ“
    /health/alerts Endpoint β€” machine-readable incident feed
  • βœ“
    Memory Fallback β€” Redis optional; never crashes your app
  • βœ“
    Zero Breaking Changes β€” add it to any running FastAPI service
  • βœ“
    MIT Licensed β€” use it however you like
# Install pip install fastapi-alertengine # In your FastAPI app from fastapi import FastAPI from fastapi_alertengine import instrument app = FastAPI() instrument(app) # that's it # Now visit /health/alerts # { # "health_score": 94, # "p95_latency_ms": 42, # "error_rate": 0.002, # "alerts": [] # }

Pipeline and response format.

Deterministic data flow from request to alert. Every stage is observable, append-only, and recoverable.

# Pipeline FastAPI Request β†’ RequestMetricsMiddleware (latency + status) β†’ Redis Streams (append-only event log) β†’ Alert Engine (P95 + error rate + anomaly scoring) β†’ /health/alerts (status: ok | warning | critical) # JSON response from /health/alerts { "status": "critical", "health_score": {"score": 23, "status": "critical", "trend": "degrading"}, "metrics": { "overall_p95_ms": 2847.3, "error_rate": 0.19, "anomaly_score": 1.4, "sample_size": 187 }, "alerts": [ { "type": "latency_spike", "severity": "critical", "reason_for_trigger": "P95 latency 2847ms exceeds threshold", "triggered_by": "absolute_threshold" } ] }

Alerts where engineers actually are.

No new apps to install. No dashboards to check. Just the channel your team already uses.

πŸ’¬

WhatsApp

Via Twilio or Sent.dm. The most reliable mobile interrupt channel globally. Recovery approvals arrive as tappable links.

Developer plan+
✈️

Telegram

Via Telegram Bot API. Available on all plans including Hobby. No per-message cost β€” flat rate. Instant delivery globally.

All plans
#

Slack

Webhook-based Slack integration for team notifications. Incidents posted to your channel with recovery link.

Startup plan+
πŸ”—

Webhook

Generic HTTP webhook fallback. Fires when primary channel fails. Integrates with any endpoint.

All plans
πŸ“ž

Voice

Automated voice call escalation via Twilio. Fires after configurable timeout if approval is not received.

Scale plan+
πŸ“‹

Audit Trail

Every delivery attempt logged immutably. Success, failure, provider, timestamp. Full ledger per incident.

All plans
ChannelProviderPlanBest For
WhatsAppSent.dmSolo (default)Zero-friction, instant setup
WhatsAppTwilioAllEnterprise existing accounts
TelegramTelegram Bot APIAllDevelopers, North America
SlackIncoming WebhooksStartup+Team transparency
WebhookHTTP POSTAllSlack/Teams/PagerDuty fallback

Orchestrator Endpoints

RESTful endpoints for tenant management, incident audit, and human-authorized recovery.

MethodPathDescription
GET/healthService health + Redis status
GET/statusActive tenants, degraded mode, DLQ, stage gates
POST/onboardRegister a new tenant
POST/verifyVerify WhatsApp number
GET/tenant/{id}Get tenant status
GET/tenant/{id}/contactsGet contact verification status
POST/tenant/{id}/testTrigger test incident
GET/audit/{incident_id}Incident audit log ?tenant_id=
GET/delivery/{incident_id}Delivery log ?tenant_id=
GET/dlqDead letter queue Startup+
GET/action/recoverPreview recovery action (safe, no side effects)
POST/action/recover/confirmExecute recovery (irreversible, requires JWT)

Environment Variables

Required and optional configuration for the managed orchestrator.

VariableRequiredDescription
REDIS_URLYesRedis connection URL
ALERTENGINE_BASE_URLYesPublic URL of this orchestrator
ANTHROPIC_API_KEYYesClaude AI API key for diagnosis
ALERT_SECRETYesJWT signing secret (min 32 chars)
TWILIO_ACCOUNT_SIDTwilio onlyTwilio account SID
TWILIO_AUTH_TOKENTwilio onlyTwilio auth token
TWILIO_WHATSAPP_FROMTwilio onlySender WhatsApp number
SENT_API_KEYSent.dm onlySent.dm API key
SENT_PHONE_IDSent.dm onlySent.dm phone ID
LOOP_INTERVAL_SOptionalPolling interval seconds (default: 5)
POLICY_MIN_SCORE_TO_ALERTOptionalMin score to open incident (default: 70)

Simple, honest pricing.

The SDK is free forever. Pay only for the managed orchestration layer.

Why not just build it yourself?

Hobby β€” $19/mo

Free uptime tools tell you when your server is dead. They don't tell you when your payment rail is dropping 40% of traffic due to API latency. Building a sliding-window P95 tracker that catches business degradation takes days of engineering. At $19/mo you're buying back a week of development time for the price of two pizzas.

Developer β€” $99/mo

Datadog APM for 3 microservices costs $93/mo and gives you a graph. AlertEngine gives you the graph, the WhatsApp message, the AI diagnosis, and the recovery authorization link. We pay for WhatsApp routing and Claude inference on your behalf.

Solo+ β€” $299/mo

SOC 2 auditors ask: "Prove an automated script didn't modify production without oversight." AlertEngine shows the exact timestamped human authorization for every recovery action. If this shaves two days off your audit, it has paid for itself for the entire year.

Hobby
$19/mo

1 service. 5 incidents. Telegram only. Ideal for evaluating AlertEngine on a small project.

Get started
Developer
$99/mo

Single app. WhatsApp or Telegram. AI diagnosis. Ideal for side projects going live.

Get started
Startup
$799/mo

Up to 10 apps. Team approvals. Multi-channel delivery. SLA included.

Get started
Scale
$1,500/mo

Unlimited apps. Custom thresholds. Dedicated support. Compliance exports.

Get started
Enterprise
Custom

On-premise. SSO. Custom SLA. Dedicated account manager.

Contact us

Human-Authorized. Always.

No automated remediation. No background execution. Every recovery action requires explicit human authorization.

πŸ”‘

JWT Recovery Tokens

Every recovery action is gated by a tenant-scoped JWT with a 5-minute TTL. Tokens are single-use and validated atomically in Redis β€” no replay possible.

πŸ‘

Preview Before Authorization

GET the recovery link to see exactly what will happen. POST to execute. The preview is read-only and irreversible actions are always a separate, explicit step.

πŸ”’

Cross-Tenant Isolation

All endpoints enforce tenant ownership. An adversarial audit confirmed: attempting to access another tenant's incidents returns 403 β€” always.

πŸ“‹

Immutable Audit Trail

Every alert, diagnosis, delivery attempt, and recovery authorization is written to an append-only log with 7-day retention. Nothing happens silently.

Survived a full adversarial audit.

An autonomous AI agent acted as a hostile tenant and attempted to break isolation, replay tokens, and flood the system. 10/10 passed.

CheckResultDetail
Cross-tenant audit accessβœ“ Blocked403 returned
Cross-tenant delivery accessβœ“ Blocked403 returned
Recovery token replay (20 concurrent)βœ“ Protected1 succeeded, 19 rejected
Duplicate incident creation (race)βœ“ ProtectedExactly 1 created
Concurrent token floodβœ“ HandledAtomic Redis SET NX
Natural incident detectionβœ“ ConfirmedEnd-to-end verified
WhatsApp deliveryβœ“ ConfirmedLive production delivery
Recovery authorization audit trailβœ“ WrittenImmutable append-only log
Degraded mode handlingβœ“ ConfirmedNORMAL/DEGRADED/EMERGENCY
Lease renewal under loadβœ“ AtomicLua compare-and-delete
βœ“

Cross-Tenant Isolation

Attempted unauthorized access to another tenant's incidents. System returned 403 on every request.

βœ“

Replay Attack (20 Concurrent)

Flooded recovery endpoint with 20 identical JWT tokens. Exactly 1 succeeded; 19 were atomically rejected via Redis SET NX.

βœ“

Natural Incident Detection

Simulated latency spike and error-rate surge. Detection triggered within 5 seconds with correct severity classification.

βœ“

Recovery Authorization Audit Trail

Every preview GET and confirm POST was logged with timestamp, tenant ID, and JWT fingerprint. Trail was immutable.

βœ“

DLQ Plan Enforcement

Attempted to access dead-letter queue on Hobby plan. Endpoint correctly returned plan-gated denial.

βœ“

Concurrent Token Flood

Race-condition test: 50 concurrent authorization attempts for the same incident. No double-execution occurred.

βœ“

Circuit Breaker Resilience

Redis outage simulation. SDK entered memory-fallback mode without crashing the host FastAPI application.

βœ“

Monthly Counter Reset

Simulated 30-day boundary crossing. Incident counters reset atomically; no plan-limit bypass possible.

βœ“

Test Incident Plan Enforcement

Hobby tenant attempted to trigger AI diagnosis (not available on Hobby). System blocked with clear plan-upgrade messaging.

βœ“

WhatsApp Delivery Confirmed

Live end-to-end delivery to verified number. Recovery link delivered. Human authorization confirmed working.

Repository Structure

Clean separation between the free SDK and the paid orchestrator. MIT-licensed core; commercial managed layer.

fastapi_alertengine/ ← Free PyPI package
middleware.py ← RequestMetricsMiddleware
engine.py ← Core alert engine
intelligence.py ← Adaptive thresholds, health scoring
storage.py ← Redis Streams persistence
 
orchestrator/ ← Paid managed service
loop.py ← Multi-tenant polling
pipeline.py ← Incident state machine
claude_engine.py ← AI diagnosis
notifications.py ← Multi-channel dispatch
providers/ ← WhatsApp, Telegram, Slack, Webhook
audit.py ← Immutable forensic log
plans.py ← Billing tiers and feature gates
 
tests/ ← 232 tests, Python 3.10/3.11/3.12
docs/ ← This landing page

Built for compliance and auditability.

AlertEngine is designed for teams where operational decisions must be documented and defensible.

Suitable for fintech, regulated industries, and teams preparing for SOC 2 or ISO 27001.

πŸ‡ΏπŸ‡Ό

Built in Zimbabwe for mobile-first operational reality.

A latency spike in Zimbabwe means a customer walks away mid-transaction. We built FastAPI AlertEngine while running a WhatsApp-native commerce platform β€” where every failure was visible first on mobile. Mobile-first isn't a design choice here. It's the infrastructure constraint.

10/10
Adversarial checks passed including replay attacks and cross-tenant isolation
Live
Live fintech platform monitored in production β€” real workloads, real tenants
232
Tests passing across the full SDK and orchestration suite
5s
End-to-end detection latency from spike to WhatsApp alert

Your infrastructure should message you directly.

The SDK is free and takes one line. The managed layer is ready when you are.

$ pip install fastapi-alertengine