AlertEngine replaces dashboards with operational conversations. Detect degradation. Get a WhatsApp message. Tap to approve the fix. Everything audited.
From anomaly detection to authorized recovery β entirely through your phone.
Add instrument(app). P95 latency, error rate, and health scoring start immediately. The orchestrator polls every 5 seconds.
Claude AI analyses the incident context and produces a plain English root-cause assessment with a recovery recommendation.
WhatsApp or Telegram message arrives. Plain English description of what broke, why, and what to do. Recovery link included.
Preview the recovery action. Tap to authorize. Nothing executes without your explicit approval. The system waits for you.
Every stage transition, delivery attempt, and authorization is logged immutably. Full incident history per tenant.
Everything you need to understand what your API is doing right now. No account. No cloud. No catch.
Deterministic data flow from request to alert. Every stage is observable, append-only, and recoverable.
No new apps to install. No dashboards to check. Just the channel your team already uses.
Via Twilio or Sent.dm. The most reliable mobile interrupt channel globally. Recovery approvals arrive as tappable links.
Via Telegram Bot API. Available on all plans including Hobby. No per-message cost β flat rate. Instant delivery globally.
Webhook-based Slack integration for team notifications. Incidents posted to your channel with recovery link.
Generic HTTP webhook fallback. Fires when primary channel fails. Integrates with any endpoint.
Automated voice call escalation via Twilio. Fires after configurable timeout if approval is not received.
Every delivery attempt logged immutably. Success, failure, provider, timestamp. Full ledger per incident.
| Channel | Provider | Plan | Best For |
|---|---|---|---|
| Sent.dm | Solo (default) | Zero-friction, instant setup | |
| Twilio | All | Enterprise existing accounts | |
| Telegram | Telegram Bot API | All | Developers, North America |
| Slack | Incoming Webhooks | Startup+ | Team transparency |
| Webhook | HTTP POST | All | Slack/Teams/PagerDuty fallback |
RESTful endpoints for tenant management, incident audit, and human-authorized recovery.
| Method | Path | Description |
|---|---|---|
| GET | /health | Service health + Redis status |
| GET | /status | Active tenants, degraded mode, DLQ, stage gates |
| POST | /onboard | Register a new tenant |
| POST | /verify | Verify WhatsApp number |
| GET | /tenant/{id} | Get tenant status |
| GET | /tenant/{id}/contacts | Get contact verification status |
| POST | /tenant/{id}/test | Trigger test incident |
| GET | /audit/{incident_id} | Incident audit log ?tenant_id= |
| GET | /delivery/{incident_id} | Delivery log ?tenant_id= |
| GET | /dlq | Dead letter queue Startup+ |
| GET | /action/recover | Preview recovery action (safe, no side effects) |
| POST | /action/recover/confirm | Execute recovery (irreversible, requires JWT) |
Required and optional configuration for the managed orchestrator.
| Variable | Required | Description |
|---|---|---|
| REDIS_URL | Yes | Redis connection URL |
| ALERTENGINE_BASE_URL | Yes | Public URL of this orchestrator |
| ANTHROPIC_API_KEY | Yes | Claude AI API key for diagnosis |
| ALERT_SECRET | Yes | JWT signing secret (min 32 chars) |
| TWILIO_ACCOUNT_SID | Twilio only | Twilio account SID |
| TWILIO_AUTH_TOKEN | Twilio only | Twilio auth token |
| TWILIO_WHATSAPP_FROM | Twilio only | Sender WhatsApp number |
| SENT_API_KEY | Sent.dm only | Sent.dm API key |
| SENT_PHONE_ID | Sent.dm only | Sent.dm phone ID |
| LOOP_INTERVAL_S | Optional | Polling interval seconds (default: 5) |
| POLICY_MIN_SCORE_TO_ALERT | Optional | Min score to open incident (default: 70) |
The SDK is free forever. Pay only for the managed orchestration layer.
Hobby β $19/mo
Free uptime tools tell you when your server is dead. They don't tell you when your payment rail is dropping 40% of traffic due to API latency. Building a sliding-window P95 tracker that catches business degradation takes days of engineering. At $19/mo you're buying back a week of development time for the price of two pizzas.
Developer β $99/mo
Datadog APM for 3 microservices costs $93/mo and gives you a graph. AlertEngine gives you the graph, the WhatsApp message, the AI diagnosis, and the recovery authorization link. We pay for WhatsApp routing and Claude inference on your behalf.
Solo+ β $299/mo
SOC 2 auditors ask: "Prove an automated script didn't modify production without oversight." AlertEngine shows the exact timestamped human authorization for every recovery action. If this shaves two days off your audit, it has paid for itself for the entire year.
1 service. 5 incidents. Telegram only. Ideal for evaluating AlertEngine on a small project.
Get startedSingle app. WhatsApp or Telegram. AI diagnosis. Ideal for side projects going live.
Get startedUp to 3 apps. Priority diagnosis. Full audit trail. Best for indie developers running real products.
Get startedNo automated remediation. No background execution. Every recovery action requires explicit human authorization.
Every recovery action is gated by a tenant-scoped JWT with a 5-minute TTL. Tokens are single-use and validated atomically in Redis β no replay possible.
GET the recovery link to see exactly what will happen. POST to execute. The preview is read-only and irreversible actions are always a separate, explicit step.
All endpoints enforce tenant ownership. An adversarial audit confirmed: attempting to access another tenant's incidents returns 403 β always.
Every alert, diagnosis, delivery attempt, and recovery authorization is written to an append-only log with 7-day retention. Nothing happens silently.
An autonomous AI agent acted as a hostile tenant and attempted to break isolation, replay tokens, and flood the system. 10/10 passed.
| Check | Result | Detail |
|---|---|---|
| Cross-tenant audit access | β Blocked | 403 returned |
| Cross-tenant delivery access | β Blocked | 403 returned |
| Recovery token replay (20 concurrent) | β Protected | 1 succeeded, 19 rejected |
| Duplicate incident creation (race) | β Protected | Exactly 1 created |
| Concurrent token flood | β Handled | Atomic Redis SET NX |
| Natural incident detection | β Confirmed | End-to-end verified |
| WhatsApp delivery | β Confirmed | Live production delivery |
| Recovery authorization audit trail | β Written | Immutable append-only log |
| Degraded mode handling | β Confirmed | NORMAL/DEGRADED/EMERGENCY |
| Lease renewal under load | β Atomic | Lua compare-and-delete |
Attempted unauthorized access to another tenant's incidents. System returned 403 on every request.
Flooded recovery endpoint with 20 identical JWT tokens. Exactly 1 succeeded; 19 were atomically rejected via Redis SET NX.
Simulated latency spike and error-rate surge. Detection triggered within 5 seconds with correct severity classification.
Every preview GET and confirm POST was logged with timestamp, tenant ID, and JWT fingerprint. Trail was immutable.
Attempted to access dead-letter queue on Hobby plan. Endpoint correctly returned plan-gated denial.
Race-condition test: 50 concurrent authorization attempts for the same incident. No double-execution occurred.
Redis outage simulation. SDK entered memory-fallback mode without crashing the host FastAPI application.
Simulated 30-day boundary crossing. Incident counters reset atomically; no plan-limit bypass possible.
Hobby tenant attempted to trigger AI diagnosis (not available on Hobby). System blocked with clear plan-upgrade messaging.
Live end-to-end delivery to verified number. Recovery link delivered. Human authorization confirmed working.
Clean separation between the free SDK and the paid orchestrator. MIT-licensed core; commercial managed layer.
AlertEngine is designed for teams where operational decisions must be documented and defensible.
Suitable for fintech, regulated industries, and teams preparing for SOC 2 or ISO 27001.
Built in Zimbabwe for mobile-first operational reality.
A latency spike in Zimbabwe means a customer walks away mid-transaction. We built FastAPI AlertEngine while running a WhatsApp-native commerce platform β where every failure was visible first on mobile. Mobile-first isn't a design choice here. It's the infrastructure constraint.
The SDK is free and takes one line. The managed layer is ready when you are.