Incident context assembled before the engineer opens their laptop
When a P1 fires, Doe pulls the PagerDuty alert, correlates Datadog metrics and deploy diffs, maps the blast radius via New Relic, and delivers a brief with the likely cause and rollback steps within 2 minutes.
Alert details from PagerDuty, correlated metrics from Datadog, and service health from New Relic assembled into an incident brief with the likely cause, blast radius, and suggested rollback steps before the on-call engineer starts investigating.
What changes
| Dimension | Before | With Doe |
|---|---|---|
| Context-gathering time | 20-30 minutes opening tabs and cross-referencing timestamps | Incident brief ready within 2 minutes of the PagerDuty alert |
| Deploy correlation | Engineer manually checks deploy logs for recent changes | Recent deploys correlated with metric changes automatically |
| Blast radius visibility | Downstream services discovered as they start failing | Affected downstream services identified in the initial brief |
| Runbook access | Engineer searches Confluence for the right runbook mid-incident | Relevant runbook linked in the brief based on the service and failure type |
How Doe builds the incident brief
P1 alert fired at 2:14 AM. Service: payment-api. Escalation policy: on-call SRE (J. Park). Alert details: "payment-api response time exceeded 5s threshold for 3 consecutive checks." Previous incident on this service: 11 days ago (bad config deploy).
payment-api CPU spiked to 94% at 2:11 AM (baseline: 35%). Error rate jumped from 0.2% to 12.4%. Last deploy to payment-api: 2:08 AM by deploy-bot (commit: "add retry logic to payment processor"). 2 other services showing elevated error rates: checkout-api and billing-service (both downstream of payment-api).
payment-api Apdex dropped to 0.12 (normal: 0.94). Transaction trace shows the retry loop in processPayment() averaging 8.3s per call (normal: 120ms). Upstream dependency map confirms checkout-api and billing-service are healthy independently but failing on calls to payment-api.
Incident brief: the payment-api deploy at 2:08 AM introduced a retry loop in processPayment() consuming CPU and cascading errors to checkout-api and billing-service. Suggested action: roll back commit abc123. Deploy diff, Datadog and New Relic dashboards, and the payment-api rollback runbook linked.
The first 30 minutes of every incident are wasted on context-gathering
PagerDuty fires at 2 AM. The on-call engineer opens Datadog, then New Relic, then checks the deploy log. They cross-reference timestamps, look for correlated service errors, and try to figure out what changed. Meanwhile, the service is down and the incident channel is filling up with "any update?" messages.
The last P1 took 47 minutes to resolve. 28 of those minutes were spent gathering context: which service, what changed, when it started. The actual fix took 19 minutes.
Get started in under 10 minutes
Connect your tools
One-click OAuth for each integration. No API keys, no engineering.
Describe what you need
“When a P1 alert fires in PagerDuty, pull the alert details, grab CPU, error rate, and recent deploys from Datadog, check service health in New Relic, and assemble an incident brief with the likely cause, affected services, and suggested rollback steps. Link the deploy diff and the relevant runbook.”
It runs on schedule
Triggered on every P1 PagerDuty alert. Brief available within 2 minutes.
Incident Response Brief FAQ
Typically within 2 minutes. The bottleneck is API response times from Datadog and New Relic, not processing. Most briefs are ready before the on-call engineer finishes reading the PagerDuty notification.
Related workflows
Stop doing the work your tools should do for you.
Set it up once. Doe runs it every time.