Overview
Alert fatigue is the silent killer of SOC effectiveness. When analysts face thousands of alerts per day, having a clear, repeatable triage process is the difference between catching a real attack and missing it in the noise. This runbook standardizes how alerts are classified, prioritized, investigated, and dispositioned so that every analyst follows the same methodology regardless of experience level or time of day.
Triage Workflow Steps
- Receive and acknowledge the alert within the SLA window
- Validate the alert against enrichment data: asset context, user behavior, threat intelligence
- Classify the alert using the severity matrix and MITRE ATT&CK mapping
- Determine if the alert is a true positive, false positive, or requires further investigation
- For true positives, escalate to incident response and open an incident ticket
- For false positives, document the root cause and submit a tuning request
- For uncertain alerts, conduct deeper investigation within the defined time window
- Close the alert with a documented disposition and investigation notes
Alert Classification Matrix
| Category | Examples | Priority | Response window |
|---|---|---|---|
| Exfiltration | Large data transfer to external IP, DNS tunneling, cloud storage upload | Critical | 15 minutes |
| Execution | PowerShell encoded commands, fileless malware, living-off-the-land binaries | High | 30 minutes |
| Credential Access | Brute force, pass-the-hash, credential dumping tools detected | High | 30 minutes |
| Lateral Movement | RDP from unexpected source, PsExec usage, SMB scanning | High | 1 hour |
| Initial Access | Phishing link clicked, exploit attempt on external service | Medium | 2 hours |
| Persistence | Scheduled task creation, registry modification, startup folder changes | Medium | 4 hours |
| Reconnaissance | Port scanning, directory enumeration, DNS zone transfer attempt | Low | 8 hours |
Enrichment and Context Gathering
Before making a triage decision, enrich the alert with context. Look up the source IP in threat intelligence feeds. Check the user account against HR records for role and department. Verify whether the asset is a high-value target like a domain controller or database server. Review recent alert history for the same source to identify patterns. Check the time of activity against normal working hours. A single data point rarely tells the whole story. The best triage decisions come from combining multiple context signals before rendering a verdict.
Handling False Positives
False positives are not just noise; they are a tuning opportunity. When you close an alert as a false positive, document exactly why it was false. Identify the root cause: was it a misconfigured rule, a legitimate business process, or missing allow-list entries? Submit a structured tuning request to the detection engineering team with the alert ID, false positive evidence, and a proposed rule modification. Track false positive rates per detection rule to identify the noisiest rules. A well-tuned SIEM should have a false positive rate below 30 percent.
Metrics and Continuous Improvement
- Track mean time to acknowledge (MTTA), mean time to triage (MTTT), and disposition accuracy
- Measure the ratio of true positives to false positives by detection rule and category
- Monitor alert volume trends to anticipate capacity needs
- Review escalation accuracy: how often do escalated alerts turn into confirmed incidents?
- Conduct weekly triage reviews where the team discusses interesting or ambiguous alerts
- Use triage metrics to identify training gaps and optimize analyst workflows
Frequently Asked Questions
How long should alert triage take per alert?
Initial triage should take 5 to 15 minutes for a straightforward alert. Complex alerts requiring deeper investigation may take 30 to 60 minutes. Set SLAs by alert severity rather than a single time target for all alerts.
What tools are essential for efficient alert triage?
A SIEM with good search and correlation capabilities, a SOAR platform for automated enrichment and playbook execution, threat intelligence feeds for IP and domain lookups, and an asset/CMDB integration for context about affected systems and users.
How do we reduce alert fatigue?
Tune detection rules aggressively to reduce false positives, automate enrichment so analysts focus on decisions rather than data gathering, use risk-based alerting that correlates multiple signals before firing, and ensure staffing matches alert volume.
Should L1 analysts escalate or investigate further?
L1 analysts should perform initial triage and make a classification decision. If the alert requires deeper technical analysis beyond their skill set or tool access, they should escalate to L2 with clear documentation of what they found. Do not hold alerts at L1 for extended investigation.
How do we measure triage quality?
Review a random sample of closed alerts weekly to verify the disposition was correct. Track the rate of re-opened alerts and missed true positives. Conduct peer reviews where L2 and L3 analysts audit L1 triage decisions to identify coaching opportunities.
Ready to use this resource?
Download it now or schedule a demo to see how Hunto AI can automate your security workflows.
