One of our Document Verification API clusters is experiencing overload
Resolved
Aug 28 at 01:50am EDT
Our Document Verification API experienced severe performance degradation causing some ID scan uploads to take 5-10 minutes instead of the expected near-instant processing. Multiple users reported these delays, with an estimated 37.5% of all verification operations potentially affected during the incident window.
User Impact
- 3-4 confirmed users reported 5-10 minute delays for ID uploads
- 10-15 total users attempted to use the service during the incident
- Successfully completed verifications showed success badges but no data was accessible
- Records were not appearing in the admin dashboard for review
Root Cause
A critical service component entered a failure state, consuming excessive CPU resources while repeatedly failing to process requests. This component handled approximately 50% of all verification traffic, causing widespread impact when it became unresponsive.
Contributing Factors
- The affected service component entered an unrecoverable state requiring manual intervention
- No automatic recovery mechanism was in place for high CPU usage scenarios
Timeline
- 2025-08-26: Normal operations, no errors reported
- 2025-08-27: Error spike began, "No scan response" errors began to appear
- 2025-08-28: Issue identified and resolved through manual intervention
Resolution
Immediate Actions Taken:
- Manual restart of the failed service component
- Traffic redistribution to reduce dependency on any single server from 50% to 25%
Planned Improvements:
- Enhanced monitoring and alerting for application-level health
- Implementation of automatic recovery mechanisms for performance degradation
- Improved health checks that can detect and isolate failing components
Affected services
Created
Aug 27 at 03:56pm EDT
We have received reports of elevated "No scan response" errors. We are investigating
Affected services