Platform outage for app.thousandeyes.com and api.thousandeyes.com
Incident Report for ThousandEyes
Postmortem

Event Summary

On Thursday, June 30, 2022, starting at 20:38 UTC, the ThousandEyes platform experienced an outage in which:

-Users were unable to log into the platform

-Alerts were not dispatched

-Agents experienced delays with uploading data and checking into the platform

Impact Summary

The ThousandEyes web application and API were unavailable between 20:38 and 21:21 UTC.

Some Cloud and Enterprise Agent tests are permanently missing data between 20:40 and 20:50 UTC.

-To check if a test is missing data during this period, open “Views“ for the test. A test that is missing data will display an empty timeline during this time and the tooltip shown when hovering over the empty timeline has a message indicating that there was no data for that metric during that period.

Resolution

The ThousandEyes web application and API were brought back online by restarting the database cluster in single-node mode.

Restoring the ThousandEyes platform also allowed the agents to resume submitting data.

Root Cause Analysis

A maintenance operation overloaded a critical database cluster, tripping safeguards that unexpectedly and simultaneously shut down every read-only replica in the cluster. As a result, the ThousandEyes web application and API became unavailable.

Additional information is available upon request to support@thousandeyes.com or opening a support chat. Instructions on opening a support chat are available here: https://docs.thousandeyes.com/product-documentation/getting-started/getting-support-from-thousandeyes

Posted Jul 08, 2022 - 22:03 UTC

Resolved
This incident has been resolved.
Posted Jul 01, 2022 - 03:00 UTC
Monitoring
Summary
A fix has been implemented and we are monitoring the results

Status
The affected Enterprise Agents should show online now with no lag in the checking in ThousandEyes platform.
Posted Jul 01, 2022 - 00:58 UTC
Identified
Status
- Some Enterprise Agents may show as offline due to a delay in the check-in process but continue to run scheduled tests
Posted Jul 01, 2022 - 00:06 UTC
Update
Summary
We are continuing to monitor for any further issues.

Status
- Some Enterprise Agents may show as offline due to a delay in the check-in process but continue to run scheduled tests
Posted Jun 30, 2022 - 23:20 UTC
Update
Summary
We are continuing to monitor for any further issues

Status
- Some Enterprise Agents may show as offline due to a delay in the check-in process but continue to run scheduled tests
- Delay in endpoint agent alerts
Posted Jun 30, 2022 - 22:44 UTC
Update
We are continuing to monitor for any further issues.
Posted Jun 30, 2022 - 22:02 UTC
Monitoring
Summary
We are continuing to monitor for any further issues.

Status
Some customers may still experience:
- Delay in Internet Insights Application Outage Alerts
- Enterprise Agents may show as offline due to a delay in the check-in process
Posted Jun 30, 2022 - 21:56 UTC
Update
Summary
A fix has been implemented and we are monitoring the results

Status
- Some users may experience an elevated error rate while accessing services from our Web app and API
- Delays in all test data and alert dispatching is expected
- Agents may show as offline but continue to run scheduled tests
Posted Jun 30, 2022 - 21:45 UTC
Update
A fix has been implemented for app.thousandeyes.com and api.thousandeyes.com. We are monitoring the results
Posted Jun 30, 2022 - 21:25 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jun 30, 2022 - 21:18 UTC
Identified
The issue has been identified
Posted Jun 30, 2022 - 21:01 UTC
Investigating
Summary
We are currently experiencing an outage of our web app and API

If you cannot open a support chat from https://status.thousandeyes.com or https://app.thousandeyes.com, please send an email to support@thousandeyes.com requesting assistance.
Posted Jun 30, 2022 - 20:48 UTC
This incident affected: Agent Configuration and Data Collection (Cloud and Enterprise Agents: Registration controller), ThousandEyes Platform and API (Platform Availability, API Availability, Test data availability), Alerts and Notifications (Alert processing), and Internet Insights collection and processing.