Event Summary
On Thursday, June 30, 2022, starting at 20:38 UTC, the ThousandEyes platform experienced an outage in which:
-Users were unable to log into the platform
-Alerts were not dispatched
-Agents experienced delays with uploading data and checking into the platform
Impact Summary
The ThousandEyes web application and API were unavailable between 20:38 and 21:21 UTC.
Some Cloud and Enterprise Agent tests are permanently missing data between 20:40 and 20:50 UTC.
-To check if a test is missing data during this period, open “Views“ for the test. A test that is missing data will display an empty timeline during this time and the tooltip shown when hovering over the empty timeline has a message indicating that there was no data for that metric during that period.
Resolution
The ThousandEyes web application and API were brought back online by restarting the database cluster in single-node mode.
Restoring the ThousandEyes platform also allowed the agents to resume submitting data.
Root Cause Analysis
A maintenance operation overloaded a critical database cluster, tripping safeguards that unexpectedly and simultaneously shut down every read-only replica in the cluster. As a result, the ThousandEyes web application and API became unavailable.
Additional information is available upon request to support@thousandeyes.com or opening a support chat. Instructions on opening a support chat are available here: https://docs.thousandeyes.com/product-documentation/getting-started/getting-support-from-thousandeyes