Affected
Major outage from 3:17 PM to 3:30 PM
Major outage from 3:17 PM to 3:30 PM
Major outage from 3:17 PM to 3:30 PM
- ResolvedResolved
At 14:56:09 UTC, a desynchronization event occurred with one node location in our database cluster, triggered by congestion at the upstream provider caused by a large inbound DDoS attack targeting another customer of that provider. This attack was not directed at our platform, but it impacted the cluster's ability to communicate effectively.
The desynchronization caused a cascading effect, a behavior we've seen previously with our current database platform, where multiple nodes began marking themselves as unhealthy, and some briefly returned to a healthy state before failing again. At 15:17:23 UTC, the Singapore node, the last remaining healthy node, also marked itself as unhealthy while our team was working to bring the other node locations back online. This caused WISP front-end services to become fully inaccessible.
From 15:17:23 to 15:30:01 UTC, no healthy nodes were available to handle traffic. At 15:30:02 UTC, the Singapore node recovered. Following this, our team scaled up the other locations as the remaining nodes resynchronized, fully restoring services.
Since taking over WISP, we've gone through three different node configurations to improve reliability and ensure true geographic redundancy. Under the previous ownership, all nodes were hosted with a single provider in one facility, which did not meet our standards for redundancy. This design left the platform vulnerable to geographic outages or unexpected catastrophic events, and it lacked the true geographic load balancing that WISP is known for.
To fix these issues, we tried the following approaches:
- Five-node setup (3 Chicago, 2 London): Provided low inter-site latency and some stability, but if one location had problems, the entire cluster was impacted.
- Three-node setup (Chicago, London, Singapore): Improved geographic diversity but created a new issue where a single node failure could trigger a cascading cluster failure.
- Current five-node setup (Chicago, London, Singapore, Los Angeles, Germany): Our most stable design yet, offering two-node failure tolerance and strong geographic diversity. However, limitations with the underlying database system still cause occasional random desynchronization and nodes marking themselves as unhealthy even without actual communication problems.Due to these recurring issues, one of our team members has been tasked exclusively with evaluating alternative database backends. For the past several weeks, well before this incident occurred, we have been rigorously testing other solutions to prevent random desync behavior and eliminate the self-failure conditions that can cascade through the system.
While our current five-node cluster is much more stable than past setups, this incident highlights a known issue with the current database platform under extreme conditions where latency and congestion disrupt syncing between sites. Normally, the cluster self-heals, and we've experienced this behavior before without any service impact, making this event unusual. Since moving to the current setup, these issues have been far less frequent, but they remain a known limitation. We are currently evaluating other database platforms to replace the existing system and prevent these problems entirely.
Additionally, before this event, a separate issue occurred with the scheduler, where the queue manager became stuck due to failed API requests caused by Cloudflare API problems. This temporarily prevented certain tasks from being processed. The issue has been resolved, and we are implementing additional safeguards to ensure that future scheduler problems do not prevent other tasks from running as expected.
We are fully committed to improving WISP's reliability through ongoing testing, architecture improvements, and a planned migration to a more resilient database platform following validation. We sincerely apologize for the disruption and thank you for your trust as we work tirelessly to enhance the stability and resilience of WISP.
- IdentifiedIdentifiedWe're aware of issues communicating to panel. Our team is working on it.
![[object Object]](/_next/image?url=https%3A%2F%2Finstatus.com%2Fuser-content%2Fv1721847486%2Fkfdsm5cn9ljdlzb6qa8y.png&w=3840&q=75)