How Google Deleted $125 Billion Pension Fund Account
In early May 2024, UniSuper, one of Australia's largest superannuation funds, faced a nightmare scenario that every cloud user dreads: their cloud provider, Google Cloud Platform (GCP), accidentally deleted their data. This incident not only disrupted services for over 600,000 pension fund members but also underscored the critical importance of robust disaster recovery (DR) planning and backup strategies.
Background
UniSuper, managing $125 billion AUD in retirement savings, relies heavily on GCP for its cloud infrastructure. The incident began with a misconfiguration during the provisioning of UniSuper’s Private Cloud services. Due to a bug in the creation script, the private cloud was mistakenly set up with a one-year subscription instead of a perpetual one. When the year elapsed, Google Cloud automatically deleted the private cloud, leading to a significant outage from May 2nd to 13th.
What Went Wrong
The deletion was triggered by a rare bug that caused a misconfiguration during the setup of UniSuper’s private cloud. This bug set the private cloud’s subscription to expire after one year, and upon expiration, Google Cloud dutifully deleted it. Unfortunately, this deletion affected data across multiple regions where UniSuper had duplicated its environment for redundancy.
The outage impacted not only UniSuper's operations but also left more than half a million members without access to their accounts. Despite having geographical redundancy and third-party backups, restoring the data and services was an arduous process that took over a week.
Analysis of the Incident
Google Cloud’s response to the incident involved both a joint statement with UniSuper and a follow-up blog post explaining the details. While Google Cloud admitted fault, the communication throughout the incident was criticized for its lack of clarity and timeliness.
UniSuper’s disaster recovery measures included backups with a third-party provider, which eventually enabled the recovery of their systems. However, the restoration process highlighted the complexity and time-consuming nature of recovering from such a large-scale deletion.
Key Lessons Learned
1. Importance of Robust Backup Strategies
The 3-2-1 backup rule, which advises maintaining three copies of your data on two different media with one offsite, proved crucial in this incident. UniSuper’s reliance on a third-party backup provider was a smart move that ultimately facilitated their recovery.
2. Necessity of Disaster Recovery Planning
Geo-redundancy within the same cloud provider isn’t enough. This incident demonstrated the need for diversified DR strategies that include replication across different cloud providers to ensure availability and quick recovery.
3. Customer Responsibility in Cloud Environments
Cloud providers often do not offer comprehensive DR guarantees. Customers must understand their role in maintaining DR plans and proactively mitigating risks, even when the cloud provider is at fault.
4. Importance of Detailed Communication
Clear and transparent communication during an incident is essential. Google Cloud’s vague and delayed updates exacerbated the situation, highlighting the need for straightforward communication to manage customer expectations effectively.
But What About Atlassian Cloud?
The Atlassian Cloud platform has suffered a number of critical outages. Some of those outages resulted in lost Jira data and configuration data so, just like all enterprise SaaS environments, Atlassian is not immune to failure.
As recent as June 3rd, 2024, Atlassian Jira suffered a global outage which, thankfully, did not result in any customer lost data.
But previous incidents in Atlassian's cloud have resulted in lost services and data so we raised this to underpin the main point that no cloud vendor is immune from disasters.
This is one of the reasons why Atlassian make it very clear to end customers that the backup and recovery of your Jira and Confluence data is the responsibility of the end customer.
"You need to go through that whole thing periodically and make sure everyone knows what their role is and what they're supposed to do."
Alex Ortiz during The Jira Life "Emergency Livestream - Jira is DOWN!"
Recommendations for Cloud Users
At the bare minimum, we recommend that you should;
- Develop comprehensive backup and disaster recovery plans.
- Regularly test and update your DR protocols to ensure they are current and effective.
- Diversify their cloud service providers to minimize risk.
- Maintain clear communication channels with their cloud providers to ensure timely and accurate updates during incidents.
Conclusion
The UniSuper Google Cloud outage serves as a stark reminder of the vulnerabilities in even the most sophisticated digital infrastructures.
It underscores the necessity for robust disaster recovery planning and the importance of diversified backup strategies. By prioritizing DR, organizations not only protect their operations and data but also preserve their reputation and client trust.
As a leader in data protection and disaster recovery in Atlassian cloud, our door is always open to discuss your situation.