The “GitLab meltdown”: moral of the story?
Pretty simple: verify your backups!
GitLab.com is in crisis after experiencing a severe data loss caused by human errors and ineffectual backups.
On Tuesday evening, one database experience a severe performance degradation, and the sysadmin tries to start an emergency database management.
But another (tired) sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a database replication process,wiping a folder containing 300GB of live production data.
In the Google Doc, the sysadmin notify:
“This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis).”
So not all is lost? Do not be too optimistic! The document concludes with the following:
“So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.”
A few hours later GitLab publishes a post on his blog where tries to reassure users on the situation:
Yesterday we had a serious incident with one of our databases. We lost 6 hours of database data (issues, merge requests, users, comments, snippets, etc.) for GitLab.com. Git/wiki repositories and self hosted installations were not affected. Losing production data is unacceptable and in a few days we’ll post the 5 why’s of why this happened and a list of measures we will implement.
As of time of writing, we’re restoring data from a 6-hours old backup of our database.
This means that any data between 17:20 UTC and 23:25 UTC from the database (projects, issues, merge requests, users, comments, snippets, etc.) is lost by the time GitLab.com is live again.
Update 18:14 UTC: GitLab.com is back online
I like to quote this sentence from TheRegister:
The world doesn’t contain enough faces and palms to even begin to offer a reaction to that sentence.
Moral of the story?
- Verify backups
- Do not work late!