本文发自 http://www.binss.me/blog/backup-and-backup/,转载请注明出处。

备份!

最近业界两件大新闻都和备份有关,一是炉石传说国服丢数据,最后回档到几天前的版本;二是gitlab把生产数据库清了,300G的数据剩下4G。之所以成为大新闻的原因当然是数据丢了救不回来了,此类事故的后果比较严重,如何漂亮地甩锅非常重要。

对此网易的回应是:

数据库由于供电意外中断的原因而产生故障,导致数据损坏

虽然暴雪与网易的工程师们已在事故发生后第一时间着手抢修,重启服务器并尝试数据恢复。但不幸的是,由于相关备份数据库也出现故障,这些尝试均未成功。

对此业内人士表示,断电?Are you kidding me?

而gitlab的回应就比较坦率了:

team-member-1 thinks that perhaps pg_basebackup is refusing to work due to the PostgreSQL data directory being present (despite being empty), decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.

team-member-1 - terminates the removal, but it’s too late. Of around 300 GB only about 4.5 GB is left.

即本来想删掉备份数据库的pg_basebackup,结果把生产环境数据库的给删了。

这让我想起有次在虚拟机里看到一个目录占了很大空间,下意识执行rm -rf,过了几秒钟发现这个目录是挂载到宿主机用户目录的,吓得赶紧ctrl+c,还好发现先删的是Applications,卡在删adobe目录了。adobe梗的出处

咳咳,言归正传。人无完人,不出错是不可能的,就是为了预防这种情况,我们才会设计并部署了各种冷热备份、主从备份、异地备份,甚至是通过log来进行还原。各种备份耗费了大量的资源,就是为了在关键时刻能够救命。来人,把老子的备份拉出来。

什么,备份丢了/损坏/用不了/不一致?(生无可恋脸)

备份?

纵观业界历次事故,备份总是无法如我们预期般地拯救损坏的主库,到达一致的状态。为什么呢?

我个人是这么认为的:在出事之前,往往没有人关心备份。我们往往总是把备份作为一种安慰剂,而不关心它是否真的可用。

像这次gitlab事故:

  1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage

  2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size. It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

  3. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.

  4. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

  5. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented. We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.

  6. Our backups to S3 apparently don’t work either: the bucket is empty

坐拥6种备份方式,可以想象在PPT上会吹的如何天花乱坠了。然而真正出事时,发现:数据库常规备份(pg_dump)不知道备份到哪去了,可能挂了(failing silently)。Azure的磁盘快照在DB server上不启用。webhook被移除。replication procedure不可用。备份到S3的目录是空的。只有一个冷备(LVM snapshot)是能用的。

很明显,平时根本就没有人检查过备份的情况,也没有测试过备份的可用性。所谓的灾难演练和恢复演练可能也就走走过场。不小心演练崩了怎么办?责任算谁的?增加工作量还不涨KPI。结果一出问题就坑爹了。

教训

备份这种东西,在实际使用之前没人能够确保其是可用的,可靠的。