Are Daily Backups Really Sufficent?

Monday afternoon we had a critical failure of an Oracle database at work. Within a few minutes of the fault taking place, I started seeing block corruption errors whilst I was reviewing some information in the production environment. At this stage, I was thinking that we might have dropped a disk in the SAN but referred it onto our database administrator to rectify it.

As is quite common, our environment consists of multiple Oracle 10g RAC nodes connected into a shared data source. The shared data source in this instance is a SAN, where we have a whole bunch of disks configured in groups for redundancy and performance. As soon as the database administrator became involved, it became apparent that we didn’t drop a single disk but had in fact lost access to an entire group of disks within the SAN.

Due to the manner in which the SAN and Oracle are configured, we were not in a position where running in a RAID environment was going to help. If we had dropped a single disk or a subset of disks from any group within the SAN, everything would have been fine; unfortunately we dropped an entire disk group. The end result of this was that we were forced to roll back our database to the previous nights backup.

The following days have been spent recovering the lost days data through various checks and balances; but it takes a lot of time and energy from everyone involved to make this happen. We’ve been fortunate enough to trade for several years without ever needing to roll back our production database due to some sort of significant event; which I suppose we should be thankful for.

After three years without performing a production disaster recovery, had we become complacent about data restoration and recovery as haven’t really needed it before? I believe that since we haven’t had a requirement to perform a disaster recovery for some three years, that our previous data recovery guidelines have now become out of date. Whilst a daily backup may have been more than sufficient for this particular database two or three years ago, the business has undergone significant growth since that time. The daily changeset for this database is now significant enough that, whilst having a daily backup is critical – it requires significant amounts of work to recover all of the data in a moderate time frame.

As a direct result of this disaster, we’re going to be reviewing our data recovery policies shortly. The outcome of that discussion will most likely be that we require higher levels of redundancy in our environment to reduce the impact of a failure. Whilst it would be ideal to have an entire copy of our production hardware, it probably isn’t going to be a cost effective solution. I’m open to suggestions about what sort of data recovery we implement, however I think that having some sort of independent warm spare may win out.

What have we learned out of this whole event:

  • daily backup of data is mandatory
  • daily backup of data may not be sufficient
  • verify that your backup sets are valid, invalid backup data isn’t worth the media it is stored on
  • be vigilant about keeping data recovery strategies in step with business growth and expectations

Maybe periodic disasters are actually healthy for a business? Whilst every business strives to avoid any sort of down time, I expect that as a direct result of the typically high availability of certain systems that disaster recovery isn’t put through its paces often or rigorously enough; which may result in longer downtimes or complete loss of data when an actual disaster recovery is required.

3 thoughts on “Are Daily Backups Really Sufficent?

  1. [Edit]: This comment was in response to someone from a company offering backup services. Whilst I was willing to let one spam-esque comment through in case it was done manually for a little self promotion, I’m not willing to let a second one through. The person in question stated that as a test, any company should request their IT department restore some ‘critical’ data as a test.

    Vicky,

    Recovering data in the above manner isn’t a problem, we restore and use our backup data regularly in various other environments.

    In my opinion, the problem we faced was that our backup schedule wasn’t happening often enough; such that too many changes were taking place during a business day, making a full disaster recovery painful (not impossible, just painful).

    After all is said and done, a business will always place a cost on the data. If the cost of losing it outweighs the cost of protecting it, then better solutions will be provided. Given that this is the first time in a many years, they might consider our current backup policy as ‘adequate’ and not want to pursue it further (unlikely) but just illustrating a point.

    Al.

  2. Actually… the real crux of the problem was that the archive logs were not being written to a secondary device so that once the backup was restored the processed transactions could be rolled forwards.

    In the event that this had been properly configured we would have been in a much better position. I’ve already discussed options and am working on the plan that will stop this from happening again…

    and yes internet ppl that read this, i’m the DBA responsible for this database… please feel free to flame away :)

  3. I should have added, the archive logs were being written to the SAN but the nature of the problem that we experienced meant that the data on a couple of the member disks in the diskset was completely overwritten and destroyed. Due to how the ASM configuration was set up we were relying on the RAID redundancy of the shelf, and not using the inbuilt redundancy that you can use in ASM. This would also have gotten us out of the situation that we were in.

Comments are closed.