Thu 28 Dec 2006
File recovery (e.g. where we recover a deleted or overwritten Excel file) and system recovery (e.g. restoring a server that was destroyed in a flood) have been linked together for many years in the 770 type organisation. For a long time restoring one Excel file from tape was the same process, done in a different scale, as restoring the whole server. In both cases we went to the last backup tape and restored from this.
With the current Microsoft products – MS Windows 2003 Server and MS Exchange 2003 - we can configure soft deletion of items through the Windows Volume Shadow Service (VSS) and Exchange’s Store Retention Policy respectively. We can configure the server so that when a file gets deleted or an email gets deleted it in fact goes into a holding area for a week or two before it gets physically deleted. This holding area is not visible unless you go to a particular menu option in Windows or Exchange. But once there, it is easy for the user to restore. This holding time can be set to be longer than the tape cycle so is a more effective backup than tape for this kind of restoration.
This is the most common way today that we restore a deleted file on the server or a single email in Exchange. Whether recovering a single file or complete systems the two most important time parameters are the Recovery Time and Recovery Point.
The Recovery Time is the time it takes to get the data back for use or restore the system to a minimum level of operation.
For an Excel file recovered with Windows Volume Shadow Services the recovery time is minutes. With a complete restoration of a server from tape it could take days if poorly planned.
The Recovery Time for a server from tape depends on how well the restoration is planned. For example, if your plan is to purchase the same server hardware, it could take weeks for this to arrive!
The Recovery Point is the time the backup you are using was created. Ideally this should be minimised. If you are using a 30 day old backup tape your recovery time is 30 days and you will lose your last 30 days of work.
For the recovery of an Excel file the Recovery Point is conveniently the same as when the file was deleted when Windows Shadow Services is used for the recovery. This is because VSS created a copy of the deleted Excel file when it was mistakenly deleted or written over and puts it in the hidden holding area.
For a server system restoration from tape the recovery point is the age of the last backup. If you are backing up every night then the worst possible Recovery Point is 24 hours if the system fails/is stolen etc just before the backup starts.
Before VSS and Exchange Store Retention all restorations, big and small, came from tape and therefore the de facto expected Recovery Point has been 24 hours. This has had a profound impact on the way we plan for total system failure. A Recovery Point of 24 hours has been detrimental to both the Recovery Time of a complete server failure and the cost of provisioning.
Server failures happen from time to time but it is a very rare occasion that we have to completely rebuild a server from tape. That is, the disk subsystem of the failed server is unavailable. Having looked back at the last 6 years, we estimate that for any given company, this kind of failure occurs on average greater than every 20 years. Sure we have server failures.
Most of the failures have been hardware. In hardware failures other than the hard disks exercising the properly specified warranty has been the fastest way to recover the server. All the servers we provide have redundant disks that start up in the situation of a failed primary drive and this has sufficed in all but one case. There was only one occasion where all disks failed, which was a flood.
Where failure is a Windows software failure and not hardware then the disk subsystem and data on it is available. The fastest way to recover is the rebuild the server on the hard disk drives carrying the data. In 2006, if you use HP hardware with an appropriate warranty this is a very rare event.
So when you are planning for a once in 20 year event if the recovery point is pushing the recovery time out and piling on the cost then this should be looked at. Data and system security is very analogous to insurance. In this case the Recovery Point is like the insurance excess that will allow the premium to get lower as it gets higher.
This I will discuss in the next post on the matter.