Oh no, all my servers are down! What now?

Our SAN failed and all our servers were lost - find out what we did next!

It was a fairly standard Tuesday afternoon. It’s 2:30pm and I am sat in the boardroom on a conference call with a partner. I look up and see Ian, our senior engineer, frantically waving at me through the glass. I make my apologies and slip out of the room and Ian says the words every CTO dreads… “Our SAN has failed and all our servers are offline!”

“How can this happen?” I said. Ian replied, “It’s RAID5 and a hard drive failed yesterday, we put in a new hard drive straight away but a second drive has failed before the first finished rebuilding.” For those who don’t know, RAID5 is a commonly used form of drive redundancy. It means you spread data across multiple hard drives so you can lose one without losing any data. However, if you lose two drives it is game over! All your data is gone - forever!

A few years ago, I would have been stressed up to my eyeballs at this point but we have planned for this. We invested in a solid backup and recovery solution, Datto, and we run annual tests. Therefore, I simply said to Ian “follow the DR plan and recover everything on the Datto” and walked back into the boardroom to finish my call. This blog explains what happened next:

Our Systems:

Most of our systems are in the Cloud, so were not affected, but we still have four critical systems which reside in the office. They are:

Our Backup and Recovery Solution

We back up all our systems once an hour using a Datto Siris appliance. This means that if a server fails we never lose more than an hour of data. This is much better than the days when everyone backed up nightly – can you imagine telling your staff they need to recreate all their work for the last 24 hours? You would not be popular!

The Datto Siris also has a number of restore options and includes its own hypervisor and offsite storage in the cloud so we can recover either on or off site, without any spare hardware and regardless of the scale of the disaster.

The Timeline of Events:

  • 14:20 – Our users suddenly begin to complain they cannot see their network drives and our NOC receive an alert to say our SAN has failed.
  • 14:30 – Ian double checked to confirm and notified me that the SAN had failed
  • 14:40 – Ian restored and virtualised our Phone Server directly on our Datto device using it’s built in hypervisor. This restored in seconds and by 14:50 the Phone Server was up and running as normal.
  • 15:00 – Ian restored the File Server but this time used the “Virtualise using Hypervisor” option. This means we use the power of our VMware environment but still use the storage from our Datto instead of the failed SAN. It is a little slower to restore but will allow the system to run faster once up - after 40 minutes it is running as normal and Sales and Finance can work again.
  • 16:20 - Ian restored the Reporting and Door Access Servers in the same way – after 30 minutes they were both running as normal.
  • 17:00 – We ran a final check over the systems to make sure they were all responding correctly and left for the pub, on time at 17:30, to celebrate!
  • Next day – We sat down and planned the move from our Datto to our replacement hardware. We now have the advantage of time, which allows us to plan and manage this project to avoid further disruption to the business.

My Conclusions:

I spend my life preaching to customers about the importance of having a solid disaster recovery plan whilst secretly hoping it would never happen to me. However, it did and I am glad. It confirmed our plan worked and not just in our annual test.

Could we have done anything better? Yes, of course, but I am pleased to see that by planning and investing correctly, one engineer could recover several servers in a very acceptable time. Far exceeding the requirements of the business. Ultimately, it changed a disaster into a minor inconvenience.

If you are responsible for IT at your company, your primary goal is to protect the company data. Other priorities like; new apps, improving usability, system performance and even business growth come secondary to this because, if you lose the data, you will (and should) lose your job! You may even destroy the company too. That would not be a good Tuesday!

If you would like to know more, or discuss how we can help you improve your recovery options, please give us a call.

P.S. I am also glad we have Ian!

Richard Palmer

Chief Technology Officer

Richard Palmer