Failover is dead

  • November 15, 2018

When doing a quick Google search of the keywords “failover meaning”, you will find this definition:

“a procedure by which a system automatically transfers control to a duplicate system when it detects a fault or failure.”

The definition is perfect, but the concept is broken. Failover is broken. It was ostensibly a good idea years ago, but if you are thinking about high availability to provide service with downtimes close to 0, It’s time that you forgot the whole idea of “failover” altogether.

In a failover scenario, you have an active server providing services, and a passive server waiting for the first one to fail. When the active server fails, the passive or secondary server takes the control automatically. But can this scenario truly be trusted to occur reliably? You have one server that nobody is using, and you think that’s going to be ready when you need it. In the best environments, I’ve seen people simulating failures to check if this works, but… it is not a common practice. I’ve seen places that don’t test their HA processes for months or even years. You could argue, If they don’t follow the best practices, it doesn’t mean the concept itself is broken. Technically true. However, we advise against the concept for the following additional reasons:

  • Syncing could be tricky. It depends too much on people checking that the right things are sync’ed. It’s an automated process of course, but someone should configure it. There are a lot of opportunities for human error there. I don’t want to discover that someone missed a key step in the components necessary for a smooth failover.
  • I have to reiterate that the secondary could malfunction, I don’t want to rely on something that hasn’t been checked to save me when I’m on fire.
  • You have better options, which are enumerated below.

If you want HA, you have to design the architecture of your application and the required services to run distributed. I’m not talking about complex distributed systems. Your system as a whole should run in nodes, where all the nodes are working all the time. If one node fails, the system should continue working. The only thing you need to do is to recover that node. The idea is to have as many independently functioning parts as you can to make identifying and fixing the problem as simple as it can be.

In simple applications, you have one service and one database. Your application should be able to run N times on different servers, with preference to share nothing architectures (i.e. don’t use NFS!!!). The application should use services to store the shared stuff; for data you use databases, for files you use an object store.

When choosing a database, use one that works as a distributed system as well . Application should be allowed to read and write all the time. In some databases you can’t write anywhere all the time, but at least you can read anywhere all the time.

In summary: If your application is complex, and contains additional components like caches, queues, etc., they should be configured to work on a distributed basis as well. If you have many independently functioning parts that are all highly organized and connected with one another, that would be superior to a failover system because individual parts would fail rather than the whole system.