Probably we all understand that any migration should be preceded by simulations in order to provide the best possible opportunity for success. What is often missed or underestimated, however, is the importance of including people in this process.
A successful simulation without the parties responsible for the real migration, those who must execute, coordinate and communicate throughout that process, may provide a false sense of preparedness. How can we know that a successful simulation can be replicated, after all, when the whole team was not involved in the process?
Consider the scenario we at Flugel faced a while back. A client required the migration of his microservices-based application to AWS, using containers. The application was running on-site, using legacy DevOps tooling, Xen VMs, Chef, and some scripting.
So what, exactly, was the problem?
There were network limitations and security concerns that made progressive migration almost impossible. The only component that could be duplicated and synchronized was the database. There were third-party API limitations that complicated the move.
This migration required a temporary halt on all on-site microservices, leading to some downtime with no guarantee of successful migration.
A good solution needed to reduce downtime to as near zero as possible while ensuring successful migration, preferably on the first attempt.
There were three important parts of this process: automation, communication, and testing.
First, the automation or infrastructure code: All deployment to the cloud was fully automated, from network layout to application deployment in each instance.
Second, communication (and coordination): Authority to proceed with this migration was not invested in a single team but shared by several teams. These teams, therefore, needed to coordinate and communicate effectively if migration were to be successful.
Third and most important, testing: The application and infrastructure would need to be tested, of course. But because a successful migration hinged on the cooperative functioning of the company’s various teams, testing had to include these teams. We had to verify that these teams would be in sync throughout the migration process.
We also needed to test the rollback in the event that something went wrong.
Migration was scheduled for January. We hoped to do at least two simulations before that time.
The first simulation ran in a pre-production environment in November. It involved QA (for manual testing), network team (for DNS changes), developers, infrastructure engineers, and project managers. The November test failed because there was a communication issue, and QA was not ready to verify on time.
The second simulation, which ran in the same environment at the end of December, was successful. January’s successful production migration ran as expected, with less than one hour of downtime.
While, in facing a complex and critical migration like the one discussed here, the focus seems naturally to fall on technologies, but we should recognize that human factors are equally important. In this particular case, and based on our first (unsuccessful) simulation, the faultlines revolved around the client’s various teams.
Testing the application and infrastructure, while important, is inadequate. The success of such testing shows an incomplete picture. Failure to rehearse the various teams would have led us to a false assumption of readiness, leading to failure of the real migration, when it mattered most. Without testing the people part of the process, we would have been left wondering what, exactly, had gone wrong.
Recognizing that people are an important part of advance testing contributed a great deal toward our happy, less-than-one-hour downtime migration. Through the simulations, the client migration team came to understand their responsibilities and their importance to migration success.
At the end, all the technical problems are people problems.
2018, Cryptoland Theme by Artureanec - Ninetheme