Throughout my career working in software development projects I’ve seen many deployments to production go wrong or fail. It will happen to any software engineering team at some point, it’s a fact! Another (not so pleasant) fact is that people find many different ways to make the situation even worse. Bare with me, you will get my point in a minute.
Back in the 200Xs, the team I was working on deployed a bug to production that caused trouble to some users. Executives got upset and eventually put the blame on the deployment manager, saying that he allowed the release to happen without adequate quality assurance. This was actually the engineering manager’s (and team) responsibility, but since the other guy was blamed for the problem, he had the right to do something about it. And he did – he introduced a policy that would require teams to notify about deployments at least one week in advance so the release team could plan to do some black-box testing by themselves in advance and make sure things looked good.
Did you mean “lack of trust”?
This “solution” created a very bad “developer versus QA” dispute and lack-of-trust environment. Instead of collaborating, the development and QA/release teams were fighting against each other when they could be working together improving the product and quality. Not only that: the release process got substantially more complicated and time consuming, and in the end this only contributed for the company to be slower and more bureaucratic and didn’t fix the quality. Not to mention that when the next problem happened, the *development team* put the blame on the testing team too! They said: “It’s your fault, you didn’t test it right!”. How nonsense is that?
Another situation happened in a different company I worked for where a team released a new feature to production that almost completely prevented users to use an application. The solutions discussed involved a complicated release process that would require 2 weeks for a feature to be rolled out, locking the code repository so people couldn’t touch the code after some point and so on. And the result of all those actions are not different from the first situation: the team and company will get slower to release features, people become afraid of taking risks thus don’t innovate, the work becomes more bureaucratic, people become less productive, then become frustrated, and it goes on and on.
The pattern is simple: instead of fixing the actual problem, other kinds of problems are created in order to try to fix the original problem.
But how do you fix it effectively?
First, if you want effective immediate/short-term solutions for these problems, the initial step is to really understand what happened so you can fix the right thing. Was it that there are some parts of the system that are lacking test coverage? What is that the acceptance tests are not automated and the team didn’t do proper regression testing? Was it that the release process is manual and the release engineer was sleepy that day and ran the wrong script? Only when you know the root cause you will be able to do something effective about it.
Second and more important, if you don’t think about the entire system and think about a holistic solution, you will eventually run into the same problems over and over again in the future.
Bottom line: you need to fix the system, and not the problem.
People tend to go for short-term solutions that will apparently fix the problem but won’t (and also won’t prevent it from happening again). You shouldn’t be thinking about things like firing people or sending angry e-mails. The impact this would have in the system is that people won’t take risks anymore so they don’t get screwed up, thus they won’t innovate, thus your company will miss opportunities and your people could get frustrated and go somewhere else. And that’s pretty bad.
A real world example
One of the possible approaches that I learned and implemented over the last year is to implement Continuous Deployment, that consists in constantly deploying software to production in an automated fashion, several times a day, with little-to-none human intervention and with a high level of confidence.
And how does that fix this class of problem? When you implement such extreme deploying practices, you need to introduce amazing discipline, automation and quality, otherwise you will really screw things up. That said, the Continuous Deployment practice will naturally trigger several healthy behaviors that will end up fixing the software development system as collateral damage. Think about it.
Say that you want to have your product going to production immediately after a check-in on the version control system.
First thing you will need is to find a way to deploy code automatically to production. But you need to do it safely because you don’t want to disturb users using your application. One approach might be to take one of the servers out of your production cluster for a while, install a new version of the software, run some tests, and then repeat that operation until all servers are updated. But because you don’t want to do that manually (because it’s error prone and slow), you need to automate some smoke tests and the entire deployment process. As a result, deployment will be automated from tip to tail, including verifying if it went well.
If you want to go even more extreme and make your code go straight to production after a commit, you will want to make sure that not only the new just-added code is working but also that you didn’t break other things that were already working. But because you don’t want to test everything manually (because, again, it’s error prone and slow), you need to have a good amount of test coverage. In that case, Test-Driven Development, automated acceptance tests and Continuous Integration might be useful practices, as they will help you maintain a really high coverage and high-quality test suite. Also, you will probably want to do some code review because it’s always good to have another eye and brain do some verification, not to mention that this is a good practice to spread knowledge and understanding about the software. But because you don’t want to lose time and deliver customer value as soon as possible, you might want to go extreme and start doing some Pair Programming. And to prevent unfinished features from going live, you might want to use some feature flags too. As a result, now you have a very trustworthy process from code check-in to deployment in production. Not to mention a pretty damn fast team, because the system is so well engineered that they will be alerted if something goes wrong, so they get more confident and courageous to add features and make changes to the codebase faster.
And the cultural change doesn’t stop there
You also don’t want any big-bang deployments that change the entire application. You want to add features in small increments so that you can slowly and continuously change the system and see how it will behave. Even the big features you will want to split in smaller features that can be better digested by the team and deployed in smaller parts, either to see how users will behave or to see how the system/environment/code itself will behave. You will soon realize that you need to have shorter sprints, because you want to deploy to production right after things are ready. Why do you have to wait for 2 weeks to deploy if everything is ready? By the way, why do you need sprints anymore? Why not shifting to a sprint-less Agile approach? Sprint Reviews & Retrospectives no more – everything will happen all the time, just in time, after every story or task is completed. Deployments by the end of the sprint no more – deployments can happen at any time and all the time!
This is not a hypothetical story – it is almost entirely based on a true project that I recently worked on. By having the right mindset, creating the right culture and ultimately targeting at Continuous Deployment, we changed the entire development system & process and shifted everybody’s behaviors from managers to developers, QA and release engineers.