Management Programming

Deploying to Production: When things go wrong, don’t make it worse! (a.k.a.: A story about Continuous Deployment)

Throughout my career working in software development projects I’ve seen many deployments to production go wrong or fail. It will happen to any software engineering team at some point, it’s a fact! Another (not so pleasant) fact is that people find many different ways to make the situation even worse. Bare with me, you will get my point in a minute.

Back in the 200Xs, the team I was working on deployed a bug to production that caused trouble to some users. Executives got upset and eventually put the blame on the deployment manager, saying that he allowed the release to happen without adequate quality assurance. This was actually the engineering manager’s (and team) responsibility, but since the other guy was blamed for the problem, he had the right to do something about it. And he did – he introduced a policy that would require teams to notify about deployments at least one week in advance so the release team could plan to do some black-box testing by themselves in advance and make sure things looked good.

Did you mean “lack of trust”?

This “solution” created a very bad “developer versus QA” dispute and lack-of-trust environment. Instead of collaborating, the development and QA/release teams were fighting against each other when they could be working together improving the product and quality. Not only that: the release process got substantially more complicated and time consuming, and in the end this only contributed for the company to be slower and more bureaucratic and didn’t fix the quality. Not to mention that when the next problem happened, the *development team* put the blame on the testing team too! They said: “It’s your fault, you didn’t test it right!”. How nonsense is that?

Another situation happened in a different company I worked for where a team released a new feature to production that almost completely prevented users to use an application. The solutions discussed involved a complicated release process that would require 2 weeks for a feature to be rolled out, locking the code repository so people couldn’t touch the code after some point and so on. And the result of all those actions are not different from the first situation: the team and company will get slower to release features, people become afraid of taking risks thus don’t innovate, the work becomes more bureaucratic, people become less productive, then become frustrated, and it goes on and on.

The pattern is simple: instead of fixing the actual problem, other kinds of problems are created in order to try to fix the original problem.

But how do you fix it effectively?

First, if you want effective immediate/short-term solutions for these problems, the initial step is to really understand what happened so you can fix the right thing. Was it that there are some parts of the system that are lacking test coverage? What is that the acceptance tests are not automated and the team didn’t do proper regression testing? Was it that the release process is manual and the release engineer was sleepy that day and ran the wrong script? Only when you know the root cause you will be able to do something effective about it.

Second and more important, if you don’t think about the entire system and think about a holistic solution, you will eventually run into the same problems over and over again in the future.

Bottom line: you need to fix the system, and not the problem.

People tend to go for short-term solutions that will apparently fix the problem but won’t (and also won’t prevent it from happening again). You shouldn’t be thinking about things like firing people or sending angry e-mails. The impact this would have in the system is that people won’t take risks anymore so they don’t get screwed up, thus they won’t innovate, thus your company will miss opportunities and your people could get frustrated and go somewhere else. And that’s pretty bad.

A real world example

One of the possible approaches that I learned and implemented over the last year is to implement Continuous Deployment, that consists in constantly deploying software to production in an automated fashion, several times a day, with little-to-none human intervention and with a high level of confidence.

And how does that fix this class of problem? When you implement such extreme deploying practices, you need to introduce amazing discipline, automation and quality, otherwise you will really screw things up. That said, the Continuous Deployment practice will naturally trigger several healthy behaviors that will end up fixing the software development system as collateral damage. Think about it.

Say that you want to have your product going to production immediately after a check-in on the version control system.

First thing you will need is to find a way to deploy code automatically to production. But you need to do it safely because you don’t want to disturb users using your application. One approach might be to take one of the servers out of your production cluster for a while, install a new version of the software, run some tests, and then repeat that operation until all servers are updated. But because you don’t want to do that manually (because it’s error prone and slow), you need to automate some smoke tests and the entire deployment process. As a result, deployment will be automated from tip to tail, including verifying if it went well.

If you want to go even more extreme and make your code go straight to production after a commit, you will want to make sure that not only the new just-added code is working but also that you didn’t break other things that were already working. But because you don’t want to test everything manually (because, again, it’s error prone and slow), you need to have a good amount of test coverage. In that case, Test-Driven Development, automated acceptance tests and Continuous Integration might be useful practices, as they will help you maintain a really high coverage and high-quality test suite. Also, you will probably want to do some code review because it’s always good to have another eye and brain do some verification, not to mention that this is a good practice to spread knowledge and understanding about the software. But because you don’t want to lose time and deliver customer value as soon as possible, you might want to go extreme and start doing some Pair Programming. And to prevent unfinished features from going live, you might want to use some feature flags too. As a result, now you have a very trustworthy process from code check-in to deployment in production. Not to mention a pretty damn fast team, because the system is so well engineered that they will be alerted if something goes wrong, so they get more confident and courageous to add features and make changes to the codebase faster.

And the cultural change doesn’t stop there

You also don’t want any big-bang deployments that change the entire application. You want to add features in small increments so that you can slowly and continuously change the system and see how it will behave. Even the big features you will want to split in smaller features that can be better digested by the team and deployed in smaller parts, either to see how users will behave or to see how the system/environment/code itself will behave. You will soon realize that you need to have shorter sprints, because you want to deploy to production right after things are ready. Why do you have to wait for 2 weeks to deploy if everything is ready? By the way, why do you need sprints anymore? Why not shifting to a sprint-less Agile approach? Sprint Reviews & Retrospectives no more – everything will happen all the time, just in time, after every story or task is completed. Deployments by the end of the sprint no more – deployments can happen at any time and all the time!

This is not a hypothetical story – it is almost entirely based on a true project that I recently worked on. By having the right mindset, creating the right culture and ultimately targeting at Continuous Deployment, we changed the entire development system & process and shifted everybody’s behaviors from managers to developers, QA and release engineers.

Control your outputs by controlling the system, and not the outputs.


Refactoring: Do it all the time!

Refactoring is a very well known technique that has been used for years in the software development field. Here are some popular definitions:

“Code refactoring is the process of changing a computer program’s source code without modifying its external functional behavior in order to improve some of the nonfunctional attributes of the software. Advantages include improved code readability and reduced complexity to improve the maintainability of the source code, as well as a more expressive internal architecture or object model to improve extensibility.”

Or, being more concise:

“A change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behavior.”
Martin Fowler

My concern with Refactoring is that most people don’t realize that they will have much more benefits if they do it all the time.

As the definition above explains, the purpose of Refactoring is to make code easier to change over time. It doesn’t make sense to keep adding features to your code for, say, 10 months and then stop-the-world for a couple of weeks to refactor it. If you’re doing that, you’re totally missing the point. In this hypothetical situation, not only (1) it would become more complex to make changes in the software over time – and you’d be in trouble for several months – but also (2) at some point of time the team would have to stop and re-engineer everything – because the software would become very difficult to change.

If a team waits for 10 months, or 2 months, or 2 weeks to refactor something, they are doing it wrong. Each time new features are added the code becomes more complex, and it’s necessary to review it as soon as possible to make the necessary improvements. If the team doesn’t do that often, they will become slower and slower each day, they will be less productive and deliver much less features (or value).

The “re-engineer” thing even worse. First, teams that stop to make huge refactorings are delivering nothing to their customers while doing that because they’re too busy fixing their own mess. It’s definitely bad to pay a software team to deliver nothing. Second (and more important), huge refactorings are often risky, because they change a great amount of code and software components, usually introducing bugs and unexpected behaviors.

To avoid all these bad things, you must do refactoring every day, all the time!

(1) Red, (2) Green, (3) RefactorEvery time you make a change, you must look for an opportunity to make an improvement. If you keep doing that, your code will always be clean, maintainable and easy to modify. Remember the TDD cycle: Red-Green-Refactor? Don’t miss the refactor part, do it all the time!

In my experience it takes no more than a few minutes to refactor if you are doing it in small chunks all the time, and more importantly, this will keep you from stopping to make huge and risky software re-writes.


Why I don’t write code comments

These days I was configuring some personal projects on ohloh and one thing called my attention. Ohloh showed the message “very few source code comments” in all my projects.

This is no surprise to me. I really don’t like to write source code comments. Why? Because if my code isn’t clear and easy enough for one to read and easily understand, I need to refactor. Many times I ask friends to take a look at code snippets and check if they understand. When they don’t, the method is usually too big or the variable/method names are not clear enough, and then I refactor to make it better to understand.

One example that I love (and use in many presentations) is the code to send e-mails using Java Mail API. The code would be something like this (from Java World tutorial):

// create some properties and get the default Session
Properties props = new Properties();
props.put("", _smtpHost);
Session session = Session.getDefaultInstance(props, null);
// create a message
Address replyToList[] = { new InternetAddress(replyTo) };
Message newMessage = new MimeMessage(session);
if (_fromName != null)
    newMessage.setFrom(new InternetAddress(from,
        _fromName + " on behalf of " + replyTo));
    newMessage.setFrom(new InternetAddress(from));
// send newMessage
Transport transport = session.getTransport(SMTP_MAIL);
transport.connect(_smtpHost, _user, _password);
transport.sendMessage(newMessage, _toList);

I know that this is just an example but you know that it’s not difficult to find monsters like this one in production code.

This code is so difficult to read that the programmer had to say what he was doing, otherwise nobody would understand its purpose. There is a lot of infrastructure and business code together resulting on a huge abstraction leak.

Now take a look at the following code (from Fluent Mail API) and tell me how many seconds do you need to understand what the code does:

new EmailMessage()
    .withSubject("Fluent Mail API")
    .withBody("Demo message")

This is an extreme example, of course. Maybe you will not write Fluent Interfaces for every little thing you need in your code, but if you add just a little bit of organization (using a better abstraction) it will already improve code readability:

class Email {
    public Email(String from, String to, ...) { ... }
    public void setFrom(String from) { ... }
    public void setTo(String to) { ... }
    public void send() {
        // put your complicated Java Mail code here

Just by encapsulating the code in an abstraction that makes sense the WTF effect will be sensibly reduced. The programmers can now at least assume that the code sends e-mail messages since it’s included in a class called “Email”.

And that’s the point. When I create a culture of prohibiting myself from writing code comments, I’m obligating me to write easy, maintainable and understandable code.

That’s very extreme, I know, but the results of this approach are awesome.