Management Programming

Deploying to Production: When things go wrong, don’t make it worse! (a.k.a.: A story about Continuous Deployment)

Throughout my career working in software development projects I’ve seen many deployments to production go wrong or fail. It will happen to any software engineering team at some point, it’s a fact! Another (not so pleasant) fact is that people find many different ways to make the situation even worse. Bare with me, you will get my point in a minute.

Back in the 200Xs, the team I was working on deployed a bug to production that caused trouble to some users. Executives got upset and eventually put the blame on the deployment manager, saying that he allowed the release to happen without adequate quality assurance. This was actually the engineering manager’s (and team) responsibility, but since the other guy was blamed for the problem, he had the right to do something about it. And he did – he introduced a policy that would require teams to notify about deployments at least one week in advance so the release team could plan to do some black-box testing by themselves in advance and make sure things looked good.

Did you mean “lack of trust”?

This “solution” created a very bad “developer versus QA” dispute and lack-of-trust environment. Instead of collaborating, the development and QA/release teams were fighting against each other when they could be working together improving the product and quality. Not only that: the release process got substantially more complicated and time consuming, and in the end this only contributed for the company to be slower and more bureaucratic and didn’t fix the quality. Not to mention that when the next problem happened, the *development team* put the blame on the testing team too! They said: “It’s your fault, you didn’t test it right!”. How nonsense is that?

Another situation happened in a different company I worked for where a team released a new feature to production that almost completely prevented users to use an application. The solutions discussed involved a complicated release process that would require 2 weeks for a feature to be rolled out, locking the code repository so people couldn’t touch the code after some point and so on. And the result of all those actions are not different from the first situation: the team and company will get slower to release features, people become afraid of taking risks thus don’t innovate, the work becomes more bureaucratic, people become less productive, then become frustrated, and it goes on and on.

The pattern is simple: instead of fixing the actual problem, other kinds of problems are created in order to try to fix the original problem.

But how do you fix it effectively?

First, if you want effective immediate/short-term solutions for these problems, the initial step is to really understand what happened so you can fix the right thing. Was it that there are some parts of the system that are lacking test coverage? What is that the acceptance tests are not automated and the team didn’t do proper regression testing? Was it that the release process is manual and the release engineer was sleepy that day and ran the wrong script? Only when you know the root cause you will be able to do something effective about it.

Second and more important, if you don’t think about the entire system and think about a holistic solution, you will eventually run into the same problems over and over again in the future.

Bottom line: you need to fix the system, and not the problem.

People tend to go for short-term solutions that will apparently fix the problem but won’t (and also won’t prevent it from happening again). You shouldn’t be thinking about things like firing people or sending angry e-mails. The impact this would have in the system is that people won’t take risks anymore so they don’t get screwed up, thus they won’t innovate, thus your company will miss opportunities and your people could get frustrated and go somewhere else. And that’s pretty bad.

A real world example

One of the possible approaches that I learned and implemented over the last year is to implement Continuous Deployment, that consists in constantly deploying software to production in an automated fashion, several times a day, with little-to-none human intervention and with a high level of confidence.

And how does that fix this class of problem? When you implement such extreme deploying practices, you need to introduce amazing discipline, automation and quality, otherwise you will really screw things up. That said, the Continuous Deployment practice will naturally trigger several healthy behaviors that will end up fixing the software development system as collateral damage. Think about it.

Say that you want to have your product going to production immediately after a check-in on the version control system.

First thing you will need is to find a way to deploy code automatically to production. But you need to do it safely because you don’t want to disturb users using your application. One approach might be to take one of the servers out of your production cluster for a while, install a new version of the software, run some tests, and then repeat that operation until all servers are updated. But because you don’t want to do that manually (because it’s error prone and slow), you need to automate some smoke tests and the entire deployment process. As a result, deployment will be automated from tip to tail, including verifying if it went well.

If you want to go even more extreme and make your code go straight to production after a commit, you will want to make sure that not only the new just-added code is working but also that you didn’t break other things that were already working. But because you don’t want to test everything manually (because, again, it’s error prone and slow), you need to have a good amount of test coverage. In that case, Test-Driven Development, automated acceptance tests and Continuous Integration might be useful practices, as they will help you maintain a really high coverage and high-quality test suite. Also, you will probably want to do some code review because it’s always good to have another eye and brain do some verification, not to mention that this is a good practice to spread knowledge and understanding about the software. But because you don’t want to lose time and deliver customer value as soon as possible, you might want to go extreme and start doing some Pair Programming. And to prevent unfinished features from going live, you might want to use some feature flags too. As a result, now you have a very trustworthy process from code check-in to deployment in production. Not to mention a pretty damn fast team, because the system is so well engineered that they will be alerted if something goes wrong, so they get more confident and courageous to add features and make changes to the codebase faster.

And the cultural change doesn’t stop there

You also don’t want any big-bang deployments that change the entire application. You want to add features in small increments so that you can slowly and continuously change the system and see how it will behave. Even the big features you will want to split in smaller features that can be better digested by the team and deployed in smaller parts, either to see how users will behave or to see how the system/environment/code itself will behave. You will soon realize that you need to have shorter sprints, because you want to deploy to production right after things are ready. Why do you have to wait for 2 weeks to deploy if everything is ready? By the way, why do you need sprints anymore? Why not shifting to a sprint-less Agile approach? Sprint Reviews & Retrospectives no more – everything will happen all the time, just in time, after every story or task is completed. Deployments by the end of the sprint no more – deployments can happen at any time and all the time!

This is not a hypothetical story – it is almost entirely based on a true project that I recently worked on. By having the right mindset, creating the right culture and ultimately targeting at Continuous Deployment, we changed the entire development system & process and shifted everybody’s behaviors from managers to developers, QA and release engineers.

Control your outputs by controlling the system, and not the outputs.

Discussion Open Source Programming

Thoughts on Git and “Enterprise Open Source”

Three years ago I wrote a post about how Git and Github changed the Open Source world and how companies can benefit from this model (in Portuguese, unfortunately, but you can try to read an almost-decent automatically translated version).

How come that three years later companies are still struggling to find a good version control system and a good model for internal collaboration when there’s a working model out there ready to be copied? 🙂

In many big corporations – like the one where I work – finding code in (Subversion) repositories is like finding a needle in a haystack. Code is hidden from potential contributors, which makes collaboration very hard. If you are really willing to collaborate, you have to download the code and send a patch by e-mail or via bug tracker. You just don’t know when, or how, or if the patch was applied. Besides all that, people lose hours and hours on slow SVN blames, checkouts and updates. Branching and merging with Subversion is too painful to mention, it’s an error-prone process that requires a lot of attention and, sometimes, hours of work. And if you are using SVN externals, oh god, poor you!

To solve these problems, two years ago my team started to work with Git using Git-SVN. This is a very well-known approach that lets you work on your Subversion repositories using Git as the client. You get some of the Git benefits like the automatic merges, local stash, local repositories and local commits but keep using Subversion as your remote repository for team syncs and “source of truth”. Well, it turns out that Git-SVN has some problems and you need to do some tricks to avoid trouble. And because you are still working with Subversion, in some situations you will still be slow, for instance, when you do an initial clone (it can take several hours if the repository is big). This approach seems acceptable at first, until you realize you’ve put a Lamborghini body around your old Horsey Horseless car.

In an effort to increase our productivity, we decided to go for a different (and obvious) approach: build our own Git server and quit Subversion forever. But because life is hard, we had several things plugged to our SVN repositories, from CI to internationalization system, and given that Git is not “official”, some of those internal tools didn’t support it and wouldn’t work. Not to mention that if we still wanted to be compliant with IT we needed to have our code on Subversion anyway. Not having Subversion was not an option.

To solve that, we wrote a server-side hook that replicates code to Subversion when there’s a git push. Despite the fact that we were still having slow Git pull and pushes sometimes, it was great because we got rid of many Git-SVN problems. The downside is that our code was being replicated to Subversion without commit history. Every push – regardless of how many commits it contains – becomes one single commit on Subversion, and we lose the commit messages. But since the main goal was to use Git only and still have the hooked systems working with Subversion, we loved it. We use Git for history (and everything else) while Subversion is just the thing that’s there because we couldn’t get rid of.

Two years later and after many different (sometimes inexplicably weird) attempts, our Git setup evolved a lot. We’re now using Gitolite for managing our Git server and, because of that, we needed to change the sync strategy to a cron sync due to the fact that the server-side hook was somewhat unstable with Gitolite. We now have several repositories, there are a few different teams using our server and if other teams want to have their own similar setups, we created a comprehensive step-by-step manual so that in less than a couple hours they can be up and running with their own boxes.

The Git adoption in our teams was painful but we made it. But we made it only for a couple teams and a lot of people are still suffering with Subversion throughout the company. And even if the entire company was using Git (which would be awesome already, don’t get me wrong), that solves only the development productivity problem, not the collaboration problem. Repositories would still be hidden and you would only be able to clone them and send (manual) pull requests knowing where they are.

That’s just wrong.

First, I believe developers shouldn’t be still justifying and fighting for Git adoption when it is clearly becoming the industry standard everywhere. For instance, Google Code had to add Git support to catch up with Github, and because they took too long to do that they lost many projects and developers (including myself). Atlassian’s Bitbucket is also supporting Git since late 2011. Even Microsoft recently announced Git support for Windows Azure. And the list goes on and on. The big players are recognizing Git’s relevance. And the small ones, they don’t care about anything else. Take Heroku for example, where deployment is only possible through Git.

Second, many Open Source projects like Rails, Node.js, Symfony, Django, PHP Language, Qt, openSUSE, YUI, JQuery and countless others already made us the favor of proving how platforms like Github and Gitorious can greatly improve the collaboration and contribution experience by providing workflows and tools that are really helpful for maintaining software projects. Those platforms enhance collaboration significantly, besides giving visibility to people’s projects and things they are working on.

We are not talking about a new hype here. Git and Github (and maybe this could be extended to other Github-like systems like Gitorious and Bitbucket) became the industry standard in the past years. Git is around for some 7 years now and was created inspired by the way people worked on the Linux Kernel, one of the biggest and most important software projects of the computer science history. By the way, if you didn’t do it already, stop for an hour now and watch this great video by Linus Torvalds on why he created Git, I promise you won’t regret. Git is ultra fast, stable, scalable, secure and makes collaboration much easier and faster. Managing merges and patches won’t be a nightmare anymore, not only because they put a lot of effort and intelligence on the merging itself but also because Git embraces collaboration workflow in a way that makes your life much easier (both for project owners and contributors). And Github will make your work more pleasant, collaborative and visible by adding even more tools and value on top of Git. But let’s not make this any longer, you get the idea already: they became the new standard for a reason.

The door is there. Now companies have to walk through it. And because I like challenges, I’ll help one more company take the red pill. See you on the other side.


Titanium Mobile hack: execute your projects from the command line using Make

Appcelerator’s Titanium Mobile is a nice platform to develop native iOS (and Android) applications using JavaScript. Unlike other platforms that provide only a web container to run web applications that pretend to be native, Titanium translates JavaScript code into native applications that look and behave like they were written in Objective-C.

To run Titanium applications you need to use Titanium Developer. This program executes a bunch of scripts to compile your projects using Xcode, starts the application in iOS simulator and so on. Fortunately Titanium is open sourced on Github, and while reading its code I had an idea that would make me feel much more comfortable about running my Titanium applications. If you are a “command-liner” like me, you probably will prefer to execute your applications from your shell using Makefiles, and that’s what my hack is all about.

The first thing I did was to understand how Titanium starts its projects and write a command-line script that replicates this behavior. Titanium uses a script called to do all the magic, and I wrote a simple wrapper script called that uses to compile and start the application:

You will notice in the script source that I read some data (the application name and ID) from tiapp.xml. This is the file used by Titanium to store projects’ metadata. if you want to use this script in your projects, it’s important to change the tiapp.xml path according to your project structure. My projects usually have a structure like this:

/bin (directory)
/SampleApp (directory)

… where:

  • bin is the directory where I put and other utility scripts.
  • SampleApp is the directory where my application lives (this directory has the same name of my application).
  • Makefile lives in the root directory.

If you use this same project structure, your tiapp.xml file will be located at /SampleApp/tiapp.xml. In this case, just copy my script and it will work for you too.

You can also configure Titanium and iOS SDK versions that will be used to compile. Just change the values of *_SDK_VERSION variables to the versions you have installed in your computer. Some variables (like PROJECT_NAME) are parameters that will be passed when the script is called (in our case, this will be informed by the Makefile).

One last thing about is that I use some Perl “black magic” at the end of the script to produce beautiful colored output depending on the log type (info, debug, error, etc.) – just like Titanium Developer does:

After that, you will just need a simple Makefile that will allow you to start the application by simply typing “make run” at the command line:

The Makefile is responsible for calling passing all the necessary variables. Just copy my Makefile to your project root, change the PROJECT_NAME variable and you are good to go!

If you want to take a look on how I use these scripts in my projects, take a look at titanium-jasmine on GitHub. You will notice in this project that you can do a lot of other fancy stuff using Make.


Refactoring: Do it all the time!

Refactoring is a very well known technique that has been used for years in the software development field. Here are some popular definitions:

“Code refactoring is the process of changing a computer program’s source code without modifying its external functional behavior in order to improve some of the nonfunctional attributes of the software. Advantages include improved code readability and reduced complexity to improve the maintainability of the source code, as well as a more expressive internal architecture or object model to improve extensibility.”

Or, being more concise:

“A change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behavior.”
Martin Fowler

My concern with Refactoring is that most people don’t realize that they will have much more benefits if they do it all the time.

As the definition above explains, the purpose of Refactoring is to make code easier to change over time. It doesn’t make sense to keep adding features to your code for, say, 10 months and then stop-the-world for a couple of weeks to refactor it. If you’re doing that, you’re totally missing the point. In this hypothetical situation, not only (1) it would become more complex to make changes in the software over time – and you’d be in trouble for several months – but also (2) at some point of time the team would have to stop and re-engineer everything – because the software would become very difficult to change.

If a team waits for 10 months, or 2 months, or 2 weeks to refactor something, they are doing it wrong. Each time new features are added the code becomes more complex, and it’s necessary to review it as soon as possible to make the necessary improvements. If the team doesn’t do that often, they will become slower and slower each day, they will be less productive and deliver much less features (or value).

The “re-engineer” thing even worse. First, teams that stop to make huge refactorings are delivering nothing to their customers while doing that because they’re too busy fixing their own mess. It’s definitely bad to pay a software team to deliver nothing. Second (and more important), huge refactorings are often risky, because they change a great amount of code and software components, usually introducing bugs and unexpected behaviors.

To avoid all these bad things, you must do refactoring every day, all the time!

(1) Red, (2) Green, (3) RefactorEvery time you make a change, you must look for an opportunity to make an improvement. If you keep doing that, your code will always be clean, maintainable and easy to modify. Remember the TDD cycle: Red-Green-Refactor? Don’t miss the refactor part, do it all the time!

In my experience it takes no more than a few minutes to refactor if you are doing it in small chunks all the time, and more importantly, this will keep you from stopping to make huge and risky software re-writes.

Events Programming

Coding Dojo SP @ Yahoo! Brazil

Last Friday (July 30th) we hosted at Yahoo! Brazil‘s office our first Coding Dojo SP group meeting!

Coding Dojo @ Yahoo!

The meeting was really cool and very crowded. 🙂 We had a Randori session with around 30 developers to solve the “write numbers to words” problem using Python (thanks to the influence of our Pythonist friend “rbp“, that led the Dojo session).

I just finished writing a post on Yahoo! Developer Network’s blog explaining in more details how the meeting was (and what the heck a Coding Dojo is). You can also see some pictures on my Flickr page.

The next meeting will be next week (we still have to define the date). Subscribe to the group’s mailing list to be notified about the next meetings.

See you next time here in the office! 😉


Why I don’t write code comments

These days I was configuring some personal projects on ohloh and one thing called my attention. Ohloh showed the message “very few source code comments” in all my projects.

This is no surprise to me. I really don’t like to write source code comments. Why? Because if my code isn’t clear and easy enough for one to read and easily understand, I need to refactor. Many times I ask friends to take a look at code snippets and check if they understand. When they don’t, the method is usually too big or the variable/method names are not clear enough, and then I refactor to make it better to understand.

One example that I love (and use in many presentations) is the code to send e-mails using Java Mail API. The code would be something like this (from Java World tutorial):

// create some properties and get the default Session
Properties props = new Properties();
props.put("", _smtpHost);
Session session = Session.getDefaultInstance(props, null);
// create a message
Address replyToList[] = { new InternetAddress(replyTo) };
Message newMessage = new MimeMessage(session);
if (_fromName != null)
    newMessage.setFrom(new InternetAddress(from,
        _fromName + " on behalf of " + replyTo));
    newMessage.setFrom(new InternetAddress(from));
// send newMessage
Transport transport = session.getTransport(SMTP_MAIL);
transport.connect(_smtpHost, _user, _password);
transport.sendMessage(newMessage, _toList);

I know that this is just an example but you know that it’s not difficult to find monsters like this one in production code.

This code is so difficult to read that the programmer had to say what he was doing, otherwise nobody would understand its purpose. There is a lot of infrastructure and business code together resulting on a huge abstraction leak.

Now take a look at the following code (from Fluent Mail API) and tell me how many seconds do you need to understand what the code does:

new EmailMessage()
    .withSubject("Fluent Mail API")
    .withBody("Demo message")

This is an extreme example, of course. Maybe you will not write Fluent Interfaces for every little thing you need in your code, but if you add just a little bit of organization (using a better abstraction) it will already improve code readability:

class Email {
    public Email(String from, String to, ...) { ... }
    public void setFrom(String from) { ... }
    public void setTo(String to) { ... }
    public void send() {
        // put your complicated Java Mail code here

Just by encapsulating the code in an abstraction that makes sense the WTF effect will be sensibly reduced. The programmers can now at least assume that the code sends e-mail messages since it’s included in a class called “Email”.

And that’s the point. When I create a culture of prohibiting myself from writing code comments, I’m obligating me to write easy, maintainable and understandable code.

That’s very extreme, I know, but the results of this approach are awesome.