“If development is frequently called in the middle of the night, automation is the likely outcome. If operations is frequently called, the usual reaction is to grow the operations team.” –James Hamilton
Klarna is in the middle of a continuous migration between its monolithic legacy system towards a more distributed, service-oriented architecture. Unlike more traditional financial companies, such as banks and stock-exchanges, Klarna’s merchant-facing APIs does not have scheduled service windows as we want as close to 24/7 availability as our ability to accept purchases directly impacts our merchants’ ability to convert sales for the end-user.
FRED is Klarna’s Next-Generation Purchase-Accepting System. It uses Riak as a high-availability distributed data store, with a data synchronization layer towards the legacy system transported over RabbitMQ. We are currently using Chef for configuration management and Splunk for log aggregation and search.
I’ve been with FRED since its planning stages and have witnessed it grow from an architectural idea towards a proof-of-concept and finally into a production system responsible for a significant portion of Klarna’s purchase accepting ability. Klarna’s legacy system was a Heinleinian dinkum thinkum for all of the company’s in-house functionality, so transitioning its most real-time critical component to a separate system has been maturing experience for the department and the engineers involved.
Typical for a young system, but atypical for a financial application, FRED is operated by its developers. DevOps is a big-buzzword these days, and can mean a lot of things such as using Infrastructure-as-Code configuration management and having developers-as-operators.
For FRED, DevOps means:
a pool of Operator-Developers groomed from knowledgeable developers
Fire Marshall: rotating System Owner who is the designated operator-on-call
Fire Marshall is responsible for the master branch, live system and production upgrades
be pragmatic and focus on what is doable
incrementally improve on what works poorly
communicate with the rest of the organization through standardized interfaces
There is a lot of responsibility on the Fire Marshall, but it also pushes us to really streamline operations. One team member likes to tell an anecdote that really informs why developer-operated systems tend to be more reliable. During the Vietnam War, the US military had major problems with helicopters failing and even crashing out of the sky during missions. Eventually, they started putting the mechanics on board the helicopters. Unsurprisingly, the incidence of mechanical failures decreased dramatically.
Having a regular rotation for the Fire Marshall further streamlines operations. A single mechanic in the helicopter might create a setup that is very efficient for himself to operate in, but mysterious and error-prone for the occasional backup who has to fill in. Having multiple people who rotate through the position forces the creation of a consistent and convenient set of operations utilities, and makes it glaringly obvious when an operational task is not documented, automated or understood by more than one person.
For the most part, when “documenting” operational tasks we prefer well-commented executable scripts over written documentation enumerating multiple steps. Or documents pointing to single commands for each action. Executable scripts are more amenable to peer-review, testable and are by definition the procedure that actually gets performed. Often-times, I’ve found myself writing technical documentation for a procedure and then after looking at the finished product, realized I might as well have scripted the operations procedure instead. So then eventually I do.
Being the lazy and particular software engineers that we are, we try to apply as much good interface design towards how we communicate with the rest of the organization as we do for the APIs and frameworks we implement. Our release notifications are generated from the pull request history and the Jira tickets. Requests for operations tasks come through structured tickets or mail forms with designated fields for all necessary information. The philosophy here is that requests for work should have a clear interface that requires all necessary information to complete the task. To whatever degree that we can make a computer enforce that for us, we do.
A great example of this is how our pull request management has evolved. In FRED’s infancy, Klarna used separate, un-integrated systems for git repository management (gitolite), issue management (Jira), code reviews (Reviewboard) and regression testing (Jenkins).
The practice of all developers pushing against the master branch lasted just days. This was chaotically bad for team relations, and we weren’t anywhere near being in production.
Then we went through an awkward phase of e-mailed pull requests with inconsistent information between requests.
Since it was clear that we really needed links to all relevant points, we just made a mail form with the code review, regression run and Jira issue required fields.
Once we had all this information in a formatted e-mail, it wasn’t too difficult to pre-format the command that the merger ran to really verify all these things were related to each other before merging and pushing.
Given the disjointed systems we had to work with, that last procedure worked remarkably well, but was still tiresome to input all links correctly.
These days, however, we just use a tight integration between Atlassian Stash, Atlassian Jira and Jenkins.
Stash takes care of the git repository management, code reviews and pull requests.
Regression tests are automatically kicked off on Jenkins when developers push code and results are reported back to Stash, which can use the information to gate keep pull requests.
Pull requests and commits are associated with Jira issues based on naming convention, so in the end all relevant information is related to the issue ticket.
Oh brave new world, that has such automation in it!
As gatekeepers of the FRED master branch and the production system, the rotating Fire Marshalls have a vested interest in making sure changes to the system are as operator-friendly as possible. Even though FRED supports rolling upgrades, and can often rollback a bad release, we prefer that new risky changes have an On/Off switch, with counters and timers on as much relevant information as possible. When a potential Fire Marshall is not on-call, much of his development time goes towards code and infrastructure improvements that enrich the configurability and monitorability of the production system. Or cool operations improvements to impress the other Fire Marshalls.
Being an operator-developer on FRED and watching the system grow has been a really satisfying experience. As the first major functionality to be split off from the legacy system, we’ve had to trail-blaze many new solutions while still being part of a much larger organization. Klarna is further diversifying into a real service-oriented architecture with independent services for payment processing, risk decisions, customer identification, and back-office servicing, which makes me look forward to seeing the company grow as a laboratory for innovations in very agile developer-operations while still providing a highly-available payments solution for our customers.
The current roster of FRED Fire Marshalls:
Thomas Järvstrand: Thomas has been with Klarna the longest and had extensive experience working with Klarna’s merchant API before joining the FRED project to help port that functionality to a new system. He is well-regarded in the Erlang community as the maintainer of EDTS, an emacs plugin for IDE-like functionality for Erlang development. He tames hairy Chef recipes with smile on his face and a great big hammer in his hand.
Malcolm Matalka: Malcolm has a strong interest in performance analysis, distributed systems and databases, while happening to be a rather ninja-like shell utility guru. Rumor has it he might even understand how Makefiles work instead of just copy-tweaking them like the rest of us. In his spare time, he writes distributed systems solutions in OCaml and broils big pieces of meat to steakhouse standards.
Daniel Lee: The author in a past life was a researcher specializing in structured operational semantics. Despite the name, structured operational semantics has remarkably little to do with DevOps. His professional passion is picking apart hard problems until they have simple solutions. He dreams of one day replacing himself with an IRC-bot that calls a bunch of shell scripts.
/Daniel Lee, The Operator-Developer Fire Marshalls of Klarna’s Next Gen Purchase