Peak Season 2023 : How Klarna achieved consistent success

Anu Sasidharan
Klarna Engineering
Published in
9 min readMar 1, 2024

--

Introduction

This article summarizes how Klarna consistently achieved its Peak Season goals during 2023.

In 2023, we aimed higher. We were committed to continual improvement, building on the progress made in 2022. Our primary focus was to deliver the best possible experiences to our customers. During the Black Friday sale, we broke our own records, a testament to our commitment and capability.

1. Klarna’s systems showed resilience with zero critical or major incidents, and we saw a 30% reduction in overall incidents compared to 2022.

2. We optimized the management of resources, leading to optimal cloud costs.

3. We paid special attention to our engineer’s experience, ensuring a smooth Peak Season preparation. This improved efficiency, and lessened the workload.

The topics this article covers:

  • Peak Season at Klarna
  • Factors contributing to our success
  • Approaches to Peak Season essentials
  • Lessons learned

Peak Season at Klarna

The most important time of the year for Klarna is the ‘Peak Season.’ It starts with the busy week of Black Friday and ends with the sales at the end of the year. Peak Season gets busy because of Holidays and Festivals, Sales and Discounts, and seasonal necessities. During this time, e-commerce and fintech companies buzz with heightened activity, with Klarna being a significant player amidst them.

Why is the Peak Season important for Klarna?

Klarna’s mission is to give shoppers around the world easy, safe, and ‘smoooth’ ways to pay. We handle an average of over 2 million purchases every day, serving over 150 million active shoppers at more than 450,000 sellers in 45 countries. On Black Friday, we handle more than 3 times the usual daily purchases.

Big sales events, especially flash sales, mean our systems have to manage a lot of traffic. Flash Sales are quick discounts or promotions from stores that get buyers excited. This rush adds to the already busy season, and our systems have to quickly handle more than 8 times the usual activity. We have even seen this go up to 40 times for a big merchant, all while still providing customers with a delightful experience.

What were the Key challenges in getting ready for the Peak Season?

Klarna is a large organization with many teams working in different domains and functions. For the Peak Season, all systems directly impacting the shopping frenzy and the systems that are supporting these frontline systems efficiently, were involved. This included over 450 systems, 180 teams, and 17 domains prepared for the Peak Season.

The challenges were:

1. Making sure the 450+ systems were ready to manage the increased loads smooothly, especially during flash sales,

2. Ensuring the security of these systems from potential attacks,

3. Aligning key decisions and strategies across the teams,

4. Communicating decisions, timelines, processes, and best practices effectively.

Key Success factors

Klarna’s decentralized structure lets teams work together and make decisions quickly, helping respond to the market’s needs. This focus on innovation and customer needs ensures smoooth shopping experiences.

For Peak Season management, we have implemented an efficient organizational structure designed for our distributed environment, with clearly defined roles and responsibilities. Good teamwork, particularly central coordination, was integral to our success during the 2023 Peak Season.

  • Teams/System Owners, being the key players, maintained their systems to meet the required standards.
  • The Continuous Readiness Team centrally coordinated the necessary preparations by making appropriate decisions and developing effective tools, processes, and practices to ensure operational readiness.
  • The Steering Committee, which includes Yaron Shaer (our CTO), Domain Leaders, and Architects, oversaw and approved the Continuous Readiness Team’s decisions.
  • Domain Readiness Leads , took the lead in their areas and provided useful feedback to the Readiness Team, helping to identify and handle potential issues early.
  • Business Developers worked closely with merchants to give important flash sale details.

Klarna’s Engineering Platform laid a solid foundation for the System Owners to ensure optimal performance of their systems. In 2023, we focused on effectively distributing responsibilities across team, domain, and central levels, designing Peak Season requirements, and creating tools for compliance checks.

We have specified two types of readiness configurations: non-negotiable requirements, which are measured objectively, and team-managed configurations, which are evaluated subjectively under the oversight of the respective domain.

  • We implemented a readiness tool which automatically checked non-negotiable compliance requirements. This tool alerts System Owners to maintain alignment with peak season and continuous requirements. It employed 56 rules examining system readiness across categories such as resilience, availability, databases, performance, and system capacity. Readiness dashboards enabled the Central Team to monitor and ensure system alignment. The tool assigned readiness scores to systems, teams, groups, domains, and Klarna as a whole. Our goal was to achieve a 100% score by October 31, 2023, thus ensuring readiness while allowing for unplanned contingencies.
  • Readiness reviews were conducted to prevent overlooking any aspect, especially subjective assessments. System Owners underwent a comprehensive checklist review, which domain architects approved. Critical systems, which form the backbone of the purchase flow, were reviewed by a central team of architects and engineering leaders.
  • Under the oversight of the experts from our cloud providers, We conducted DDOS fire drills on publicly exposed systems to identify potential vulnerabilities that could lead to attacks during peak times.

We’ve empowered our teams in the following ways:

  • Traffic Capacity Predictions: We introduced Kapacity (Klarna Capacity), our in-house tool, to help teams project minute-by-minute request volumes using past data and merchant predictions. Kapacity provides growth metrics and extra capacity for unexpected increases, allowing System Owners to easily access data, estimate incoming requests, and make informed decisions about resource allocation. This resolves a pain point from 2022, when more than half of our systems had to make independent predictions based on central metrics. Now, with Kapacity, we offer predictions for every system and service.
  • Performance Testing Tools & Framework: Customized to Klarna’s engineering needs, our centralized performance testing framework streamlines the development, building, and execution of performance tests, ensuring our services can handle heavy loads. System Owners are guided by comprehensive best practices to guarantee a consistent user experience and confirm that their test parameters fulfill well-defined SLO requirements. The tool is capable of monitoring the success of tests conducted by the systems.
  • Best Practices & Guidelines: We provide guidance in several areas including Lambda Readiness,Databases, handling dependencies (internal and third parties), capacity reservations, monitoring, observability, runbooks, etc.

A Closer Look at Peak Season Essentials

For the inquisitive readers who are keen to delve deeper into the approaches for the key preparations for Peak Season, this section is designed with you in mind. Some of these topics have been touched upon in previous sections, yet here we deep dive into the nitty-gritty aspects.

To guarantee a seamless and delightful customer experience during the 2023 Peak Season, preparations and detailed planning were undertaken, focusing primarily on the following essential areas:

1. Comprehensive Performance Testing

2. Proficient Management of Flash Sales

3. DDOS Readiness preparedness

Approach for Performance Testing: Understanding the importance of system efficiency in handling elevated traffic, particularly during flash sales, extensive performance testing has been carried out in alignment with Klarna’s specific requirements. Based on direct traffic dependencies, the distinctive nature of flash sales, and trends observed from historical data, the Klarna Infrastructure capacity has been segregated into two levels:

  • FLAS Capacity (Fast, Large Spike): Designed to accommodate sudden, substantial surges of activity often triggered by flash sales, campaigns, or incidents. In such scenarios, traffic can spike drastically within a two-minute duration, necessitating a system robust enough to manage these increments without relying on auto-scaling, which might be too slow to react.
  • Baseline Capacity: This constitutes the maximum capacity needed to support the regular daily traffic (excluding FLAS events). It’s structured to comfortably endure peak daily volumes, boasting an automatic auto-scaling functionality for consistent performance.

The systems anticipated to face FLAS events have undergone load tests (where performance is gauged against expected loads), spike tests (assessing the system’s handling of sudden load increments, typically due to flash sales), and overload tests(designed to ascertain the load point at which a system fails or exhibits significant degradation). Conversely, the Baseline systems only needed to conduct load and overload tests.

Moreover, it’s integral to test underlying components like dependencies and databases — this ensures comprehensive system performance.

  • Failover tests play a critical role too. Failover is an automatic switching from the primary system to a backup system, initiated when a fault or failure is detected. Swiftly configuring databases and their corresponding clients for rapid failover is crucial for maintaining system resilience and overall performance, especially during unexpected events. For instance, if a sudden traffic surge occurs and the database isn’t primed for quick failover, notable slowdowns may transpire, or the system might become completely unresponsive. Such a scenario could culminate in downtime, potentially compromising data integrity or inciting data loss if handled incorrectly. Equipping your configuration with fast failover is thus a priority to avoid such disruptions, further ensuring a seamless user experience even under significant loads.We employed strategic retries, exponential backoffs, and robust exception handling. Additionally, we use specified query timeouts to maintain optimal speed for both indexed and non-indexed lookups. These measures all contribute to smoother recovery of database operations, thereby preserving application performance integrity.

Approach for Managing Flash sales : Flash sales management is divided into two categories: Managed Flash Sales and Unmanaged Flash Sales.

  • Managed Flash Sales are reported through a dedicated process by a Business Developer, who has direct contact with the Merchant. This report triggers an automated notification to the Continuous Readiness team. The team subsequently reviews the merchant’s historical data and projected peak capacity. The information is then compared against the existing rate limit for the relevant merchant category. This evaluation assists in managing intense sales activities during periods when partners are anticipated to exceed their standard rate limit. This vital procedure protects the Klarna platform from potential overloads that could negatively impact all partners. If a rate limit adjustment is needed due to a flash sale, the Continuous Readiness team requests a temporary rate limit increase from a central team which manages rate limit for all merchants. This decision is based on Klarna’s set capacity levels and a comprehensive understanding of the situation. Impacted teams, including Accountable Leads and On-Call members, are subsequently notified. The Business Developer or the Key Account Manager who reported the Flash Sale, along with the merchant’s solution engineer, is also informed. All pertinent stakeholders join a direct message group where they receive further details about the flash sale, such as the schedule, impacted countries, expected peak times, and other important considerations. Significant flash sales are monitored centrally to provide continuous support during the event.
  • On the other hand, Unmanaged Flash Sales are unpredictable. This means that any sudden load during these sales is managed by the FLAS capacity predictions provided by the Kapacity (prediction tool), which also includes a safety buffer. During peak season, particularly on Black Friday, continuous central monitoring is in place to ensure prompt support should any issues arise. To further ensure operational continuity, Technical Managers from our cloud providers were also on standby for support.

Approach for DDOS readiness : Our procedure for ensuring DDOS readiness involved enabling central platform level protection around several aspects in addition to the fire drill conducted for identified public endpoints. This exercise was led by Case Taintor (Competence Group Lead of Klarna Engineering Platform), in collaboration with our networking, security teams and cloud providers. The fire drill was planned with the aim of boosting confidence, identifying areas for improvement, validating processes, and gaining practice.

The fire drill exercise included a simulated synthetic test, reviewing setups such as WAF rules, and providing targeted advice. This drill aided system owners in identifying necessary actions on CloudFront configurations, alerting mechanisms, and origin setups. Moreover, it helped update essential checklists and procedures in the Runbook.

Lessons Learned: Aspiring for Continual Growth

One of the leadership principles that resonates profoundly with me is ‘start small and learn fast.’ At Klarna, we have the daring to experiment with cutting-edge technology trends, which accelerates our learning and fosters innovation. The accomplishment of each Peak Season is a tribute to the collaborative efforts of all the teams within the readiness scope. Our most recent Peak Season has set a new standard, thanks to its unprecedentedly high-quality delivery.

In the future,

  • We aim to further enhance our engineers’ efficiency by lessening the efforts needed for Peak Season preparations as well as in other operations throughout the year.
  • We will emphasize customer delight as the cornerstone of Klarna, rooted in our customer-centric philosophy. Therefore, persistent performance testing is of utmost importance, ensuring the reliability and optimum performance of Klarna’s systems. We currently boast a suite of impressive tools, and our goal is to foster a culture that encourages performance-driven development.
  • As the premier AI-powered bank, we are committed to leveraging AI to enhance efficiency and maintain an unwavering focus on quality.
  • We also anticipate increasing our dependency management with third-party systems to further augment readiness.

In conclusion, Klarna’s Peak Season 2023 was a game-changer, setting new standards in planning, preparedness, and execution. Excellence in Peak Season is no more a goal, but a standard that Klarna strives to elevate with each passing year, promising a future of seamless shopping experiences for our customers. In the face of ever-evolving challenges, Klarna’s steadfast commitment to customer delight, championed by a spirit of continual growth and innovation, continues to guide us towards stellar trajectories of success.

Did you enjoy this post and want to stay updated on our latest projects and advancements in the engineering field? Join the Klarna Engineering community on Medium, Meetup.com and LinkedIn.

--

--

An accomplished Engineering Manager with diverse industry experience. Accountable and Competence Lead at Klarna focused on building resilient systems.