Automating the Klarna Card Ownership Fees System using AWS Step Functions

Michel Neumann
Klarna Engineering
Published in
8 min readMay 2, 2024

--

This article outlines how my team and I applied automation using AWS Step Functions and CloudFormation on a system to charge the monthly fee for Klarna Cards, enabling us to transform a previously manual routine into a self-sufficient, scheduled workflow. The initiative significantly streamlined operations and reduced maintenance cost.

Introduction

In early 2023, Klarna introduced monthly fees for Klarna Cards in the US. In the Card & Banking domain, two teams, including myself as engineer, developed this system within a tight four-month deadline. Initially, the system required extensive manual operation, including a detailed checklist for engineers to follow to ensure successful executions. The teams launched, planning iterative improvements of that routine.

Months passed by without any advancement in refining the operation process nor automating any part of it. To provide an overview of what needed to be done by the teams to run the batch jobs:

  • Designating an engineer to lead the monthly process, coordinated using JIRA tickets
  • Updating exemption lists and submitting pull requests to the code-base prior to initiating batch runs
  • Ensuring data integrity by performing Athena queries across three different production databases within the AWS Console
  • Manually initiating multiple batch jobs in a specified sequence with manual input of arguments in a live production setting
  • Awaiting termination of the jobs and conducting a thorough review of the outcomes of the final batch jobs for each market

Overall, this routine took around three business days per month and required two engineers to approve code changes and review the results of the batch runs. Considering new markets where fees may be rolled out towards, this workflow posed a significant challenge to maintaining high-quality standards and preventing potential incidents.

Taking on The Challenge

We recognized that continuing with our current process was unsustainable and bound to cause issues down the line. When the topic got priority, I took the opportunity to lead the initiative, dedicating my time to investigate the problem and propose a viable solution.

I created a “Request for Comments” (RFC) document, a formal method employed by Klarna for proposing ideas, to outline potential solutions and gather ideas. During this process, discussions and constructive feedback were shared. We concluded on committing to AWS Step Functions, a service which allows orchestrating multiple services into server-less workflows.

To manage expectations and set project milestones, I developed a detailed timeline. I estimated the completion of different phases of the automation project, including the initial MVP and the final implementation. Additionally, I outlined a series of implementation and discovery tasks to be integrated into upcoming sprints.

System Overview

For better context, I am going to describe the Ownership Fee system as it operated before the introduction of automation. The system is distributed over two AWS accounts: one directly owned by the team and another shared account with resources used by the Klarna App.

Architecture diagram of the Ownership Fees before automation

CloudFormation and AWS CDK are utilized to deploy resources. Most notably are two Glue jobs and two SQS queues with one of them being a “Dead Letter Queue” (DLQ), containing messages that could not be processed. The Glue jobs are reading data from three distinct databases, each managed by a separate domain and housed in yet another AWS account.

A Python script running PySpark will preprocess, join, and transform the data to generate JSON records, each describing a customer’s card information for a given month. These records are written into the SQS queue and processed by an AWS Lambda function, which is deployed in the shared AWS account. The Lambda function implements a decision tree which results in either triggering a fee charge using a dependency system or an exemption for the customer from the fees for the current month. Decisions made by the Lambda function will be stored in a DynamoDB table within the same account.

Adding Automation

To implement automation we created a so-called “state machine”, which describes a sequence of event-driven steps where each step in the workflow is called a “state”. A state represents a unit of work that can call any AWS service or API. In our case these states would consist of Lambda functions and Glue job triggers to solve the issues of the manual routine mentioned previously. Finally, we will need to add a scheduler to trigger the workflow on a predetermined interval to avoid having an engineer to invoke the system manually by logging into AWS Console and launching the state machine.

Leveraging the existing use of AWS CloudFormation, a new Stack using CDK has been created, exclusively containing automation resources. A Stack describes a modular grouping of AWS resources that can be independently managed and altered without impacting other stacks. In this stack, the state machine has been defined.

import { Stack, StackProps } from 'aws-cdk-lib'
import { StateMachine } from 'aws-cdk-lib/aws-stepfunctions'

export class KlarnaCardOwnershipFeesAutomationStack extends Stack {
constructor(scope: Construct, id: string, props: StackProps) {
super(scope, id, props)
}

private provision() {
const stateMachine = new StateMachine(this, 'state-machine', {
stateMachineName: `klarna-card-fees-automation`,
role: stateMachineExecutionRole,
definitionBody,
})

// More resources will follow here...
}
}

To integrate the execution of Lambda functions into the workflow, “LambdaInvoke” tasks needed to be created.

Because the Lambda is defined in the same CDK stack, we were able to reference it directly. However, it is also possible to invoke Lambda functions that are deployed elsewhere, for example by using their resource ARNs.

import { Code, Function, Runtime } from 'aws-cdk-lib/aws-lambda'
import { LambdaInvoke } from 'aws-cdk-lib/aws-stepfunctions-tasks'

const lambda = new Function(this, 'lambda', {
functionName: `klarna-card-fees-lambda`,
code: Code.fromAsset('...'),
handler: 'index.handler',
runtime: Runtime.NODEJS_20_X,
})

const lambdaInvoke = new LambdaInvoke(this, 'step-invoke-lambda', {
lambdaFunction: lambda,
resultSelector: {
FeePeriod: JsonPath.stringAt('$.Payload'),
}
})

To pass arguments in between states, tasks support specifying input and output selectors. This is a feature of the “Amazon States Language”, a JSON-based, structured language used to define state machines. The return value of the Lambda function (“Payload”) will be passed to the next state as a JSON record of the following format:

{
"FeePeriod": "2024-01"
}

Up next, the steps to trigger the Glue jobs were defined. The tasks reference the Glue jobs by their name. It is important to note that both the state machine and the Glue job must be deployed in the same region, as there is no support to invoke them across regions or accounts as of time of writing.

import { GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks'

const startGlueJob = new GlueStartJobRun(this, 'step-fn-start-glue-job', {
glueJobName: 'klarna-card-fees-glue-job',
integrationPattern: IntegrationPattern.RUN_JOB,
arguments: TaskInput.fromObject({
'--FEE_PERIOD': JsonPath.stringAt('$.FeePeriod'),
'--MARKET': "US",
}),
})

Accessing the state input, which in our case was returned by the previous Lambda task, can be achieved through a syntax known as “JSONPath”. With “$” pointing to the root of the input object, the “FeePeriod” property is selected, read as a string and passed as a Glue job run argument.

By setting the “integrationPattern” to “IntegrationPattern.RUN_JOB”, the execution of the Glue job is handled synchronously, meaning the state machine will pause until the Glue job has terminated.

Linking the previously defined steps together, concluded with a “Succeed” step to mark the execution as successful, forms the basis of the so-called “definition body” of the state machine.

import { Succeed } from 'aws-cdk-lib/aws-stepfunctions'

const definitionBody = DefinitionBody.fromChainable(
lambdaInvoke.next(
startGlueJob.next(
new Succeed(this, 'automation-complete')
)
)
)
)

Due to the “least privilege principle” of the AWS Well-Architected framework, the state machine itself lacks permissions to execute the Lambda functions and Glue jobs. In alignment with best practices and to ensure the state machine operates within its required capabilities, an AWS IAM role has been created that allows the missing permissions. This role was added to the state machine’s configuration.

const stateMachineExecutionRole = new Role(this, 'state-machine-execution-role', {
assumedBy: new ServicePrincipal(`states.us-east-1.amazonaws.com`),
roleName: `state-machine-execution-role`,
inlinePolicies: [
new Policy(this, 'state-machine-execution-role-policy', {
statements: [
new PolicyStatement({
effect: Effect.ALLOW,
actions: ['lambda:InvokeFunction'],
resources: [`arn:aws:lambda:*:${account}:function:klarna-card-fees-lambda`],
}),
new PolicyStatement({
effect: Effect.ALLOW,
actions: ['glue:StartJobRun'],
resources: [`arn:aws:glue:*:${account}:job/klarna-card-fees-glue-job`],
}),
],
}),
],
})

With all components configured, deploying the stack is as simple as executing the “cdk deploy” command, after which the resources will be set up. Accessing the state machine is straightforward through the AWS Console under the “Step Functions” section.

Step Functions module in the AWS Console

A scheduled trigger to enable full automation is yet missing. For this, AWS EventBridge provides a solution to set up a CRON scheduler, which automatically triggers the state machines at specified intervals. An additional IAM role is required for the scheduler to have the necessary permissions to activate the state machine.

import { CfnSchedule } from 'aws-cdk-lib/aws-scheduler'

const schedulerExecutionRole = new Role(this, 'scheduler-execution-role', {
assumedBy: new ServicePrincipal('scheduler.amazonaws.com'),
roleName: `klarna-card-fees-scheduler-role`,
})

const schedule = new CfnSchedule(this, 'automation-schedule', {
name: `klarna-card-fees-automation-schedule`,
scheduleExpression: 'cron(0 9 3 * ? *)', // Every 3rd at 9:00AM UTC
state: 'ENABLED',
target: {
arn: stateMachine.stateMachineArn,
roleArn: schedulerExecutionRole.roleArn,
}
})

stateMachine.grantStartExecution(schedulerExecutionRole)

By configuring the CRON schedule, enabling it, and designating the state machine as the target, the setup is complete. Following a redeployment, automation is fully in place, allowing the system to operate independently. To get notified about issues during the execution, we created custom monitors on Datadog with integration towards OpsGenie, to call-out an engineer if unexpected errors occur.

A CRON scheduler has been created using AWS EventBridge, targeting the state machine

A complete architecture diagram showcasing the addition of the entire automation framework is provided below.

Complete architecture diagram including automation

Results

What were the advantages of solving this problem? Primarily, it led to a 12.5% monthly reduction in the team’s workload, freeing up resources for additional projects and initiatives. This decrease in workload translates into lower operational costs, as there’s no longer a need to dedicate engineer hours to system operation. By automating processes, the risk of human error was minimized, thereby reducing the likelihood of potential incidents.

Additionally, the team had the chance to reduce tech debt, reworking IAM roles by eliminating unnecessary permissions and improving the staging environment, enabling thorough testing of the automation process before its rollout to production.

I personally gained a lot of knowledge during this project which I gave back to my team and domain by creating documentation and hosting presentations to deep-dive into the technical aspects and challenges. Finally, working on this project greatly enhanced my proficiency with AWS, which ultimately led me to successfully acquire the AWS Solutions Architect Associate certification!

Summary

In conclusion, leveraging AWS Step Functions for automating tasks is highly beneficial for those who manage system components within AWS accounts that require regular execution or need to follow a specific sequence. This blog post has demonstrated the functionality of AWS Step Functions through a straightforward example. Yet, it is possible to architect more complex workflows, invoking other AWS resources, run processes in parallel, incorporate error management strategies, and much more.

A big thanks to all contributors and to my team for their extensive support, providing me the opportunity to drive this project!

Did you enjoy this post? Follow Klarna Engineering on Medium and LinkedIn to stay updated on more articles like this.

--

--