Skip to content

Chaos Engineering: Simulating Outages using Chaos API

LocalStack Chaos API is capable of simulating infrastructure faults to allow conducting controlled chaos engineering tests on AWS infrastructure. Its purpose is to uncover vulnerabilities and improve system robustness. Chaos API offers a means to deliberately introduce failures and observe their impacts, helping developers to better equip their systems against actual outages.

In this tutorial we study the effects of outages on a sample AWS application. We use the Chaos API to simulate the outage and design a mitigation to make the application resilient against database outages.

This tutorial is designed for users new to the Chaos API and assumes basic knowledge of the AWS CLI and our awslocal wrapper script. In this example, we will use the Chaos API to create controlled outages in a DynamoDB database. The aim is to test the software’s behavior and error handling capabilities.

For this particular example, we’ll be using a sample application repository. Clone the repository, and follow the instructions below to get started.

The general prerequisites for this guide are:

Start LocalStack by using the docker-compose.yml file from the repository. Ensure to set your Auth Token as an environment variable during this process. The cloud resources will be automatically created upon the LocalStack start.

Terminal window
LOCALSTACK_AUTH_TOKEN=<YOUR_LOCALSTACK_AUTH_TOKEN>
docker compose up

The following diagram shows the architecture that this application builds and deploys:

Architecture

Before starting any outages, it’s important to verify that our application is functioning correctly. Start by creating an entity and saving it. To do this, use curl to call the API Gateway endpoint for the POST method:

Terminal window
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-2004",
"name": "Ultimate Gadget",
"price": "49.99",
"description": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry.
Compact, powerful, and loaded with features."
}'
Output
Product added/updated successfully.

Next, we will configure the Chaos API to target all DynamoDB operations. The Chaos API is powerful enough to refine outages to particular operations like PutItem or GetItem, but the objective here is to simulate a failure of entire service. The following configuration will cause all API calls to fail with a 80% failure rate, each resulting in an HTTP 500 status code and a SomethingWentWrong error.

Terminal window
curl --location --request PATCH 'http://localhost.localstack.cloud:4566/_localstack/chaos/faults' \
--header 'Content-Type: application/json' \
--data '
[
{
"service": "dynamodb",
"probability": 0.8,
"error": {
"statusCode": 500,
"code": "SomethingWentWrong"
}
}
]'

This makes the database inaccessible. No external client or a LocalStack service can retrieve or add new products, resulting in the API Gateway returning an Internal Server Error.

Downtime and data loss are critical issues to avoid in enterprise applications. Fortunately, encountering this issue early in the development phase allows developers to implement effective error handling and develop mechanisms to prevent data loss during a database outage.

Architecture

A possible solution involves setting up an SNS topic, an SQS queue, and a Lambda function. The Lambda function will be responsible for retrieving queued items and attempting to re-execute the PutItem operation on the database. If DynamoDB remains unavailable, the item will be placed back in the queue for a later retry.

Terminal window
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1003",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
}'
Output
A DynamoDB error occurred.
Message sent to queue.

If we review the logs, it will show that the DynamoDbException has been managed effectively.

2023-11-06T22:21:40.789 INFO --- [ asgi_gw_2] localstack.request.aws : AWS dynamodb.PutItem => 500 (DynamoDbException)
2023-11-06T22:21:40.834 DEBUG --- [ asgi_gw_4] l.services.sns.publisher : Topic 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic' publishing '5520d37a-fc21-4a73-b1bf-f9b9afce5908' to subscribed
'arn:aws:sqs:us-east-1:000000000000:ProductEventsQueue' with protocol 'sqs' (subscription 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic:0a4abf8c-744a-404a-9ff9-f132e25d1b30')

This element will remain in the queue until the outage is resolved.

To stop the outage, use the following configuration:

Terminal window
curl --location --request POST 'http://localhost.localstack.cloud:4566/_localstack/chaos/faults' \
--header 'Content-Type: application/json' \
--data '[]'

With the outage now ended, the Product that initially failed to reach the database to finally be stored successfully. This can be confirmed by scanning the database.

Terminal window
awslocal dynamodb scan --table-name Products
Output
{
"Items": [
{
"name": {
"S": "Super Widget"
},
"description": {
"S": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
},
"id": {
"S": "prod-1003"
},
"price": {
"N": "29.99"
}
},
{
"name": {
"S": "Ultimate Gadget"
},
"description": {
"S": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry.
Compact, powerful, and loaded with features."
},
"id": {
"S": "prod-2004"
},
"price": {
"N": "49.99"
}
}
],
"Count": 2,
"ScannedCount": 2,
"ConsumedCapacity": null
}

The LocalStack Chaos API can also introduce a network latency for all connections. This can be done with the following configuration:

Terminal window
curl --location --request POST 'http://localhost.localstack.cloud:4566/_localstack/chaos/effects' \
--header 'Content-Type: application/json' \
--data '{
"latency": 5000
}'

With this configured, you can use the same sample stack to observe and understand the effects of a 5-second delay on each service call.

Terminal window
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--max-time 2 \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1088",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
}'
Output
An error occurred (InternalError) when calling the GetResources operation (reached max retries: 4)