Athena
6 minute read
Introduction
Athena is an interactive query service provided by Amazon Web Services (AWS) that enables you to analyze data stored in S3 using standard SQL queries. Athena allows users to create ad-hoc queries to perform data analysis, filter, aggregate, and join datasets stored in S3. It supports various file formats, such as JSON, Parquet, and CSV, making it compatible with a wide range of data sources.
LocalStack allows you to configure the Athena APIs with a Hive metastore that can connect to the S3 API and query your data directly in your local environment. The supported APIs are available on our API coverage page, which provides information on the extent of Athena’s integration with LocalStack.
Getting started
This guide is designed for users new to Athena and assumes basic knowledge of the AWS CLI and our awslocal
wrapper script.
Start your LocalStack container using your preferred method. We will demonstrate how to create an Athena table and run a query against it in addition to reading the results with the AWS CLI.
Note
To utilize the Athena API, LocalStack will download additional dependencies. This involves getting a Docker image of around 1.5GB, containing Presto, Hive, and other tools. These components are retrieved automatically when you initiate the service. To ensure a smooth initial setup, ensure you’re connected to a stable internet connection while fetching these components for the first time.Create an S3 bucket
You can create an S3 bucket using the mb
command.
Run the following command to create a bucket named athena-bucket
:
$ awslocal s3 mb s3://athena-bucket
You can create some sample data using the following commands:
$ echo "Name,Service" > data.csv
$ echo "LocalStack,Athena" >> data.csv
You can upload the data to your bucket using the cp
command:
$ awslocal s3 cp data.csv s3://athena-bucket/data/
Create an Athena table
You can create an Athena table using the CreateTable
API
Run the following command to create a table named athena_table
:
$ awslocal athena start-query-execution \
--query-string "create external table tbl01 (name STRING, surname STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://athena-bucket/data/';" --result-configuration "OutputLocation=s3://athena-bucket/output/"
The following output would be retrieved:
{
"QueryExecutionId": "593acab7"
}
You can retrieve information about the query execution using the GetQueryExecution
API.
Run the following command:
$ awslocal athena get-query-execution --query-execution-id 593acab7
Replace 593acab7
with the QueryExecutionId
returned by the StartQueryExecution
API.
Get output of the query
You can get the output of the query using the GetQueryResults
API.
Run the following command:
$ awslocal athena get-query-results --query-execution-id 593acab7
You can now read the data from the tbl01
table and retrieve the data from S3 that was mentioned in your table creation statement.
Run the following command:
$ awslocal athena start-query-execution \
--query-string "select * from tbl01;" --result-configuration "OutputLocation=s3://athena-bucket/output/"
You can retrieve the execution details similarly using the GetQueryExecution
API using the QueryExecutionId
returned by the previous step.
You can copy the ResultConfiguration
from the output and use it to retrieve the results of the query.
Run the following command:
$ awslocal cp s3://athena-bucket/output/593acab7.csv .
$ cat 593acab7.csv
Replace 593acab7.csv
with the path to the file that was present in the ResultConfiguration
of the previous step.
You can also use the GetQueryResults
API to retrieve the results of the query.
Delta Lake Tables
LocalStack Athena supports Delta Lake, an open-source storage framework that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.
To illustrate this feature, we take a sample published in the AWS blog.
The Delta Lake files used in this sample are available in a public S3 bucket under s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table
.
For your convenience, we have prepared the test files in a downloadable ZIP file here.
We start by downloading and extracting this ZIP file:
$ mkdir /tmp/delta-lake-sample; cd /tmp/delta-lake-sample
$ wget https://localstack-assets.s3.amazonaws.com/aws-sample-athena-delta-lake.zip
$ unzip aws-sample-athena-delta-lake.zip; rm aws-sample-athena-delta-lake.zip
We can then create an S3 bucket in LocalStack using the awslocal
command line, and upload the files to the bucket:
$ awslocal s3 mb s3://test
$ awslocal s3 sync /tmp/delta-lake-sample s3://test
Next, we create the table definitions in Athena:
$ awslocal athena start-query-execution \
--query-string "CREATE EXTERNAL TABLE test (product_id string, product_name string, \
price bigint, currency string, category string, updated_at double) \
LOCATION 's3://test/' TBLPROPERTIES ('table_type'='DELTA')"
Please note that this query may take some time to finish executing.
You can observe the output in the LocalStack container (ideally with DEBUG=1
enabled) to follow the steps of the query execution.
Finally, we can now run a SELECT
query to extract data from the Delta Lake table we’ve just created:
$ queryId=$(awslocal athena start-query-execution --query-string "SELECT * from deltalake.default.test" | jq -r .QueryExecutionId)
$ awslocal athena get-query-results --query-execution-id $queryId
The query should yield a result similar to the output below:
...
"Rows": [
{
"Data": [
{ "VarCharValue": "product_id" },
{ "VarCharValue": "product_name" },
{ "VarCharValue": "price" },
{ "VarCharValue": "currency" },
{ "VarCharValue": "category" },
{ "VarCharValue": "updated_at" }
]
},
{
"Data": [
{ "VarCharValue": "00005" },
{ "VarCharValue": "USB charger" },
{ "VarCharValue": "50" },
{ "VarCharValue": "INR" },
{ "VarCharValue": "Electronics" },
{ "VarCharValue": "1653462374.9975588" }
]
},
...
...
Note
TheSELECT
statement above currently requires us to prefix the database/table name with deltalake.
- this will be further improved in a future iteration, for better parity with AWS.Iceberg Tables
The LocalStack Athena implementation also supports Iceberg tables.
You can define an Iceberg table in Athena using the CREATE TABLE
statement, as shown in the example below:
CREATE TABLE mytable (c1 integer, c2 string, c3 double)
LOCATION 's3://mybucket/prefix/' TBLPROPERTIES ( 'table_type' = 'ICEBERG' )
Once the table has been created and data inserted into it, you can see the Iceberg metadata and data files being created in S3:
s3://mybucket/_tmp.prefix/
s3://mybucket/prefix/data/00000-0-user1_20230212221600_cd8f8cbd-4dcc-4c3f-96a2-f08d4104d6fb-job_local1695603329_0001-00001.parquet
s3://mybucket/prefix/data/00000-0-user1_20230212221606_eef1fd88-8ff1-467a-a15b-7a24be7bc52b-job_local1976884152_0002-00001.parquet
s3://mybucket/prefix/metadata/00000-06706bea-e09d-4ff1-b366-353705634f3a.metadata.json
s3://mybucket/prefix/metadata/00001-3df6a04e-070d-447c-a213-644fe6633759.metadata.json
s3://mybucket/prefix/metadata/00002-5dcd5d07-a9ed-4757-a6bc-9e87fcd671d5.metadata.json
s3://mybucket/prefix/metadata/2f8d3628-bb13-4081-b5a9-30f2e81b7226-m0.avro
s3://mybucket/prefix/metadata/70de28f7-6507-44ae-b505-618d734174b9-m0.avro
s3://mybucket/prefix/metadata/snap-8425363304532374388-1-70de28f7-6507-44ae-b505-618d734174b9.avro
s3://mybucket/prefix/metadata/snap-9068645333036463050-1-2f8d3628-bb13-4081-b5a9-30f2e81b7226.avro
s3://mybucket/prefix/temp/
Client configuration
You can configure the Athena service in LocalStack with various clients, such as PyAthena, awswrangler, among others! Here are small snippets to get you started:
from pyathena import connect
conn = connect(
s3_staging_dir="s3://s3-results-bucket/output/",
region_name="us-east-1",
endpoint_url="http://localhost:4566",
)
cursor = conn.cursor()
cursor.execute("SELECT 1,2,3 AS test")
print(cursor.fetchall())
import awswrangler as wr
import pandas as pd
ENDPOINT = "http://localhost:4566"
DATABASE = "testdb"
S3_BUCKET = "s3://s3-results-bucket/output/"
wr.config.athena_endpoint_url = ENDPOINT
wr.config.glue_endpoint_url = ENDPOINT
wr.config.s3_endpoint_url = ENDPOINT
wr.catalog.create_database(DATABASE)
df = wr.athena.read_sql_query("SELECT 1 AS col1, 2 AS col2, 3 AS col3", database=DATABASE)
print(df)
Resource Browser
The LocalStack Web Application provides a Resource Browser for Athena query execution, writing SQL queries, and visualizing query results. You can access the Resource Browser by opening the LocalStack Web Application in your browser, navigating to the Resources section, and then clicking on Athena under the Analytics section.
The Resource Browser allows you to perform the following actions:
- View Databases: View the databases available in your Athena instance by clicking on the Databases tab.
- View Catalogs: View the catalogs available in your Athena instance by clicking on the Catalogs tab.
- Edit Catalogs: Edit the catalogs available in your Athena instance by clicking on the Catalog name, editing the catalog, and then clicking on the Submit button.
- Create Catalogs: Create a new catalog by clicking on the Create Catalog button, entering the catalog details, and then clicking on the Submit button.
- Run SQL Queries: Run SQL queries by clicking on the SQL button, entering the query, and then clicking on the Execute button.
Examples
The following code snippets and sample applications provide practical examples of how to use Athena in LocalStack for various use cases: