Redshift
4 minute read
Introduction
RedShift is a cloud-based data warehouse solution which allows end users to aggregate huge volumes of data and parallel processing of data. RedShift is fully managed by AWS and serves as a petabyte-scale service which allows users to create visualization reports and critically analyze collected data. The query results can be saved to an S3 Data Lake while additional analytics can be provided by Athena or SageMaker.
LocalStack allows you to use the RedShift APIs in your local environment to analyze structured and semi-structured data across local data warehouses and data lakes. The supported APIs are available on our API coverage page, which provides information on the extent of RedShift’s integration with LocalStack.
Getting started
This guide is designed for users new to RedShift and assumes basic knowledge of the AWS CLI and our awslocal
wrapper script.
Start your LocalStack container using your preferred method. We will demonstrate how to create a RedShift cluster and database while using a Glue Crawler to populate the metadata store with the schema of the RedShift database tables using the AWS CLI.
Define the variables
First, we will define the variables we will use throughout this guide. Export the following variables in your shell:
REDSHIFT_CLUSTER_IDENTIFIER="redshiftcluster"
REDSHIFT_SCHEMA_NAME="public"
REDSHIFT_DATABASE_NAME="db1"
REDSHIFT_TABLE_NAME="sales"
REDSHIFT_USERNAME="crawlertestredshiftusername"
REDSHIFT_PASSWORD="crawlertestredshiftpassword"
GLUE_DATABASE_NAME="gluedb"
GLUE_CONNECTION_NAME="glueconnection"
GLUE_CRAWLER_NAME="gluecrawler"
The above variables will be used to create a RedShift cluster, database, table, and user. You will also create a Glue database, connection, and crawler to populate the Glue Data Catalog with the schema of the RedShift database tables.
Create a RedShift cluster and database
You can create a RedShift cluster using the CreateCluster
API.
The following command will create a RedShift cluster with the variables defined above:
$ awslocal redshift create-cluster \
--cluster-identifier $REDSHIFT_CLUSTER_IDENTIFIER \
--db-name $REDSHIFT_DATABASE_NAME \
--master-username $REDSHIFT_USERNAME \
--master-user-password $REDSHIFT_PASSWORD \
--node-type n1
You can fetch the status of the cluster using the DescribeClusters
API.
Run the following command to extract the URL of the cluster:
$ REDSHIFT_URL=$(awslocal redshift describe-clusters \
--cluster-identifier $REDSHIFT_CLUSTER_IDENTIFIER | jq -r '(.Clusters[0].Endpoint.Address) + ":" + (.Clusters[0].Endpoint.Port|tostring)')
Create a Glue database, connection, and crawler
You can create a Glue database using the CreateDatabase
API.
The following command will create a Glue database:
$ awslocal glue create-database \
--database-input "{\"Name\": \"$GLUE_DATABASE_NAME\"}"
You can create a connection to the RedShift cluster using the CreateConnection
API.
The following command will create a Glue connection with the RedShift cluster:
$ awslocal glue create-connection \
--connection-input "{\"Name\":\"$GLUE_CONNECTION_NAME\", \"ConnectionType\": \"JDBC\", \"ConnectionProperties\": {\"USERNAME\": \"$REDSHIFT_USERNAME\", \"PASSWORD\": \"$REDSHIFT_PASSWORD\", \"JDBC_CONNECTION_URL\": \"jdbc:redshift://$REDSHIFT_URL/$REDSHIFT_DATABASE_NAME\"}}"
Finally, you can create a Glue crawler using the CreateCrawler
API.
The following command will create a Glue crawler:
$ awslocal glue create-crawler \
--name $GLUE_CRAWLER_NAME \
--database-name $GLUE_DATABASE_NAME \
--targets "{\"JdbcTargets\": [{\"ConnectionName\": \"$GLUE_CONNECTION_NAME\", \"Path\": \"$REDSHIFT_DATABASE_NAME/%/$REDSHIFT_TABLE_NAME\"}]}" \
--role r1
Create table in RedShift
You can create a table in RedShift using the CreateTable
API.
The following command will create a table in RedShift:
$ REDSHIFT_STATEMENT_ID=$(awslocal redshift-data execute-statement \
--cluster-identifier $REDSHIFT_CLUSTER_IDENTIFIER \
--database $REDSHIFT_DATABASE_NAME \
--sql \
"create table $REDSHIFT_TABLE_NAME(salesid integer not null, listid integer not null, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp)" | jq -r .Id)
You can check the status of the statement using the DescribeStatement
API.
The following command will check the status of the statement:
$ wait "awslocal redshift-data describe-statement \
--id $REDSHIFT_STATEMENT_ID" ".Status" "FINISHED"
Run the crawler
You can run the crawler using the StartCrawler
API.
The following command will run the crawler:
$ awslocal glue start-crawler \
--name $GLUE_CRAWLER_NAME
You can wait for the crawler to finish using the GetCrawler
API.
The following command will wait for the crawler to finish:
$ wait "awslocal glue get-crawler \
--name $GLUE_CRAWLER_NAME" ".Crawler.State" "READY"
You can finally retrieve the schema of the table using the GetTable
API.
The following command will retrieve the schema of the table:
$ awslocal glue get-table \
--database-name $GLUE_DATABASE_NAME \
--name "${REDSHIFT_DATABASE_NAME}_${REDSHIFT_SCHEMA_NAME}_${REDSHIFT_TABLE_NAME}"