Credit Scoring with Snowpark & LocalStack

Introduction

In this tutorial, you will learn how you can use the Snowflake emulator with Snowpark for Python and your favorite Python libraries for data analysis.

The Jupyter Notebook and the dataset used in this tutorial are available on GitHub.

Prerequisites

localstack CLI with a LOCALSTACK_AUTH_TOKEN
LocalStack for Snowflake
Snowpark with other Python libraries
Jupyter Notebook

You should also download credit_files.csv and credit_request.csv files from the LocalStack repository. The files should be present in the same directory as your Jupyter Notebook.

Start the Snowflake emulator

Start your LocalStack container in your preferred terminal/shell.

export LOCALSTACK_AUTH_TOKEN=<your_auth_token>
localstack start --stack snowflake

Create a Snowpark session

The next step is to configure the Snowflake emulator. The Snowflake emulator runs on snowflake.localhost.localstack.cloud. You can use the Snowpark to connect to the locally running Snowflake server.

Start Jupyter Notebook and create a new notebook. Add the following code to connect to the Snowflake emulator:

1
from snowflake.snowpark import *
2
from snowflake.snowpark.functions import *
3

4
connection_parameters = {
5
    "user": "test",
6
    "password": "test",
7
    "account": "test",
8
    "warehouse": "test",
9
    "host": "snowflake.localhost.localstack.cloud",
10
}
11

12
session = Session.builder.configs(connection_parameters).create()

In the above configuration, you can set user, password, account, and warehouse as test to avoid passing any production values. You can now run Snowflake SQL queries on your local machine.

1
session.sql("create or replace database credit_bank").collect()
2
session.sql("use schema credit_bank.public").collect()
3
print(session.sql("select current_warehouse(), current_database(), current_schema(), current_user(), current_role()").collect())

1
[Row(?COLUMN?='TEST', CURRENT_DATABASE='CREDIT_BANK', CURRENT_SCHEMA='public', ?COLUMN?='TEST', GET_CURRENT_ROLE='PUBLIC')]

Create the tables

You can now create two tables associated with this tutorial:

CREDIT_FILES: This table contains the credit on files along with the credit standing whether the loan is being repaid or if there are actual issues with reimbursing the credit.
CREDIT_REQUESTS: This table contains the new credit requests that the bank needs to provide approval on.

Run the following code to create the credit_df table:

1
import pandas as pd
2
credit_files = pd.read_csv('credit_files.csv')
3
session.write_pandas(credit_files,"CREDIT_FILES",auto_create_table='True')
4
credit_df = session.table("CREDIT_FILES")
5
credit_df.schema

StructType([StructField('CREDIT_REQUEST_ID', LongType(), nullable=True), StructField('CREDIT_AMOUNT', LongType(), nullable=True), StructField('CREDIT_DURATION', LongType(), nullable=True), StructField('PURPOSE', StringType(), nullable=True), StructField('INSTALLMENT_COMMITMENT', LongType(), nullable=True), StructField('OTHER_PARTIES', StringType(), nullable=True), StructField('CREDIT_STANDING', StringType(), nullable=True), StructField('CREDIT_SCORE', LongType(), nullable=True), StructField('CHECKING_BALANCE', LongType(), nullable=True), StructField('SAVINGS_BALANCE', LongType(), nullable=True), StructField('EXISTING_CREDITS', LongType(), nullable=True), StructField('ASSETS', StringType(), nullable=True), StructField('HOUSING', StringType(), nullable=True), StructField('QUALIFICATION', StringType(), nullable=True), StructField('JOB_HISTORY', LongType(), nullable=True), StructField('AGE', LongType(), nullable=True), StructField('SEX', StringType(), nullable=True), StructField('MARITAL_STATUS', StringType(), nullable=True), StructField('NUM_DEPENDENTS', LongType(), nullable=True), StructField('RESIDENCE_SINCE', LongType(), nullable=True), StructField('OTHER_PAYMENT_PLANS', StringType(), nullable=True)])

In a similar fashion, you can create the credit_req_df table:

1
credit_requests = pd.read_csv('credit_request.csv')
2
session.write_pandas(credit_requests,"CREDIT_REQUESTS",auto_create_table='True')
3
credit_req_df = session.table("CREDIT_REQUESTS")
4
credit_req_df.schema

StructType([StructField('CREDIT_REQUEST_ID', LongType(), nullable=True), StructField('CREDIT_AMOUNT', LongType(), nullable=True), StructField('CREDIT_DURATION', LongType(), nullable=True), StructField('PURPOSE', StringType(), nullable=True), StructField('INSTALLMENT_COMMITMENT', LongType(), nullable=True), StructField('OTHER_PARTIES', StringType(), nullable=True), StructField('CREDIT_SCORE', LongType(), nullable=True), StructField('CHECKING_BALANCE', LongType(), nullable=True), StructField('SAVINGS_BALANCE', LongType(), nullable=True), StructField('EXISTING_CREDITS', LongType(), nullable=True), StructField('ASSETS', StringType(), nullable=True), StructField('HOUSING', StringType(), nullable=True), StructField('QUALIFICATION', StringType(), nullable=True), StructField('JOB_HISTORY', LongType(), nullable=True), StructField('AGE', LongType(), nullable=True), StructField('SEX', StringType(), nullable=True), StructField('MARITAL_STATUS', StringType(), nullable=True), StructField('NUM_DEPENDENTS', LongType(), nullable=True), StructField('RESIDENCE_SINCE', LongType(), nullable=True), StructField('OTHER_PAYMENT_PLANS', StringType(), nullable=True)])

Analyze the data

You can now analyze the data using Snowpark and your favorite Python libraries. For example, you can fetch the first 5 rows of the credit_df table:

credit_df.toPandas().head()

You can similarly fetch the first 5 rows of the credit_req_df table:

credit_req_df.toPandas().head()

You can also visualize the numeric features and categorical features. For example, you can visualize the histogram of the credit_df table:

credit_df.toPandas().hist(figsize=(15,15))

credit_df_hist

You can also visualize the categorical features of the credit_df table:

1
import matplotlib.pyplot as plt
2
import seaborn as sns
3

4
sns.set(style="darkgrid")
5

6
fig, axs = plt.subplots(5, 2, figsize=(15, 30))
7
df = credit_df.toPandas()
8
sns.countplot(data=df, y="PURPOSE", ax=axs[0,0])
9
sns.countplot(data=df, x="OTHER_PARTIES", ax=axs[0,1])
10
sns.countplot(data=df, x="CREDIT_STANDING", ax=axs[1,0])
11
sns.countplot(data=df, x="ASSETS", ax=axs[1,1])
12
sns.countplot(data=df, x="HOUSING", ax=axs[2,0])
13
sns.countplot(data=df, x="QUALIFICATION", ax=axs[2,1])
14
sns.countplot(data=df, x="SEX", ax=axs[3,0])
15
sns.countplot(data=df, x="MARITAL_STATUS", ax=axs[3,1])
16
sns.countplot(data=df, x="OTHER_PAYMENT_PLANS", ax=axs[4,0])
17
sns.stripplot(y="PURPOSE", x="CREDIT_AMOUNT", data=df, hue='CREDIT_STANDING', jitter=True, ax=axs[4,1])
18
plt.show()

credit_df_cat

Conclusion

You can now perform further experimentations with the Snowflake emulator. For example, you can use the Snowpark API to run queries to get various insights, such as determining the range of loans per different category.