Textract

Get started with Textract on LocalStack

Textract is a machine learning service that automatically extracts text, forms, and tables from scanned documents. It simplifies the process of extracting valuable information from a variety of document types, enabling applications to quickly analyze and understand document content.

LocalStack supports Textract via the Pro/Team offering, allowing you to mock Textract APIs in your local environment. The supported APIs are available on our API coverage page, providing details on the extent of Textract’s integration with LocalStack.

Getting started

This guide is tailored for users new to Textract and assumes basic knowledge of the AWS CLI and our awslocal wrapper script.

Start your LocalStack container using your preferred method. We will demonstrate how to perform basic Textract operations, such as mocking text detection in a document.

Detect document text

You can use the DetectDocumentText API to identify and extract text from a document. Execute the following command:

$ awslocal textract detect-document-text \
    --document '{"S3Object":{"Bucket":"your-bucket","Name":"your-document"}}'

The following output would be retrieved:

{
    "DocumentMetadata": {
        "Pages": {
            "Pages": 389
        }
    },
    "Blocks": [],
    "DetectDocumentTextModelVersion": "1.0"
}

Start document text detection job

You can use the StartDocumentTextDetection API to asynchronously detect text in a document. Execute the following command:

$ awslocal textract start-document-text-detection \
        --document-location '{"S3Object":{"Bucket":"bucket","Name":"document"}}'

The following output would be retrieved:

{
    "JobId": "501d7251-1249-41e0-a0b3-898064bfc506"
}

Save the JobId value to use in the next command.

Get document text detection job

You can use the GetDocumentTextDetection API to retrieve the results of a document text detection job. Execute the following command:

$ awslocal textract get-document-text-detection \
    --job-id "501d7251-1249-41e0-a0b3-898064bfc506"

Replace 501d7251-1249-41e0-a0b3-898064bfc506 with the JobId value retrieved from the previous command. The following output would be retrieved:

{
    "DocumentMetadata": {
        "Pages": {
            "Pages": 389
        }
    },
    "JobStatus": "SUCCEEDED",
    "Blocks": [],
    "DetectDocumentTextModelVersion": "1.0"
}

Last modified February 9, 2024: add docs on textract (#1072) (ea27d8c21)