# Text (PDF) Extraction AI Blueprint

# Batch-Predict

Text extraction uses a trained model to extract raw text from a batch of digital or scanned PDF files. Further operations such as searches can be performed on the text extracted using the blueprint.

# Purpose

Use this batch blueprint to extract text from a batch of PDF files and store the extracted text in JSON format.

The blueprint uses a combination of PDF extraction and OCR techniques to extract text from digital and scanned PDFs. To use this blueprint and generate a JSON file in the output artifacts, provide the path to the directory containing the PDF files to be extracted. All the PDF files in the selected directory are parsed and their text stored in the resulting JSON file.

# Deep Dive

The following flow diagram illustrates this batch-predict blueprint’s pipeline: Deep Dive

# Flow

The following list provides a high-level flow of this blueprint’s run:

  • In the S3 Connector, the user uploads the PDF files containing text to be extracted.
  • In the Batch task, the user provides the S3 location to the PDF files.
  • The blueprint outputs a single JSON file with information about the extracted text.

# Arguments/Artifacts

For more information on this blueprint’s tasks, its inputs, and outputs, click here.

# Input

--dir is the path to the directory containing the relevant PDF files.

# Output

--result.json is the name of the output JSON file that contains the extracted text. An example result.json file can be found here.

# Instructions

NOTE

The minimum resource recommendations to run this blueprint are 3.5 CPU and 8 GB RAM.

Complete the following steps to run the text-extractor blueprint in batch mode:

  1. Click the Use Blueprint button. The cnvrg Blueprint Flow page displays. Use

  2. Click the S3 Connector task to display its dialog.

    • Within the Parameters tab, provide the following Key-Value information: S3 Params
      • Key: bucketname − Value: provide the data bucket name
      • Key: prefix − Value: provide the main path to the folder with the PDFs
    • Click the Advanced tab to change resources to run the blueprint, as required. S3 Advanced
  3. Click the Batch task to display its dialog.

    • Within the Parameters tab, provide the following Key-Value pair information: Batch Params

      • Key: dir − Value: provide the path to the PDF directory including the S3 prefix
      • /input/s3_connector/pdf_extraction_data − ensure the path adheres to this format

      NOTE

      You can use the prebuilt data example paths provided.

    • Click the Advanced tab to change resources to run the blueprint, as required. Batch Advanced

  4. Click the Run button. Run

    The cnvrg software deploys a text-extractor model that extracts text from a batch of PDFs and downloads a JSON file with information about the extracted text.

  5. Track the blueprint’s real-time progress in its Experiments page, which displays artifacts such as logs, metrics, hyperparameters, and algorithms. Progress

  6. Select Batch > Experiments > Artifacts and locate the batch output JSON files. Artifacts

  7. Click the result.json File Name, click the Menu icon, and select Open File to view the output JSON file. JSON

A custom model that extracts text from a batch of PDF files has now been deployed. For information on this blueprint's software version and release details, click here.

# Connected Libraries

Refer to the following libraries connected to this blueprint:

Refer to the following blueprint related to this batch blueprint:

# Inference

Text extraction uses a trained model to extract text from PDFs. This blueprint can be used with digital or scanned PDFs. Input a PDF and receive raw text data from the PDF’s textual content. Further operations such as searches can be performed on the text extracted using the blueprint.

# Purpose

Use this inference blueprint to immediately extract text from a scanned or digital PDF. To use this pretrained text-extractor model, create a ready-to-use API-endpoint that can be quickly integrated with your data and application.

This inference blueprint’s model was trained to extract text from one PDF at a time. To simultaneously extract text from multiple PDFs, run this counterpart’s batch predict blueprint, which extracts text from multiple PDFs placed in an S3 bucket or a cnvrg dataset.

# Instructions

NOTE

The minimum resource recommendations to run this blueprint are 3.5 CPU and 8 GB RAM.

Complete the following steps to deploy a text-extractor API endpoint:

  1. Click the Use Blueprint button. Use Blueprint

  2. In the dialog, select the relevant compute to deploy the API endpoint and click the Start button.

  3. The cnvrg software redirects to your endpoint. Complete one or both of the following options:

    • Use the Try it Live section with any text-containing PDF to check the model's ability to extract text. Try it Live
    • Use the bottom integration panel to integrate your API with your code by copying in the code snippet. Integration

An API endpoint that extracts text from any digital or scanned PDF has now been deployed. For information on this blueprint's software version and release details, click here.

Refer to the following blueprint related to this inference blueprint:

Last Updated: 1/17/2023, 10:52:15 PM