# Document Classification AI Blueprint

# Batch-Predict

Document classification involves the process of classifying the content of documents to facilitate their filtering, management, organization, and searchability.

# Purpose

Use this batch blueprint to classify text from document files in .pdf, .txt, .docx, and .doc formats. The blueprint skips other documents formats. The blueprint’s input files are placed in a directory and its output is stored in CSV format. Provide the path containing folders with the specified-formatted document files. Also, provide as CSV the tags or labels to associate with these files. The blueprint outputs a result.csv in including each tag’s probability for each file.

# Deep Dive

The following flow diagram illustrates this batch-predict blueprint’s pipeline: Deep Dive

# Flow

The following list provides a high-level flow of this blueprint’s run:

  • In the S3 Connector, the user provides the data bucket name and the directory path where the files are located.
  • In the Batch task, the user provides the dir path to the S3 Connector’s document files folders.
  • The blueprint outputs a result.csv CSV file with document classifications.

# Arguments/Artifacts

For more information and examples on this blueprint’s inputs and outputs, click here.

# Inputs

  • --dir is the path to the folder containing all folders storing the target files
  • --labels are the target labels whose values are provided in quotes without spaces; for example: "value1,value2". NOTE: Provide the following two inputs to use your own trained model and eliminate this --labels input.
  • --trained_model is the path to model.pt file trained with this blueprint’s training counterpart. To access, download the file from the output artifacts of the Train task of the Document Classification Train Blueprint.
  • --trained_classes ID is the path to classes.json file, required if using your own trained model. Download this file from the output artifacts of the Train task of the Document Classification Train Blueprint.

# Output

  • result.csv is the blueprint’s output that includes the document file names, their labels, and each label’s probability for each file.

# Instructions

NOTE

The minimum resource recommendations to run this blueprint are 3.5 CPU and 8 GB RAM.

NOTE

This blueprint’s performance can benefit from using GPU as its compute.

Complete the following steps to run the document-classifier model in batch mode:

  1. Click the Use Blueprint button. The cnvrg Blueprint Flow page displays. Run
  2. Click the S3 Connector task to display its dialog.
    • Within the Parameters tab, provide the following Key-Value information: S3 Params
      • Key: bucketname − Value: provide the data bucket name
      • Key: prefix − Value: provide the main path to the files folders
    • Click the Advanced tab to change resources to run the blueprint, as required. S3 Advanced
  3. Click the Batch task to display its dialog.
    • Within the Parameters tab, provide the following Key-Value pair information: Batch Params

      • Key: --dir − Value: provide the S3 path to the folder containing the folders storing the target files in the following format: /input/s3_connector/dc_classification_data/; see Batch Inputs for more information.
      • Key: --labels − Value: provide the target labels in quotes as CSVs without spaces; see Batch Inputs for more information.

      NOTE

      You can use the prebuilt data example paths provided.

    • Click the Advanced tab to change resources to run the blueprint, as required. Batch Advanced

  4. Click the Run button. Run The cnvrg software deploys a document-classifier model that classifies text in a batch of files and outputs a CSV file with the document classifications.
  5. Select Batch > Experiments > Artifacts and locate the batch output CSV file. Artifacts
  6. Select the result.csv File Name, click the Menu icon, and select Open File to view the output CSV file. CSV

A custom model that classifies text in different formatted document files has been deployed in batch mode. For information on this blueprint’s software version and release details, click here.

# Connected Libraries

Refer to the following libraries connected to this blueprint:

Refer to the following blueprints related to this batch blueprint:

# Inference

Document classification involves the process of classifying the content of documents to facilitate their filtering, management, organization, and searchability.

# Purpose

Use this inference blueprint to deploy a document-classifier model and its API endpoint. To use this pretrained document-classifier model, create a ready-to-use API-endpoint that is quickly integrated with your input data in the form of raw text along with custom label names, returning for each label an associated probability value.

This inference blueprint’s model was trained using Hugging Face multi_nli datasets. To use custom document data according to your specific business, run this counterpart’s training blueprint, which trains the model and establishes an endpoint based on the newly trained model.

# Instructions

NOTE

The minimum resource recommendations to run this blueprint are 3.5 CPU and 8 GB RAM.

NOTE

This blueprint’s performance can benefit from using GPU as its compute.

Complete the following steps to deploy this document-classifier endpoint:

  1. Click the Use Blueprint button. Use Blueprint
  2. In the dialog, select the relevant compute to deploy the API endpoint and click the Start button.
  3. The cnvrg software redirects to your endpoint. Complete one or both of the following options:
    • Use the Try it Live section with any document file or link to be classified. Try it Live
    • Use the bottom integration panel to integrate your API with your code by copying in the code snippet. Integration

An API endpoint that classifies documents has now been deployed. For more information on this blueprint’s software version and release details, click here.

Refer to the following blueprints related to this inference blueprint:

# Training

Document classification involves the process of classifying the content of documents to facilitate their filtering, management, organization, and searchability.

# Overview

The following diagram provides an overview of this blueprint's inputs and outputs. Overview

# Purpose

Use this training blueprint to train a custom model on textual content within a set of documents. This blueprint also establishes an endpoint that can be used to classify documents based on the newly trained model.

To train this model with your data, provide in S3 a documents_dir dataset directory with multiple subdirectories containing the different classes of documents. The blueprint supports document files in .pdf, .txt, .docx, and .doc formats. Other documents formats are skipped.

# Deep Dive

The following flow diagram illustrates this blueprint's pipeline: Deep Dive

# Flow

The following list provides a high-level flow of this blueprint’s run:

  • In the S3 Connector, the user provides the data bucket name and the directory path containing the training documents, divided into class subfolders inside the main folder. Also provided is a single CSV file containing document names and their mappings to their respective classes.
  • In the Train task, the user provides the documents_dir path to the documents directory including the previous S3 prefix.
  • The blueprint trains the model on the given dataset and produces a model output file.
  • The user uses the deployed endpoint to classify personalized business documents.

# Arguments/Artifacts

For information and examples of this task’s inputs and outputs, click here.

# Train Inputs

  • --documents_dir is path to the main folder containing the training document to be used for training, which is divided into class subfolders inside this main folder.
  • --labels path is the path to the CSV file containing mapping of document names to their classes. This two-column CSV file contains the document names and their classes/labels, the first column called document with the document names present in the training folder and second column called class with the classes/labels the model is to learn so future documents can be associated to them with certain level of confidence. For an example CSV file, click here.
  • --epochs is number of training iterations for the model, which can be increased if the loss is high at the end of training. Default: 200.

# Train Outputs

  • model.pt is the file containing the retrained model, which can be used later for detecting the intent of a customer’s message.
  • classes.json is the file containing all the unique classes found in the training dataset.

# Instructions

NOTE

The minimum resource recommendations to run this blueprint are 3.5 CPU and 8 GB RAM.

NOTE

This blueprint’s performance can benefit from using GPU as its compute.

Complete the following steps to train the document-classifier model:

  1. Click the Use Blueprint button. The cnvrg Blueprint Flow page displays. Use Blueprint

  2. In the flow, click the S3 Connector task to display its dialog.

    • Within the Parameters tab, provide the following Key-Value pair information: S3 Connector Params
      • Key: bucketname - Value: enter the data bucket name
      • Key: prefix - Value: provide the main path to the images folder
    • Click the Advanced tab to change resources to run the blueprint, as required. S3 Connector Advanced
  3. Return to the flow and click the Train task to display its dialog.

    • Within the Parameters tab, provide the following Key-Value pair information: Train Params

      • Key: documents_dir − Value: provide the path to the directory including the S3 prefix in the following format: /input/s3_connector/dc_classification_train_data
      • Key: labels path − Value: provide the path to the CSV file containing mapping of document names to their classes in the following format: /input/s3_connector/dc_classification_train_data/file.csv
      • Key: epochs − Value: provide the number of training iterations for the model.

      NOTE

      You can use the prebuilt example data paths provided.

    • Click the Advanced tab to change resources to run the blueprint, as required. Train Advanced

  4. Click the Run button. Run The cnvrg software launches the training blueprint as set of experiments, generating a trained document-classifier model and deploying it as a new API endpoint.

    NOTE

    The time required for model training and endpoint deployment depends on the size of the training data, the compute resources, and the training parameters.

    For more information on cnvrg endpoint deployment capability, see cnvrg Serving.

  5. Track the blueprint's real-time progress in its Experiment page, which displays artifacts such as logs, metrics, hyperparameters, and algorithms. Progress

  6. Click the Serving tab in the project and locate your endpoint.

  7. Complete one or both of the following options:

    • Use the Try it Live section with any document file or link to be classified. Try it Live
    • Use the bottom integration panel to integrate your API with your code by copying in the code snippet. Integration

A custom model and API endpoint, which can classify a document’s textual content, have now been trained and deployed. For information on this blueprint’s software version and release details, click here.

# Connected Libraries

Refer to the following libraries connected to this blueprint:

Refer to the following blueprints related to this training blueprint:

Last Updated: 1/17/2023, 10:52:15 PM