# Datasets

Datasets in cnvrg allows users to automatically and easily upload and version any kind of file.

The cnvrg Datasets functionality uses an object store for the backend to host any file type of varying quantities and sizes up to 500GB. Additionally, cnvrg Datasets allows you to version, label, and tag your data.

Datasets are managed at the organizational level rather than on a separate project level. Once you've uploaded datasets to cnvrg in your organization, you can reuse them in every project, experiment, and notebook.

The topics in this page:

# Datasets Page

Access all your connected datasets from the Datasets tab of your organization.

The cnvrg platform automatically manages your dataset with an internal version controlled system, so you can track your dataset at every stage. Any action writes as a new commit, so you can browse and select specific versions. Versioning gives you the confidence to use your dataset as needed. cnvrg always keeps it safe and controlled without the risk of lost files or features.

Access different versions of your data using the Version drop-down menu on the dataset's page. Select the desired version and the page refreshes to display the selected version.

Under the Actions columns in any specific dataset, click:

  • Browse to perform queries on the dataset.
  • Revert to revert your dataset to that specific commit version.

# Dataset File Uploads

Use the following file-size guidelines to identify the correct method to upload your dataset files:

  • cnvrg user interface (UI): for small dataset files limited to 20MB each upload instance.
  • cnvrg command line interface (CLI): for files greater than 20MB but less than 5GB.
  • Network File System (NFS): for datasets greater than 5GB* but less than 500GB. See NFS Integration.
  • PersistentVolumeClaim (PVC) (Kubernetes only): for datasets greater than 5GB* but less than 500GB. They upload in Local Folders as an NFS mount. Refer to NFS Cache Configuration.

\*NOTE

These are guidelines; the dataset doesn't have to be greater than 5GB to use PVC mount or NFS integration.

There are several ways to add datasets:

For a large dataset, it is recommended to use the CLI, NFS, or PVC.

# Uploading datasets using the web UI

The cnvrg UI suffices to upload small dataset files. Complete the following steps to create a new dataset:

  1. Navigate to the Datasets tab.
  2. Click + New Dataset.
  3. In the displayed panel, select a Name and a Type (optional).
  4. Click Save Dataset.

cnvrg creates an empty dataset ready for all your files to be added.

After creating your dataset, you can upload your data using the drag-and-drop UI. Every upload session is counted as a commit.

WARNING

Keep in mind the cnvrg web UI limits each file upload instance to 20MB. If you want to upload larger files, use the CLI, NFS, or PVC.

NOTE

You can upload files to datasets that you created using the UI in combination with the CLI.

# Uploading and removing files using the CLI

If you have a large dataset, you can use the CLI.

TIP

Details about installing and using the cnvrg CLI can be found here.

Upload data

Use the cnvrg data put command to upload files to a remote dataset without initializing a local version:

cnvrg data put [--commit=SHA1/latest] [--message='MESSAGE'] dataset_slug file_path/file

The data put command uploads the matching files to the remote dataset.

If you use the command with a specific file (for example, cnvrg data put mnist image.png), it uploads the file to the parent directory of the dataset.

If you include the file path (for example, cnvrg data put mnist training/image.png), it uploads the file to that same path in the remote dataset.

If you use a regex (for example, cnvrg data put mnist *.png), all files that match the regex upload to the remote dataset.

Learn more about cnvrg data put in the CLI documentation.

NOTE

In CLI versions below 1.9.5, you must use the full dataset URL. For example: cnvrg data put https://app.cnvrg.io/my_org/datasets/mnist image.png

Remove data

Use the cnvrg data rm command to remove files from a remote dataset without initializing a local version:

cnvrg data rm [--message='MESSAGE'] dataset_slug file_path/file

The data rm command removes the matching files from the remote dataset.

If you use the command with a specific file (for example, cnvrg data rm mnist image.png), it removes the file from the dataset.

If you use the command with a folder (for example, cnvrg data rm mnist training/), the folder and all its contents are removed from the remote dataset.

Learn more about cnvrg data rm in the CLI documentation.

# Version Control

The cnvrg platform automatically manages your dataset with an internal version controlled system, so you can track your dataset at every stage. Any action writes as a new commit, so you can browse and select specific versions. Versioning gives you the confidence to use your dataset as you need; cnvrg always keeps it safe and controlled without lost files or features.

Access different versions of your data using the Version drop-down menu on the dataset's page. Select the desired version and the page refreshes to display the selected version.

Under the Actions columns in any specific dataset, click:

  • Browse to perform queries on the dataset.
  • Revert to revert your dataset to that specific commit version.

# Commits

Click the Commits tab of your dataset to access an overview of its version history.

# Commit table

The commit table displays a row for each commit of your dataset with the following information for each commit:

  • Commit Message: The commit's message. Note: Clicking on the message displays the commit summary page.
  • Commit SHA1
  • Commit Size
  • File Count: The number of files in the commit.
  • User: The user who made the commit. <<<<<<< HEAD
  • Created at: The date it was committed.
  • (If connected to NFS) Cache: The toggle to cache/clear the commit for each connected NFS
  • (If connected to NFS) Storage: The NFS that the Cache toggle and Status are reffering to. =======
  • Created at: The date and time of the commit.
  • (If connected to NFS) Cache: The toggle to cache/clear the commit for each connected NFS.
  • (If connected to NFS) Storage: The NFS the Cache toggle and Status are referring to.
  • (If connected to NFS) Status: If and when the commit was cached to the specific NFS.
  • Commit actions menu:
    • Browse: The commit's browser. Note: Clicking this link displays the file viewer for the commit.
    • Revert: The default revert function. Note: Clicking this link makes the commit the default commit (requires confirmation).

# Commit summary page

To access this page, go to your dataset and click Commits. Click the commit message corresponding to the desired commit.

This page provides a summary of the selected commits changes, including:

  • The commit message and SHA1.
  • The date and time the commit was made.
  • The link to Browse the file viewer for the dataset at this commit.
  • The SHA1 of the parent commit.
  • A list of the files changed by the commit.

# Dataset Commit Caches (NFS Integration)

cnvrg can be integrated with one or more NFS data storage units. When an NFS disk is connected to your organization, you can cache commits of your dataset to the disk. When a commit is cached, you can attach it to jobs so they have immediate access to the data. Moreover, the job does not need to clone the dataset on start-up.

The caching process is simple. You can clear unused commits manually or even automatically.

There are five ways to access the controls for caching and clearing commits:

# Connect to NFS storage

To integrate NFS into your environment, please contact cnvrg.io support. The cnvrg team will help you set up the integration.

# Cache a commit to NFS

Complete the following steps to cache a commit to NFS:

  1. Access the Cache button for the selected commit through one of the methods to display the Cache Commit panel, which summarizes the commit to be cached:
    • The commit SHA1
    • The commit size
    • The commit status
    • The used and remaining storage for the selected NFS unit
  2. In the panel, use the drop-down menu to select the NFS unit on which the commit is to be cached.
  3. Click Cache to start caching the dataset commit.

The commit begins to cache. The information page for the cached commit displays, where you can track the caching process live.

An email notification sends when the caching process finishes.

# Interrupt caching and clear the commit

If you choose to cache a commit, and the commit has not finished caching, you can interrupt the process, and stop and clear the commit.

Complete the following steps to stop and clear a caching commit from the cached commit's information screen:

  1. Navigate to the information page of the caching commit you want to interrupt.
  2. If the status is Caching, a red Stop & Clear button displays. Locate this button.
  3. Click the Stop & Clear button. An information panel displays with details about the commit.
  4. Click Stop & Clear on the panel to interrupt the caching and clear the commit from the NFS disk.

The caching is interrupted and the commit is cleared. You can follow the process live from the cached commit's information screen.

# Clear a cached commit

WARNING

You cannot clear a commit while the cached commit is currently in use by any active jobs such as workspaces and experiments. To enable clearing of the commit, either wait for the jobs to finish or stop them.

Complete the following steps to clear a cached commit and free the space currently in use on the NFS unit:

  1. Access the Clear button for the chosen commit using one of the supported methods to display the Clear Commit panel showing a summary of the commit being cleared:
    • The commit SHA1
    • The NFS storage to be cleared
    • The NFS storage available after the clear
  2. Click Clear to start clearing the dataset commit.

When the commit begins to clear, an information page displays, tracking the live clearing process.

# Access cached commit information page

Each cached commit has its own information page. On the cached commit's information page, you can find:

  • The commit SHA1.
  • The user who last used the cached commit and when.
  • The current cache status.
  • The size of the commit.
  • The most recent caching activity for the commit.
  • The name of the NFS disk and its remaining capacity.
  • The date and time when the next automatic clear will occur.

There are two ways to access this page:

  • On the files viewer for the commit:
    • Navigate to the files viewer for the desired cached commit.
    • Click the cached activity status along the top.
  • From the Commits tab of the dataset:
    • Navigate to the desired dataset.
    • Click the Commits tab.
    • Click the cached activity status for the desired commit.

# Automatically clear cached commits

cnvrg has the capability to automatically clear unused cached commits from your NFS disk to save disk space. Control this functionality in the organization Settings.

To turn this functionality on or off, go to Settings and use the Automatically Clear Cached Commits toggle. Set the number of days a cached commit is left unused before cnvrg automatically clears it from the NFS disk.

When the auto-clear functionality is toggled on, cnvrg clears any cached commits unused for the time specified in Settings. For example, if you choose to clear cached commits if unused for 10 days, any cached commits unused by a job (like a workspace or experiment) for 10 days is cleared from their NFS disk and the space is restored.

A day before an unused cached commit is due to be automatically cleared, cnvrg sends an email notification to the user.

# Local Folders (Network Storage Support)

If you have a storage disk attached to your network, you can easily mount the storage as a local folder for a job. There is no added setup required, but you must ensure your cluster or machine on which then job is running has access to the storage disk.

The option to use a network drive as a job's local folder is located in the Advanced Settings for Experiments and Workspaces.

Pass the IP address and folder of any external drive you want to attach as Local Folders to the job. Use the format <ip_of_network_drive>:/<name_of_folder>. This is mounted to the machine as /nfs/<name_of_folder>.

# Dataset Additions to Jobs

You can add a dataset in the following situations:

  • At workspace or experiment startup
  • In an online workspace
  • In a flow

You can also add datasets to active jobs using the cnvrg CLI and SDK.

# At workspace or experiment startup

When starting a workspace or an experiment from the UI, use the Datasets selector to add one or more datasets:

  1. Click Start Workspace or New Experiment and provide the relevant details in the displayed pane.
  2. To add your dataset(s), select the Datasets drop-down list.
  3. For each dataset to add to the workspace or experiment, choose the commit or query and the portion of the dataset to clone.
  4. Click the dataset to add it to the list of datasets for the workspace or experiment.
    You can remove a dataset from the list by clicking the X next to its name.
  5. Under Advanced Settings, pass the IP address and folder of any external drive to attach as Local Folders to the job. Use the format <ip_of_network_drive>:/<name_of_folder>. This mounts to the machine as /nfs/<name_of_folder>.
  6. Click Start Workspace or Run.

The datasets are cloned to the remote compute.

# In an online workspace

Complete the following steps to add datasets on-the-fly to online workspaces:

  1. Click the Datasets tab on the right sidebar for your open workspace to display the statuses of all datasets already connected to the workspace.
  2. To add a new dataset, click the Select Datasets To Attach drop-down menu.
  3. For each dataset to add to the workspace, choose the commit or query and the portion of the dataset to clone.
  4. Click the dataset to add it to the list of datasets for the workspace.
    You can remove a dataset from the list by clicking the X next to its name.
  5. Click Attach.

The selected datasets begin cloning to the workspace. You can track their statuses from the datasets panel where you attached them.

# In a flow

Complete the following steps to add a dataset in a flow:

  1. Open the flow to which you want to add the dataset.
  2. In the New Task drop-down list, select Data Task.
  3. In the displayed panel, select the Dataset, Dataset Commit, and Dataset Query (only if queries exist for the dataset).
  4. Click Save Changes.
  5. Link the displayed purple box to the task you want the dataset to be available in.

You can also use add datasets when constructing a flow with a YAML.

NOTE

The dataset is also available in any tasks that follow on from the task you connect it to.

# With the CLI

To add a dataset when running an experiment using the cnvrg CLI, use the --datasets flag:

cnvrg run python3 train.py --datasets=[{["id": "hotdogs", "commit": "latest", "query": "ketchup", "tree_only": true]}, [...]]

You can include multiple datasets in the array.

You can also include dataset information in a flow YAML file and use it with the CLI. See the full cnvrg CLI documentation for more information about running experiments using the CLI.

# With the SDK

To add a dataset when running an experiment using the cnvrg SDK, use the datasets parameter:

from cnvrg import Experiment
e = Experiment.run('python3 train.py',
                    datasets=['hotdogs.ketchup'])

You can include multiple datasets in the array.

You can also include dataset information in a flow YAML file and use it with the SDK. See the full cnvrg SDK documentation for more information about running experiments using the SDK.

# Dataset File Downloads

There are several ways to download datasets or use them in your code. The following sections provide examples of dataset usage using the cnvrg CLI, the SDK, and other popular frameworks integrations.

# Including datasets within jobs

When you choose to include a dataset at an experiment start-up or within a workspace (whether at start-up or on-the-fly using the sidebar), the dataset is mounted in the remote compute and accessible from the absolute path: /data/name_of_dataset/.

For example, when working in a workspace, if you include a dataset hotdogs, the dataset and all its associated files can be found in the /data/hotdogs/ directory.

# Cloning using the CLI

Clone a dataset using the following command:

cnvrg data clone dataset_url [--commit=commit] [--only_tree] [--query=query]

This command is used to clone the dataset from the app servers into your local environment.

See the full cnvrg CLI documentation for more information about dataset usage in the CLI.

TIP

The dataset is accessible at /data/dataset_name/.

# Downloading using the cnvrg SDK

To download a dataset using the cnvrg SDK, use the following code snippet in your Python code or Jupyter workspace:

from cnvrg.modules.dataset import Dataset
dataset = Dataset('owner/dataset')

See the full cnvrg SDK documentation for more information about data commands in the SDK.

TIP

The dataset is accessible at /data/dataset_name/.

# Dataset Metadata

In machine learning, most data entries have metadata associated with them. Metadata can represent labels, image size, data source, and any other features of your files. Cnvrg has built-in support for tagging objects.

Simply include a YAML file for each object you want to store metadata.

A metadata YAML file is simple to create. Use the following guidelines:

  • Name it according to the following convention: NameOfFile.ext_tags.yml.

  • Ensure the file's contents fit the following syntax:

    ---
    label1: value1
    label2: value2
    

As an example, for an image 9.png representing the digit "9" in an MNIST dataset, to tag it with metadata, create a YAML file with the following content:

9.png_tags.yml

---
label: "9"
width: "32"
height: "32"
source: "yann lecun"

Then update your changes back to cnvrg using the cnvrg UI or CLI.

# Data Queries

Using cnvrg, you can create queries and subsets of your original dataset. This is especially useful when you want to work on a subset of your dataset (for example, only data labelled "train").

There are two ways to query your data: using the file path or using tags. Complete the following steps:

  1. Navigate to the relevant dataset page.
  2. At the top, enter your query and click Search.

The files that match the query load.

TIP

You can also use a regular expression within your query.

# Querying datasets using the file path

You can query any kind of file, type, or folder in your dataset. Simply enter the file path query you are looking for.

Examples queries:

  • *.zip
  • cats/*.png
  • *.tffrecord
  • image-0*.png
  • training/6/*.png

To search using wildcards:

{ "fullpath": "image*.png" }

# Querying datasets using tags

Once a dataset has been tagged using YAML files, you can use the key:value syntax within the query box to search for those labels.

{ "label": "hotdog" }

The query above searches for files that have metadata with the label hotdog.

# Querying datasets using wildcards

Use wildcard when querying labels:

{ "label": "hot*" }

The query above searches for files that have metadata with the label hot*.

# AND/OR operators

To run queries with and/or logical operators, use the _or and _and operators. For example, for querying yellow and blue objects, use the follow query:

{ _or: [{"color": "yellow"}, {"color": "blue"}] }

using wildcards:

{ _or: [{"color": "yel*"}, {"color": "blue"}] }

For AND operators, use a similar query:

{ _and: [{"color": "yellow"}, {"background": "blue"}] }

using wildcards:

{ _and: [{"color": "yellow"}, {"background": "*"}] }

# Mathematical operations

To run queries with >, <, <=, and >= operators, use the following:

{ result: { gt: 10 } }

The following table provides the keywords required to query the operators:

Operator Keyword
Greater than (>) gt
Greater than and equal to (>=) gte
Less than (<) lt
Less than and equal to (<=) lte

For example, to run a query on a range of objects where x: 305 => x >= 290, use the following:

{ _and: [{"x": { lte: 305 }}, {"x": { gte: 290 } } ]}

# In/not queries

To run queries with in operators, use the following:

{ "color": ["brown", "yellow"] }

To run queries with not operators, use the following:

{ "color": { not: "brown" } }

To run queries with not and in operators, use the following:

{ "color": { not: ["brown", "yellow"] } }

To run queries with not in fullpath, use the following:

{ _and: [{"fullpath":"15*.jpg"}, {"fullpath": {not: "1519.jpg"}}]}

TIP

When running a query, search on a specific commit version by using the Commit drop-down.

# Saving queries

Once you have searched using a query, save it by clicking Save Query. For each saved query, click the Queries tab in the dataset's left sidebar to browse its files and view its information.

You can also use saved queries when loading datasets in cnvrg jobs. Use the Datasets selector to choose the dataset and then the specific commit or query in the drop-down list.

# Collaborators

alt

Making collaboration simpler is a key goal in cnvrg's mission. As such, every dataset in cnvrg can have different members. Datasets are private by default and only their collaborators can access them.

To view the collaborators in your dataset, click the Settings tab of the dataset and then click Collaborators.

TIP

More details on cnvrg collaboration functionality can be found here.

# Add a collaborator

Add collaborator

Complete the following steps to add a collaborator to a dataset:

  1. Go to Dataset > Settings > Collaborators.
  2. Click Add Collaborator.
  3. In the displayed panel, enter the username or email of the user you want to add.
  4. Click Submit.

cnvrg adds the user as a collaborator to the dataset.

# Remove a collaborator

Complete the following steps to remove a collaborator from a dataset:

  1. Go to Dataset > Settings > Collaborators.
  2. Click the Remove button next to the user you want to remove.

cnvrg removes the user from the dataset.

NOTE

Administrators in your organization have access to all datasets, without being added as a collaborator.

Last Updated: 11/30/2022, 4:33:15 PM