# Datasets

cnvrg datasets allows you to upload and version any kind of file automatically and easily.

cnvrg datasets uses an object store for the backend and allows you to host any kind of file with no limit to size or quantity. Additionally, cnvrg datasets allows you to version, label and tag your data.

Datasets are managed at the organizational level and not per project, separately. Once you've uploaded datasets to cnvrg, you can reuse them in every project, experiment, and notebook.

The topics in this page:

# Datasets Page

You can access all your connected datasets from the Datasets tab of your organization.

cnvrg manages your dataset with an internal version controlled system, automatically, so you can track your dataset at every stage. Any action writes as a new commit, so that you are able to browse and select specific versions. Versioning gives you the confidence to use your dataset as you need -- cnvrg will always keep it safe and controlled. No lost files or features any more!

You access different versions of your data from the version drop-down menu on the dataset's page. You select the version you want and the page refreshes to display the selected version.

Under the Actions columns in any specific dataset, click:

  • Browse to perform queries on the dataset.
  • Revert to revert your dataset to that specific commit version.

# Connecting to Your Dataset

There are two ways to use datasets:

For small datasets, the web interface suffices but if you have a large dataset, it is recommended to use the CLI.

# Uploading datasets using the web UI

To create a new dataset:

  1. Navigate to the Datasets tab.
  2. Click + New Dataset.
  3. In the panel that appears, select a name and a type (optional).
  4. Click Save Dataset.

cnvrg creates an empty dataset ready for all your files to be added.

After creating your dataset through the UI, you can upload your data simply, using the drag-and-drop interface. Every upload session is counted as a commit.

WARNING

Keep in mind that each file is limited to 20MB at a time. If you want to upload larger files you should use the CLI.

NOTE

You can upload files to datasets that you created using the UI, in combination with the CLI, too!

# Uploading and removing files using the CLI

If you have a large dataset, you might consider using the CLI. The cnvrg CLI supports files of any size.

TIP

Details about installing and using the cnvrg CLI can be found here.

Upload data

You can easily upload files to a remote dataset without initializing a local version using cnvrg data put.

cnvrg data put [--commit=SHA1/latest] [--message='MESSAGE'] dataset_url file_path/file

The data put command will upload the matching files to the remote dataset.

If you use the command with a specific file (for example, cnvrg data put https://app.cnvrg.io/my_org/datasets/mnist image.png), it will upload the file to the parent directory of the dataset.

If you include the file path (for example, cnvrg data put https://app.cnvrg.io/my_org/datasets/mnist training/image.png), it will upload the file to that same path in the remote dataset.

If you use a regex (for example, cnvrg put https://app.cnvrg.io/my_org/datasets/mnist *.png), all files that match the regex will be uploaded to the remote dataset.

Learn more about cnvrg data put in the CLI documentation.

Remove data

You can remove files from a remote dataset without initializing a local version using cnvrg data rm.

cnvrg data rm [--message='MESSAGE'] dataset_url file_path/file

The data rm command will remove the matching files from the remote dataset.

If you use the command with a specific file (for example, cnvrg data rm https://app.cnvrg.io/my_org/datasets/mnist image.png), it will remove the file from the dataset.

If you use the command with a folder (for example, cnvrg data rm https://app.cnvrg.io/my_org/datasets/mnist training/), the folder and all its contents will be removed from the remote dataset.

Learn more about cnvrg data rm in the CLI documentation.

# Version Control

cnvrg manages your dataset with an internal version controlled system, automatically, so you can track your dataset at every stage. Any action writes as a new commit, so that you are able to browse and select specific versions. Versioning gives you the confidence to use your dataset as you need -- cnvrg will always keep it safe and controlled. No lost files or features any more!

You access different versions of your data from the version drop-down menu on the dataset's page. You select the version you want and the page refreshes to display the selected version.

Under the Actions columns in any specific dataset, click:

  • Browse to perform queries on the dataset.
  • Revert to revert your dataset to that specific commit version.

# Commits

The Commits tab of your dataset provides an overview of the version history for the dataset.

The commit table has a row for each commit of your dataset. For each commit you can see:

  • Commit Message: Clicking on the message takes you to the commit summary page.
  • Commit Sha1
  • Commit Size
  • User: The user who made the commit.
  • Created at: The date it was committed.
  • (If connected to NFS) Cache (the toggle to cache/clear the commit)
  • (If connected to NFS) Status: If and when the commit was cached.
  • Commit actions menu:
    • Browse: Clicking this link takes you to the file viewer for the commit.
    • Revert: Makes the commit the default commit. (Requires confirmation).

# Commit summary page

To access this page, go to your chosen dataset > Commits > Click the commit message corresponding to the desired commit.

This page provides a summary of the selected commits changes, including:

  • The commit message and Sha1.
  • When the commit was made.
  • The link to Browse the file viewer for the dataset at this commit.
  • The sha1 of the parent commit.
  • A list of the files changed by the commit.

# Caching Dataset Commits (NFS Integration)

cnvrg can be integrated with an NFS data storage unit. When an NFS disk is connected to your organization you can cache commits of your dataset on that NFS. When a commit is cached, you can attach it to jobs so they will have immediate access to the data, and the job will not need to clone the dataset on start-up.

The caching process is very simple and you are able to clear unused commits manually or even automatically.

There are five ways you can access the controls for caching and clearing commits:

# Connect to NFS storage

To integrate NFS into your environment, contact cnvrg support. Our team will help you get the integration set up.

# Cache a commit to NFS

To cache a commit to NFS:

  1. Through any of the methods for accessing the Cache button for the chosen commit, you will reach the Cache Commit panel that summarizes the commit you are about to cache:
    • The commit sha1.
    • The commit size.
    • The status of the commit.
    • For the selected NFS unit, the already used and remaining storage is displayed.
  2. In the panel, choose the NFS unit from the drop-down menu on which you will be caching the commit.
  3. Click Cache to start caching the dataset commit.

The commit will begin to cache. You will be taken to the information page for the cached commit, where you can track the caching process live.

You will receive an email notification when the caching process finishes.

# Interrupt caching and clear the commit

If you choose to cache a commit, and the commit has not finished caching, you can interrupt the process and stop and clear the commit.

You can stop and clear a caching commit from the cached commit's information screen.

  1. Navigate to the information page of the caching commit you wish to interrupt.
  2. If the status is Caching, there will be a red button titled Stop & Clear. Click this button.
  3. An information panel appears with some information about the commit.
  4. Press Stop & Clear on the panel to interrupt the caching and clear the commit from the NFS disk.

The caching is interrupted and the commit cleared. You can follow the process live from the cached commit's information screen.

# Clear a cached commit

WARNING

You cannot clear a commit while the cached commit is currently in use by any active jobs (workspaces, experiments and so on). To enable clearing of the commit, either wait for the jobs to finish or stop them.

To clear a cached commit and free the space that is is currently using on the NFS unit:

  1. Through any of the methods for accessing the Clear button for the chosen commit, you will reach the Clear Commit panel that summarizes the commit you are about to clear:
    • The commit sha1.
    • The storage to be cleared on the NFS.
    • Storage available on NFS after the clear.
  2. Click Clear to start clearing the dataset commit.

The commit will begin to clear. You will be taken to the information page for the cached commit, where you can track the clearing process live.

# Cached commit information page

Each cached commit has its own information page. On the cached commit's information page, you can find:

  • The commit sha1.
  • Who the cached commit was last used by and when.
  • The current cache status.
  • The size of the commit.
  • The most recent caching activity for the commit.
  • The name of the NFS disk and its remaining capacity.
  • When the next automatic clear will occur.

There are two ways you can access this page:

  • On the files viewer for the commit:
    • Navigate to the files viewer for the cached commit of your choice.
    • Click on the cached activity status along the top.
  • From the Commits tab of the dataset.
    • Navigate to the dataset of your choice.
    • Click on the Commits tab.
    • Click on the cached activity status for the commit of your choice.

# Automatically clear cached commits

cnvrg has the capability to automatically clear unused cached commits from your NFS disk to save disk space. This functionality is controlled in the organization Settings.

To turn this functionality on or off, use the toggle for Automatically Clear Cached Commits in Settings. Set the amount of days a cached commit must be left unused, before cnvrg automatically clears it from the NFS disk.

When toggled on, cnvrg will clear any cached commits that have not be used for the amount of time you set in the settings for this functionality. For example, if you choose to clear cached commits if not used for 10 days, any cached commits that have not been used by a job (workspace, experiment and so on) for 10 days will be cleared from their NFS disk and the space restored.

A day before an unused cached commit is due to be automatically cleared, you will receive an email notification.

# Local Folders (Network Storage Support)

If you have a storage disk attached to your network, you can easily mount the storage as a local folder for a job. There is no added setup required, but you must ensure your cluster or machine on which then job is running has access to the storage disk.

The option to use a network drive as a local folder in a job can be found in the Advanced Settings for experiments and workspaces.

Pass the ip address and folder of any external drive you would like to attach as Local Folders to the job. Use the format <ip_of_network_drive>:/<name_of_folder>. This will be mounted to the machine as /nfs/<name_of_folder>.

# Adding Datasets to Jobs

There are three situations in which you can add a dataset:

  • When starting a workspace or experiment using the web UI
  • In an online workspace
  • In a flow

You can also add datasets to jobs when running them using the cnvrg CLI and cnvrg SDK.

# When starting a workspace or experiment using the web UI

When you start a workspace or an experiment from the UI, you can choose one or more datasets to include, using the dataset selector.

  1. Click Start Workspace or New Experiment and fill in the relevant details in the pane that appears.
    To add your dataset(s), select the Datasets drop-down list.
  2. For each dataset you want to add to the workspace or experiment, choose the commit or query and the portion of the dataset to clone.
  3. Click the dataset to add it to the list of datasets for the workspace or experiment.
    You can remove a dataset from the list by clicking the X next to its name.
  4. Under Advanced Settings, you can pass the ip address and folder of any external drive you would like to attach as Local Folders to the job. Use the format <ip_of_network_drive>:/<name_of_folder>. This will be mounted to the machine as /nfs/<name_of_folder>.
  5. Click Start Workspace or Run.

The datasets are cloned to the remote compute.

# In an online workspace

You can add datasets on the fly to online workspaces.

  1. Click the Datasets tab on the right sidebar for your open workspace. There you will see the statuses of all datasets already connected to the workspace.
  2. To add a new dataset, click the Select Datasets To Attach drop-down menu.
  3. For each dataset you want to add to the workspace, choose the commit or query and the portion of the dataset to clone.
  4. Click the dataset to add it to the list of datasets for the workspace.
    You can remove a dataset from the list by clicking the X next to its name.
  5. Click Attach.

The selected datasets begin cloning to the workspace. You can track their statuses from the datasets panel where you attached them.

# In a flow

  1. Open the flow to which you wish to add the dataset.
  2. In the New Task drop-down list, select Data Task.
  3. In the panel that appears, select the Dataset, Dataset Commit and Dataset Query (only if queries exist for the dataset).
  4. Click Save Changes.
  5. Link the purple box that appears to the task you want the dataset to be available in.

You can also use add datasets when constructing a flow with a YAML.

NOTE

The dataset will also be available in any tasks that follow on from the task you connect it to.

# Using the CLI

To add a dataset when running an experiment using the cnvrg CLI use the --datasets flag:

cnvrg run python3 train.py --datasets=[{["id": "hotdogs", "commit": "latest", "query": "ketchup", "tree_only": true]}, [...]]

You can include multiple datasets in the array.

You can also include dataset information in a flow YAML file and use it with the CLI. See the full cnvrg CLI documentation for more information about running experiments using the CLI.

# Using the SDK

To add a dataset when running an experiment using the cnvrg SDK use the datasets parameter:

from cnvrg import Experiment
e = Experiment.run('python3 train.py',
                    datasets=['hotdogs.ketchup'])

You can include multiple datasets in the array.

You can also include dataset information in a flow YAML file and use it with the SDK. See the full cnvrg SDK documentation for more information about running experiments using the SDK.

# Accessing the Dataset Files

There are several ways to download datasets or use them in your code. Below, you can see examples of how to use datasets using the CLI, cnvrg SDK and other popular frameworks integrations.

# Within jobs

When you choose to include a dataset within a workspace (whether added at start up or on-the-fly using the sidebar) or with an experiment, the dataset will be mounted in the remote compute and accessible from the absolute path: /data/name_of_dataset/.

For example, if you include your dataset hotdogs, when working in the workspace the dataset can be found at /data/hotdogs/ and all of the datasets files will be found in that directory.

# Cloning using the CLI

You can clone a dataset with the following command:

cnvrg data clone dataset_url [--commit=commit] [--only_tree] [--query=query]

This command is used to clone the dataset from the app servers into your local environment.

See the full cnvrg CLI documentation for more information about datasets with the CLI.

TIP

The dataset will be accessible at /data/dataset_name/.

# Using the cnvrg SDK

# Download

To download a dataset using the cnvrg SDK, you can use the following code snippet in your Python code or Jupyter workspace:

from cnvrg.modules.dataset import Dataset
dataset = Dataset('owner/dataset')

See the full cnvrg SDK documentation for more information about data commands in the SDK.

TIP

The dataset will be accessible at /data/dataset_name/.

# Generator
# PyTorch

You can load a dataset directly as a PyTorch data object using the following code:

from cnvrg.modules.dataset import Dataset
dataset = Dataset('owner/dataset').pytorch_dataset()

# Dataset Metadata

In machine learning, most data entries have metadata associated with them. Metadata can represent labels, image size, data source and any other features of your files. cnvrg has built-in support for tagging objects.

Simply include a YAML file for each object you want to store metadata about.

# Create metadata

A metadata YAML file is very simple to create.

  • It must be named according to the following convention: NameOfFile.ext_tags.yml.

  • The file's contents should fit the following syntax:

    ---
    label1: value1
    label2: value2
    

For example, let's say you have an image 9.png representing the digit "9" in an MNIST dataset. To tag it with metadata, create a YAML file with the following content:

9.png_tags.yml

---
label: "9"
width: "32"
height: "32"
source: "yann lecun"

Then update your changes back to cnvrg using the CLI or web UI.

# Querying Data

Using cnvrg, you can create queries and subsets of your original dataset. This is especially useful when you want to work on a subset of your dataset (for example, only data labelled "train").

There are two ways to query your data: using the file path, or using tags.

  1. Navigate to the relevant Dataset page.
  2. At the top, enter your query and click Search.

The files that match the query are loaded.

TIP

You can also use a regular expression within your query.

# Querying datasets using the file path

You can query any kind of file, type or folder in your dataset. Simply enter the file path query you are looking for.

Examples

  • *.zip
  • cats/*.png
  • *.tffrecord
  • image-0*.png
  • training/6/*.png

To search using wildcards:

{ "fullpath": "image*.png" }

# Querying datasets using tags

Once a dataset has been tagged using YAML files, you can use the key:value syntax within the query box to search for those labels.

{ "label": "hotdog" }

The query above searches for files that have metadata with the label hotdog.

# Querying datasets using wildcards

Use wildcard when querying labels:

{ "label": "hot*" }

The query above searches for files that have metadata with the label hot*.

# AND/OR operators

To run queries with and/or logical operators, you can use the _or and _and operators. For example, for querying yellow and blue objects, use the follow query:

{ _or: [{"color": "yellow"}, {"color": "blue"}] }

using wildcards:

{ _or: [{"color": "yel*"}, {"color": "blue"}] }

For AND operators, use a similar query:

{ _and: [{"color": "yellow"}, {"background": "blue"}] }

using wildcards:

{ _and: [{"color": "yellow"}, {"background": "*"}] }

# Mathematical operations

To run queries with >, <, <=, and >= operators, use the following:

{ result: { gt: 10 } }

The following table provides the keywords you require to query the operators:

Operator Keyword
Greater than (>) gt
Greater than and equal to (>=) gte
Less than (<) lt
Less than and equal to (<=) lte

For example, to run a query on a range of objects where x: 305 => x >= 290, use the following:

{ _and: [{"x": { lte: 305 }}, {"x": { gte: 290 } } ]}

# In/not queries

To run queries with in operators, use the following:

{ "color": ["brown", "yellow"] }

To run queries with not operators:

{ "color": { not: "brown" } }

To run queries with not and in operators:

{ "color": { not: ["brown", "yellow"] } }

To run queries with not in fullpath:

{ _and: [{"fullpath":"15*.jpg"}, {"fullpath": {not: "1519.jpg"}}]}

TIP

When running a query, you can search on a specific commit version by using the Commit drop-down.

# Saving queries

Once you have searched using a query, you can then save it by clicking Save Query. For each saved query, you can browse its files and see its information from within the Queries tab in the dataset's left sidebar.

You can also use saved queries when loading datasets in cnvrg jobs. In the datasets selector, choose the dataset and then you can choose the specific commit or query in the drop-down list.

# Collaborators

alt

Making collaboration simpler is a key goal in cnvrg's mission. As such, every dataset in cnvrg can have different members. Datasets are private by default and only their collaborators can access them.

To view the collaborators in your dataset, go to the Settings tab of the dataset and click Collaborators.

TIP

More details on how collaboration works in cnvrg can be found here.

# Add a collaborator

Add collaborator

To add a collaborator to the dataset:

  1. Go to Dataset > Settings > Collaborators.
  2. Click Add Collaborator.
  3. In the panel that appears, type the username or email of the person you wish to add.
  4. Click Submit.

cnvrg adds the user as a collaborator on the dataset.

# Remove a collaborator

To remove a collaborator from the dataset:

  1. Go to Dataset > Settings > Collaborators.
  2. Click the Remove button next to the user you wish to remove.

cnvrg removes the user from the dataset.

NOTE

Administrators in your organization have access to all datasets, without being added as a collaborator.

Last Updated: 7/6/2020, 9:50:58 AM