The `dataregistry` CLI

The dataregistry Command Line Interface (CLI) is a convenient way to perform some basic tasks with the data registry directly from the command line.

First, make sure the dataregistry package is installed (see here for details), after which the dregs command will be available. To check, type

dregs --help

to display the list of available commands for dregs.

The full reference documentation for the dataregistry CLI command options can be found here, below we cover some example use cases.

Registering a dataset

Registering a dataset using the CLI is essentially identical to using the dataregistry package from within Python (it is simply calling the Registrar.register_dataset() function with arguments from the command line). Therefore reviewing the first tutorial notebook is the best place to get familiar with the registration process and for descriptions of the associated dataset metadata.

Typing

dregs register dataset --help

will list all the metadata properties that can be associated with a dataset during registration. As when registering datasets using the dataregistry package, the dataset name and version properties are mandatory, which will always be the first two parameters passed to the dregs register dataset command respectively.

For example, say I have produced some data from my latest DESC publication that I want to archive/distribute via the data registry. My data are located in the directory /some/place/at/nersc/my_paper_dataset/, and I want to tag it as a project dataset owned by the DESC Generic Working Group. To do this I would run the CLI as follows:

dregs register dataset \
   my_paper_dataset \
   1.0.0 \
   --old-location /some/place/at/nersc/my_paper_dataset/ \
   --owner_type project \
   --owner "DESC Generic Working Group" \
   --description "Data from my_paper_dataset"

This will recursively copy the /some/place/at/nersc/my_paper_dataset/ directory into the data registry shared space with the name='my_paper_dataset' (other non-specified properties will keep their default values).

Updating a dataset

Now say a few months later a bug has been discovered in the my_paper_dataset data and the entry needs to be updated. As we did not deliberately specify that my_paper_dataset could be overwritten during the initial registration, we need to create a new version of the dataset.

dregs register dataset \
   my_paper_dataset \
   patch \
   --old-location /some/place/at/nersc/my_paper_dataset_updated/ \
   --owner_type project \
   --owner "DESC Generic Working Group" \
   --description "Data from my_paper_dataset describing bugfix" \

Here we associate it with the previous dataset through name= my_paper_dataset (and making sure we keep the same owner and owner_type), and tell the data registry to automatically bump the patch version to 1.0.1 by specifying “patch” as the version string (you could however have entered “1.0.1” here if you prefer).

Querying the data registry

We can also do some simple querying via the CLI to see what datasets we, or others, have in the data registry.

We can do this using the dregs ls command (type dregs ls --help for more info).

By default, typing

dregs ls

will list all the datasets registered by “you” (i.e., where owner == $USER). To be more precise, you can specify the owner and/or owner_type you want to list the datasets for.

For example, to see all the datasets from the DESC Generic Working Group we would do

dregs ls --owner "DESC Generic Working Group"

To list entries from all owners do --owner none.

You can search against the dataset.name column, with wildcard support, where * is the wildcard character, e.g.,

dregs ls --name dataset:dc2:*

will search for all datasets whose name starts with the pattern “dataset:dc2:” (note the --name queries here are case insensitive).

To select what columns are printed in the result use the --return_cols option, e.g.,

dregs ls --return_cols dataset_id name description status

Using dregs ls is a quick an easy way to remind yourself what names you gave to previous datasets, and what relative paths they reside at.

The dataregistry CLI

Registering a dataset

Updating a dataset

Querying the data registry

The `dataregistry` CLI