The dataregistry package

Reference documentation for the core objects within the dataregistry package. Demonstrations of their usage can be found in the tutorials section.

The DataRegistry class

The DataRegistry class is the primary front end to the dataregistry package. This should be the only object users have to import to their code.

It connects the user to the database, and serves as a wrapper to both the Registrar and Query classes.

class dataregistry.DataRegistry(owner=None, owner_type=None, config_file=None, schema=None, root_dir=None, verbose=False, site=None)

Primary data registry wrapper class.

The DataRegistry class links to both the Registrar class, to register/modify/delete datasets, and the Query class, to query existing datasets.

Links to the database is done automatically using the:
  • the users config file (if None defaults are used)

  • the passed schema (if None the default schema is used)

The root_dir is the location the data is copied to. This can be manually passed, or alternately a predefined site can be chosen. If nether are chosen, the NERSC site will be selected as the default.

Parameters:
  • owner (str) – To set the default owner for all registered datasets in this instance.

  • owner_type (str) – To set the default owner_type for all registered datasets in this instance.

  • config_file (str) – Path to config file, if None, default location is assumed.

  • schema (str) – Schema to connect to, if None, default schema is assumed.

  • root_dir (str) – Root directory for datasets, if None, default is assumed.

  • verbose (bool) – True for more output.

  • site (str) – Can be used instead of root_dir. Some predefined “sites” are built in, such as “nersc”, which will set the root_dir to the data registry’s default data location at NERSC.

Registrar.get_owner_types()

Returns a list of allowed owner_types that can be registered within the data registry.

Returns:

- – Set of owner_types

Return type:

set

Query.find_datasets(property_names=None, filters=[], verbose=False, return_format='property_dict')

Get specified properties for datasets satisfying all filters

If property_names is None, return all properties from the dataset table (only). Otherwise, return the property_names columns for each discovered dataset (which can be from multiple tables via a join).

Filters should be a list of dataregistry Filter objects, which are logic constraints on column values.

These choices get translated into an SQL query.

Parameters:
  • property_names (list, optional) – List of database columns to return (SELECT clause)

  • filters (list, optional) – List of filters (WHERE clauses) to apply

  • verbose (bool, optional) – True for more output relating to the query

  • return_format (str, optional) – The format the query result is returned in. Options are “CursorResult” (SQLAlchemy default format), “DataFrame”, or “proprety_dict”. Note this is not case sensitive.

Returns:

result – Requested property values

Return type:

CursorResult, dict, or DataFrame (depending on return_format)

DatasetTable.register(relative_path, version, version_suffix=None, name=None, creation_date=None, description=None, execution_id=None, access_API=None, access_API_configuration=None, is_overwritable=False, old_location=None, copy=True, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, location_type='dataregistry', url=None, contact_email=None)

Create a new dataset entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

First, the dataset entry is created in the database. If success, the data is then copied (if old_location was provided). Only if both steps are successful will there be “valid” status entry in the registry.

Parameters:
  • relative_path** (str)

  • version** (str)

  • version_suffix** (str, optional)

  • name** (str, optional)

  • creation_date** (datetime, optional)

  • description** (str, optional)

  • execution_id** (int, optional)

  • access_API** (str, optional)

  • is_overwritable** (bool, optional)

  • old_location (str, optional) –

    Absolute location of dataset to copy into the data registry.

    If None, dataset should already be at correct relative_path within the data registry.

  • copy (bool, optional) – True to copy data from old_location into the data registry (default behaviour). False to create a symlink.

  • verbose (bool, optional) – Provide some additional output information

  • owner** (str, optional)

  • owner_type** (str, optional)

  • execution_name** (str, optional)

  • execution_description** (str, optional)

  • execution_start** (datetime, optional)

  • execution_site** (str, optional)

  • execution_configuration** (str, optional)

  • input_datasets (list, optional) – List of dataset ids that were the input to this execution

  • input_production_datasets (list, optional) – List of production dataset ids that were the input to this execution

  • max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file

  • location_type** (str, optional) – If location_type=”external”, either url or contact_email must be supplied

  • url** (str, optional) – For location_type=”external” only

  • contact_email** (str, optional)

Returns:

  • prim_key (int) – The dataset ID of the new row relating to this entry (else None)

  • execution_id (int) – The execution ID associated with the dataset

ExecutionTable.register(name, description=None, execution_start=None, site=None, configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None)

Create a new execution entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:
  • name** (str)

  • description** (str, optional)

  • execution_start** (datetime, optional)

  • site** (str, optional)

  • configuration** (str, optional)

  • input_datasets** (list, optional)

  • input_production_datasets** (list, optional)

  • max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file

Returns:

my_id – The execution ID of the new row relating to this entry

Return type:

int

DatasetAliasTable.register(aliasname, dataset_id)

Create a new dataset_alias entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:
  • aliasname** (str)

  • dataset_id** (int)

Returns:

prim_key – The dataset_alias ID of the new row relating to this entry

Return type:

int