The dataregistry package

Reference documentation for the core objects within the dataregistry package. Demonstrations of their usage can be found in the tutorials section.

The DataRegistry class

The DataRegistry class is the primary front end to the dataregistry package. This should be the only object users have to import to their code.

It connects the user to the database, and serves as a wrapper to both the Registrar and Query classes.

class dataregistry.DataRegistry(owner=None, owner_type=None, config_file=None, schema=None, root_dir=None, verbose=False, site=None)

Primary data registry wrapper class.

The DataRegistry class links to both the Registrar class, to register/modify/delete datasets, and the Query class, to query existing datasets.

Links to the database is done automatically using the:
  • the users config file (if None defaults are used)

  • the passed schema (if None the default schema is used)

The root_dir is the location the data is copied to. This can be manually passed, or alternately a predefined site can be chosen. If nether are chosen, the NERSC site will be selected as the default.

Parameters:
  • owner (str) – To set the default owner for all registered datasets in this instance.

  • owner_type (str) – To set the default owner_type for all registered datasets in this instance.

  • config_file (str) – Path to config file, if None, default location is assumed.

  • schema (str) – Schema to connect to, if None, default schema is assumed.

  • root_dir (str) – Root directory for datasets, if None, default is assumed.

  • verbose (bool) – True for more output.

  • site (str) – Can be used instead of root_dir. Some predefined “sites” are built in, such as “nersc”, which will set the root_dir to the data registry’s default data location at NERSC.

Registrar.get_owner_types()

Returns a list of allowed owner_types that can be registered within the data registry.

Returns:

- – Set of owner_types

Return type:

set

Query.find_datasets(property_names=None, filters=[], verbose=False, return_format='property_dict')

Get specified properties for datasets satisfying all filters

If property_names is None, return all properties from the dataset table (only). Otherwise, return the property_names columns for each discovered dataset (which can be from multiple tables via a join).

Filters should be a list of dataregistry Filter objects, which are logic constraints on column values.

These choices get translated into an SQL query.

Parameters:
  • property_names (list, optional) – List of database columns to return (SELECT clause)

  • filters (list, optional) – List of filters (WHERE clauses) to apply

  • verbose (bool, optional) – True for more output relating to the query

  • return_format (str, optional) – The format the query result is returned in. Options are “CursorResult” (SQLAlchemy default format), “DataFrame”, or “proprety_dict”. Note this is not case sensitive.

Returns:

result – Requested property values

Return type:

CursorResult, dict, or DataFrame (depending on return_format)

DatasetTable.register(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, relative_path=None, kwargs_dict=None)

Create a new dataset entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

First, the dataset entry is created in the database. If success, the data is then copied (if old_location was provided). Only if both steps are successful will there be “valid” status entry in the registry.

Parameters:
  • name** (str)

  • version** (str)

  • creation_date** (datetime, optional)

  • description** (str, optional)

  • execution_id** (int, optional)

  • access_api** (str, optional)

  • is_overwritable** (bool, optional)

  • old_location (str, optional) –

    Absolute location of dataset to copy into the data registry.

    If None, dataset should already be at correct relative_path within the data registry.

  • verbose (bool, optional) – Provide some additional output information

  • owner** (str, optional)

  • owner_type** (str, optional)

  • execution_name** (str, optional)

  • execution_description** (str, optional)

  • execution_start** (datetime, optional)

  • execution_site** (str, optional)

  • execution_configuration** (str, optional)

  • input_datasets (list, optional) – List of dataset ids that were the input to this execution

  • input_production_datasets (list, optional) – List of production dataset ids that were the input to this execution

  • max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file

  • keywords (list[str], optional) – List of keywords to tag dataset with. Each keyword must be registered already in the keywords table.

  • location_type** (str, optional) – If location_type=”external”, either url or contact_email must be supplied

  • url** (str, optional) – For location_type=”external” only

  • contact_email** (str, optional)

  • test_production (boolean, default False. Set to True for testing) – code for production owner_type

  • relative_path** (str, optional)

  • kwargs_dict (dict) – Stores all the keyword arguments passed to this function (and defaults). Automatically generated by the decorator, do not pass manually.

Returns:

  • prim_key (int) – The dataset ID of the new row relating to this entry (else None)

  • execution_id (int) – The execution ID associated with the dataset

DatasetTable.replace(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, kwargs_dict=None)

Replace a dataset in the registry.

This is so a user can keep the same name/version/ower/owner_type combination as a previous dataset. Note the original dataset must have is_overwritable=True to allow the replace to work.

The process is as follows:
  • The original dataset is deleted, and the entry in the database tagged accordingly

  • A new entry is made with the same name/version combination as before, and the data goes into the same relative_path as before. All other properties are what the user specifies in the replace function

  • The old dataset gets pointed to the new dataset saying it is the most up to date iteration

Returns:

  • prim_key (int) – The dataset ID of the new row relating to this entry (else None)

  • execution_id (int) – The execution ID associated with the dataset

DatasetTable.modify(entry_id, modify_fields)

Modify an entry in the DESC data registry.

Only certain columns are allowed to be modified after registration, this is defined in the schema yaml file.

Parameters:
  • entry_id (int) – The dataset/execution/etc ID we wish to delete from the database

  • modify_fields (dict) – Dict where key is the column to modify (must be allowed to modify) and value is the desired new value for the entry

DatasetTable.delete(dataset_id)

Delete an dataset entry from the DESC data registry.

This will also remove the raw data from the root dir, but the dataset entry remains in the registry (now with an updated status field).

Parameters:

dataset_id (int) – Dataset we want to delete from the registry

DatasetTable.add_keywords(dataset_id, keywords)

Add/append keywords to an already existing dataset.

First check the keywords are valid, then append. If the dataset already has one or more of the passed keywords attributed to it, the keyword(s) will not be duplicated.

Parameters:
  • dataset_id (int)

  • keywords (list[str])

DatasetTable.get_modifiable_columns()

Return a list of all columns in this table that are “modifiable”.

As defined in the schema yaml file.

Returns:

mod_list

Return type:

list[str]

DatasetTable.get_keywords()

Returns the list of system keywords that are allowed.

Returns:

keywords

Return type:

list[str]

ExecutionTable.register(name, description=None, execution_start=None, site=None, configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None)

Create a new execution entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:
  • name** (str)

  • description** (str, optional)

  • execution_start** (datetime, optional)

  • site** (str, optional)

  • configuration** (str, optional)

  • input_datasets** (list, optional)

  • input_production_datasets** (list, optional)

  • max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file

Returns:

my_id – The execution ID of the new row relating to this entry

Return type:

int

DatasetAliasTable.register(aliasname, dataset_id, ref_alias_id=None, access_api=None, access_api_configuration=None, supersede=False)

Create a new dataset_alias entry in the DESC data registry. It may refer to a dataset (default) or another alias

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:
  • aliasname (str alias name)

  • dataset_id** (int not None if alias refers to dataset)

  • ref_alias_id** (int not None if alias refers to) – another alias

  • access_api** (str api, if any, which can read the) – dataset

  • access_api_configuration** (str extra information for access_api)

  • supersede (bool if True, create a new entry with) – this alias name even if old ones exist

Returns:

prim_key – The dataset_alias ID of the new row relating to this entry

Return type:

int