The `dataregistry` package

Reference documentation for the core objects within the dataregistry package. Demonstrations of their usage can be found in the tutorials section.

The DataRegistry class

The DataRegistry class is the primary front end to the dataregistry package. This should be the only object users have to import to their code.

It connects the user to the database, and serves as a wrapper to both the Registrar and Query classes.

class dataregistry.DataRegistry(owner=None, owner_type=None, config_file=None, root_dir=None, verbose=False, site=None, namespace=None, schema=None, entry_mode='working', query_mode='both')

Primary data registry wrapper class.

The DataRegistry class links to both the Registrar class, to register/modify/delete datasets, and the Query class, to query existing datasets.

Links to the database is done automatically using the:

the users config file (if None defaults are used)
the passed schema (if None the default schema is used)

The root_dir is the location the data is copied to. This can be manually passed, or alternately a predefined site can be chosen. If nether are chosen, the NERSC site will be selected as the default.

Parameters:

owner (str) – To set the default owner for all registered datasets in this instance.
owner_type (str) – To set the default owner_type for all registered datasets in this instance.
config_file (str) – Path to config file, if None, default location is assumed.
root_dir (str) – Root directory for datasets, if None, default is assumed.
verbose (bool) – True for more output.
site (str) – Can be used instead of root_dir. Some predefined “sites” are built in, such as “nersc”, which will set the root_dir to the data registry’s default data location at NERSC.
namespace (str, optional) – Namespace to connect to. If None, the default namespace will be used.
schema (str, optional) – Schema to connect to, to connect directly to a chosen schema, bypassing the namespace.
entry_mode (str, optional) – Which schema (“working” or “production”) within the namespace to use when writing/modifying/deleting entries.
query_mode (str, optional) – Which schema(s) (“working” or “production”) to probe when querying. By default query_mode=”both”, which searches both schemas together, however this can be restricted to either “working” or “production” to restrict searches to a single schema.

Registrar.get_owner_types()

Returns a list of allowed owner_types that can be registered within the data registry.

Returns:: - – Set of owner_types
Return type:: set

Query.find_datasets(property_names=None, filters=[], verbose=False, return_format='property_dict', strip_table_names=False)

Get specified properties for datasets satisfying all filters. Both schemas (i.e., the working and production schema) are searched, with the results combined.

If property_names is None, return all properties from the dataset table (only). Otherwise, return the property_names columns for each discovered dataset (which can be from multiple tables via a join).

Filters should be a list of dataregistry Filter objects, which are logic constraints on column values.

These choices get translated into an SQL query.

Parameters:

property_names (list, optional) – List of database columns to return (SELECT clause)
filters (list, optional) – List of filters (WHERE clauses) to apply
verbose (bool, optional) – True for more output relating to the query
return_format (str, optional) – The format the query result is returned in. Options are “DataFrame”, or “proprety_dict”. Note this is not case sensitive.
strip_table_names (bool, optional) – True to remove the table name in the results columns This only works if a single table is needed for the query

Returns:

result – Requested property values

Return type:

dict, or DataFrame (depending on return_format)

DatasetTable.register(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, relative_path=None, kwargs_dict=None)

Create a new dataset entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

First, the dataset entry is created in the database. If success, the data is then copied (if old_location was provided). Only if both steps are successful will there be “valid” status entry in the registry.

Parameters:

name** (str)
version** (str)
creation_date** (datetime, optional)
description** (str, optional)
execution_id** (int, optional)
access_api** (str, optional)
is_overwritable** (bool, optional)
old_location (str, optional) –
Absolute location of dataset to copy into the data registry.

If None, dataset should already be at correct relative_path within the data registry.
verbose (bool, optional) – Provide some additional output information
owner** (str, optional)
owner_type** (str, optional)
execution_name** (str, optional)
execution_description** (str, optional)
execution_start** (datetime, optional)
execution_site** (str, optional)
execution_configuration** (str, optional)
input_datasets (list, optional) – List of dataset ids that were the input to this execution
input_production_datasets (list, optional) – List of production dataset ids that were the input to this execution
max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file
keywords (list[str], optional) – List of keywords to tag dataset with. Each keyword must be registered already in the keywords table.
location_type** (str, optional) – If location_type=”external”, either url or contact_email must be supplied
url** (str, optional) – For location_type=”external” only
contact_email** (str, optional)
test_production (boolean, default False. Set to True for testing) – code for production owner_type
relative_path** (str, optional)
kwargs_dict (dict) – Stores all the keyword arguments passed to this function (and defaults). Automatically generated by the decorator, do not pass manually.

Returns:

prim_key (int) – The dataset ID of the new row relating to this entry (else None)
execution_id (int) – The execution ID associated with the dataset

DatasetTable.replace(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, kwargs_dict=None)

Replace a dataset in the registry.

This is so a user can keep the same name/version/ower/owner_type combination as a previous dataset. Note the original dataset must have is_overwritable=True to allow the replace to work.

The process is as follows:

The original dataset is deleted, and the entry in the database tagged accordingly
A new entry is made with the same name/version combination as before, and the data goes into the same relative_path as before. All other properties are what the user specifies in the replace function
The old dataset gets pointed to the new dataset saying it is the most up to date iteration

Returns:

prim_key (int) – The dataset ID of the new row relating to this entry (else None)
execution_id (int) – The execution ID associated with the dataset

DatasetTable.modify(entry_id, modify_fields)

Modify an entry in the DESC data registry.

Only certain columns are allowed to be modified after registration, this is defined in the schema yaml file.

Parameters:

entry_id (int) – The dataset/execution/etc ID we wish to delete from the database
modify_fields (dict) – Dict where key is the column to modify (must be allowed to modify) and value is the desired new value for the entry

DatasetTable.delete(name, version_string, owner, owner_type, confirm=False)

Delete an dataset entry from the DESC data registry.

This will also remove the raw data from the root dir, but the dataset entry remains in the registry (now with an updated status field).

Parameters:

name/version_string/owner/owner_type (str) – Identifiers for dataset we want to delete from the registry
confirm (bool) – Will ask for a confirmation

DatasetTable.add_keywords(dataset_id, keywords)

Add/append keywords to an already existing dataset.

First check the keywords are valid, then append. If the dataset already has one or more of the passed keywords attributed to it, the keyword(s) will not be duplicated.

Parameters:

dataset_id (int)
keywords (list[str])

DatasetTable.get_modifiable_columns()

Return a list of all columns in this table that are “modifiable”.

As defined in the schema yaml file.

Returns:: mod_list
Return type:: list[str]

DatasetTable.get_keywords()

Returns the list of system keywords that are allowed.

Returns:: keywords
Return type:: list[str]

ExecutionTable.register(name, description=None, execution_start=None, site=None, configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None)

Create a new execution entry in the DESC data registry.

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:

name** (str)
description** (str, optional)
execution_start** (datetime, optional)
site** (str, optional)
configuration** (str, optional)
input_datasets** (list, optional)
input_production_datasets** (list, optional)
max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file

Returns:

my_id – The execution ID of the new row relating to this entry

Return type:

int

DatasetAliasTable.register(aliasname, dataset_id, ref_alias_id=None, access_api=None, access_api_configuration=None, supersede=False)

Create a new dataset_alias entry in the DESC data registry. It may refer to a dataset (default) or another alias

Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.

Parameters:

aliasname (str alias name)
dataset_id** (int not None if alias refers to dataset)
ref_alias_id** (int not None if alias refers to) – another alias
access_api** (str api, if any, which can read the) – dataset
access_api_configuration** (str extra information for access_api)
supersede (bool if True, create a new entry with) – this alias name even if old ones exist

Returns:

prim_key – The dataset_alias ID of the new row relating to this entry

Return type:

int

The dataregistry package

The DataRegistry class

The `dataregistry` package