The dataregistry
package
Reference documentation for the core objects within the dataregistry
package. Demonstrations of their usage can be found in the tutorials section.
The DataRegistry class
The DataRegistry
class is the primary front end to the dataregistry
package.
This should be the only object users have to import to their code.
It connects the user to the database, and serves as a wrapper to both the
Registrar
and Query
classes.
- class dataregistry.DataRegistry(owner=None, owner_type=None, config_file=None, schema=None, root_dir=None, verbose=False, site=None)
Primary data registry wrapper class.
The DataRegistry class links to both the Registrar class, to register/modify/delete datasets, and the Query class, to query existing datasets.
- Links to the database is done automatically using the:
the users config file (if None defaults are used)
the passed schema (if None the default schema is used)
The root_dir is the location the data is copied to. This can be manually passed, or alternately a predefined site can be chosen. If nether are chosen, the NERSC site will be selected as the default.
- Parameters:
owner (str) – To set the default owner for all registered datasets in this instance.
owner_type (str) – To set the default owner_type for all registered datasets in this instance.
config_file (str) – Path to config file, if None, default location is assumed.
schema (str) – Schema to connect to, if None, default schema is assumed.
root_dir (str) – Root directory for datasets, if None, default is assumed.
verbose (bool) – True for more output.
site (str) – Can be used instead of root_dir. Some predefined “sites” are built in, such as “nersc”, which will set the root_dir to the data registry’s default data location at NERSC.
- Registrar.get_owner_types()
Returns a list of allowed owner_types that can be registered within the data registry.
- Returns:
- – Set of owner_types
- Return type:
set
- Query.find_datasets(property_names=None, filters=[], verbose=False, return_format='property_dict')
Get specified properties for datasets satisfying all filters
If property_names is None, return all properties from the dataset table (only). Otherwise, return the property_names columns for each discovered dataset (which can be from multiple tables via a join).
Filters should be a list of dataregistry Filter objects, which are logic constraints on column values.
These choices get translated into an SQL query.
- Parameters:
property_names (list, optional) – List of database columns to return (SELECT clause)
filters (list, optional) – List of filters (WHERE clauses) to apply
verbose (bool, optional) – True for more output relating to the query
return_format (str, optional) – The format the query result is returned in. Options are “CursorResult” (SQLAlchemy default format), “DataFrame”, or “proprety_dict”. Note this is not case sensitive.
- Returns:
result – Requested property values
- Return type:
CursorResult, dict, or DataFrame (depending on return_format)
- DatasetTable.register(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, relative_path=None, kwargs_dict=None)
Create a new dataset entry in the DESC data registry.
Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.
First, the dataset entry is created in the database. If success, the data is then copied (if old_location was provided). Only if both steps are successful will there be “valid” status entry in the registry.
- Parameters:
name** (str)
version** (str)
creation_date** (datetime, optional)
description** (str, optional)
execution_id** (int, optional)
access_api** (str, optional)
is_overwritable** (bool, optional)
old_location (str, optional) –
Absolute location of dataset to copy into the data registry.
If None, dataset should already be at correct relative_path within the data registry.
verbose (bool, optional) – Provide some additional output information
owner** (str, optional)
owner_type** (str, optional)
execution_name** (str, optional)
execution_description** (str, optional)
execution_start** (datetime, optional)
execution_site** (str, optional)
execution_configuration** (str, optional)
input_datasets (list, optional) – List of dataset ids that were the input to this execution
input_production_datasets (list, optional) – List of production dataset ids that were the input to this execution
max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file
keywords (list[str], optional) – List of keywords to tag dataset with. Each keyword must be registered already in the keywords table.
location_type** (str, optional) – If location_type=”external”, either url or contact_email must be supplied
url** (str, optional) – For location_type=”external” only
contact_email** (str, optional)
test_production (boolean, default False. Set to True for testing) – code for production owner_type
relative_path** (str, optional)
kwargs_dict (dict) – Stores all the keyword arguments passed to this function (and defaults). Automatically generated by the decorator, do not pass manually.
- Returns:
prim_key (int) – The dataset ID of the new row relating to this entry (else None)
execution_id (int) – The execution ID associated with the dataset
- DatasetTable.replace(name, version, creation_date=None, description=None, execution_id=None, access_api=None, access_api_configuration=None, is_overwritable=False, old_location=None, verbose=False, owner=None, owner_type=None, execution_name=None, execution_description=None, execution_start=None, execution_site=None, execution_configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None, keywords=[], location_type='dataregistry', url=None, contact_email=None, test_production=False, kwargs_dict=None)
Replace a dataset in the registry.
This is so a user can keep the same name/version/ower/owner_type combination as a previous dataset. Note the original dataset must have is_overwritable=True to allow the replace to work.
- The process is as follows:
The original dataset is deleted, and the entry in the database tagged accordingly
A new entry is made with the same name/version combination as before, and the data goes into the same relative_path as before. All other properties are what the user specifies in the replace function
The old dataset gets pointed to the new dataset saying it is the most up to date iteration
- Returns:
prim_key (int) – The dataset ID of the new row relating to this entry (else None)
execution_id (int) – The execution ID associated with the dataset
- DatasetTable.modify(entry_id, modify_fields)
Modify an entry in the DESC data registry.
Only certain columns are allowed to be modified after registration, this is defined in the schema yaml file.
- Parameters:
entry_id (int) – The dataset/execution/etc ID we wish to delete from the database
modify_fields (dict) – Dict where key is the column to modify (must be allowed to modify) and value is the desired new value for the entry
- DatasetTable.delete(dataset_id)
Delete an dataset entry from the DESC data registry.
This will also remove the raw data from the root dir, but the dataset entry remains in the registry (now with an updated status field).
- Parameters:
dataset_id (int) – Dataset we want to delete from the registry
- DatasetTable.add_keywords(dataset_id, keywords)
Add/append keywords to an already existing dataset.
First check the keywords are valid, then append. If the dataset already has one or more of the passed keywords attributed to it, the keyword(s) will not be duplicated.
- Parameters:
dataset_id (int)
keywords (list[str])
- DatasetTable.get_modifiable_columns()
Return a list of all columns in this table that are “modifiable”.
As defined in the schema yaml file.
- Returns:
mod_list
- Return type:
list[str]
- DatasetTable.get_keywords()
Returns the list of system keywords that are allowed.
- Returns:
keywords
- Return type:
list[str]
- ExecutionTable.register(name, description=None, execution_start=None, site=None, configuration=None, input_datasets=[], input_production_datasets=[], max_config_length=None)
Create a new execution entry in the DESC data registry.
Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.
- Parameters:
name** (str)
description** (str, optional)
execution_start** (datetime, optional)
site** (str, optional)
configuration** (str, optional)
input_datasets** (list, optional)
input_production_datasets** (list, optional)
max_config_length (int, optional) – Maxiumum number of lines to read from a configuration file
- Returns:
my_id – The execution ID of the new row relating to this entry
- Return type:
int
- DatasetAliasTable.register(aliasname, dataset_id, ref_alias_id=None, access_api=None, access_api_configuration=None, supersede=False)
Create a new dataset_alias entry in the DESC data registry. It may refer to a dataset (default) or another alias
Any args marked with ‘**’ share their name with the associated column in the registry schema. Descriptions of what these columns are can be found in schema.yaml or the documentation.
- Parameters:
aliasname (str alias name)
dataset_id** (int not None if alias refers to dataset)
ref_alias_id** (int not None if alias refers to) – another alias
access_api** (str api, if any, which can read the) – dataset
access_api_configuration** (str extra information for access_api)
supersede (bool if True, create a new entry with) – this alias name even if old ones exist
- Returns:
prim_key – The dataset_alias ID of the new row relating to this entry
- Return type:
int