Database schema

A description of the data registry schema layout. Note that each schema in the database (e.g., the default and production schemas) follows the same structure.

The execution table

row	description	type
execution_id	Unique identifier for execution	Integer
description	Short description of execution	String
register_date	When was the execution registered in the database	DateTime
execution_start	When was the execution performed at site	DateTime
name	Identifies the code executed (e.g., could be pipeline name)	String
site	Site where the code was run (e.g., NERSC)	String
configuration	Path to execution configuration file (txt, YAML, TOML, etc). Ingested as raw text	String
creator_uid	UID of person who registered the entry	StringShort

The provenance table

row	description	type
provenance_id	Unique identifier for this provenance entry	Integer
code_version_major	Major version of code when this schema was created	Integer
code_version_minor	Minor version of code when this schema was created	Integer
code_version_patch	Patch version of code when this schema was created	Integer
creator_uid	UID of person who registered the entry	StringShort
db_version_major	Major version of schema	Integer
db_version_minor	Minor version of schema	Integer
db_version_patch	Patch version of schema	Integer
git_hash	Git hash at time of schema creation	String
repo_is_clean	Was git repo clean at schema creation?	Boolean
update_method	What type of schema update does this entry relate to (‘CREATE’,’MODIFY’,’MIGRATE’)	String
schema_enabled_date	Date schema was created/updated	DateTime
comment	Comment relating to new provenance entry	StringLong
associated_production	Name of production schema associated with this schema	String

The execution_alias table

row	description	type
execution_alias_id	Unique identifier for execution alias	Integer
supersede_date	If a new entry has been added to the table with the same alias name (but different dataset_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None	DateTime
creator_uid	UID of person who registered the entry	StringShort
register_date	Date the execution alias was registered	DateTime
alias	User given execution alias name	String
execution_id	Execution this alias is linked to	Integer

The dataset table

row	description	type
dataset_id	Unique identifier for this dataset	Integer
name	Any convenient, evocative name for the human. Note the combination of name and version must be unique.	String
relative_path	Relative path storing the data, relative to <root_dir>. If None, generated from the name and version_string	String
version_major	Major version in semantic string (i.e., X.x.x)	Integer
version_minor	Minor version in semantic string (i.e., x.X.x)	Integer
version_patch	Patch version in semantic string (i.e., x.x.X)	Integer
version_string	Version string	String
creation_date	Dataset creation date	DateTime
register_date	Date the dataset was registered	DateTime
creator_uid	UID of person who registered the entry	StringShort
access_api	Describes the software that can read the dataset (e.g., ‘GCRCatalogs’, ‘skyCatalogs’)	StringShort
access_api_configuration	Additional (text) info which may be needed by access_api	String
owner	Owner of the dataset (defaults to $USER)	String
owner_type	Datasets owner type, can be ‘user’, ‘group’, ‘project’ or ‘production’.	String
data_org	Dataset organisation (‘file’ or ‘directory’)	String
nfiles	How many files are in the dataset	Integer
total_disk_space	Total disk spaced used by the dataset	Float
register_root_dir	The root_dir the dataset was originally ingested into	String
description	User provided human-readable description of the dataset	String
execution_id	Execution this dataset is linked to	Integer
is_overwritable	True means this dataset can be overwritten in the future	Boolean
status	What is the status of the dataset? This is a bitmask description of multiple states. Bit number 0=valid (1 if dataset is valid, 0 if copy data failed during creation), 1=deleted (1 if dataset is deleted and no longer on disk, 0 if data is still on disk, database entry is always kept) 3=archived (1 if data has been archived), 3=replaced (1 if the dataset has been overwritten/replaced). For example ‘0b0011` would be valid=1, deleted=1, archived=0 and replaced=0.	Integer
archive_date	Dataset archive date	DateTime
archive_path	Path the dataset was archived to	String
delete_date	Date the dataset was deleted	DateTime
delete_uid	User ID of person who deleted the dataset	String
move_date	Date the dataset was last moved	DateTime
location_type	What is the physical location of the data? ‘dataregistry’ means the data is located within the <root_dir> and managed by the dataregistry. External means the data is not managed by the dataregistry, either because it is off-site or because it is stored outside <root_dir> therefore there is only a database entry (in this case a url or contact_email must be provided during registration). ‘meta_only’ is for a legitimate entry involving no actual data, but possibly referring to other entries which do directly reference managed data, as may happen for some GCRCatalogs entries, and ‘dummy’ is a dataset for internal testing purposes only. The data registry will only attempt to manage data created with this field set to ‘dataregistry’.	String
url	URL that points to the data (used in the case of external datasets, i.e., location_type=’external’).	String
contact_email	Contact information for someone regarding the dataset.	String
replace_id	Dataset ID of the new entry that replaced this dataset	Integer
replace_iteration	How many times has this datset been overwritten?	Integer

The dependency table

row	description	type
dependency_id	Unique identifier for dependency	Integer
register_date	Date the dependency was registered	DateTime
input_id	Dataset this dependency is linked to (for every dependency, this, or input_production_id, must be non-null)	Integer
input_production_id	Production dataset this dependency is linked to (for every dependency, this, or input_id, must be non-null)	Integer
execution_id	Execution this dependency is linked to	Integer

The dataset_alias table

row	description	type
dataset_alias_id	Unique identifier for dataset alias	Integer
supersede_date	If a new entry has been added to the table with the same alias name (but different dataset_id`or `alias_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None	DateTime
creator_uid	UID of person who registered the entry	StringShort
register_date	Date the dataset alias was registered	DateTime
alias	User-given dataset alias name	String
dataset_id	Dataset this alias is linked to	Integer
ref_alias_id	Alias this alias is linked to	Integer
access_api	Describes the software that can read the dataset (e.g., ‘gcr-catalogs’, ‘skyCatalogs’)	StringShort
access_api_configuration	Additional (text) info which may be needed by access_api	String