Database schema
A description of the data registry schema layout. Note that each schema in the database (e.g., the default and production schemas) follows the same structure.
The execution table
row |
description |
type |
---|---|---|
execution_id |
Unique identifier for execution |
Integer |
description |
Short description of execution |
String |
register_date |
When was the execution registered in the database |
DateTime |
execution_start |
When was the execution performed at site |
DateTime |
name |
Identifies the code executed (e.g., could be pipeline name) |
String |
site |
Site where the code was run (e.g., NERSC) |
String |
configuration |
Path to execution configuration file (txt, YAML, TOML, etc). Ingested as raw text |
String |
creator_uid |
UID of person who registered the entry |
StringShort |
The provenance table
row |
description |
type |
---|---|---|
provenance_id |
Unique identifier for this provenance entry |
Integer |
code_version_major |
Major version of code when this schema was created |
Integer |
code_version_minor |
Minor version of code when this schema was created |
Integer |
code_version_patch |
Patch version of code when this schema was created |
Integer |
creator_uid |
UID of person who registered the entry |
StringShort |
db_version_major |
Major version of schema |
Integer |
db_version_minor |
Minor version of schema |
Integer |
db_version_patch |
Patch version of schema |
Integer |
git_hash |
Git hash at time of schema creation |
String |
repo_is_clean |
Was git repo clean at schema creation? |
Boolean |
update_method |
What type of schema update does this entry relate to (‘CREATE’,’MODIFY’,’MIGRATE’) |
String |
schema_enabled_date |
Date schema was created/updated |
DateTime |
comment |
Comment relating to new provenance entry |
StringLong |
associated_production |
Name of production schema associated with this schema |
String |
The execution_alias table
row |
description |
type |
---|---|---|
execution_alias_id |
Unique identifier for execution alias |
Integer |
supersede_date |
If a new entry has been added to the table with the same alias name (but different dataset_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
register_date |
Date the execution alias was registered |
DateTime |
alias |
User given execution alias name |
String |
execution_id |
Execution this alias is linked to |
Integer |
The dataset table
row |
description |
type |
---|---|---|
dataset_id |
Unique identifier for this dataset |
Integer |
name |
Any convenient, evocative name for the human. Note the combination of name and version must be unique. |
String |
relative_path |
Relative path storing the data, relative to <root_dir>. If None, generated from the name and version_string |
String |
version_major |
Major version in semantic string (i.e., X.x.x) |
Integer |
version_minor |
Minor version in semantic string (i.e., x.X.x) |
Integer |
version_patch |
Patch version in semantic string (i.e., x.x.X) |
Integer |
version_string |
Version string |
String |
creation_date |
Dataset creation date |
DateTime |
register_date |
Date the dataset was registered |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
access_api |
Describes the software that can read the dataset (e.g., ‘GCRCatalogs’, ‘skyCatalogs’) |
StringShort |
access_api_configuration |
Additional (text) info which may be needed by access_api |
String |
owner |
Owner of the dataset (defaults to $USER) |
String |
owner_type |
Datasets owner type, can be ‘user’, ‘group’, ‘project’ or ‘production’. |
String |
data_org |
Dataset organisation (‘file’ or ‘directory’) |
String |
nfiles |
How many files are in the dataset |
Integer |
total_disk_space |
Total disk spaced used by the dataset |
Float |
register_root_dir |
The root_dir the dataset was originally ingested into |
String |
description |
User provided human-readable description of the dataset |
String |
execution_id |
Execution this dataset is linked to |
Integer |
is_overwritable |
True means this dataset can be overwritten in the future |
Boolean |
status |
What is the status of the dataset? This is a bitmask description of multiple states. Bit number 0=valid (1 if dataset is valid, 0 if copy data failed during creation), 1=deleted (1 if dataset is deleted and no longer on disk, 0 if data is still on disk, database entry is always kept) 3=archived (1 if data has been archived), 3=replaced (1 if the dataset has been overwritten/replaced). For example ‘0b0011` would be valid=1, deleted=1, archived=0 and replaced=0. |
Integer |
archive_date |
Dataset archive date |
DateTime |
archive_path |
Path the dataset was archived to |
String |
delete_date |
Date the dataset was deleted |
DateTime |
delete_uid |
User ID of person who deleted the dataset |
String |
move_date |
Date the dataset was last moved |
DateTime |
location_type |
What is the physical location of the data? ‘dataregistry’ means the data is located within the <root_dir> and managed by the dataregistry. External means the data is not managed by the dataregistry, either because it is off-site or because it is stored outside <root_dir> therefore there is only a database entry (in this case a url or contact_email must be provided during registration). ‘meta_only’ is for a legitimate entry involving no actual data, but possibly referring to other entries which do directly reference managed data, as may happen for some GCRCatalogs entries, and ‘dummy’ is a dataset for internal testing purposes only. The data registry will only attempt to manage data created with this field set to ‘dataregistry’. |
String |
url |
URL that points to the data (used in the case of external datasets, i.e., location_type=’external’). |
String |
contact_email |
Contact information for someone regarding the dataset. |
String |
replace_id |
Dataset ID of the new entry that replaced this dataset |
Integer |
replace_iteration |
How many times has this datset been overwritten? |
Integer |
The dependency table
row |
description |
type |
---|---|---|
dependency_id |
Unique identifier for dependency |
Integer |
register_date |
Date the dependency was registered |
DateTime |
input_id |
Dataset this dependency is linked to (for every dependency, this, or input_production_id, must be non-null) |
Integer |
input_production_id |
Production dataset this dependency is linked to (for every dependency, this, or input_id, must be non-null) |
Integer |
execution_id |
Execution this dependency is linked to |
Integer |
The dataset_alias table
row |
description |
type |
---|---|---|
dataset_alias_id |
Unique identifier for dataset alias |
Integer |
supersede_date |
If a new entry has been added to the table with the same alias name (but different dataset_id`or `alias_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
register_date |
Date the dataset alias was registered |
DateTime |
alias |
User-given dataset alias name |
String |
dataset_id |
Dataset this alias is linked to |
Integer |
ref_alias_id |
Alias this alias is linked to |
Integer |
access_api |
Describes the software that can read the dataset (e.g., ‘gcr-catalogs’, ‘skyCatalogs’) |
StringShort |
access_api_configuration |
Additional (text) info which may be needed by access_api |
String |