Database schema
A description of the data registry schema layout. Note that each schema in the database (e.g., the default and production schemas) follows the same structure.
The execution table
row |
description |
type |
---|---|---|
execution_id |
Unique identifier for execution |
Integer |
description |
Short description of execution |
String |
register_date |
When was the execution registered in the database |
DateTime |
execution_start |
When was the execution performed at site |
DateTime |
name |
Identifies the code executed (e.g., could be pipeline name) |
String |
site |
Site where the code was run (e.g., NERSC) |
String |
configuration |
Path to execution configuration file (txt, YAML, TOML, etc). Ingested as raw text |
String |
creator_uid |
UID of person who registered the entry |
StringShort |
The provenance table
row |
description |
type |
---|---|---|
provenance_id |
Unique identifier for this provenance entry |
Integer |
code_version_major |
Major version of code when this schema was created |
Integer |
code_version_minor |
Minor version of code when this schema was created |
Integer |
code_version_patch |
Patch version of code when this schema was created |
Integer |
code_version_suffix |
Version suffix of code when this schema was created |
String |
creator_uid |
UID of person who registered the entry |
StringShort |
db_version_major |
Major version of schema |
Integer |
db_version_minor |
Minor version of schema |
Integer |
db_version_patch |
Patch version of schema |
Integer |
git_hash |
Git hash at time of schema creation |
String |
repo_is_clean |
Was git repo clean at schema creation? |
Boolean |
update_method |
What type of schema update does this entry relate to (‘CREATE’,’MODIFY’,’MIGRATE’) |
String |
schema_enabled_date |
Date schema was created/updated |
DateTime |
comment |
Comment relating to new provenance entry |
StringLong |
The execution_alias table
row |
description |
type |
---|---|---|
execution_alias_id |
Unique identifier for execution alias |
Integer |
supersede_date |
If a new entry has been added to the table with the same alias name (but different dataset_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
register_date |
Date the execution alias was registered |
DateTime |
alias |
User given execution alias name |
String |
execution_id |
Execution this alias is linked to |
Integer |
The dataset table
row |
description |
type |
---|---|---|
dataset_id |
Unique identifier for this dataset |
Integer |
name |
Any convenient, evocative name for the human. Note the combination of name, version and version_suffix must be unique. If None name is generated from the relative path. |
String |
relative_path |
Relative path storing the data, relative to <root_dir> |
String |
version_major |
Major version in semantic string (i.e., X.x.x) |
Integer |
version_minor |
Minor version in semantic string (i.e., x.X.x) |
Integer |
version_patch |
Patch version in semantic string (i.e., x.x.X) |
Integer |
version_suffix |
Optional version suffix to place at the end of the version string. Cannot be used for production datasets. |
String |
version_string |
Version string |
String |
creation_date |
Dataset creation date |
DateTime |
register_date |
Date the dataset was registered |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
access_API |
Describes the software that can read the dataset (e.g., ‘gcr-catalogs’, ‘skyCatalogs’) |
StringShort |
owner |
Owner of the dataset (defaults to $USER) |
String |
owner_type |
Datasets owner type, can be ‘user’, ‘group’, ‘project’ or ‘production’. |
String |
data_org |
Dataset organisation (‘file’ or ‘directory’) |
String |
nfiles |
How many files are in the dataset |
Integer |
total_disk_space |
Total disk spaced used by the dataset |
Float |
register_root_dir |
The root_dir the dataset was originally ingested into |
String |
description |
User provided human-readable description of the dataset |
String |
execution_id |
Execution this dataset is linked to |
Integer |
is_overwritten |
True if the original data for this dataset has been overwritten at some point. This would have required that is_overwritable was set to true on the original dataset |
Boolean |
is_overwritable |
True means this dataset can be overwritten in the future |
Boolean |
status |
What is the status of the dataset? This is a bitmask description of multiple states. Bit number 0=valid (1 if dataset is valid, 0 if copy data failed during creation), 1=deleted (1 if dataset is deleted and no longer on disk, 0 if data is still on disk, database entry is always kept) 3=archived (1 if data has been archived). For example ‘0b011` would be valid=1, deleted=1 and archived=0. |
Integer |
archive_date |
Dataset archive date |
DateTime |
archive_path |
Path the dataset was archived to |
String |
delete_date |
Date the dataset was deleted |
DateTime |
delete_uid |
User ID of person who deleted the dataset |
String |
move_date |
Date the dataset was last moved |
DateTime |
location_type |
What is the physical location of the data? ‘dataregistry’ means the data is located within the <root_dir> and managed by the dataregistry, external means the data is not managed by the dataregistry, either because it is off-site or because it is stored outside <root_dir> therefore there is only a database entry (in this case a url or contact_email must be provided during registration) and ‘dummy’ is a dataset for testing purposes only (only a database entry is created in this case). |
String |
url |
URL that points to the data (used in the case of external datasets, i.e., location_type=’external’). |
String |
contact_email |
Contact information for someone regarding the dataset. |
String |
The dependency table
row |
description |
type |
---|---|---|
dependency_id |
Unique identifier for dependency |
Integer |
register_date |
Date the dependency was registered |
DateTime |
input_id |
Dataset this dependency is linked to (for every dependency, this, or input_production_id, must be non-null) |
Integer |
input_production_id |
Production dataset this dependency is linked to (for every dependency, this, or input_id, must be non-null) |
Integer |
execution_id |
Execution this dependency is linked to |
Integer |
The dataset_alias table
row |
description |
type |
---|---|---|
dataset_alias_id |
Unique identifier for dataset alias |
Integer |
supersede_date |
If a new entry has been added to the table with the same alias name (but different dataset_id), the old entry will be superseded. supersede_date in the old entry tracks when this happened. If the entry has not been superseded, supersede_date will be None |
DateTime |
creator_uid |
UID of person who registered the entry |
StringShort |
register_date |
Date the dataset alias was registered |
DateTime |
alias |
User given dataset alias name |
String |
dataset_id |
Dataset this alias is linked to |
Integer |