Accessing DC2 data in PostgreSQL at NERSC part 2#

Owner: Joanne Bogart @jrbogart
Last Verified to Run: 2020-08-03

This notebook introduces some additional features of the PostgreSQL database at NERSC.

Learning objectives:

After going through this notebook, you should be able to:

  1. Discover which object catalogs are available

  2. Query on native quantities in those catalogs

  3. Make use of custom functions, in particular for area searches

Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC

Prerequisites#

Please see the first notebook in this series for instructions on how to gain access to the database.

import psycopg2

import numpy as np

%matplotlib inline 
import matplotlib.pyplot as plt

import pandas as pd

Make the db connection

dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)

Finding Data#

Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the schema (acts like a namespace) run12i. To reference, say, a table called position for Run1.2i use run12i.position.

Finding Datasets#

To find out which datasets are available and by what schema names, query the table run_provenance. It’s in a special schema known as public which does not normally need to be specified.

cols = ['schema_name', 'run_designation','simulation_program', 'db_ingest', 'remarks']
hdrs = ['schema_name', 'run_desig', 'sim_prog','db_ingest', 'remarks']
# Additional columns in run_provenance store software and input data versions
prov_query = 'SELECT '  + ','.join(cols) + ' from run_provenance'
with dbconn.cursor() as cursor:
    cursor.execute(prov_query)
    fmt = '{0!s:14} {1!s:10} {2!s:9} {3!s:15} {4!s}'
    print(fmt.format(hdrs[0], hdrs[1], hdrs[2], hdrs[3], hdrs[4]))
    for record in cursor:
        print(fmt.format(record[0], record[1], record[2], record[3], record[4]))

Normally only datasets where the db_ingest field contains ‘complete’ are of interest

# Pick one of the supported datasets
schema = 'run12p_v4'

Querying on Native Quantities#

Unlike DPDD quantities (all in a single view), native quantities are split across several tables. The first notebook in the PostgreSQL collection shows how to find out which tables belong to a schema and what columns a table has. Alternatively, if you know the column names you want, you can query for the table name. The following looks for the table containing ext_shapeHSM_HsmShapeRegauss_resolution.

column = 'ext_shapehsm_hsmshaperegauss_flag'
find_table_query = "select table_name from information_schema.columns where table_schema='{}' and column_name='{}'"
find_table_query = find_table_query.format(schema, column)
print(find_table_query)
with dbconn.cursor() as cursor:
    cursor.execute(find_table_query)
    for record in cursor:
        print(record[0])

There is some necessary fussiness here:

  • Note ext_shapeHSM_HsmShapeRegauss_flag has been transformed to all lower-case in the query. This is required when querying information_schema, where this string is a value in the database (not a column name).

  • In the query single quotes are used around literals like run12p_v4. Double quotes won’t work.

Now suppose we wanted to combine a cut on this quantity, in table dpdd_ref, with cuts on DPDD quantities like clean. Then the query has to be made on a join of these two tables, where we specify that the value of dpdd_ref.object_id = dpdd.objectid. This causes the corresponding rows from each table (or view) to be treated as if they were assembled into one long row. Here is a simple query showing how this is done. A more realistic one would have more conditions in the where clause and might join more than two tables.

schema = 'run12i'
join_query = 'select count(object_id) from {schema}.dpdd join {schema}.dpdd_ref '
join_query += 'on {schema}.dpdd_ref.object_id = {schema}.dpdd.objectid'
join_query = join_query.format(**locals())
where = " where (ext_shapeHSM_HSmShapeRegauss_flag = 'f') and clean"
join_query += where
print(join_query)                # confirm the query looks reasonable
with dbconn.cursor() as cursor:
    %time cursor.execute(join_query)
    for record in cursor:
        print(record[0])

Adjustments for Larger Datasets#

In the above query I switched to a different, significantly smaller dataset (2.8 million objects rather than 13.7 million). For the much larger 2.1i_dr1b_v1 (over 78 million objects) we have to pay some attention to performance. The dpdd view is formed from a join of 3 tables. dpdd_ref is one of them, so in the above query that table is joined to itself, increasing the resources needed to make the query unnecessarily. It’s more efficient to just get all quantities from a join of tables only, but the dpdd view is more than just the result of a join with some of the columns renamed. clean is formed by doing logical operations on several native-quantity flags. Starting with run 2.1i_dr1b_v1 it can be expressed as good and not deblend_skipped where both good and deblend_skipped are columns in dpdd_ref. (In earlier runs, good existed only in the dpdd view as the result of logical operations on several native-quantity flags). We also have to exclude non-primary objects, as the dpdd view does. The flag detect_isprimary is in the position table. The query for run 2.1i_dr1b_v1 can be written as shown in the next cell. Skip it if you’re in a hurry; even with these techniques it still takes about 13 minutes.

schema = 'run21i_dr1b_v1'
join_query = 'select count(position.object_id) from {schema}.position left join {schema}.dpdd_ref '
join_query += 'on {schema}.position.object_id = {schema}.dpdd_ref.object_id'
join_query = join_query.format(**locals())
where = " where detect_isprimary and good and (not deblend_skipped) and (ext_shapeHSM_HSmShapeRegauss_flag = 'f')"
join_query += where
print(join_query)                # confirm the query looks reasonable
with dbconn.cursor() as cursor:
    %time cursor.execute(join_query)
    for record in cursor:
        print(record[0])

In all such joins, position should be the first table in the join and left joins should be used, as in the example. This will in general result in better performance since it forces the join(s) to be done in the order specified. All tables are indexed on object_id. Only the position table has additional indexes (on detect_isprimary and on coord, a special column used in area searches).

User-defined Functions (UDFs)#

Many math functions from the c library have been wrapped and incorporated in an extension module installed in the database. They have their normal c library names with the prefix c_. Functions with a floating point argument or return value usually have two versions, such as c_log (double precision natural logarithm) and c_logf (single precision). They can be incorporated in queries as in this example using the command-line interface program psql:

desc_dc2_drp=> select c_asin(1.0);

     c_asin
-----------------
 1.5707963267949

There are also functions specially crafted for HSC or LSST catalogs with suggestive names like patch_contains, tract_from_object_id (used in q3 in the first notebook of this series), sky_to_pixel,..

desc_dc2_drp=> select count(*) from run12i.dpdd where tractsearch(objectId, 5063);
 count
--------
 233982
(1 row)

Restricting by tract or patch#

Let’s try the last query from the previous section restriced to tract 3446 (picked at random), which has about 570,000 objects.

schema = 'run21i_dr1b_v1'
join_query = 'select count(position.object_id) from {schema}.position left join {schema}.dpdd_ref '
join_query += 'on {schema}.position.object_id = {schema}.dpdd_ref.object_id'
join_query = join_query.format(**locals())
where = " where detect_isprimary and tractsearch(position.object_id, 3446) and  "
where += "good and (not deblend_skipped) and (ext_shapeHSM_HSmShapeRegauss_flag = 'f')"
join_query += where
print(join_query)                # confirm the query looks reasonable
with dbconn.cursor() as cursor:
    %time cursor.execute(join_query)
    for record in cursor:
        print(record[0])

Tract 3446 has about 1/136 of the objects in the run21i_dr1b_v1 dataset. The elapsed time for the tract query is less than 1/136 of the time taken by the original query. It pays to restrict queries to a tract when possible, even if it means issuing the same query for several tracts.

Area Searches#

The dpdd view has one extra column, coord, which is not formally a DPDD quantity. coord is an alternate way (other than ra and dec) to express location. A coord value is a triple of doubles representing a position on a sphere in units of arcseconds. This column is indexed, which can make certain calculations faster. In particular, using the functions conesearch and boxsearch (which take a coord as input) rather than starting with ra and dec makes queries much faster. There are also functions to translate between coord and (ra, dec).