Accessing DC2 truth and simulated observations data in PostgreSQL at NERSC

Accessing DC2 truth and simulated observations data in PostgreSQL at NERSC#

Owner: Joanne Bogart @jrbogart
Last Verified to Run: 2020-11-24

This notebook demonstrates access to truth catalog data via the PostgreSQL database at NERSC. Currently (May 1, 2020) two kinds of truth catalogs are available: star truth and supernova truth. This notebook will concentrate on star truth. The minion observation database is also available, as well as the Run2.2i dr6 object catalog (dpdd columns approximating specification in LSE-163 only).

Learning objectives:

After going through this notebook, you should be able to:

Find out what star truth and simulated observation information is available and query it.
Make use of standard tools to, e.g., plot light curves

Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC

Prerequisites#

See Getting Started with PostgreSQL at NERSC, especially the “Preliminaries” section
Some minimal acquaintance with SQL is helpful. See the “SQL Primer” section of the above document

Conventions#

SQL keywords have been written in ALL CAPS only to make them stand out in queries. (The database server ignores case in queries for keywords, column names and table names.)

import psycopg2
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt
import pandas as pd

Make the db connection

dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)

obs_schema = 'minion_test'
truth_schema = 'star_truth'
object_schema = 'run22i_dr6_wfd_v1'
#truth_schema = 'sne_truth'

Initial steps in this notebook can be run for supernova truth instead of star truth just by changing the value for truth_schema above, but the section labeled Sample Query and the light curve queries need more adjustments.

Convenience utilities, defined here so as not to clutter up the main line.

def get_schema_tables(conn, schema):
    '''
    Returns 1-d numpy array of table names
    '''
    q = f"""SELECT DISTINCT table_name FROM information_schema.columns 
         WHERE table_schema='{schema}' 
         ORDER BY table_name"""
    with conn.cursor() as cursor:
        cursor.execute(q)
        records = cursor.fetchall()
    tables = np.array([r[0] for r in records])
    return tables

def get_table_columns(conn, schema, table):
    '''
    Returns a pandas dataframe with columns column_name, datatype
    '''
    q = f"""SELECT column_name, data_type FROM information_schema.columns 
        WHERE table_schema='{schema}' AND table_name='{table}' 
        ORDER BY column_name"""
    with conn.cursor() as cursor:
        cursor.execute(q)
        records = cursor.fetchall()
    print(len(records))
    df = pd.DataFrame(records, columns=['column_name', 'data_type'])
    return df
    

Another convenience routine. It makes use of the public function conesearch and provides a place to document its arguments. See also a more comprehensive list of such functions.

def format_cone_search(coord_column, ra, dec, radius):
    '''
    Parameters
    coord_column:  name of column of type earth in the table
    ra:  ra value at center of cone (degrees)
    dec:  dec value at center of cone (degrees)
    radius: radius of cone (arcseconds)
    
    Returns
    Condition to be inserted into WHERE clause for the query
    '''
    cond = f"""conesearch({coord_column},'{ra}','{dec}','{radius}')"""
    return cond

Display all tables belonging to the schema.

schemas = [truth_schema, obs_schema]
for s in schemas:
    tables = get_schema_tables(dbconn, s)
    print(f"\nTables for schema {s}:")
    for t in tables: 
        print(t)

Display all columns belonging to a couple tables

truth_summary_info = get_table_columns(dbconn, truth_schema, 'truth_summary' )
truth_summary_info.style.set_properties(**{'text-align': 'left'})
truth_summary_info

stellar_variability_info = get_table_columns(dbconn, truth_schema, 'stellar_variability_truth' )
stellar_variability_info.style.set_properties(**{'text-align': 'left'})
stellar_variability_info

The most generally useful table in the obs database is summary. Note ra and dec are stored in radians in this table.

obs_summary_info = get_table_columns(dbconn, obs_schema, 'summary' )
obs_summary_info.style.set_properties(subset=["column_name", "data_type"], **{'text-align': 'right'})
obs_summary_info

Sample Query#

Find delta flux readings for stars which are expected to be in the field of view for a particular visit. This sort of query returns practically instantly. In the stellar_variability_truth both of the columns mentioned in the WHERE clause - id and obshistid - are indexed.

obshistid = 731791
table = 'stellar_variability_truth'
ids = ["'835183'", "'31303590103'","'31102013522'","'31303588649'", "'30317268917'",
       "'30825472052'","'835279'","'31102039372'","'30825477672'","'31102046245'", 
       "'30321363109'","'31102051190'","'31102061342'","'30321363877'","'31102061079'",
       "'31411663457'", "'31107813412'"]
id_list = ",".join(ids)

flux_q = f"""SELECT id, delta_flux FROM {truth_schema}.{table} 
         WHERE (obshistid={obshistid})
         AND id IN ({id_list});"""
print(flux_q)
with dbconn.cursor() as cursor:
    cursor.execute(flux_q)
    f_records = cursor.fetchall()
df_flux = pd.DataFrame(f_records, columns=['id', 'delta_flux']) 
df_flux

Area searches#

The column coord in table truth_summary and columns with “coord” in their names in the observation summary table are of a special type earth. The value is a triple of double precision numbers describing the position on a unit sphere corresponding to ra and dec. Indexes have been defined on these columns to speed up area searches.

The observation with obshistid=14 has desc dithered ra, dec = (1.69831745204, -0.59856) in radians. Find all observations within a radius (expressed in arcseconds).

Warning: All ra, dec in the observation summary table are in radians. ra and dec in truth_summary are in degrees. format_cone_search needs ra, dec in degrees.

radius = 500
ra_deg = np.degrees((1.69831745204,))[0]
dec_deg = np.degrees((-0.59856,))[0]
dec = -0.59856
cond = format_cone_search('descditheredcoord', ra_deg, dec_deg, radius)
obs_query = f"""SELECT obshistid,descditheredra,descdithereddec FROM {obs_schema}.summary
            WHERE {cond}"""
# Uncomment the following to confirm that the query looks reasonable
#print(obs_query)
with dbconn.cursor() as cursor:
    %time cursor.execute(obs_query)
    records = cursor.fetchall()
    df_obs = pd.DataFrame(records, columns=['obshistid', 'ra_radians', 'dec_radians'])
df_obs

Light curves#

Find length of light curves for variable stars near a particular location

# pick a location that probably gets lots of visits
# for (70.0, -30.0, 80)  get 3; for (60.0, -30.0, 80) get 1
ra = 60.0       # 70.0     
dec = -30.0
radius = 150    
tbl_spec = f"""SELECT S.id, S.ra, S.dec, max(abs(V.delta_flux)),count(V.bandpass) AS visit_count 
           FROM {truth_schema}.truth_summary AS S JOIN 
           {truth_schema}.stellar_variability_truth AS V ON S.id=V.id """
where = "WHERE " + format_cone_search('S.coord', ra, dec, radius) + " AND S.is_variable=1 "
group_by = " GROUP BY S.id,S.ra,S.dec"
q = tbl_spec + where + group_by

# This takes a couple minutes to complete
with dbconn.cursor() as cursor:
    %time cursor.execute(q)
    records = cursor.fetchall()

df_lengths = pd.DataFrame(records, columns=['id', 'ra','dec', 'max_delta_flux','count'])
df_lengths

Similar to above, but this time don’t count visits. Get the delta_flux values instead

ra = 70.0     
dec = -30.0
radius = 80 
columns = ['S.id', 'ra', 'dec', 'bandpass', 'delta_flux']
col_list = (',').join(columns)
tbl_spec = f"""SELECT {col_list} 
           FROM {truth_schema}.truth_summary AS S JOIN 
           {truth_schema}.stellar_variability_truth AS V ON S.id=V.id """
where = "WHERE " + format_cone_search('S.coord', ra, dec, radius) + " and S.is_variable=1 "
q = tbl_spec + where

with dbconn.cursor() as cursor:
    %time cursor.execute(q)
    records_lc = cursor.fetchall()

df_cone_lcs = pd.DataFrame(records_lc, columns=columns)
df_cone_lcs

Plot light curves for one star#

Pick the second object from the results of the first query in this section (id=1568931714) since max_delta_flux is large

Get the data#

Get delta_flux and time values for the plot and some summary information about the star. Use ORDER BY clause so that data are presented conveniently for plotting.

id = 1568931714
var_tbl = 'stellar_variability_truth'
lc_q = f"""SELECT bandpass,mjd,delta_flux FROM {truth_schema}.{var_tbl}
       WHERE id='{id}' ORDER BY bandpass, mjd;"""
print(lc_q)
with dbconn.cursor() as cursor:
    %time cursor.execute(lc_q)
    lc_records = cursor.fetchall()
print(len(lc_records))
df_single_lc = pd.DataFrame(lc_records, columns=['bandpass','mjd','delta_flux'])
df_single_lc

sum_tbl = 'truth_summary'
sum_fluxes = ','.join([f"flux_{b}" for b in ['g', 'i', 'r', 'u', 'y', 'z']])
sum_q = f"""SELECT ra,dec,{sum_fluxes} FROM {truth_schema}.{sum_tbl} 
         WHERE id='{id}';"""
print(sum_q)
with dbconn.cursor() as cursor:
    %time cursor.execute(sum_q)
    sum_record = cursor.fetchone()
lc_ra = sum_record[0]
lc_dec = sum_record[1]
print(f'ra={lc_ra}, dec={lc_dec}')

Plotting#

from astropy.time import Time
def plot_band_lc(axes, times, fluxes, params):
    out = axes.scatter(np.asarray(times), np.asarray(fluxes), **params)

def plot_level(axes, yvalue, params):
    xmin, xmax = axes.get_xlim()
    out = axes.plot(np.asarray([xmin, xmax]), np.asarray([yvalue, yvalue]), **params)

def format_title(id, ra, dec, band=None):  
    if band is None:
        return f'Per-band light curves for star {id} at (ra,dec)=({ra}, {dec})'
    else:
        return f'Light curve for star {id}, band={band} at (ra,dec)=({ra}, {dec})'

def plot_object(title, the_data, band=None):
    '''
    Plot r, g and i light 'curves' (delta_flux as scatter plot) for an object
    or plot only requested band
    Parameters
    -----------
    title : string
    the_data : data frame which must include columns filtername, obsstart, mag
    '''
    good_d = the_data[(np.isnan(the_data.delta_flux)==False)]
    red_d = good_d[(good_d.bandpass=="r")]
    green_d = good_d[(good_d.bandpass=="g")]
    i_d = good_d[(good_d.bandpass=="i")]
    #print("red data shape: ", red_e.shape, "   green data shape: ", green_e.shape, "  i data shape: ", i_e.shape)
    fix, axes = plt.subplots(figsize=(12,8))

    plt.title(title)
    plt.xlabel('Julian date')
    plt.ylabel('Delta flux')

    params_r = {'marker' : 'o', 'label' : 'r band', 'color' : 'red'}
    params_g = {'marker' : 'o', 'label' : 'g band', 'color' : 'green'}
    params_i = {'marker' : 'o', 'label' : 'i band', 'color' : 'orange'}
    #print('In plot_object printing i-band values')
    #for ival in list(i_d['mag']): print(ival)
    if band is None or band=='r':
        plot_band_lc(axes, list(red_d['mjd']), list(red_d['delta_flux']), params_r)
    if band is None or band=='g':
        plot_band_lc(axes, list(green_d['mjd']), list(green_d['delta_flux']), params_g)
    if band is None or band=='i':
        plot_band_lc(axes, list(i_d['mjd']), list(i_d['delta_flux']), params_i)
    #plot_level(axes, coadd_mag['r'], {'label' : 'r coadd mag', 'color' : 'red'})
    #plot_level(axes, coadd_mag['g'], {'label' : 'g coadd mag', 'color' : 'green'})
    #plot_level(axes, coadd_mag['i'], {'label' : 'i coadd mag', 'color' : 'orange'})
    if band is None:
        plt.legend()

for band in ('r','g','i'):
    title = format_title(id, lc_ra, lc_dec, band)
    plot_object(title, df_single_lc, band)

Is it in the object table?#

First get truth information for our chosen star, then try to find a match, restricting to point sources within a few arcseconds

truth_q = f"SELECT ra,dec,flux_g,flux_r,flux_i from {truth_schema}.truth_summary where id='{id}'"
print(truth_q)
with dbconn.cursor() as cursor:
    %time cursor.execute(truth_q)
    truth_records = cursor.fetchall()
#truth_records
print(len(truth_records))
truth_df = None
truth_df = pd.DataFrame(truth_records, columns=['ra', 'dec', 'flux_g', 'flux_r', 'flux_i'])
truth_df.shape
truth_df

cols = ['objectid','ra', 'dec', 'psflux_g','psflux_r', 'psflux_i']
col_spec = ','.join(cols)
obj_q = f"""SELECT {col_spec} from {object_schema}.dpdd_object  WHERE extendedness < 0.5 AND """
radius = 20   # in arcseconds
obj_q += format_cone_search('coord', lc_ra, lc_dec, radius) 
print(obj_q)
with dbconn.cursor() as cursor:
    %time cursor.execute(obj_q)
    obj_records = cursor.fetchall()
obj_df = pd.DataFrame(obj_records, columns=cols)
obj_df