Accessing DC2 truth and simulated observations data in PostgreSQL at NERSC#
Owner: Joanne Bogart @jrbogart
Last Verified to Run: 2020-11-24
This notebook demonstrates access to truth catalog data via the PostgreSQL database at NERSC. Currently (May 1, 2020) two kinds of truth catalogs are available: star truth and supernova truth. This notebook will concentrate on star truth. The minion observation database is also available, as well as the Run2.2i dr6 object catalog (dpdd columns approximating specification in LSE-163 only).
Learning objectives:
After going through this notebook, you should be able to:
Find out what star truth and simulated observation information is available and query it.
Make use of standard tools to, e.g., plot light curves
Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC
Prerequisites#
See Getting Started with PostgreSQL at NERSC, especially the “Preliminaries” section
Some minimal acquaintance with SQL is helpful. See the “SQL Primer” section of the above document
Conventions#
SQL keywords have been written in ALL CAPS only to make them stand out in queries. (The database server ignores case in queries for keywords, column names and table names.)
import psycopg2
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
Make the db connection
dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)
obs_schema = 'minion_test'
truth_schema = 'star_truth'
object_schema = 'run22i_dr6_wfd_v1'
#truth_schema = 'sne_truth'
Initial steps in this notebook can be run for supernova truth instead of star truth
just by changing the value for truth_schema
above, but the section labeled Sample Query
and the light curve queries need more adjustments.
Convenience utilities, defined here so as not to clutter up the main line.
def get_schema_tables(conn, schema):
'''
Returns 1-d numpy array of table names
'''
q = f"""SELECT DISTINCT table_name FROM information_schema.columns
WHERE table_schema='{schema}'
ORDER BY table_name"""
with conn.cursor() as cursor:
cursor.execute(q)
records = cursor.fetchall()
tables = np.array([r[0] for r in records])
return tables
def get_table_columns(conn, schema, table):
'''
Returns a pandas dataframe with columns column_name, datatype
'''
q = f"""SELECT column_name, data_type FROM information_schema.columns
WHERE table_schema='{schema}' AND table_name='{table}'
ORDER BY column_name"""
with conn.cursor() as cursor:
cursor.execute(q)
records = cursor.fetchall()
print(len(records))
df = pd.DataFrame(records, columns=['column_name', 'data_type'])
return df
Another convenience routine.
It makes use of the public function conesearch
and provides a place to document its arguments.
See also a more comprehensive list of such functions.
def format_cone_search(coord_column, ra, dec, radius):
'''
Parameters
coord_column: name of column of type earth in the table
ra: ra value at center of cone (degrees)
dec: dec value at center of cone (degrees)
radius: radius of cone (arcseconds)
Returns
Condition to be inserted into WHERE clause for the query
'''
cond = f"""conesearch({coord_column},'{ra}','{dec}','{radius}')"""
return cond
Display all tables belonging to the schema.
schemas = [truth_schema, obs_schema]
for s in schemas:
tables = get_schema_tables(dbconn, s)
print(f"\nTables for schema {s}:")
for t in tables:
print(t)
Display all columns belonging to a couple tables
truth_summary_info = get_table_columns(dbconn, truth_schema, 'truth_summary' )
truth_summary_info.style.set_properties(**{'text-align': 'left'})
truth_summary_info
stellar_variability_info = get_table_columns(dbconn, truth_schema, 'stellar_variability_truth' )
stellar_variability_info.style.set_properties(**{'text-align': 'left'})
stellar_variability_info
The most generally useful table in the obs database is summary
. Note ra and dec are stored in radians in this table.
obs_summary_info = get_table_columns(dbconn, obs_schema, 'summary' )
obs_summary_info.style.set_properties(subset=["column_name", "data_type"], **{'text-align': 'right'})
obs_summary_info
Sample Query#
Find delta flux readings for stars which are expected to be in the field of view for a particular visit. This sort of query returns practically instantly. In the stellar_variability_truth
both of the columns mentioned in the WHERE
clause - id
and obshistid
- are indexed.
obshistid = 731791
table = 'stellar_variability_truth'
ids = ["'835183'", "'31303590103'","'31102013522'","'31303588649'", "'30317268917'",
"'30825472052'","'835279'","'31102039372'","'30825477672'","'31102046245'",
"'30321363109'","'31102051190'","'31102061342'","'30321363877'","'31102061079'",
"'31411663457'", "'31107813412'"]
id_list = ",".join(ids)
flux_q = f"""SELECT id, delta_flux FROM {truth_schema}.{table}
WHERE (obshistid={obshistid})
AND id IN ({id_list});"""
print(flux_q)
with dbconn.cursor() as cursor:
cursor.execute(flux_q)
f_records = cursor.fetchall()
df_flux = pd.DataFrame(f_records, columns=['id', 'delta_flux'])
df_flux
Area searches#
The column coord
in table truth_summary
and columns with “coord” in their names in the observation summary table are of a special type earth
. The value is a triple of double precision numbers describing the position on a unit sphere corresponding to ra
and dec
. Indexes have been defined on these columns to speed up area searches.
The observation with obshistid=14 has desc dithered ra, dec = (1.69831745204, -0.59856) in radians. Find all observations within a radius (expressed in arcseconds).
Warning: All ra, dec in the observation summary table are in radians. ra and dec in truth_summary
are in degrees. format_cone_search
needs ra, dec in degrees.
radius = 500
ra_deg = np.degrees((1.69831745204,))[0]
dec_deg = np.degrees((-0.59856,))[0]
dec = -0.59856
cond = format_cone_search('descditheredcoord', ra_deg, dec_deg, radius)
obs_query = f"""SELECT obshistid,descditheredra,descdithereddec FROM {obs_schema}.summary
WHERE {cond}"""
# Uncomment the following to confirm that the query looks reasonable
#print(obs_query)
with dbconn.cursor() as cursor:
%time cursor.execute(obs_query)
records = cursor.fetchall()
df_obs = pd.DataFrame(records, columns=['obshistid', 'ra_radians', 'dec_radians'])
df_obs
Light curves#
Find length of light curves for variable stars near a particular location
# pick a location that probably gets lots of visits
# for (70.0, -30.0, 80) get 3; for (60.0, -30.0, 80) get 1
ra = 60.0 # 70.0
dec = -30.0
radius = 150
tbl_spec = f"""SELECT S.id, S.ra, S.dec, max(abs(V.delta_flux)),count(V.bandpass) AS visit_count
FROM {truth_schema}.truth_summary AS S JOIN
{truth_schema}.stellar_variability_truth AS V ON S.id=V.id """
where = "WHERE " + format_cone_search('S.coord', ra, dec, radius) + " AND S.is_variable=1 "
group_by = " GROUP BY S.id,S.ra,S.dec"
q = tbl_spec + where + group_by
# This takes a couple minutes to complete
with dbconn.cursor() as cursor:
%time cursor.execute(q)
records = cursor.fetchall()
df_lengths = pd.DataFrame(records, columns=['id', 'ra','dec', 'max_delta_flux','count'])
df_lengths
Similar to above, but this time don’t count visits. Get the delta_flux values instead
ra = 70.0
dec = -30.0
radius = 80
columns = ['S.id', 'ra', 'dec', 'bandpass', 'delta_flux']
col_list = (',').join(columns)
tbl_spec = f"""SELECT {col_list}
FROM {truth_schema}.truth_summary AS S JOIN
{truth_schema}.stellar_variability_truth AS V ON S.id=V.id """
where = "WHERE " + format_cone_search('S.coord', ra, dec, radius) + " and S.is_variable=1 "
q = tbl_spec + where
with dbconn.cursor() as cursor:
%time cursor.execute(q)
records_lc = cursor.fetchall()
df_cone_lcs = pd.DataFrame(records_lc, columns=columns)
df_cone_lcs
Plot light curves for one star#
Pick the second object from the results of the first query in this section (id=1568931714) since max_delta_flux is large
Get the data#
Get delta_flux and time values for the plot and some summary information about the star. Use ORDER BY
clause so that data are presented conveniently for plotting.
id = 1568931714
var_tbl = 'stellar_variability_truth'
lc_q = f"""SELECT bandpass,mjd,delta_flux FROM {truth_schema}.{var_tbl}
WHERE id='{id}' ORDER BY bandpass, mjd;"""
print(lc_q)
with dbconn.cursor() as cursor:
%time cursor.execute(lc_q)
lc_records = cursor.fetchall()
print(len(lc_records))
df_single_lc = pd.DataFrame(lc_records, columns=['bandpass','mjd','delta_flux'])
df_single_lc
sum_tbl = 'truth_summary'
sum_fluxes = ','.join([f"flux_{b}" for b in ['g', 'i', 'r', 'u', 'y', 'z']])
sum_q = f"""SELECT ra,dec,{sum_fluxes} FROM {truth_schema}.{sum_tbl}
WHERE id='{id}';"""
print(sum_q)
with dbconn.cursor() as cursor:
%time cursor.execute(sum_q)
sum_record = cursor.fetchone()
lc_ra = sum_record[0]
lc_dec = sum_record[1]
print(f'ra={lc_ra}, dec={lc_dec}')
Plotting#
from astropy.time import Time
def plot_band_lc(axes, times, fluxes, params):
out = axes.scatter(np.asarray(times), np.asarray(fluxes), **params)
def plot_level(axes, yvalue, params):
xmin, xmax = axes.get_xlim()
out = axes.plot(np.asarray([xmin, xmax]), np.asarray([yvalue, yvalue]), **params)
def format_title(id, ra, dec, band=None):
if band is None:
return f'Per-band light curves for star {id} at (ra,dec)=({ra}, {dec})'
else:
return f'Light curve for star {id}, band={band} at (ra,dec)=({ra}, {dec})'
def plot_object(title, the_data, band=None):
'''
Plot r, g and i light 'curves' (delta_flux as scatter plot) for an object
or plot only requested band
Parameters
-----------
title : string
the_data : data frame which must include columns filtername, obsstart, mag
'''
good_d = the_data[(np.isnan(the_data.delta_flux)==False)]
red_d = good_d[(good_d.bandpass=="r")]
green_d = good_d[(good_d.bandpass=="g")]
i_d = good_d[(good_d.bandpass=="i")]
#print("red data shape: ", red_e.shape, " green data shape: ", green_e.shape, " i data shape: ", i_e.shape)
fix, axes = plt.subplots(figsize=(12,8))
plt.title(title)
plt.xlabel('Julian date')
plt.ylabel('Delta flux')
params_r = {'marker' : 'o', 'label' : 'r band', 'color' : 'red'}
params_g = {'marker' : 'o', 'label' : 'g band', 'color' : 'green'}
params_i = {'marker' : 'o', 'label' : 'i band', 'color' : 'orange'}
#print('In plot_object printing i-band values')
#for ival in list(i_d['mag']): print(ival)
if band is None or band=='r':
plot_band_lc(axes, list(red_d['mjd']), list(red_d['delta_flux']), params_r)
if band is None or band=='g':
plot_band_lc(axes, list(green_d['mjd']), list(green_d['delta_flux']), params_g)
if band is None or band=='i':
plot_band_lc(axes, list(i_d['mjd']), list(i_d['delta_flux']), params_i)
#plot_level(axes, coadd_mag['r'], {'label' : 'r coadd mag', 'color' : 'red'})
#plot_level(axes, coadd_mag['g'], {'label' : 'g coadd mag', 'color' : 'green'})
#plot_level(axes, coadd_mag['i'], {'label' : 'i coadd mag', 'color' : 'orange'})
if band is None:
plt.legend()
for band in ('r','g','i'):
title = format_title(id, lc_ra, lc_dec, band)
plot_object(title, df_single_lc, band)
Is it in the object table?#
First get truth information for our chosen star, then try to find a match, restricting to point sources within a few arcseconds
truth_q = f"SELECT ra,dec,flux_g,flux_r,flux_i from {truth_schema}.truth_summary where id='{id}'"
print(truth_q)
with dbconn.cursor() as cursor:
%time cursor.execute(truth_q)
truth_records = cursor.fetchall()
#truth_records
print(len(truth_records))
truth_df = None
truth_df = pd.DataFrame(truth_records, columns=['ra', 'dec', 'flux_g', 'flux_r', 'flux_i'])
truth_df.shape
truth_df
cols = ['objectid','ra', 'dec', 'psflux_g','psflux_r', 'psflux_i']
col_spec = ','.join(cols)
obj_q = f"""SELECT {col_spec} from {object_schema}.dpdd_object WHERE extendedness < 0.5 AND """
radius = 20 # in arcseconds
obj_q += format_cone_search('coord', lc_ra, lc_dec, radius)
print(obj_q)
with dbconn.cursor() as cursor:
%time cursor.execute(obj_q)
obj_records = cursor.fetchall()
obj_df = pd.DataFrame(obj_records, columns=cols)
obj_df