DC2 Object Catalog Run2.2i GCR tutorial – Part IV: accessing photo-z

DC2 Object Catalog Run2.2i GCR tutorial – Part IV: accessing photo-z#

Owners: Yao-Yuan Mao @yymao, Sam Schmidt @sschmidt23

Last Verifed to Run: 2020-06-09 (by @sschmidt23)

This notebook will show you how to access the “add-on” columns that provide the photometric redshift (photo-z) information for the DC2 Object Catalog (Run 2.2i).

Learning objectives: After going through this notebook, you should be able to:

  1. Load and efficiently access a DC2 object catalog (+ photo-z) with the GCR

  2. Understand how the photo-z data are stored / represented

  3. Look at an example of galaxy photo-z distributions

Logistics: This notebook is intended to be run through the Jupyter Lab NERSC interface available here: https://jupyter.nersc.gov/. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC

Other notes: If you restart your kernel, or if it automatically restarts for some reason, all imports and variables will become undefined so, you will have to re-run everything.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import GCRCatalogs
from GCR import GCRQuery
GCRCatalogs.__version__

Load the catalog#

Loading the object catalog with photo-z add-on. The catalog name is dc2_object_run2.2i_with_photoz.

It takes a few seconds for the catalog instance to initiate.

cat = GCRCatalogs.load_catalog('dc2_object_run2.2i_dr3_with_photoz')

Photo-z access methods#

There are several photo-z related quantities available in the catalog, a summary of which can be found on this Confluence page: https://confluence.slac.stanford.edu/display/LSSTDESC/List+of+available+DC2+catalogs+created+by+PhotoZ

There are photo-z estimates in the form of both a single number “point estimate” for each galaxy, as well as a 1D redshift probability density function (PDF) representing the posterior probability of the galaxy being at a given redshift calculated on a specific redshift grid.

There are multiple single point estimates:

  1. photoz_mode: the mode of the redshift PDF, the highest peak of the posterior probability

  2. photoz_mean: the weighted mean of the redshift PDF.

  3. photoz_median: the redshift where the redshift CDF is equal to 0.5.

The redshift pdf is stored in the multi-valued column photoz-pdf. The grid of redshifts at which the posterior probability is evaluated is stored in the catalog with the special attribute of photoz_pdf_bin_centers. You can access this attribute for catalog cat with something like zgrid = cat.photoz_pdf_bin_centers

There are three additional columns that can be used as various quality flags:

  1. photoz_odds (see Benitez 2000) is a measure of the integrated amount or probaility within a fixed distance around photoz_mode. If the redshift posterior is single peaked and narrow this number will be close to 1.0, if the posterior is multi-peaked and/or broad it is likely to be smaller. Thus, high values of photoz_odds can be used as an indicator of photo-z quality.

  2. photoz_mode_ml_red_chi2 is the reduced chi-squared value for the maximum likelihood estimate of the best fit template at the photo-z mode. If this chi-squared value is very large, it indicates that none of the SED templates employed by the photo-z code were good fits to the observed colors, and thus the redshift may be suspect. High values may also occur for very bright galaxies where photometric errors are small and thus chi-squared values can grow large.

We will demonstrate access methods for several of these quantities in detail. You can notice that all the photo-z columns have a prefix of photoz_.

Let’s first make sure that these columns are indeed available.

sorted(q for q in cat.list_all_quantities() if q.startswith('photoz_'))

Let’s now try access the photo-z data! Everything you already about the GCR access of object catalogs will still apply. Including the use of filters and native_filters (native_filters is used for selecting tracts mostly). We will access the single tract 4850.

data = cat.get_quantities(['photoz_mode'], 
                          filters=['photoz_mode < 0.2', 'mag_i < 26'], 
                          native_filters=['tract==4850'])

# check if the filters work
print((data['photoz_mode'] < 0.2).all())

Now, if you want to make a plot of the PDF for a galaxy or galaxies, you will need to access the photoz_pdf column. Note that each entry stores an array, so use with care!

As an example, let’s just load one tract (using the return_iterator feature) of the full PDFs and also peak and mean values at the same time. The first few tracts have very few objects, some of which are only detected in a handful of bands, so we’ll load the 14th tract:

xcat = cat.get_quantities(['photoz_pdf', 'photoz_mode', 'photoz_mean'], return_iterator=True)
for i in range(14):
    data = next(xcat)

There are 684,634 objects in this tract, and there are 301 bins in the photo-z PDF. The photoz_pdf data is a 1D array has a shape of (684634,), but each entry stores an array of 301 values.

data['photoz_pdf'].shape

Now, let’s plot 10 example PDFs. We will plot every 100th PDF as this shows a bit more variety in PDF shapes. The PDFs were evaluated on a set grid of redshift values. For run2.2 this grid extends to z=3.0.
To get the array of bin center values, you can access the photoz_pdf_bin_centers attribute, as demonstrated below.
We overplot the photoz_mode and photoz_mean values as well.

fig, ax = plt.subplots(5, 2, figsize=(12,16))
for pdf, z_peak, z_mean, ax_this in zip(data['photoz_pdf'][::100], data['photoz_mode'][::100],
                                        data['photoz_mean'][::100], ax.flat):
    l = ax_this.plot(cat.photoz_pdf_bin_centers, pdf,label='p(z)');
    ax_this.axvline(z_peak, color=l[0].get_color(), ls=':', lw=1,label='photoz_mode');
    ax_this.axvline(z_mean,color='r',ls='--',lw=1,label='photoz_mean');
    ax_this.set_xlabel('$z$');
    ax_this.set_ylabel('$p(z)$');
    ax_this.legend(loc='upper right')

We see that photoz_mode does, indeed, correspond to the mode/peak of the posterior. photoz_mean lies at the weighted mean redshift. For multi-peaked posteriors this position can actually be in a location of relatively low probability between two peaks. We see a variety of PDF shapes: narrow unimodal; broad and/or tailed; and multimodal distributions. This is due to a combination of the galaxy’s observed colors and magnitude uncertainties.

Example#

Now that we have learned all the access methods, let’s try to work out an example!

First of all, let’s define a set of reasonable cuts to give us galaxies

cuts = [
    GCRQuery('extendedness > 0'),     # Extended objects
    GCRQuery((np.isfinite, 'mag_i')), # Select objects that have i-band magnitudes
    GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...) 
                       # and was not skipped by the deblender
    GCRQuery('snr_i_cModel > 20'),    # SNR > 20
    GCRQuery('snr_r_cModel > 20'),
    GCRQuery('snr_g_cModel > 20'),
    GCRQuery('mag_i_cModel < 23'),    # cModel imag brighter than 23
    GCRQuery('mag_i_cModel > 20')     # cModel imag fainter than 20 (exclude super bright objects)
]

Now let’s make some plots! Let’s compare the histogram of photoz_mode values to the sum of the individual PDF values, the “stacked” PDF being a common (but not statistically correct) way of estimating redshift distributions:

data = cat.get_quantities(['photoz_mode', 'mag_g_cModel', 'mag_r_cModel', 'mag_i_cModel','photoz_pdf'], filters=cuts, native_filters=['tract==4850'])

We can construct a rough estimate for the N(z) by summing the individual galaxy PDFs in photoz_pdf, and compare the results of this sum to the histogram of the single point photoz_mode values.

sumpdf = np.sum(data['photoz_pdf'],axis=0)
fig = plt.figure(figsize=(12,8))
plt.hist(data['photoz_mode'], 100, label="photo-z mode");
plt.plot(cat.photoz_pdf_bin_centers, sumpdf*3.,label="summed $p(z)$",lw=2,c='r');
plt.xlabel("redshift");
plt.legend(loc='upper right');

The photoz_mode and photoz_pdf shapes roughly agree, though we see a more smooth distribution with the full posteriors. We also see a reduction of the anomalous high redshift features that appear with photoz_mode, where the posteriors are multi-peaked: using the full PDFs properly puts some of this probability at lower redshift rather than assigning to the single high redshift value.

Let’s also plot a color-color diagram and color code by the photoz_mode value

fig = plt.figure(figsize=(10,8))
plt.scatter(data['mag_g_cModel'] - data['mag_r_cModel'],
            data['mag_r_cModel'] - data['mag_i_cModel'],
            c=data['photoz_mode'], s=4, vmin=0, vmax=2.5,cmap='inferno');

plt.xlim(-1, 3);
plt.ylim(-0.5, 2);
plt.xlabel('$g-r$',fontsize=18);
plt.ylabel('$r-i$',fontsize=18);
plt.colorbar(label='photoz_mode');

# Food for thought: Look at how the photo-z values are distributed in this color-color space. Is this behavior expected?

We can see a strong correlation between redshift and position in color space, however the colors are determined by both the SED shape and redshift, so we also see degenerate areas, particularly at the blue end near color = 0.0 where low and high redshift solutions are close in the observed color space.

Now let’s examine a tomographic slice selected in terms of photoz_mode:

bin_cut = GCRQuery('photoz_mode > 0.6', 'photoz_mode < 0.8').mask(data)
sumpdf_bin = np.sum(data['photoz_pdf'][bin_cut],axis=0)
fig = plt.figure(figsize=(10,5))
plt.hist(data['photoz_mode'][bin_cut], bins = np.arange(0.0,3.01,.01), label=r'photo-z mode, $0.6 < z_{\rm mode} < 0.8$');
plt.plot(cat.photoz_pdf_bin_centers, sumpdf_bin,c='r', label=r'summed $p(z)$, $0.6 < z_{\rm mode} < 0.8$');
plt.xlabel("redshift",fontsize=15);
plt.ylabel("Number",fontsize=15)
plt.legend(loc='upper right');
plt.xlim(0.25,1.25)

We see that, while the photoz_mode values have been selected in a narrow range, the summed PDF values show “tails” that extend beyond the nominal bin. This illustrates the uncertain values in photo-z estimates that are not easily summed up in a single number like photoz_mode.

Let’s look at the location of galaxies within this tomographic bin in color space:

fig = plt.figure(figsize=(10,8))
plt.scatter(data['mag_g_cModel'] - data['mag_r_cModel'],
            data['mag_r_cModel'] - data['mag_i_cModel'],
            c='k',s=2,alpha=0.8,label='all galaxies');
plt.scatter(data['mag_g_cModel'][bin_cut] - data['mag_r_cModel'][bin_cut],
            data['mag_r_cModel'][bin_cut] - data['mag_i_cModel'][bin_cut],
            c=data['photoz_mode'][bin_cut], s=10, vmin=0, vmax=2.5, cmap='inferno',
            label='$0.6<z\_mode<0.8$');

plt.xlim(-1, 3);
plt.ylim(-0.5, 2);
plt.xlabel('$g-r$',fontsize=18);
plt.ylabel('$r-i$',fontsize=18);
plt.colorbar(label='photoz_mode');
plt.legend(loc='upper left')

We see that the range of colors for this tomographic bin is restricted compared to the overall distribution. Users may want to modify bin definitions by looking at color distributions to avoid common degeneracies. As an example of such a degenracy, let’s attempt to select galaxies with photoz_mode between 2.0 and 2.5:

bin_cut = GCRQuery('photoz_mode > 2.0', 'photoz_mode < 2.5').mask(data)
sumpdf_bin = np.sum(data['photoz_pdf'][bin_cut],axis=0)
fig = plt.figure(figsize=(10,5))
plt.hist(data['photoz_mode'][bin_cut], bins = np.arange(0.0,3.01,.01), label=r'photo-z mode, $0.6 < z_{\rm mode} < 0.8$');
plt.plot(cat.photoz_pdf_bin_centers, sumpdf_bin,c='r', label=r'summed $p(z)$, $0.6 < z_{\rm mode} < 0.8$');
plt.xlabel("redshift",fontsize=18);
plt.ylabel("Number",fontsize=18)
plt.legend(loc='upper right');
#plt.xlim(0.25,1.25)

We see that the sum of the PDFs indicates that there are potential degenerate solutions at both z~0.2 and 1.4 with non-negligible probability. We also see that the relative height fo the two peaks in the 2<z<2.5 bin is rather different between the photoz_mode histogram and the sum of photoz_pdf. Let’s plot the colors of these galaxies:

fig = plt.figure(figsize=(10,8))
plt.scatter(data['mag_g_cModel'] - data['mag_r_cModel'],
            data['mag_r_cModel'] - data['mag_i_cModel'],
            c='k',s=2,alpha=0.8,label='all galaxies');
plt.scatter(data['mag_g_cModel'][bin_cut] - data['mag_r_cModel'][bin_cut],
            data['mag_r_cModel'][bin_cut] - data['mag_i_cModel'][bin_cut],
            c=data['photoz_mode'][bin_cut], s=10, vmin=0, vmax=2.5, cmap='inferno',
            label='$2.0<z\_mode<2.5$');

plt.xlim(-1, 3);
plt.ylim(-0.5, 2);
plt.xlabel('$g-r$',fontsize=18);
plt.ylabel('$r-i$',fontsize=18);
plt.colorbar(label='photoz_mode');
plt.legend(loc='upper left')

We see that galaxies with photoz_mode between 2.0 and 2.5 reside in several areas of color space, including an area of potential degeneracy near g-r ~ r-i ~ 0 where galaxies at both low and high redshift reside. This is not uncommon for high redshift galaxies, which tend to have very blue, power-law-like SEDs, which tend to have blue colors at nearly all redshifts, and are thus prone to degeneracies.