API

sndata

Module containing Dataset classes. These read in data from various sources and turns the light curves into astropy tables that can be read by the rest of the code.

class snmachine.sndata.Dataset(folder, subset='none', filter_set=['desg', 'desr', 'desi', 'desz'])[source]

Class to manage the files from a single dataset. The base class works with data from the SPCC. This class can be inherited and overridden to work a completely different kind of dataset. All this class really needs is a list of object names and a method, get_lightcurve, which takes an individual object name and returns a light curve. Other functions provided here are for plotting and convenience.

get_lightcurve(flname)[source]

Given a filename, returns a light curve astropy table that conforms with sncosmo requirements

Parameters:flname (str) – The filename of the supernova (relative to data_root)
Returns:Light curve
Return type:astropy.table.Table
get_max_length()[source]

Gets the length (in days) of the longest observation in the dataset.

get_object_names(subset='none')[source]

Gets a list of the names of the files within the dataset.

Parameters:subset (str or list-like, optional) – Used to specify which files you want. Current setup is get_object_names will accept a list of indices, a list of actual object names as a subset or the keyword ‘spectro’.
get_redshift()[source]

Returns a list of the redshifts of the entire dataset.

Returns:Array of redshifts
Return type:~numpy.ndarray
get_types()[source]

Returns a list of the types of the entire dataset.

Returns:Array of types
Return type:~numpy.ndarray
plot_all(plot_model=True)[source]

Plots all the supernovae in the dataset and allows the user to cycle through them with the left and right arrow keys.

Parameters:plot_model (bool, optional) – Whether or not to overplot the model.
plot_lc(fname, plot_model=True, title=True, loc='best')[source]

Public function to plot a single light curve.

Parameters:
  • fname (str) – The filename of the supernova (relative to data_root)
  • plot_model (bool, optional) – Whether or not to overplot the model
  • title (str, optional) – Put a title on the plot
  • loc (str, optional) – Location of the legend
reduced_chi_squared(subset='none')[source]

Returns the reduced chi squared for each object, once a model has been set.

Parameters:subset (str or list-like, optional) – List of a subset of object names. If not supplied, the full dataset will be used
Returns:Dictionary of reduced chi^2 for each object
Return type:dict
set_model(fit_sn, *args)[source]

Can use any function to set the model for all objects in the data.

Parameters:
  • fit_sn (function) – A function which can take a light curve (astropy table) argument and a list of arguments and returns an astropy table
  • args (list, optional) – Whatever arguments fit_sn requires
sim_stats(**kwargs)[source]

Prints information about the survey/simulation.

Parameters:
  • indices (list-like, optional) – List of indices to indicate which objects to consider. This allows you to, for example,
  • the statistics of a training subsample. (see) –
  • plot_redshift (bool, optional) – Plots a histogram of the redshift distribution
class snmachine.sndata.OpsimDataset(folder, subset='none', mix=False, filter_set=['lsstu', 'lsstg', 'lsstr', 'lssti', 'lsstz', 'lssty'])[source]

Class to read in an LSST simulated dataset, based on OpSim runs and SNANA simulations.

get_data(folder, subset='none')[source]

Reads in the simulated data

Parameters:
  • folder (str) – Folder where simulations are located
  • subset (str or list-like, optional) – List of a subset of object names. If not supplied, the full dataset will be used
get_lightcurve(tab)[source]

Converts the sncosmo convention for the astropy tables to snmachine’s.

Parameters:tab (astropy.table.Table) – Light curve
class snmachine.sndata.SDSS_Data(folder, subset='none', training_only=False, filter_set=['sdssu', 'sdssg', 'sdssr', 'sdssi', 'sdssz'], subset_length=False, classification='none')[source]

Class to read in the SDSS supernovae dataset

get_SNe(subset_length)[source]

Function to take all supernovae from Master SDSS data file and return a random sample of SNe of user-specified length if requested

Parameters:subset_length (int) – Number of objects to return
Returns:List of object names
Return type:list-like
get_info(flname)[source]
Function which takes file name of supernova and returns dictionary of spectroscopic and photometric redshifts and their errors when available as
well as the type of the supernova
Parameters:flname (str) – Name of object
Returns:Redshift, redshift error, type
Return type:list-like
get_lightcurve(flname)[source]

Given a filename, returns a light curve astropy table that conforms with sncosmo requirements :param flname: The filename of the supernova (relative to data_root) :type flname: str

Returns:Light curve
Return type:astropy.table.Table
get_object_names(subset='none', subset_length=False, classification='none')[source]

Gets a list of the names of the files within the dataset. :param subset: List of a subset of object names. If not supplied, the full dataset will be used :type subset: str or list-like, optional :param subset_length: Number of objects to return (False to return all) :type subset_length: bool or int, optional :param classification: Can specify a particular type of supernova to return (‘none’ for all types) :type classification: str, optional

Returns:Object names
Return type:list-like
get_photo(subset_length, classification)[source]

Function to take all purely photometric supernovae from Master file and return a random sample of SNe of user-specified length if requested :param subset_length: Number of objects to return (False to return all) :type subset_length: bool or int :param classification: Can specify a particular type of supernova to return (‘none’ for all types) :type classification: str

Returns:List of object names
Return type:list-like
get_spectro(subset_length, classification)[source]

Function to take all spectroscopically confirmed supernovae from Master file and return a random sample of SNe of user-specified length if requested

Parameters:
  • subset_length (bool or int) – Number of objects to return (False to return all)
  • classification (str) – Can specify a particular type of supernova to return (‘none’ for all types)
Returns:

List of object names

Return type:

list-like

class snmachine.sndata.SDSS_Simulations(folder, subset='none', training_only=False, filter_set=['sdssu', 'sdssg', 'sdssr', 'sdssi', 'sdssz'], subset_length=False, classification='none')[source]

Class to read in the SDSS simulations dataset

get_data(subset='none', subset_length=False, classification='none')[source]

Function to get all data in same form as SDSS Data

get_lightcurve(lc)[source]

Given a filename, returns a light curve astropy table that conforms with sncosmo requirements

Parameters:flname (str) – The filename of the supernova (relative to data_root)
Returns:Light curve
Return type:astropy.table.Table
snmachine.sndata.plot_lc(lc)[source]

External function to plot light curves.

Parameters:lc (astropy.table.Table) – Light curve

snfeatures

Module for feature extraction on supernova light curves.

class snmachine.snfeatures.Features[source]

Base class to define basic functionality for extracting features from supernova datasets. Users are not restricted to inheriting from this class, but any Features class must contain the functions extract_features and fit_sn.

convert_astropy_array(tab)[source]

Convenience function to convert an astropy table (floats only) into a numpy array.

goodness_of_fit(d)[source]

Test (for any feature set) how well the reconstruction from the features fits each of the objects in the dataset.

Parameters:d (Dataset) – Dataset object.
Returns:Table with the reduced Chi2 for each object
Return type:astropy.table.Table
posterior_predictor(lc, nparams, chi2)[source]

Computes posterior predictive p-value to see if the model fits sufficient well. *UNTESTED*

Parameters:
  • lc (astropy.table.Table) – Light curve
  • nparams (int) – The number of parameters in the model. For the wavelets, this will be the number of PCA coefficients. For the parametric models, this will be the number parameters per model multiplied by the number of filters. For the template models, this is simply the number of parameters in the model.
  • chi2 (array-like) – An array of chi2 values for each set of parameter space in the parameter samples. This is easy to obtain as -2*loglikelihood output from a multinest or mcmc chain. For features such as the wavelets, this will have to be separately calculated by drawing thousands of curves consistent with the coefficients and their errors and then computing the chi2.
Returns:

The posterior predictive p-value. If this number is too close to 0 or 1 it implies the model is a poor fit.

Return type:

float

class snmachine.snfeatures.ParametricFeatures(model_choice, sampler='leastsq', limits=None)[source]

Fits a few options of generalised, parametric models to the data.

extract_features(d, chain_directory='chains', save_output=True, n_attempts=20, nprocesses=1, n_walkers=100, n_steps=500, walker_spread=0.1, burn=50, nlp=1000, starting_point=None, convert_to_binary=True, n_iter=0, restart=False, seed=-1)[source]

Fit parametric models and return best-fitting parameters as features.

Parameters:
  • d (Dataset object) – Dataset
  • chain_directory (str) – Where to save the chains
  • save_output (bool) – Whether or not to save the intermediate output (if Bayesian inference is used instead of least squares)
  • n_attempts (int) – Allow the minimiser to start in new random locations if the fit is bad. Put n_attempts=1 to fit only once with the default starting position.
  • nprocesses (int, optional) – Number of processors to use for parallelisation (shared memory only)
  • n_walkers (int) – emcee parameter - number of walkers to use
  • n_steps (int) – emcee parameter - total number of steps
  • walker_spread (float) – emcee parameter - standard deviation of distribution of starting points of walkers
  • burn (int) – emcee parameter - length of burn-in
  • nlp (int) – multinest parameter - number of live points
  • starting_point (None or array-like) – Starting points of parameters for leastsq or emcee
  • convert_to_binary (bool) – multinest parameter - whether or not to convert ascii output files to binary
  • n_iter (int) – leastsq parameter - number of iterations to avoid local minima
  • restart (bool) – Whether or not t restart from existing multinest chains
Returns:

Best-fitting parameters

Return type:

astropy.table.Table

fit_sn(lc, features)[source]

Fits the chosen parametric model to a given light curve.

Parameters:
  • lc (astropy.table.Table) – Light curve
  • features (astropy.table.Table) – Model parameters
Returns:

Fitted light curve

Return type:

astropy.table.Table

lnprob_emcee(params, x, y, yerr)[source]

Likelihood function for emcee

Parameters:
  • params
  • x
  • y
  • yerr
run_emcee(d, obj, save_output, chain_directory, n_walkers, n_steps, walker_spread, burn, starting_point, seed=-1)[source]

Runs emcee on all the filter bands of a given light curve, fitting the model to each one and extracting the best fitting parameters.

Parameters:
  • d (Dataset object) – Dataset
  • obj (str) – Object name
  • save_output (bool) – Whether or not to save the intermediate output
  • chain_directory (str) – Where to save the chains
  • n_walkers (int) – emcee parameter - number of walkers to use
  • n_steps (int) – emcee parameter - total number of steps
  • walker_spread (float) – emcee parameter - standard deviation of distribution of starting points of walkers
  • burn (int) – emcee parameter - length of burn-in
  • starting_point (None or array-like) – Starting points of parameters
Returns:

Best fitting parameters (at the maximum posterior)

Return type:

astropy.table.Table

class snmachine.snfeatures.TemplateFeatures(model=['Ia'], sampler='leastsq', lsst_bands=False, lsst_dir='../lsst_bands/')[source]

Calls sncosmo to fit a variety of templates to the data. The number of features will depend on the templates chosen (e.g. salt2, nugent2p etc.)

extract_features(d, save_chains=False, chain_directory='chains', use_redshift=False, nprocesses=1, restart=False, seed=-1)[source]

Extract template features for a dataset.

Parameters:
  • d (Dataset object) – Dataset
  • save_chains (bool) – Whether or not to save the intermediate output (if Bayesian inference is used instead of least squares)
  • chain_directory (str) – Where to save the chains
  • use_redshift (bool) – Whether or not to use provided redshift when fitting objects
  • nprocesses (int, optional) – Number of processors to use for parallelisation (shared memory only)
  • restart (bool) – Whether or not to restart from multinest chains
Returns:

Table of fitted model parameters.

Return type:

astropy.table.Table

fit_sn(lc, features)[source]

Fits the chosen template model to a given light curve.

Parameters:
  • lc (astropy.table.Table) – Light curve
  • features (astropy.table.Table) – Model parameters
Returns:

Fitted light curve

Return type:

astropy.table.Table

registerBands(dirname, prefix=None, suffix=None)[source]

Register LSST bandpasses with sncosmo. Courtesy of Rahul Biswas

class snmachine.snfeatures.WaveletFeatures(wavelet='sym2', ngp=100, **kwargs)[source]

Uses wavelets to decompose the data and then reduces dimensionality of the feature space using PCA.

GP(obj, d, ngp=200, xmin=0, xmax=170, initheta=[500, 20], gpalgo='george')[source]

Fit a Gaussian process curve at specific evenly spaced points along a light curve.

Parameters:
  • obj (str) – Object name
  • d (Dataset object) – Dataset
  • / int, optional (ngp) – Number of points to evaluate Gaussian Process at
  • xmin (float, optional) – Minimim time to evaluate at
  • xmax (float, optional) – Maximum time to evaluate at
  • initheta (list-like, optional) – Initial values for theta parameters. These should be roughly the scale length in the y & x directions.
Returns:

Table with evaluated Gaussian process curve and errors

Return type:

astropy.table.Table

Notes

Wraps internal module-level function in order to circumvent multiprocessing module limitations in dealing with objects when parallelising.

best_coeffs(vals, tol=0.98)[source]

Determine the minimum number of PCA components required to adequately describe the dataset.

Parameters:
  • vals (list-like) – List of eigenvalues (ordered largest to smallest)
  • tol (float, optional) – How much ‘energy’ or information must be retained in the dataset.
Returns:

The required number of coefficients to retain the requested amount of “information”.

Return type:

int

extract_GP(d, ngp, xmin, xmax, initheta, save_output, output_root, nprocesses, gpalgo='george')[source]

Runs Gaussian process code on entire dataset. The result is stored inside the models attribute of the dataset object.

Parameters:
  • d (Dataset object) – Dataset
  • ngp (int) – Number of points to evaluate Gaussian Process at
  • xmin (float) – Minimim time to evaluate at
  • xmax (float) – Maximum time to evaluate at
  • initheta (list-like) – Initial values for theta parameters. These should be roughly the scale length in the y & x directions.
  • save_output (bool) – Whether or not to save the output
  • output_root (str) – Output directory
  • nprocesses (int, optional) – Number of processors to use for parallelisation (shared memory only)
extract_features(d, initheta=[500, 20], save_output='none', output_root='features', nprocesses=1, restart='none', gpalgo='george', xmin=None, xmax=None)[source]

Applies a wavelet transform followed by PCA dimensionality reduction to extract wavelet coefficients as features.

Parameters:
  • d (Dataset object) – Dataset
  • initheta (list-like, optional) – Initial values for theta parameters. These should be roughly the scale length in the y & x directions.
  • save_output (bool, optional) – Whether or not to save the output
  • output_root (str, optional) – Output directory
  • nprocesses (int, optional) – Number of processors to use for parallelisation (shared memory only)
  • restart (str, optional) – Either ‘none’ to start from scratch, ‘gp’ to restart from saved Gaussian processes, or ‘wavelet’ to restart from saved wavelet decompositions (will look in output_root for the previously saved outputs).
  • log (bool, optional) – Whether or not to take the logarithm of the final PCA components. Recommended setting is False (legacy code).
Returns:

Table of features (first column object names, the rest are the PCA coefficient values)

Return type:

astropy.table.Table

extract_pca(object_names, wavout)[source]

Dimensionality reduction of wavelet coefficients using PCA.

Parameters:
  • object_names (list-like) – Object names corresponding to each row of the wavelet coefficient array.
  • wavout (array) – Wavelet coefficient array, each row corresponds to an object, each column is a coefficient.
  • log
Returns:

Astropy table containing PCA features.

Return type:

astropy.table.Table

extract_wavelets(d, wav, mlev, nprocesses, save_output, output_root)[source]

Perform wavelet decomposition on all objects in dataset. Output is stored as astropy table for each object.

Parameters:
  • d (Dataset object) – Dataset
  • wav (str or swt.Wavelet object) – Which wavelet family to use
  • mlev (int) – Max depth
  • nprocesses (int, optional) – Number of processors to use for parallelisation (shared memory only)
  • save_output (bool, optional) – Whether or not to save the output
  • output_root (str, optional) – Output directory
Returns:

  • wavout (array) – A numpy array of the wavelet coefficients where each row is an object and each column a different coefficient
  • wavout_err (array) – A numpy array storing the (assuming Gaussian) error on each coefficient.

fit_sn(lc, comps, vec, mn, xmin, xmax, filter_set)[source]

Fits a single object using previously run PCA components. Performs the full inverse wavelet transform.

Parameters:
  • lc (astropy.table.Table) – Light curve
  • comps (astropy.table.Table) – The PCA coefficients for each object (i.e. the astropy table of wavelet features from by extract_features).
  • vec (array-like) – PCA component vectors as array (each column is a vector, ordered from most to least significant)
  • mn (array-like) – Mean vector
  • xmin (float) – The minimum on the x axis (as defined for the original GP decomposition)
  • xmax (float) – The maximum on the x axis (as defined for the original GP decomposition)
  • filter_set (list-like) – The full set of filters of the original dataset
Returns:

Fitted light curve

Return type:

astropy.table.Table

iswt(coefficients, wavelet)[source]

Performs inverse wavelet transform. M. G. Marino to complement pyWavelets’ swt.

Parameters:
  • coefficients (array) – approx and detail coefficients, arranged in level value exactly as output from swt: e.g. [(cA1, cD1), (cA2, cD2), …, (cAn, cDn)]
  • wavelet (str or swt.Wavelet) – Either the name of a wavelet or a Wavelet object
Returns:

The inverse transformed array

Return type:

array

pca(X)[source]

Performs PCA decomposition of a feature array X.

Parameters:X (array) – Array of features to perform PCA on.
Returns:
  • vals (list-like) – Ordered array of eigenvalues
  • vec (array) – Ordered array of eigenvectors, where each column is an eigenvector.
  • mn (array) – The mean of the dataset, which is subtracted before PCA is performed.

Notes

Although SVD is considerably more efficient than eigh, it seems more numerically unstable and results in many more components being required to adequately describe the dataset, at least for the wavelet feature sets considered.

project_pca(X, eig_vec)[source]

Project a vector onto a PCA axis (i.e. transform data to PCA space).

Parameters:
  • X (array) – Vector of original data (for one object).
  • eig_vec (array) – Array of eigenvectors, first column most significant.
Returns:

Coefficients of eigen vectors

Return type:

array

restart_from_gp(d, output_root)[source]

Allows the restarted of the feature extraction process from previously saved Gaussian Process curves.

Parameters:
  • d (Dataset object) – The same dataset (object) on which the previous GP analysis was performed.
  • output_root (str) – Location of GP objects
restart_from_wavelets(d, output_root)[source]

Allows the restarted of the feature extraction process from previously saved wavelet decompositions. This allows you to quickly try different dimensionality reduction (e.g. PCA) algorithms on the wavelets.

Parameters:
  • d (Dataset object) – The same dataset (object) on which the previous wavelet analysis was performed.
  • output_root (str) – Location of previously decomposed wavelet coefficients
Returns:

  • wavout (array) – A numpy array of the wavelet coefficients where each row is an object and each column a different coefficient
  • wavout_err (array) – A similar numpy array storing the (assuming Gaussian) error on each coefficient.

wavelet_decomp(lc, wav, mlev)[source]

Perform a wavelet decomposition on a single light curve.

Parameters:
  • lc (astropy.table.Table) – Light curve
  • wav (str or swt.Wavelet object) – Which wavelet family to use
  • mlev (int) – Max depth
Returns:

Decomposed coefficients in each filter.

Return type:

astropy.table.Table

snmachine.snfeatures.get_MAP(chain_name)[source]

Read maximum posterior parameters from a stats file of multinest.

Parameters:chain_name (str) – Root for the chain files
Returns:Best-fitting parameters
Return type:list-like
snmachine.snfeatures.output_time(tm)[source]

Simple function to output the time nicely formatted.

Parameters:tm (Input time in seconds.) –

parameteric models

Module for parametric models for use in snfeatures module

class snmachine.parametric_models.KarpenkaModel(**kwargs)[source]

Parametric model as implemented in Karpenka et al. (http://arxiv.org/abs/1208.1264)

evaluate(t, params)[source]

Evaluate the function at given values of t

Parameters:
  • t (ndarray) – The time steps over which to evaluate (starting at 0)
  • params (list-like) – The parameters of the model
Returns:

Function values evaluated at t

Return type:

ndarray

class snmachine.parametric_models.NewlingModel(**kwargs)[source]

Parametric model as implemented in Newling et al. (http://arxiv.org/abs/1010.1005)

evaluate(t, params)[source]

Evaluate the function at given values of t

Parameters:
  • t (ndarray) – The time steps over which to evaluate (starting at 0)
  • params (list-like) – The parameters of the model
Returns:

Function values evaluated at t

Return type:

ndarray

tsne plot

Utility script for making nice t-SNE plots (https://lvdmaaten.github.io/tsne/)

snmachine.tsne_plot.get_tsne(feats, objs, perplexity=100, seed=-1)[source]

Return the transformed features running the sklearn t-SNE code.

Parameters:
  • feats (astropy.table.Table) – Input features
  • objs (list) – Subset of objects to run on (t-SNE is slow for large numbers, 2000 randomly selected objects is a good compromise)
  • perplexity (float, optional) – t-SNE parameter which controls (roughly speaking) how sensitive the t-SNE plot is to small details
Returns:

Xfit – Transformed, embedded 2-d features

Return type:

array

snmachine.tsne_plot.plot(feats, types, objs=[], seed=-1)[source]

Convenience function to run t-SNE and plot

Parameters:
  • feats (astropy.table.Table) – Input features
  • types (array) – Types of the supernovae (to colour the points appropriately)
  • objs (list) – Subset of objects to run on (t-SNE is slow for large numbers, 2000 randomly selected objects is a good compromise)
snmachine.tsne_plot.plot_tsne(Xfit, types, loc='best')[source]

Plot the resulting t-SNE embedded features.

Parameters:
  • Xfit (array) – Transformed, embedded 2-d features
  • types (array) – Types of the supernovae (to colour the points appropriately)
  • loc (str, optional) – Location of the legend in the plot

snclassifier

Utility module mostly wrapping sklearn functionality and providing utility functions.

snmachine.snclassifier.F1(pr, Yt, true_class, full_output=False)[source]

Calculate an F1 score for many probability threshold increments and select the best one.

Parameters:
  • pr (array) – An array of probability scores, either a 1d array of size N_samples or an nd array, in which case the column corresponding to the true class will be used.
  • Yt (array) – An array of class labels, of size (N_samples,)
  • true_class (int) – which class is taken to be the “true class” (e.g. Ia vs everything else)
  • full_output (bool, optional) – If true returns two vectors corresponding to F1 as a function of threshold, instead of the best value.
Returns:

  • best_F1 (float) – (If full_output=False) The largest F1 value
  • best_threshold (array) – (If full_output=False) The probability threshold corresponding to best_F1
  • f1 (array) – (If full_output=True) F1 as a function of threshold.
  • threshold (array) – (If full_output=True) Vector of thresholds (from 0 to 1)

snmachine.snclassifier.FoM(pr, Yt, true_class=1, full_output=False)[source]

Calculate a Kessler FoM for many probability threshold increments and select the best one.

FoM is defined as: FoM = TP^2/((TP+FN)(TP+3*FP))

Parameters:
  • pr (array) – An array of probability scores, either a 1d array of size N_samples or an nd array, in which case the column corresponding to the true class will be used.
  • Yt (array) – An array of class labels, of size (N_samples,)
  • true_class (int) – which class is taken to be the “true class” (e.g. Ia vs everything else)
  • full_output (bool, optional) – If true returns two vectors corresponding to F1 as a function of threshold, instead of the best value.
Returns:

  • best_FoM (float) – (If full_output=False) The largest FoM value
  • best_threshold (array) – (If full_output=False) The probability threshold corresponding to best_FoM
  • fom (array) – (If full_output=True) FoM as a function of threshold.
  • threshold (array) – (If full_output=True) Vector of thresholds (from 0 to 1)

class snmachine.snclassifier.OptimisedClassifier(classifier, optimise=True, **kwargs)[source]

Implements an optimised classifier (although it can be run without optimisation). Equipped with interfaces to several sklearn classes and functions.

classify(X_train, y_train, X_test)[source]

Run unoptimised classifier with initial parameters. :param X_train: Array of training features of shape (n_train,n_features) :type X_train: array :param y_train: Array of known classes of shape (n_train) :type y_train: array :param X_test: Array of validation features of shape (n_test,n_features) :type X_test: array

Returns:
  • Yfit (array) – Predicted classes for X_test
  • probs (array)
  • (If self.prob=True) Probability for each object to belong to each class.
optimised_classify(X_train, y_train, X_test, **kwargs)[source]

Run optimised classifier using grid search with cross validation to choose optimal classifier parameters. :param X_train: Array of training features of shape (n_train,n_features) :type X_train: array :param y_train: Array of known classes of shape (n_train) :type y_train: array :param X_test: Array of validation features of shape (n_test,n_features) :type X_test: array :param params: Allows the user to specify which parameters and over what ranges to optimise. If not set,

defaults will be used.
Parameters:true_class (int, optional) – The class determined to be the desired class (e.g. Ias, which might correspond to class 1). This allows the user to optimise for different classes (based on ROC curve AUC).
Returns:
  • Yfit (array) – Predicted classes for X_test
  • probs (array)
  • (If self.prob=True) Probability for each object to belong to each class.
snmachine.snclassifier.plot_roc(fpr, tpr, auc, labels=[], cols=[], label_size=26, tick_size=18, line_width=3, figsize=(8, 6))[source]

Plots a ROC curve or multiple curves. Can plot the results from multiple classifiers if fpr and tpr are arrays where each column corresponds to a different classifier.

Parameters:
  • fpr (array) – An array containing the false positive rate at each probability threshold
  • tpr (array) – An array containing the true positive rate at each probability threshold
  • auc (float) – The area under the ROC curve
  • labels (list, optional) – Labels of each curve (e.g. ML algorithm names)
  • cols (list, optional) – Colors of the line(s)
  • label_size (float, optional) – Size of x and y axis labels.
  • tick_size (float, optional) – Size of tick labels.
  • line_width (float, optional) – Line width
snmachine.snclassifier.roc(pr, Yt, true_class=0)[source]

Produce the false positive rate and true positive rate required to plot a ROC curve, and the area under that curve.

Parameters:
  • pr (array) – An array of probability scores, either a 1d array of size N_samples or an nd array, in which case the column corresponding to the true class will be used.
  • Yt (array) – An array of class labels, of size (N_samples,)
  • true_class (int) – which class is taken to be the “true class” (e.g. Ia vs everything else)
Returns:

  • fpr (array) – An array containing the false positive rate at each probability threshold
  • tpr (array) – An array containing the true positive rate at each probability threshold
  • auc (float) – The area under the ROC curve

snmachine.snclassifier.run_pipeline(features, types, output_name='', columns=[], classifiers=['nb', 'knn', 'svm', 'neural_network', 'boost_dt'], training_set=0.7, param_dict={}, nprocesses=1, scale=True, plot_roc_curve=True, return_classifier=False)[source]

Utility function to classify a dataset with a number of classification methods. This does assume your test set has known values to compare against. Returns, if requested, the classifier objects to run on future test sets.

Parameters:
  • features (astropy.table.Table or array) – Features either in the form of a table or array
  • types (astropy.table.Table or array) – Classes, either in the form of a table or array
  • output_name (str, optional) – Full root path and name for output (e.g. ‘<output_path>/salt2-run-‘)
  • columns (list, optional) – If you want to run a subset of columns
  • classifiers (list, optional) – Which available ML classifiers to use
  • training_set (float or list, optional) – If a float, this is the fraction of objects that will be used as training set. If a list, it’s assumed these are the ID’s of the objects to be used
  • param_dict (dict, optional) – Use to run different ranges of hyperparameters for the classifiers when optimising
  • nprocesses (int, optional) – Number of processors for multiprocessing (shared memory only). Each classifier will then be run in parallel.
  • scale (bool, optional) – Rescale features using sklearn’s preprocessing Scalar class (highly recommended this is True)
  • plot_roc_curve (bool, optional) – Whether or not to plot the ROC curve at the end
  • return_classifier (bool, optional) – Whether or not to return the actual classifier objects (due to the limitations of multiprocessing, this can’t be done in parallel at the moment).
Returns:

(If return_classifier=True) Dictionary of fitted sklearn Classifier objects

Return type:

dict