Reading HDF5 Files ================== Overview -------- Many TXPipe outputs are in the HDF5 format. This is a fast and flexible file type that can also be easily read/written in parallel. HDF5 files contain three types of object: * ``datasets`` are equivalent to saved numpy arrays. * ``groups`` are like directories and can contain datasets or sub-groups. * ``attributes`` are for small pieces of metadata, and a set of attributes can convert to a python dictionary. They can be attached to whole files or to individual groups or datasets. The name scheme for datasets and groups is the same a for Unix files and folders, e.g. ``f['group/subgroup/dataset']``. From the command line, you can use the ``h5ls`` command to list the contents of an HDF5 file:: h5ls -r filename.hdf5 h5py ---- In python, you read these files with the ``h5py`` library. Here's an example opening one of the files generated by the example "laptop" pipeline in TXPipe:: import h5py f = h5py.File("./data/example/outputs/shear_tomography_catalog.hdf5") # Print out the items in the root of the file print(f.keys()) # prints # showing the three groups generated by tomography stage We can create variables to represent groups in the file:: g = f["tomography"] print(g.keys()) # prints Printing a dataset doesn't load it, it just shows the size and type of the data:: print(g["mean_e1"]) # prints Instead we load data sets as a numpy arrays with a slice:: e = g["mean_e1"][:] print(e) # prints [ 0.00283134 -0.0140038 0.0011645 -0.01299088] For longer arrays we may want to just read a subset of the data:: b = g["source_bins"][0:100] Attributes ---------- The easiest way to read attributes from h5py is to turn them into a dictionary:: d = dict(f['provenance'].attrs) print(d) # prints lots of provenance tracking information like all the package versions # and configuration options