Computing Infrastructure

  • Salman Habib (ANL) 
  • Jim Chiang (SLAC)

The Computing Infrastructure group must provide several capabilities to the DESC: sufficient computing resources and tools to run and manage large productions; a software framework for collaboration and external code to function in; and common tools needed by multiple WGs.

The figure below takes the example of weak lensing to illustrate some of the components that need to be accessed and provided by DESC members in carrying out science analyses, and therefore those that need to be embedded in a software framework. Collaboration members will be using simulation tools and the Data Management stack. They will take the outputs (e.g., galaxy catalogs or shear measurements) and generate their own analysis code that will have multiple layers starting, e.g., with computations of 2-point functions and propagating all the way to constraints on cosmological parameters. The overarching vision for the DESC computing environment is that scientists will find it easy to access the project tools, develop their own code, generate pipelines with modules written by others within the collaboration, incorporate external code (such as the HEALPix library and camb), and track results across the collaboration toidentify the optimal algorithms, modules, and pipelines for different problems. 

On the computing infrastructure side, the collaboration is employing pathfinder projects as a practical way to begin developing cross-working group coordination, to facilitate the identification and prototyping of tools and methods, to test project management tools, and to integrate pipeline and computing infrastructure through the combined efforts of computing professionals and key DESC members. We are taking lessons learned from the DC1 pathfinders to build a solid foundation which can be expanded to more working groups and larger datasets in DC2, as a step to full collaboration-wide implementation in DC3. The Twinkles project is the first such pathfinder and is an end-to-end demonstrator, incorporating both cosmological simulations and detailed photon simulations, data processing, and end analysis by a distributed group of collaborators.
 Recommendations have already emerged for organization of github repositories, code development/review strategies, unit and continuous integration test setups, and use of issues for tracking work. A candidate workflow engine and file catalog have been employed for bulk production of simulations and image processing at NERSC and SLAC, providing provenance and production metrics for the resulting datasets which needed some 25k CPU hrs to produce. The LSS 2-point correlation function validation project is the second pathfinder, primarily focusing on executing the analysis end of the pipeline and exercising the need for a framework connecting inference tools. With NERSC as our primary computing host, a significant effort is underway to optimize its use, both in terms of running large numbers of jobs and making efficient use of the architecture (many core, small memory, vectorized).