Fred Johnson, Acting Director, Computational Science Research & Partnerships (SciDAC) Division
Office of Advanced Scientific Computing Research
DOE Office of Science
The critical importance of enabling software technology for leading edge research is being thrown into sharp relief by the remarkable escalation in the application complexity, quantities of data that scientists must now grapple with, and the scale of the computing platforms that they must use to do it. The effects of this ongoing complexity and data tsunami as well as the drive toward petascale computing are reverberating throughout every level of the software environment on which today’s vanguard applications depend – through the algorithms, the libraries, the system components, and the diverse collection of tools and methodologies for software development, performance optimization, data management, and data visualization. It is increasingly clear that our ability today to adapt and scale up the elements of this common software foundation will largely determine our ability tomorrow to attack the questions emerging at the frontiers of science.
Nowhere is this connection between scalable software technology and breakthrough science more evident than in the articles of this issue of CTWatch Quarterly. Each one offers an informative and stimulating discussion of some of the major work being carried out by one of the Centers for Enabling Technologies (CET) of the Department of Energy’s wide ranging and influential SciDAC program. The joint mission of the CETs is to assure that the scientific computing software infrastructure addresses the needs of SciDAC applications, data sets and parallel computing platforms, and to help prepare the scientific community for an environment where distributed, interdisciplinary collaboration is the norm. Each CET is a multidisciplinary team that works closely with one or more of SciDAC’s major application teams. Each one focuses its attention on the mathematical and computing problems confronting some major aspect of software functionality, such as distributed data management, application development, performance tuning, or scientific visualization. Making necessary progress in any of these areas requires the collective effort from the national (and international) research community, yet as these articles show, working in the context of SciDAC research has enabled these CETs to make leadership contributions.
The articles here reflect the rich diversity of components, layers and perspectives encompassed by SciDAC’s software ecosystem. They are grouped together according to the aspect of the problem of scalability they address. One group of articles focuses on the software innovations that will be necessary to cope with multiple order of magnitude increases in the number of processors and processor cores on petascale systems and beyond; another set focuses on the data management challenges spawned by the exponential growth in the size of tomorrow’s routine data sets; and finally, CETs dedicated to scientific visualization address the need to understand increasingly large and complex data sets generated either experimentally or computationally. The articles in this issue of CTWatch Quarterly follow these groupings.
We begin with a discussion (Gibson et al.) of the future requirements for fault tolerant computing from the leaders of the Petascale Data Storage Institute (PDSI). Given the surprising consequences that scaling up often introduces, it seems to strike an appropriate note – sobriety based on experience. The PDSI team has been collecting and analyzing data on failure rates from contemporary HPC systems in an effort to understand the impact that scaling up to systems with millions of hardware elements will have on successful application execution in general, and on the requirements for next generation storage systems, in particular. The results of their timely analysis are thought provoking. They show generally that as systems scale up, conventional approaches to fault tolerance based on familiar check-point and restart may break down along various fronts because the size and frequency of the checkpoints that must be taken on massive systems makes the process unsustainable. Their analysis makes it clear that systems research in this area is destined to become more and more critical.
Three of the CETs focus on issues of software development and maintenance that are raised by the extreme demands of next generation applications and the requirements of the HPC systems on which they must run. The scope of the Center for Technology for Advanced Scientific Computing Software (TASCS), presented in Parker et al., is the most general. For the TASCS group, the increasing scale and complexity of SciDAC applications and systems software is itself a critical problem. They argue that a far higher degree of modularity is required in the software that describes the multi-physics, multi-scale simulations that are now being developed. The more stove-piped these applications are, the less smoothly and intelligently they will be able to adapt and innovate to meet the conditions that we know are coming – more parallelism, more data intensity, shorter mean time to failure, and so on. The core techniques, tools, components and best practices of the Common Component Architecture (CCA) that they survey in their article are designed to help solve this aspect of the scalability problem for the broad SciDAC community.
The other two code-oriented CETs – the Performance Engineering Research Institute (PERI) and the Center for Scalable Application Development Software (CScADS) – focus on application performance and programmer productivity in the context of systems designed with thousands or millions of multicore and/or heterogeneous processors. They share the common goal of providing a tool set for achieving high performance that is as automated and easy to use as possible, allowing researchers to keep their attention focused on the domain science questions at hand. Both have made concerted efforts, through sponsored workshops and direct contact, to engage with and leverage the experience of the SciDAC developer community, with initial emphasis in the areas of Fusion Energy and Combustion. Yet their work emphasizes different, but complementary aspects of the problem. The PERI group (Baily et al.) builds on a foundation of performance modeling, endeavoring to understand, through systematic empirical testing and analysis, the way real world applications behave on real world systems. The knowledge gained thereby is then used to help guide the application design and development process through a variety of techniques, the more automated the better. By contrast, the CScADS group (Mellor-Crummy et al.) is exploring programming models that make the process of developing well tuned, highly parallel software as easy and efficient as possible by innovatively combining high level languages, scripting languages, compilers and other software tools. As these efforts converge, their collective results hold tremendous promise for the HPC developer community.
The CETs dedicated to scientific visualization have to confront the problem of petascale science from a uniquely important point of view, namely, where the bits meet the mind and the bandwidth is inherently limited. Their task is to find ways to enable scientists to fruitfully apply their observational capabilities, constrained as they are by nature, to some of the world’s largest and most complex datasets, using some of the world’s most massive and sophisticated computational platforms.
As described in the Bethel, Johnson et al. article, the Visualization and Analytics Center for Enabling Technologies(VACET) group is developing solutions to this problem that combine “query-driven” strategies, which pre-filter the data to be visualized for relevance and interest, with “context and focus” user interface designs, which enable scientists to control their field of attention while navigating complex data spaces. The success of this approach obviously depends on finding fast and efficient ways to index and search targeted data sets; VACET is collaborating closely with other centers in researching this problem. The work of the Institute for Ultrascale Visualization(Ultraviz Institute), described in Kwan-Liu Ma’s article, also (by necessity) puts the question of interface design at the center of its research agenda, especially for cases requiring the exploration of time-varying multivariate volume data. The Institute’s investigation of “in situ” visualization attempts to address problems at the other end of the visualization pipeline. To overcome the severe problems of data logistics involved in managing the rendering of multi-terabyte data sets in networked environments, in situ visualization performs the necessary calculations while data still resides on the supercomputer that was used to generate it.
Similar problems of petascale data logistics are central to the mission of the three CETs that focus on large scale data management for distributed environments. As the leaders of the Center for Enabling Distributed Petascale Science(CEDPS) make clear in their article (Schopf et al.), such questions of “data placement” are central to the end-to-end effectiveness of SciDAC’s highly distributed collaboration environments. The authors describe their development of a policy-driven data placement service, which builds on their experience working with several leading application communities, including HEP, Fusion Energy, Combustion, and Earth Systems. Complementary efforts on automated scientific workflow, using well known Kepler middleware, are also underway at the Scientific Data Management (SDM) Center. But in order to help investigators manage and analyze the data deluge they confront, the SDM Center research portfolio extends farther down the storage middleware stack. In the Shoshani et al. article, they describe the ensemble of software tools and middleware that they are developing to help scientists to explore their data through automatic feature extraction and highly scalable indexing of massive data sets, and to optimize their use of storage resources through low level parallel I/0 libraries and in situ processing on the storage nodes (“active storage”).
The third CET focused on data management – the Earth System Grid Center for Enabling Technologies (ESG-CET) – revolves, as the name suggests, around a single major application community, viz. the climate research community. This community has been at the forefront of the data grid movement for many years, aggressively developing and deploying data grid technology to show how high impact data sharing can be implemented on a global scale, even while the volume of data continues to escalate. Their discussion (Williams et al.) of the past successes, the current implementation, and the future plans for the Earth System Grid describes a model that several other application communities would do well to emulate as we enter the era of petascale data.
Reflecting on the range and diversity of the work on software cyberinfrastructure presented in this issue of CTWatch Quarterly, it’s hard to avoid the conclusion that the relentless movement toward petascale science, in which the DOE SciDAC program has played such a leading role, has generated a software ecosystem whose continued vitality seems more and more essential to success on the new frontiers of research. But we cannot be complacent. The push beyond petascale is just around the corner and, as before, the effort to scale up even further is certain to bring up uniquely difficult problems that we have not yet anticipated. We must hope, therefore, that the next generation of enabling software technology researchers contains the same kind of energetic, dedicated and creative pioneers that have led the current one.