Creating and Operating National-Scale Cyberinfrastructure Services
Charlie Catlett, Pete Beckman, Dane Skow and Ian Foster, The Computation Institute, University of Chicago and Argonne National Laboratory
The term “cyberinfrastructure” is broadly defined to include computer applications, services, data, networks, and many other components supporting science.1 Here we discuss the underlying resources and integrative systems and software that together comprise a grid “facility” offering a variety of services to users and applications. These services can range from application execution services to data management and analysis services, presented in such a way that end-user applications can access these services separately or in combination (e.g., in a workflow).
We use the TeraGrid2 project to illustrate the functions and costs of providing national cyberinfrastructure. Developed and deployed in its initial configuration between 2001 and 2004, the TeraGrid is a persistent, reliable, production national facility that today integrates eighteen distinct resources at eight “resource provider” facilities.3This facility supports over 1000 projects and several thousand users (Fig. 1) across the sciences. TeraGrid architecture, planning, coordination, operation, and common software and services are provided through the Grid Infrastructure Group (GIG), led by the University of Chicago. TeraGrid staff work with end-users, both directly and through surveys and interviews, to drive the technical design and evolution of the TeraGrid facility in support of science. In addition, TeraGrid is developing partnerships with major science facilities and communities to provide needed computational, information management, data analysis, and other services and resources, thus allowing those communities to focus on their science rather than on the creation and operation of services.
TeraGrid supports a variety of use scenarios, ranging from traditional supercomputing to advanced Grid workflow and distributed applications. In general terms, TeraGrid emphasizes two complementary types of use. TeraGrid “Deep” involves harnessing TeraGrid’s integrated high-capability resources to enable scientific discovery that would not otherwise be possible. TeraGrid “Wide” is an initiative that is adapting TeraGrid services and capabilities to be readily used by the broader scientific community through interfaces such as web portals and desktop applications. All of these use scenarios—even traditional supercomputing users—benefit from the common services that are operated across the participating organizations, such as uniform access to storage, common data movement mechanisms, facility-wide authentication, and distributed accounting and allocations systems that provide the basis for authorization.
Creating and operating a grid facility involves integrating resources, software, and user support services into a coherent set of services for users and applications. Resources are explored by Roskies,4 while Killeen and Simon5discuss user and community support. We discuss here the software infrastructure and policies required to integrate these diverse components to create a persistent, reliable national-scale facility. While the federation of multiple, independent computing centers requires carefully designed federation, governance, and sociological policies and processes, in this article we focus only on the functional and technical costs of operating a national grid infrastructure.
Software components in a grid facility include science applications, grid middleware, infrastructure support services, and mechanisms to integrate community-developed systems we call “Science Gateways.” If we define the fundamental components of infrastructure to be those that have the longest useful lifespan, then software is clearly the critical investment. While particular platforms (e.g., x86) may have long lifetimes, individual high-end computational resources have a useful lifespan of perhaps five years. In contrast, many components of our software infrastructure are already 10 years old. For example, TeraGrid deployed the Globus Toolkit6 nearly five years ago (it was not new at the time), and our expectation is that this software will be integral for the foreseeable future. Similarly, scientific communities have invested several years in building software infrastructure – tools, databases, and web portals for example – for their communities. Science application software and the tools for developing, debugging and managing that software are often even older. As we consider costs and investments for integrating grid facilities, it is essential that we leverage these investments.
The vast majority of scientific grid facilities rely heavily on a common core set of middleware systems, such as the Globus middleware (which includes numerous components, such as GridFTP for data transfer, GRAM for job submission, Grid Security Infrastructure, and the credential management software MyProxy7) and a variety of related tools such as the Condor scheduling system8 and the verification and validation suite Inca.9 The development and wide-scale adoption of these components has been made possible by substantial investments by DOE, NSF, and other agencies in the U.S. and abroad. In particular, NSF’s investment of roughly $50M in the NSF Middleware Initiative (NMI) program10 over the past five years has played a key role in developing and “hardening” these and other software systems such that they can be reliably used in grid facilities, as evidenced by their widespread adoption world-wide in hundreds of grid projects and facilities. For example, the NMI GRIDS Center11 has supported the development, integration testing, and packaging of many components. This work has reduced the complexity of creating a basic grid system and greatly simplified updating systems that adopted earlier versions of software. Additional investments of tens of millions of dollars has been made worldwide in grid deployment projects that have contributed to the maturation of these software systems, the development of tools for particular functions, and the pioneering of the new application approaches enabled by TeraGrid-class facilities. For example, the TeraGrid project invested roughly $1M in the initial design and development of the Inca system, which is one of many such components that are available today through the NMI program.
Continued investment in middleware capabilities development, through programs like NMI, is critical if we are to deliver on the promise of cyberinfrastructure. Major grid facilities like TeraGrid, and the user-driven application and user environment projects that build on those facilities, typically involve a two-year development schedule and a five-year capability roadmap, both of which rely on the progression of capabilities from research prototypes to demonstration systems to supportable software infrastructure.
In parallel with NMI over the past several years, other programs within NSF, DOE, NIH, and other agencies have provided funding to bring together software engineers and computational scientists to create software infrastructure aimed at harnessing cyberinfrastructure for specific disciplines. For example, the Linked Environments for Atmospheric Discovery12 project is creating an integrated set of software and services designed for atmospheric scientists and educators. Similar cyberinfrastructure has been created in other disciplines such as high energy and nuclear physics,13 14 15 fusion science,16 earth sciences,17 18 astronomy,19 20 nanotechnology,21 bioinformatics,22 and cancer research and clinical practice.23
In the TeraGrid project we have formed a set of partnerships around the concept of “Science Gateways,” with the objective of providing TeraGrid services (e.g., computational, information management, visualization, etc.) to user communities through the tools and environments they are already using, in contrast to traditional approaches that require the user to learn how to use the Grid facilities directly. The most common presentation of these community-developed cyberinfrastructure environments is in the form of web portals, though some provide desktop applications or community-specific grid systems instead of or in addition to.
We have partnered in the TeraGrid project not only with gateway providers but also with other grid facilities to identify and standardize a set of services and interaction methods that will enable web portals and applications to invoke computation, information management, visualization, and other services. While still in the early stages, the TeraGrid Science Gateways program has catalyzed a new paradigm for delivering cyberinfrastructure to the science and education community, with a scalable wholesale/retail relationship between grid facilities and gateway providers. Additional benefits to this model include improved security architecture (offering targeted, restricted access to users rather than open login access) and collaboration support (community members can readily share workflows, tools, or data through and among gateway systems).
The creation of a grid facility involves the integration of a set of resource providers. A coherent grid facility must leverage software infrastructure, as described above, to provide a set of common services, a framework that allows for exploitation of unique facilities, and the infrastructure needed to coordinate the efforts of the resource providers in support of users. Common services include operations centers, network connectivity, software architecture and support, planning, and verification and validation systems. Facility-wide infrastructure includes components such as web servers, collaboration systems, the framework for resource management policy and processes, operation coordination, training documentation and services, and software repositories. Underlying Grid middleware software provides common services and interfaces for such functions as authentication and authorization, job submission and execution, data movement, monitoring, discovery, resource brokering, and workflow.
For scientific computing, and in particular high-performance computing, the fact that a user can reliably expect the Unix operating system as the standard environment on almost all major shared resources has been a boon to scientists making persistent software investments. Internet connectivity and basic services such as SSH and FTP have similarly become standard offerings. A grid facility aims to provide services that allow for resources to be aggregated, such that applications hosted on various resources can be combined into a complex workflow. Such a set of services, operated within a single organization, would be merely complex. Providing these services across a collection of organizations adds policy, social, coordination, and other integration requirements that exceed the complexity of the grid middleware itself.
Addressing these requirements to create and manage a national-scale grid environment requires the creation and operation of both organizational and technical integration services. We do not attempt to prescribe organizational structures within which these functions reside. However, we do make several observations. First, a grid facility requires close collaboration and cooperation among all participating organizations, each of which provides one or more functions and services as part of the overall facility. Second, despite the fact that each participating resource provider shares the goal of creating and operating a high-quality grid facility, it is necessary to identify specific responsibilities for coordinating and providing common services. In most grid projects, this function is performed by a system integration team that coordinates and plans common services, providing these services directly and through partner organizations.
In the rest of this section, we use the TeraGrid to illustrate the specific functions and costs required to provide national cyberinfrastructure. For each functional section, we discuss the scope of work as well as the approximate staffing levels, both in the system integration team (the GIG) and at the resource provider facilities. We use units of “full time equivalents” or “FTE” to measure effort because most staff members are employed partially on TeraGrid funding and partially on other institutional funding.
The TeraGrid software environment involves four areas. Grid middleware, including the Globus Toolkit, Condor, and other tools, provides capabilities for harnessing TeraGrid resources as an integrated system. These Grid middleware components are deployed, along with libraries, tools, and virtualization constructs, as the Coordinated TeraGrid Software and Services (CTSS) system, which provides users with a common development and runtime environment across heterogeneous platforms. This common environment lowers barriers users encounter in exploiting the diverse capabilities of the distributed TeraGrid facility to build and run applications. A software deployment validation and verification system, Inca, continuously monitors this complex environment, providing users, administrators, and operators with real-time and historical information about system functionality. In addition, users are provided with login credentials and an allocations infrastructure that allows a single allocation to be used on any TeraGrid system through the Account Management Information Exchange (AMIE)24 account management and distributed accounting system.
These four components must work seamlessly together, combined with related administrative policies and procedures, to deliver TeraGrid services to users. Software integration efforts must ensure that these components can be readily deployed, updated, and managed by resource provider staff, while working with science partners to both harden and enhance the capabilities of the overall system and with the NMI project to implement an independent external test process.
Increasingly, TeraGrid is also providing service-hosting capabilities that allow science communities to leverage the operational infrastructure of this national-scale grid. For example, data and collections-hosting services are provided as part of the TeraGrid resource provider activities at SDSC. Users may request storage space, specifying their desired access protocols, ranging from remote file I/O via a wide area parallel filesystem to GridFTP25 and Storage Resource Broker (SRB).26 Similarly, communities are provided with software areas on all TeraGrid computational resources, thus enabling community-supported software stacks and applications to be deployed TeraGrid-wide.
A general-purpose facility such as TeraGrid must evolve constantly in concert with the changing and growing needs and ideas of its user community. Sometimes the need for a new capability or the improvement of an existing one will be obvious from operational experience or groundswell requests from the user community. In other cases, multiple competing ideas may arise within particular communities/subsets of the facility that must either be replaced by a new common component, or integrated into a coherent system. TeraGrid services are defined as part of the CTSS package, with major releases at roughly six-month intervals used to introduce new capabilities.
The costs of integrating, deploying, and operating these software systems can be significant. The TeraGrid project applies 10 FTEs to the tasks of integrating, verifying, validating, and deploying new capabilities. This staff works with 42 resource integration and support FTEs from the resource provider facilities. The latter staff is responsible for the support and administration of the specific computational, information management, visualization, and data resources operated by resource provider facilities.
User support is best done in a manner that can fully exploit all available human connections to users and their problem domains. The most frequent model is to have the user support staff local to the resource providers. This model is motivated in part by the historical organization of computing centers as vertically integrated, standalone facilities, and in part by the fact that close connection to the users and their issues is important to the centers, providing vital information for tuning, improving and designing next generation facilities.
TeraGrid leverages this model, coordinating the support staff across the sites to provide a set of support programs that give users a “one stop shop” whose major function (beyond basic “first aid”) is to establish the connection between the user and the appropriate local support. This approach also allows us to draw on the expertise and availability of peers across the full organization.
This coordinated, leveraged approach is essential when supporting a user community in the context of a distributed grid facility, where services and applications involve multiple components. Diagnosing and tuning applications in such an environment often requires the engagement of experts from multiple organizations. At the same time, it is important that a single responsible party “own” getting a solution to the user. Often, providing a modest amount of focused attention, while drawing on specialists across the facility, allows researchers to make rapid substantial progress in the efficiency and capabilities of their applications.
TeraGrid user support services comprise three FTEs who provide central coordination and 25 applications support and consulting FTEs from the resource provider facilities. A particular benefit to this distributed teaming approach is that TeraGrid can draw on a much more diverse group of experts than can be found in any single facility.
The TeraGrid GIG is also creating a team of experts whose role currently is to integrate a set of prototype science gateways. Consisting of 10 FTE located at eight science gateway sites, this distributed support team will shift within 12-18 months from primarily integrating prototypes to becoming a general support team for the dozens of science gateways that we anticipate will emerge from these early pioneering efforts. Complementing the direct end-user support team, this team’s customers will be user support and technical staff associated with science gateways.
As with a single-site facility, a national cyberinfrastructure requires focused effort on communications to key groups, including end-users, funding agencies, and other stakeholders. Each resource provider within a national grid facility will provide documentation and training for the resources and services locally provided, and these materials must be proactively integrated, in a similar fashion to the services and resources themselves. This tasks requires an overall communication architecture that provides structure and common interfaces and formats for the training and documentation materials, along with the curation – the analog to software verification and validation – of the overall systems.
TeraGrid coordinates these areas with two FTEs who work with three FTEs at resource provider facilities as well as the external relations, education, and training staff at those facilities (but not dedicated to TeraGrid).
A key strategy for not only communication but also user support and simplifying the use of TeraGrid is a user portal program that provides users with a web-based, customizable interface for training, documentation, and common user functions such as resource directories, job submission and monitoring, and management of authorization credentials across TeraGrid. The user portal project involves two FTEs who work closely with the communications, training, and documentation staff.
While largely transparent to end-users, any national grid facility must be supported by a deep foundation of operational infrastructure. This need is particularly important for facilities such as TeraGrid that operate national-scale resources, purchased and supported on behalf of government agencies, where accountability for the use of those resources is required, coupled with an open peer-review process for allocating access to the resources. Operational services discussed here also include networking, security coordination, and an operations center.
Resource Allocation and Management
Many national-scale grid consortia operate “best-effort” services that provide access to excess capacity to stakeholder user groups. In contrast, TeraGrid operates resources on behalf of broad national communities, and these resources are allocated by formal processes. Specifically, resources are allocated by a peer-review committee that meets quarterly to review user requests for allocations. (Allocations are specified in service units, analogous to CPU hours.) The mechanisms needed to support this nationally peer-reviewed system include a distributed accounting system that works in concert with authentication and authorization systems to debit project allocations according to use by users authorized by the principal investigator of the given project. In addition, support for the allocation review process itself requires a proposal request and review infrastructure, databases for users and usage, and information exchange systems for usage data and user credentials. The TeraGrid has obtained much of this infrastructure from its predecessor, the NSF Partnerships for Advanced Computational Infrastructure (PACI) program, in which several million dollars of software development was invested during the past decade.
The operation of the TeraGrid resource allocation and management infrastructure requires four GIG FTEs for coordination along with seven FTEs at resource provider facilities to support the various databases and proposal support systems, and to perform local accounting integration with the distributed TeraGrid system.
Security management in a national grid facility requires a high degree of coordination among security professionals at many sites. TeraGrid security coordination is based on a set of agreed-upon policies ranging from minimum security practices to change management and protocols for incident response and notification.
The GIG team provides coordination of the distributed security team for general communication, incident response management, and analysis of the security impact of system changes (e.g., software, new systems, etc.). However, the provision of distributed authentication and authorization services for individual users and groups (or “virtual organizations”27), as is required in grid facilities, is also a significant part of the security coordination effort.
Security coordination across TeraGrid requires two GIG FTEs working with three FTEs at resource provider sites, with participation from additional security operations staff from each resource provider organization. While participation in a national grid security coordination team requires investment of time on the part of local security staff, the benefits to the site are high in terms of training, assistance, and early notification of events that might impact the local site.
Many national-scale grid facilities rely on existing Internet connectivity. In contrast, TeraGrid operates a dedicated network. Irrespective of the networking strategy, effort is needed to optimize services over networks between resource provider locations, particularly with respect to data movement over high bandwidth-delay product networks. In addition, distributed applications and services often require assistance from networking experts at multiple sites. Thus, a national-scale grid facility such as TeraGrid requires a networking team consisting of contacts from each resource provider site. As with the security team, the benefits to the site far outweigh the time-investment on the part of local networking staff.
In the case of TeraGrid, this component of the support infrastructure comprises a network architect/coordinator within the GIG to oversee the networking team, which includes five FTEs from resource provider facilities along with general networking contacts at all sites. The networking working group coordinates the operation of the TeraGrid network. Participants also assist in user support, such as diagnosing problems and optimizing performance of distributed services and applications.
TeraGrid provides a distributed operations center, leveraging the 24/7 operations centers at two of the resource provider facilities (NCSA and SDSC) to provide around-the-clock support. The distributed 24/7 operations center plays several essential roles in the TeraGrid facility, including the management of a common trouble-ticket system and ongoing measurement of key metrics related to the health and performance of the facility. TeraGrid operations requirements also include management of the distributed accounting system, which involves the collection of usage information into a central usage database. The TeraGrid GIG funds two FTEs for various aspects of operations and two FTEs at resource provider facilities.
The creation of a national-scale team comprised of individuals from multiple independent institutions requires careful attention to collaboration systems and processes in support of virtual, and distributed, teams. Two FTEs within the TeraGrid GIG maintain infrastructure (e.g., CVS repository, discussion forums, wiki and bugzilla servers) that is used both for day-to-day collaboration and to “curate” important project data. For coordination of activities, TeraGrid relies on two types of virtual teams. Working groups are persistent groups of TeraGrid staff with a common mission, such as supporting TeraGrid software, networks, or accounting systems. These working groups are complemented with short-term planning teams called requirements analysis teams (affectionately, “RATs”). Working groups typically involve key members from each resource provider site and coordinators from the GIG and meet regularly on an ongoing basis. RATs are generally smaller (4-6 members) and work on a particular issue for 6-8 weeks to produce a recommendation or proposal for new policy or projects.
The resources and integrative software and services that make up a national-scale grid facility define one axis of its operations. However there is also a distinct institutional axis, which is where decisions are made regarding the facility’s operations, policies, and major changes to its services, resources, and software. TeraGrid formalizes these latter processes in terms of a numbered, citable, persistent document series, not unlike those used by standards bodies. The initial document28 in the series lays out the roles and responsibilities of TeraGrid’s GIG and resource provider partners as well as a decision-making process.
While top-down, hierarchical management is feasible in a single organization, a federation of interdependent peer organizations requires a different model. At the same time, while democratic processes may work for loose collaborations, they are not appropriate for operation of a production facility. TeraGrid decision-making relies on consensus among representatives of each resource provider, under the leadership of the principal investigator of the GIG who serves as overall TeraGrid project director. This team of resource provider and GIG principals, called the Resource Provider Forum, meets weekly in an open Access Grid session and quarterly for face-to-face review and planning.
Figure 2 shows how the TeraGrid cyberinfrastructure facility allocates staff to provide high-capability, high capacity, high-reliability computational, information management, and data analysis services on a national scale. Approximately 25% of the staff are allocated to common integration functions (TeraGrid GIG) and 75% to resource provider facility functions. User support and external communications are emphasized at similar levels in both the resource provider efforts and the common GIG effort. GIG effort is the bulk of the software, policy and management, and operational services; resource provider effort is the bulk of the resource integration and support and functions. Note that even the “central” functions are distributed: the common services are largely staffed in a distributed fashion at the resource provider sites. TeraGrid’s GIG, operated by the University of Chicago, relies on subcontracts with resource provider facilities for more than 2/3 of the GIG staff, making even the common services team a distributed enterprise. What is important is that this GIG staff, and the services that it provides, is coordinated by a single entity.
Although these numbers will differ in the particular areas from one national grid project to another, we believe that they are representative of the general balance of requirements, both among different functions and between “common” or centrally-provided services and those provided by resource provider facilities.
2 The TeraGrid 2006 – www.teragrid.org/
3 TeraGrid Resource Providers are Argonne National Laboratory / University of Chicago, Indiana University, the National Center for Supercomputing Applications, Oak Ridge National Laboratory, the Pittsburgh Supercomputing Center, Purdue University, the San Diego Supercomputer Center, and the Texas Advanced Computing Center.
4 Roskies, R., Zacharia, T. “Designing and Supporting High-end Computational Facilities,” CTWatch Quarterly 2(2): May 2006.
5 Killeen, T. L., Simon, H. D. “Supporting National User Communities at NERSC and NCAR,” CTWatch Quarterly 2(2): May 2006.
6 Foster, I. “Globus Toolkit Version 4: Software for Service-Oriented Systems,” IFIP International Conference on Network and Parallel Computing, 2005, Springer-Verlag LNCS 3779, 2-13.
7 Novotny, J., Tuecke, S. and Welch, V. “An Online Credential Repository for the Grid: MyProxy,” 10th IEEE International Symposium on High Performance Distributed Computing, San Francisco, 2001, IEEE Computer Society Press.
8 Litzkow, M. and Livny, M. “Experience with the Condor Distributed Batch System,” IEEE Workshop on Experimental Distributed Systems, 1990.
9 Smallen, S., Olschanowsky, C., Ericson, K., Beckman, P. and Schopf, J.M. “The Inca Test Harness and Reporting Framework,” SC’2004 High Performance Computing, Networking, and Storage Conference, 2004.
10 NSF Middleware Initiative (NMI), 2006 – www.nsf-middleware.org/
11 NSF Middleware Initiative (NMI) Grid Research Integration Development and Support (GRIDS) Center, 2006 – www.grids-center.org/
12 Droegemeier, K. et al, “Linked Environments for Atmospheric Discovery (LEAD): Architecture, Technology Roadmap, and Deployment Strategy,” 21st Conference on Interactive Information Processing Systems for Meteorology, Oceanography, and Hydrology, 2005, American Meteorological Society.
13 Open Science Grid (OSG), 2006 – www.opensciencegrid.org/
14 Avery, P. and Foster, I. “The GriPhyN Project: Towards Petascale Virtual Data Grids,” 2001 – www.griphyn.org/
15 Avery, P., Foster, I., Gardner, R., Newman, H. and Szalay, A. “An International Virtual-Data Grid Laboratory for Data Intensive Science,” Technical Report GriPhyN-2001-2, 2001 – www.griphyn.org
16 Schissel, D.P., Keahey, K., Araki, T., Burruss, J.R., Feibush, E., Flanagan, S.M., Foster, I., Fredian, T.W., Greenwald, M.J., Klasky, S.A., Leggett, T., Li, K., McCune, D.C., Lane, P., Papka, M.E., Peng, Q., Randerson, L., Sanderson, A., Stillerman, J., Thompson, M.R. and Wallace, G. “The National Fusion Collaboratory Project: Applying Grid Technology for Magnetic Fusion Research,” Workshop on Case Studies on Grid Applications, 2004.
17 GEON: The Geosciences Network, 2006 – www.geongrid.org/
18 Bernholdt, D., Bharathi, S., Brown, D., Chanchio, K., Chen, M., Chervenak, A., Cinquini, L., Drach, B., Foster, I., Fox, P., Garcia, J., Kesselman, C., Markel, R., Middleton, D., Nefedova, V., Pouchard, L., Shoshani, A., Sim, A., Strand, G. and Williams, D. “The Earth System Grid: Supporting the Next Generation of Climate Modeling Research,” Proceedings of the IEEE, 93 (3). 485-495. 2005.
19 National Virtual Observatory, 2006 – www.us-vo.org/
20 Szalay, A. and Gray, J. “The World-Wide Telescope,” Science, 293. 2037-2040. 2001.
21 Nanotechnology Simulation Hub (NanoHub), 2006 – www.nanohub.org/
22 Ellisman, M. and Peltier, S. “Medical Data Federation: The Biomedical Informatics Research Network,” The Grid: Blueprint for a New Computing Infrastructure (2nd Edition), Morgan Kaufmann, 2004.
23 Cancer Bioinformatics Grid (caBIG), 2006 – cabig.nci.nih.gov/
24 Account Management Information Exchange (AMIE), 2006 – scv.bu.edu/AMIE/
25 Allcock, B., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I. and Foster, I., “The Globus Striped GridFTP Framework and Server”. SC’2005, 2005.
26 Baru, C., Moore, R., Rajasekar, A. and Wan, M. “The SDSC Storage Resource Broker,” 8th Annual IBM Centers for Advanced Studies Conference, Toronto, Canada, 1998.
27 Foster, I., Kesselman, C. and Tuecke, S. “The Anatomy of the Grid: Enabling Scalable Virtual Organizations,” International Journal of Supercomputer Applications, 15 (3). 200-222. 2001.
28 Catlett, C., Goasguen, S. and Cobb, J. “TeraGrid Policy Management Framework,” TeraGrid Report TG-1, 2006.