The NRC Report on the Future of Supercomputing
Marc Snir, University of Illinois at Urbana-Champaign
A variety of events led to a reevaluation of the United States supercomputing programs by several studies in 2003 and 2004. The events include the emergence of the Japanese Earth Simulator in early 2002 as the leading supercomputing platform; the near disappearance of Cray, the last remaining U.S. manufacturer of custom supercomputers; some criticism of the acquisition budgets of the Department of Energy’s (DOE) Advanced Simulation and Computing (ASC) program; and some doubts about the level and direction of supercomputing R&D in the U.S. We report here on a study that was conducted by a committee convened by the Computer Science and Telecommunications Board (CSTB) of the National Research Council (NRC). It was chaired by Susan L. Graham and Marc Snir; it had sixteen additional members with diverse backgrounds: William J. Dally, James W. Demmel, Jack J. Dongarra, Kenneth S. Flamm, Mary Jane Irwin, Charles Koelbel, Butler W. Lampson, Robert F. Lucas, Paul C. Messina, Jeffrey M. Perloff, William H. Press, Albert J. Semtner, Scott Stern, Shankar Subramaniam, Lawrence C. Tarbell, Jr. and Steve J. Wallach. The CSTB study director was Cynthia A. Patterson, assisted by Phil Hilliard, Margaret Marsh Huynh and Herbert S. Lin. The study was sponsored by the DOE’s Office of Science and the DOE’s Advanced Simulation and Computing program.
The study commenced in March 2003. Information was gathered from briefings during 5 committee meetings; an application workshop in which more than 20 computational scientists participated; site visits to DOE labs and NSA; a town hall meeting at the 2003 Supercomputing Conference; and a visit to Japan that included a supercomputing forum held in Tokyo. An interim report was issued in July 2003 and the final report was issued in November 2004. The report was extensively reviewed by seventeen external reviewers in a blind peer-review process as well as by NRC staff. The prepublication version of the report (at over 200 pages), entitled “Getting up to Speed: The Future of Supercomputing,” is available from the National Academies Press1 and also from DOE.2. The final published version of the report is due in early 2005.
The study focuses on supercomputing, narrowly defined as the development and use of the fastest and most powerful computing systems — i.e., capability computing. It covers technological, political and economic aspects of the supercomputing enterprise. We summarize in the following sections the main findings and recommendations of this study.
The study documents past contributions of supercomputing to national defense and to scientific discovery, together with evidence of its increasing importance in the future. Numerical simulation and digital data analysis have become essential to research in most disciplines, and many disciplines have insatiable needs for more performance. In areas such as climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals; and there is a clear understanding of the likely advances that will accrue from the use of better performing supercomputing platforms. Supercomputers are essential to the missions of government agencies in areas such as intelligence or stockpile stewardship; they are an essential tool to the solution of important societal problems. Finally, technologies developed on supercomputers broadly contribute to our economy. Examples include application codes (such as NASTRAN) that were initially developed in national labs and run on supercomputers and then disseminated to broad industrial bases; as well as core IT technologies (such as multithreading or vector processing) that were pioneered on supercomputers and migrated to broadly used IT platforms. For reasons explained later, we expect this “trickle-down” process to continue, and perhaps intensify, in coming years. Although it is hard to quantify in a precise manner the benefits of supercomputing, the committee believes that the returns on increased investments in supercomputing will greatly exceed the cost of these investments.
The public sector is the leading user and purchaser of supercomputers: According to International Data Corporation (IDC), more than 50 percent of high-performance computer (HPC) purchases and more than 80 percent of capability system purchases in 2003 were made by the public sector. The reason for this is that supercomputers are used mostly to produce “public goods” and are essential for many government missions. They are used to support government funded basic and applied research; and they are used to support DoD or DOE missions, and the missions of intelligence agencies. Supercomputing technologies have often migrated to mainstream computing, but on a time table that is longer than the horizon of commercial computer companies.
This state of affair implies that the government plays a crucial role in the supercomputing industry, since its acquisitions have a major impact on the health, indeed, the existence of such an industry. Historically, the government has played an active role in ensuring that supercomputers are available to fulfill its needs by funding supercomputing R&D and by forging long-term relationships with key providers. While active government intervention has risks, it is necessary in areas where the private market is nonexistent or too small to ensure a steady flow of products and technologies that satisfy government needs. Thus, one clearly needs an active government policy to ensure a steady supply of military submarines or aircrafts; whereas no active government involvement is needed to ensure a steady supply of PCs. Are supercomputers more like military submarines or like PCs? To answer this question we need first to look at the current state of supercomputing in the US.
There are strong signs that supercomputing is a healthy business overall, and a healthy business in the US. Supercomputers at academic centers and government laboratories are used to do important research; supercomputers are used effectively in support of essential security missions; good progress is being made on stockpile stewardship, using supercomputing simulations. The large majority of supercomputers are US made: according to IDC, US vendors had 98% market share in capability systems in 2003; 91% of the TOP500 systems, as of June 2004, were US made.
On the other hand, companies that primarily develop supercomputing technologies, such as Cray, have a hard time staying in business. Supercomputers are a diminishing fraction of the total computer market, with a total value of less than $1 billion a year. It is an unstable market, with variations of more than 20 percent in sales from year to year. It is a market that is almost entirely dependent on government acquisitions.
The current state of supercomputing is largely a consequence of the success of commodity-based supercomputing. Most of the systems on the TOP500 list are now clusters, i.e., systems assembled from commercial, off-the-shelf (COTS) processors and switches; more than 95 percent of the systems use commodity microprocessor nodes. On the other hand, on the first TOP500 list of June 1993 only a quarter of the systems used commodity scalar microprocessors and none used COTS switches.
Cluster supercomputers have ridden on the coattails of Moore’s Law, benefiting from the huge investments in commodity processors and the fast increase in processor performance. Indeed, the top performing commodity-based system on the June 1994 TOP500 list had 3,689 nodes; in June 2004 it had 4,096 nodes. While the number of nodes increased only by 11 percent, the system performance, as measured by the Linpack benchmark, improved by a factor of 139 in ten years! Cluster technology offers, for many applications, supercomputing performance at a cost/performance of a PC: as a result, high-performance computing can be afforded by an increasing number of users. Indeed, the verdict of the market is that clusters offer better value for money in many sectors where custom vector systems were previously used.
Yet clusters cannot satisfy all supercomputing needs. For some problems, acceptable time to solution can be achieved only by scaling to a very large number of commodity nodes. Communication overheads become a bottleneck. A hybrid supercomputer, where commodity processors are connected via a custom network interface (connecting to the memory bus, rather than to an I/O bus) can support higher per-node bandwidth with lower overheads, thus enabling efficient use of a larger number of nodes. (The Cray XT3 and the SGI Altix are examples of such systems). A custom supercomputer, built of custom processors, can provide higher per-node performance and thus reduce the need to scale to a large number of nodes, at the expense of using more intra-node parallelism. (The Cray X1 and the NEC SX6 are the two current examples of such systems). Custom processors are especially important for codes that exhibit low locality and, thus, do not take advantage of caches. In such a case, it is important that the intra-node parallelism of the processor support a large number of concurrent memory accesses, as vector or heavily multithreaded processors do.
The success of clusters has reduced the market for hybrid and custom supercomputers to the point where the viability of these systems are heavily dependent on government support. Government investment in the development and acquisition of such platforms has shrunk. Computer suppliers are reluctant to invest in custom supercomputing due to the small size of the market, the uncertainty of the financial returns, and the opportunity cost of not applying skilled personnel to products designed for the broader IT market. Furthermore, academic research on the design of supercomputers has diminished. From the mid-nineties to the early 2000’s, the number of published papers on supercomputing or high-performance computing has shrunk by half; the number of National Science Foundation (NSF) grants on parallel architecture design has shrunk by half; and large projects that build prototype systems have disappeared. The reduced research investment is worrisome, as it will be harder to benefit from advances due to Moore’s law in the future. Some of the main obstacles are summarized next.
Memory latency continues to increase, relative to processor speed: An extrapolation of current trends would lead to the conclusion that by 2020 a processor will execute about 800 loads and 90,000 floating point operations while waiting for one memory access to complete — an untenable differential. While the problem affects all processors, it affects scientific computing and high-performance computing earlier, as commercial codes can usually take better advantage of caches.
Global communication latency continues to increase and global bandwidth continues to decrease, relative to node speed. Again, an extrapolation of current trends would lead by 2020 to a global bandwidth of about 0.001 word per flop and a global latency equivalent to almost 1 Mflops. The problem affects tightly coupled HPC applications much more than loosely coupled commercial workloads.
Improvement in single processor performance is slowing down: It is hard to further increase pipelining depth or instruction-level parallelism, so that increasing chip gate counts do not contribute much to single processor performance. To stay on Moore’s curve of microprocessor performance, vendors need to use increasing levels of on-chip multiprocessing. This is not a major problem for many commercial applications that can cope with modest levels of parallelism, but will be a problem for high-end supercomputers that will need to cope with hundreds of thousands of concurrent threads.
As circuit size shrinks and the number of circuits in a large supercomputer grows, mean-time-to-failure decreases. The largest computer systems are more affected by this problem than modest size computers.
Although clusters have reduced the hardware cost of supercomputing, they have increased the programming effort needed to implement large parallel codes. Scientific codes and the platforms these codes run on have become more complex while the programming environments used to develop these codes have seen little progress. As a result, software productivity is low. Programming is done using message-passing libraries that are low-level and contribute large communication overheads. No higher-level programming notation that adequately captures parallelism and locality, the two main algorithmic concerns of parallel programming, has emerged. The application development environments and tools used to program complex parallel scientific codes are generally less advanced and less robust than those used for general commercial computing. Hybrid or custom systems could support more efficient parallel programming models, e.g., models that use global memory. But this potential is largely unrealized, because of the very low investments in supercomputing software such as compilers, the desire to maintain compatibility with the prevalent cluster architecture, and the fear of investing in software that runs only on architectures that may disappear in a few years. The software problem will worsen as higher levels of parallelism are required and as global communication becomes relatively slower.
The problems listed above indicate a clear need for change. We need new architectures to cope with the breakdown in current designs due to the diverging rate of improvement of various components (e.g., processor speed vs. memory speed vs. switch speed). We need new languages, new tools, and new operating systems to cope with the increased levels of parallelism, and the low software productivity. We need continued improvements in algorithms to handle larger problems, new models (to improve performance or accuracy), and to exploit changing supercomputer hardware characteristics.
But it takes time to realize the benefits of research. It took more than a decade from the first vector product until vector programming was well supported by algorithms, languages and compilers; it took more than a decade from the first massively parallel processor (MPP) products to well-supported standard message-passing programming environments. As the research pipeline has emptied, we are in a weak position to cope with the obstacles that are likely to limit supercomputing progress in the next decade.
Change is inhibited by the large investments in application software. While new hardware is purchased every three to five years, large software packages are maintained and used over decades. Changes in architectures and programming models may require expensive recoding, a nearly impossible task for poorly maintained, large “dusty deck” codes. Ecosystems are created through the mutually reinforcing effect of hardware and software that supports well a certain programming model, application software designed for such a programming model, and people that are familiar with the programming model and its environment. Even though the ecosystem may be caught in a “local minimum” and better productivity could be achieved with other architectures and programming models, change requires coordination in all aspects of technology (hardware and software), and very large investments in code rewriting and people retraining to overcome the potential barrier.
Progress also will be hampered by the small size and fragility of the supercomputing ecosystem. The community of researchers that develop new supercomputing hardware and software and applications is small. For example, according to the Taulbee surveys of the last few years, out of more than 800 CS PhDs that graduate each year in the U.S., only 36 specialize in computational sciences (and only 3 are hired by national laboratories). Since supercomputing is a very small fraction of the total IT industry, and since large system skills are needed in many other areas (e.g., Google), people can easily move to new jobs. There is little flow of personnel among the various groups in industry working on supercomputing and little institutional memory: the same problems are solved again and again. The loss of a few tens of people with essential skills can critically hamper a company or a lab. Instability of long-term funding and uncertainty in policies compound this problem.
Our report concludes that the U.S. government has unique supercomputing needs that will not be satisfied without government involvement. In this sense, producing custom supercomputers and supercomputing unique technologies is like producing cutting-edge weapon systems. However, there are essential differences: not only are custom supercomputers essential to our security, they can also accelerate many other research and engineering endeavors. Furthermore, custom supercomputers are much more closely related to commercially available products, such as clusters, then, say, military aircraft are to civilian aircraft. There is a significant reuse of commercial technologies in custom supercomputers and a continuous flow of invention from custom supercomputers to commodity systems. Finally, the development cycles are much shorter and the development costs are much lower.
This leads to the following overall recommendation:
Overall Recommendation: To meet the current and future needs of the United States, the government agencies that depend on supercomputing, together with the U.S. Congress, need to take primary responsibility for accelerating advances in supercomputing and ensuring that there are multiple strong domestic suppliers of both hardware and software.
To facilitate the government’s assumption of that responsibility, the committee makes eight recommendations.
Recommendation 1. To get the maximum leverage from the national effort, the government agencies that are the major users of supercomputing should be jointly responsible for the strength and continued evolution of the supercomputing infrastructure in the United States, from basic research to suppliers and deployed platforms. The Congress should provide adequate and sustained funding.
A small number of government agencies are the primary users of supercomputing. These agencies are also the major funders of supercomputing research. At present, those agencies include the Department of Energy (DOE), including its National Nuclear Security Administration and its Office of Science; the Department of Defense (DoD), including its National Security Agency (NSA); the National Aeronautics and Space Administration (NASA); the National Oceanic and Atmospheric Administration (NOAA); and the National Science Foundation (NSF). The increasing use of supercomputing in biomedical applications suggests that NIH should be added to the list. There is a significant overlap among the supercomputing needs of these agencies.
The model we envisage is not a loose coordination, where each agency informs the others of its plans, but an integrated effort based on a joint long term plan. This 5-10 year High End Computing (HEC) plan would be based on both the roadmap that is the subject of Recommendation 5 and the needs of the participating agencies. Included in the plan would be a clear delineation of the responsibilities of various agencies. Joint planning and coordination of acquisitions will reduce procurement overhead and provide more stability to vendors. Agencies that support research will coordinate their efforts to ensure adequate funding of research addressing major roadblocks, as described in Recommendation 6. A more integrated effort by a few agencies may fund industrial development. House and Senate appropriation committees would ensure that budgets passed into law are consistent with the HEC plan.
Recommendation 2. The government agencies that are the primary users of supercomputing should ensure domestic leadership in those technologies that are essential to meet national needs.
Since the broad market on its own will not satisfy some of the supercomputing needs, the government should ensure the continued availability of needed unique technologies. The U.S. government may want to restrict the export of some technologies, and thus may want these technologies to be produced in the U.S. More importantly, no other country is certain to produce these technologies. The United States needs to invest in supercomputing not in order to be ahead of other countries, but in order to have the tools needed to support critical agency missions in areas such as signals intelligence and weapon stewardship. These investments will also broadly benefit scientific research and the U.S. economy.
Recommendations 3 through 8 outline some of the actions that need to be taken by these agencies to maintain leadership.
Recommendation 3. To satisfy its need for unique supercomputing technologies such as high-bandwidth systems, the government needs to ensure the viability of multiple domestic suppliers.
The viability of vendors of unique supercomputing technologies depends on stable, long-term government investments at adequate levels: both the absolute investment level and its predictability matter, because of the lack of alternative support. Such stable support can be provided either via government funding of R&D expenses or via steady procurements (or both). The model proposed by the British UKHEV initiative, whereby government solicits and funds proposals for the procurement of three successive generations of a supercomputer family over four to six years is a good example of a model that reduces instability.
The most important unique supercomputing technology identified in this report is custom supercomputing systems. The committee estimated the R&D cost for such a product to be about $70 million per year. This includes both the hardware platform and the software stack. The cost would be lower for a vendor that does not do both.
There also are many supercomputing unique technologies in the software area, leading to the following recommendation:
Recommendation 4. The creation and long-term maintenance of the software that is key to supercomputing requires the support of those agencies that are responsible for supercomputing R&D. That software includes operating systems, libraries, compilers, software development and data analysis tools, application codes, and databases.
The committee believes that higher and more coordinated investments could significantly improve the productivity of supercomputing platforms. The models for software support are likely to be varied — vertically integrated vendors that produce both hardware and software, horizontal vendors that produce software for many different hardware platforms, not-for-profit organizations, software developed in the open source model, etc. However, no matter which model is used, stability and continuity are essential. Software has to be maintained and evolved over decades; this requires a stable cadre of software developers with intimate knowledge of the software. Independent software vendors (ISVs) can play an important role in developing and maintaining software products; the government can help by ensuring that software is developed in national labs only when it can not be bought.
Recommendation 5. The government agencies responsible for supercomputing should underwrite a community effort to develop and maintain a roadmap that identifies key obstacles and synergies in all of supercomputing.
A roadmap is necessary to ensure that investments in supercomputing R&D are prioritized appropriately. It should be developed with wide participation from researchers, developers of both commodity and custom technologies and users; it should be driven both top-down from application needs and bottom-up from technology barriers; it should be, as much as possible, quantitative in measuring needs and capabilities; finally, it should not ignore the strong interdependencies of technologies.
The roadmap should be used by agencies and by Congress to guide their long-term research and development investments. It is important also to invest in some high-risk, high-return research ideas that are not indicated by the roadmap, to avoid being blindsided.
Recommendation 6. Government agencies responsible for supercomputing should increase their levels of stable, robust, sustained multiagency investment in basic research. More research is needed in all the key technologies required for the design and use of supercomputers (architecture, software, algorithms, and applications).
The decreased research investments at a time in which roadblocks are accumulating puts the supercomputing enterprise at risk. A major correction is needed. The committee estimated the investment needed to support research in core supercomputing technologies at $140 million per year. (This estimate does not include application development or platform acquisition.) It is important that this investment focus on universities, both because of the importance of a free flow of information at an early stage, and because of the role of universities in educating the future cadre of supercomputing practitioners. Finally, research should include a mix of small, medium and large projects, including demonstration systems where technologies are integrated. Such systems are important to study the interplay of technologies and validate them in a realistic environment; they should not be confused with product prototypes and should not be expected to support users.
Recommendation 7. Supercomputing research is an international activity; barriers to international collaboration should be minimized.
Research has always benefited from the open exchange of ideas. In light of the relatively small community of supercomputing researchers, international collaborations are particularly beneficial. The fast development cycles, the fast technology evolution, and the frequent flow of ideas and technologies between supercomputing and the broader IT industry require close interaction between the supercomputing industry and the broader IT industry, and between supercomputing research and the broader IT research. The strategic advantage to the U.S. from supercomputing is not due to a single product, but from a broad capability to acquire and exploit effectively systems that can best reduce the time to solution of important computational problems. Looser export restrictions would not erode this advantage but would benefit U.S. vendors; in particular, restrictions that affect commodity clusters that can be assembled from widely available components lack any rationale.
Barriers also reduce the benefit of supercomputing to science. Science is a collaborative international endeavor, and many of the best U.S. graduate students are foreigners. A restriction on supercomputer access by foreign nationals means that supercomputers are less available to support science in the U.S.
Recommendation 8. The U. S. government should ensure that researchers with the most demanding computational requirements have access to the most powerful supercomputing systems.
Supercomputing is important for the advancement of science. NSF supercomputing centers, as well as DOE science centers, have done an important service in providing supercomputing support to scientists. However, these centers have seen a broadening of their mission with constant budgets, and have been under pressure to support an increasing number of users. It is important that sufficient stable funding be provided to support an adequate science supercomputing infrastructure. Capability systems should be available to scientists with the most demanding problems and should be used only for jobs that need this capability; supercomputing resources should be available to educate the next generation and to develop the needed software infrastructure. Finally, it is important that the science communities that use supercomputers have a strong say and a shared responsibility for the provision of adequate supercomputing infrastructure, with budgets for the acquisition and maintenance of such infrastructure being clearly separated from the budgets for IT research.
The final report was presented at the 2004 Supercomputing Conference and at briefings attended by DOE staff, congressional staff, staff from the Office of Science and Technology Policy and staff from the Office of Management and Budget. Its content was covered by the general press and by trade publications. The report was generally well received, with words of caution on the difficulty in allocating more money to supercomputing in a world of retrenching research budgets. People in the audience at Supercomputing rightly remarked that similar recommendations appeared in previous reports yet were not acted upon. While it would be better if these recommendations had been acted upon, it is good that various reports push similar recommendations: In order to effect change, the community has to speak in one voice. The recent High-End Computing Revitalization Act is a step in the right direction, but much more is needed. The agencies and the scientists that need supercomputers have to act together and push not only for an adequate supercomputing infrastructure now, but for adequate plans and investments that will ensure they have the tools they need in five or ten years.
The authors wish to thank Cynthia A. Patterson for her careful editing of this text.