OnDemand is a Dell cluster with 64 Intel dual-socket, dual-core compute nodes for a total of 256 processors. The 2.33 GHz, 4-way nodes have 8 GB of memory. The system, which has a nominal theoretical peak performance of 2.4 Tflops, is running the SDSC-developed Rocks open-source Linux cluster operation software and the IBRIX parallel file system. Jobs are scheduled by the Sun Grid Engine.
OnDemand also makes use of the SPRUCE system developed by a team at Argonne National Laboratory. SPRUCE provides production-level functionality, including access controls, reporting, and fine-grained control for urgent computing jobs. An organization can issue tokens to its user groups who have been approved for urgent computing runs. Different colors (classes) of SPRUCE tokens represent varying urgency levels. A yellow token will put the requested job in the normal queue in the Sun Grid Engine scheduler; an orange token goes to the high priority queue; and a job submitted with a red token will preempt running jobs if necessary.
The researchers are working to develop additional capabilities. Currently, jobs with the least amount of accumulated CPU are the first to be preempted. In the future, preempted backfill jobs may be held and restarted when appropriate, without being killed, and investigation of checkpoint and restart systems is ongoing.
Backfill jobs consist of a variety of regular user jobs, primarily parallel scientific computing and visualization applications using MPI. Users who run on the OnDemand cluster are made aware of the cluster’s mission to prioritize jobs that require immediate turnaround.
One of the most interesting and successful applications using OnDemand is a commercial application called Star-P 5, which extends easy access to supercomputing to a much wider range of researchers. Users can code models and algorithms on their desktop computers using familiar applications like MATLAB, Python and R, and then run them interactively on SDSC's OnDemand cluster through the Star-P platform. This eliminates the need to re-program applications to run on parallel systems, so that programming that took months can now be done in days, and simulations that took days on the desktop can now be done in minutes. Lowering the barrier to supercomputing resources will let researchers jumpstart research that otherwise wouldn't get done.
Star-P supports researchers by allowing them to transparently use HPC clusters through a client (running on their user desktop environment) and server framework (running in an HPC cluster environment). For example, existing MATLAB users on a PC desktop can now achieve parallel scalability from the same MATLAB desktop interface with a simple set of STAR-P commands. This has enabled many users to achieve the tremendous speed-ups that advanced research groups see by laboriously reprogramming applications using MPI.
Researchers on SDSC’s OnDemand are using STAR-P in a variety of application areas, including science, engineering, medical and financial disciplines. Several research groups have seen true performance breakthroughs through STAR-P, which fundamentally changes the type of problems they are able to explore. A close collaboration with SDSC also won the Interactive Supercomputing HPC Challenge at SC 07.
SDSC and its academic and industrial partners, including Argonne National Laboratory and Interactive Supercomputing, are aggressively continuing to improve the cluster environment to enhance this urgent computing service. The accumulating experience at SDSC using OnDemand is playing a critical role as a testbed as the team works to further develop the urgent computing paradigm and robust infrastructure.
2SDSC Allocations - www.sdsc.edu/us/allocations/
3SDSC On-Demand cluster - www.sdsc.edu/us/resources/ondemand/
4ShakeMovie, Caltech's Near Real-Time Simulation of So. Calif. Seismic Events Portal -http://shakemovie.caltech.edu/
5Star-P at Interactive Supercomputing - www.interactivesupercomputing.com/