Once the resource is chosen based on the advisor, the job is submitted. SPRUCE provides support for both Globus-based urgent submissions and direct submission to local job-queuing systems. Currently SPRUCE supports all the major resource managers such as Torque, LoadLeveler, and LSF and schedulers such as Moab, Maui, PBS Pro, SGE, and Catalina. The system can support any scheduler with little effort. By extending the Resource Specification Language (RSL) of the Globus Toolkit, which is used to identify user-specific resource requests, the ability to indicate a level of urgency for jobs is incorporated. A new “urgency” parameter is defined for three levels: critical (red), high (orange), and important (yellow). These urgency levels are guidelines that help resource providers enable varying site-local response protocols to differentiate potentially competing jobs. Users with valid SPRUCE tokens can simply submit their original Globus submission script with one additional RSL parameter (of the form “urgency =
At the core of the SPRUCE architecture is the invariant that urgent jobs may be submitted only while a right-of-way token is active. In order to support this, a remote authentication step is inserted into the job submission tool-chain for each resource supporting urgent computation. Since the SPRUCE portal contains the updated information regarding active sessions and users permitted to submit urgent jobs, it is also the natural point for authentication. When an urgent computing job is submitted, the urgent priority parameter triggers authentication. This authentication is not related to a user’s access to resource, which has already been handled by the traditional Grid certificate or by logging into the Unix-based resource. Rather, it is a “Mother, may I” request for permission to queue a high-priority job. This request is sent to the SPRUCE portal, where it is checked against active tokens, resource names, maximum priority, and associated users. Permission is granted if an appropriate right-of-way token is active and the job parameters are within the constraints set for the token. All transactions, successful and unsuccessful, are logged.
All of the above works only when the resource providers support a set of urgent computing policy responses corresponding to different levels of requested urgencies. These policies can vary for every site based on comfort level. The SPRUCE architecture does not define or assume any particular policy for how sites respond to urgent computing requests. This approach complicates some usage scenarios, but it is unavoidable given the way we build Grids from distributed resources of independent autonomous centers and given the diversity of resources and operating systems available for computing. The SPRUCE architecture cannot simply standardize the strategy for responding to urgent computation. Instead, we are left with many possible choices for supporting urgent computation depending on the systems software and middleware as well as on constraints based on accounting of CPU cycles, machine usability, and user acceptance. Given the current technology for Linux clusters and more tightly integrated systems such as the Cray XT3 and the IBM Blue Gene, the following responses to an urgent computing request are possible:
- Scheduling the urgent job as “next-to-run” in a priority queue. This approach is simple and is highly recommended as a possible response for all resource providers. No running computation is killed; the impact on normal use is low. The urgent job will begin when all of the running jobs complete for a given set of CPUs. Unfortunately, this wait could go up to hours or even days.
- Suspending running jobs and immediately launching the urgent job. This will then force some memory paging, but the suspended job could be resumed later. Node crashes and failed network connections can be an obstacle in reviving suspended jobs. The benefit of this policy is that urgent jobs will begin almost immediately, making this option attractive in some cases.
- Forcing a checkpoint/restart of running jobs and re-queuing the urgent job as the next to run. This response is similar to the previous response but safely moves the checkpoint to a location where it can then be used to restart on alternative resources. Architectures supporting system-based checkpoint/restart can be used to support urgent computing where reliable. This checkpointing for large-memory systems could take 30 minutes or more depending on I/O and disk rates.
- Killing all running jobs and queuing the urgent job as next to run. Clearly this response is drastic and frustrating to the users who will lose their computation. Nevertheless, it will ensure that extremely urgent computations begin immediately after running jobs are killed.
Another factor in choosing the response policy is accounting and stakeholder accountability. Certain machines are funded for specific activities, and only a small amount of discretionary time is permitted. Furthermore, in order to improve fairness, some form of compensation (e.g., refunding CPU hours or a one-time higher priority rescheduling) could be provided to jobs that are killed to make room for an urgent job. Another idea is to provide discounted CPU cycles for jobs that are willing to be terminated to make room for urgent computations. In any case, resource providers are encouraged to map all three levels of urgency—critical, high, and important—to clearly defined responses.