Understanding Policy Configuration
Applies To: Microsoft HPC Pack 2008 R2, Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
The policy configuration settings control how resources are allocated to queued or running jobs. The Scheduling Mode lets you optimize resource allocation for large batch and MPI workloads or for service workloads. For information about how to change the configuration options, see Configure the HPC Job Scheduler Service.
The following table summarizes the two scheduling modes and their default configurations:
Start jobs in queue order, and attempt to allocate the maximum requested resources to running jobs.
Attempt to start all incoming jobs as soon as possible at their minimum resource requirements. If additional resources are available, grow jobs based on priority.
See Queued mode settings in this topic.
See Balanced mode settings in this topic.
Queued mode settings
In Queued mode, the HPC Job Scheduler Service starts jobs in queue order, and attempts to allocate the maximum requested resources to running jobs. The following sections describe the preemption and adaptive resource allocation settings that are associated with Queued mode.
Preemption allows higher priority jobs that are waiting in the queue to start sooner by taking resources away from lower priority, preemptible jobs that are already running. If you enable the Grow by preemption policy (see “Adaptive resource allocation” below), preemption will also be used to help grow higher priority, running jobs to their maximum resource request (available starting with HPC Pack 2008 R2 with Service Pack 2 (SP2).
The Preemptable job property is defined by the administrator in job templates. Use job templates to define the types of jobs that can be preempted, or the sets of users who can submit preemptible or nonpreemptible jobs. Preemptable cannot be defined when submitting a job through HPC Cluster Manager, HPC Job Manager, the HPC PowerShell, or the HPC command-line tools. It is only possible to do this by using the HPC API, if the selected job template specifies both True and False as valid values for the Preemptable job property.
Preemption in Queued mode has the following options:
Graceful preemption (Default): Take resources from the preempted job as its running tasks complete so that work is not lost.
Immediate preemption: Take resources from the preempted job by canceling all running tasks so that resources can be allocated to the high priority job immediately. For more information about job and task cancelation, see the Additional Considerations section in Cancel a Job or Task.
Task level preemption (introduced in HPC Pack 2008 R2 with SP3): Enable preemption of individual tasks instead of entire jobs. With the default immediate preemption settings, the scheduler will cancel an entire job if any of its resources are needed for a higher priority job. When you enable task level preemption, the scheduler will cancel individual tasks instead. For example, if a Normal priority job is running 100 tasks on 1 core each, and a High priority job is submitted that requires 10 cores, task level preemption will cancel 10 tasks, rather than canceling the entire job. This option can improve job throughput by minimizing the amount of rework that must be done due to preemption.
Starting with HPC Pack 2012, in Queued scheduling mode, the default option for preemption behavior is task-level immediate preemption, rather than job-level preemption. This default behavior means that only as many tasks of low priority jobs are preempted as are needed to provide the resources required for the higher priority jobs, rather than preempting all of the tasks in the low priority jobs.
Starting with HPC Pack 2012 with Service Pack 1 (SP1), a service-oriented architecture (SOA) job will end the tasks after the current request is finished, even if there are additional requests to be calculated. In previous versions of HPC Pack, a SOA job will end its tasks to release resources for other job only after all the requests are calculated.
No preemption: Do not preempt jobs.
Adaptive resource allocation
Adaptive resource allocation dynamically adjusts the resources allocated to a job based on its tasks. Enabling resource adjustments can result in a significant improvement in cluster utilization and reduced job queue times, especially for clusters which run jobs composed of multiple tasks, such as parametric sweep computations. Only jobs that contain more than one task or subtask can benefit from automatic resource adjustment.
Adaptive allocation has the following settings that can be enabled or disabled:
Increase resources automatically (enabled by default): Use available resources to grow higher priority, running jobs to their maximum before starting lower priority jobs. With automatic growth enabled, the HPC Job Scheduler Service can allocate free resources to running jobs that have additional tasks to run. The service will not allocate more resources than the maximum requested for the job. This results in jobs spending more time in the queue waiting for resources, but they finish more quickly after they are started. Available resources are allocated first to the highest-priority job in the system, whether this job is running or queued.
- Grow by preemption (introduced in HPC Pack 2008 R2 with SP2): To help grow higher priority running jobs to their maximum, use preemption to take resources away from lower priority, running jobs. Preemption must be enabled to use this setting.
Decrease resources automatically (enabled by default): With automatic shrink enabled, the HPC Job Scheduler Service can release unused resources from running jobs that have no additional tasks to run. The service will not shrink resources below the minimum requested for the job. Automatic shrink results in better overall cluster utilization, but it may cause problems if you add tasks to jobs that are already in progress.
In the default job template, the job properties Auto Calculate Maximum and Auto Calculate Minimum are set to a default value of True. If a job template specifies that True is the only valid value for these properties, the submitting user will not have the option of specifying maximum and minimum resources for a job submitted with that template, and resources will be automatically calculated based on the tasks in the job.
Balanced mode settings
In Balanced mode, the HPC Job Scheduler Service attempts to start all incoming jobs as soon as possible at their minimum resource requirements. After all the jobs in the queue have their minimum resources, additional cluster resources are allocated to jobs based on their priority. Resource allocation is periodically rebalanced to fill idle resources, start new jobs, and adjust allocation according to the Priority Bias setting. The following sections describe the settings associated with Balanced mode.
Balanced scheduling is limited in situations where node groups overlap. Balanced mode is more effective in non-overlapping node groups.
If you specify that a job should run on a single node (available starting with HPC Pack 2012), the balancing performed by the HPC Job Scheduler Service may be limited by other jobs that are running on the cluster.
Preemption in Balanced mode allows jobs that are waiting in the queue to start sooner by taking resources away from preemptible jobs that are already running.
Preemption settings in Balanced mode can only be configured by an HPC administrator starting with HPC Pack 2012 with Service Pack 1 (SP1). In previous versions of HPC Pack, preemption in Balanced mode is always Immediate.
The Preemptable job property is defined by the administrator in job templates. Use job templates to define the types of jobs that can be preempted, or the sets of users who can submit preemptible or nonpreemptible jobs. Preemptable cannot be defined when submitting a job through HPC Cluster Manager, HPC Job Manager, the HPC PowerShell, or the HPC command-line tools. It is only possible to do this by using the HPC API, if the selected job template specifies both True and False as valid values for the Preemptable job property. (The default is True.)
Preemption in Balanced mode has the following options:
Immediate preemption (Default): Take resources from the preempted job by canceling and requeuing sufficient running tasks so that resources can be allocated to another job immediately. For most cluster workloads, immediate preemption in Balanced mode enables more jobs to start in a given time period. For that reason it is recommended in most cases to achieve balanced scheduling.
Graceful preemption: Take resources from the preempted job as its running tasks complete so that work is not lost. This is an advanced setting that should only be enabled for specific workloads. For example, it might be considered when using Balanced mode with service-oriented architecture (SOA) jobs consisting of long-running tasks, where it is critical to keep the results retuned by each intermediate task.
Graceful preemption in Balanced mode can slow down the response time of starting a new job, and can reduce the balancing speed. The cluster should be carefully tested and monitored when graceful preemption in Balanced mode is enabled. For more information, see the following additional considerations.
Additional considerations for preemption in Balanced mode
Balancing speed Balanced mode attempts to balance jobs as quickly as possible, using immediate preemption by default. If you choose to enable graceful preemption in Balanced mode, the balancing can only take place at the rate at which tasks exit. If there are long-running tasks on the cluster, balancing can take a long time. If the rate of incoming jobs exceeds the rate of the exiting tasks, the cluster will only balance when sufficient tasks have exited to reallocate the resources.
Resource utilization By default in Balanced mode, the HPC Job Scheduler Service immediately preempts tasks to free up the resources (such as cores, nodes, or sockets) needed by any waiting job. However, if graceful preemption is enabled, resources are freed up as tasks exit, regardless of the resource requirements of a waiting job. It is possible that the freed-up resources are not the ones that are required by the next waiting job, and resources may remain idle until other tasks finish.
Priority Bias controls how additional resources are allocated to jobs. In Balanced mode, “additional resources” refers to cluster resource above the total minimum resources for all running jobs. Tasks that are running on additional resources can be canceled with immediate preemption to accommodate new jobs or to converge on the desired allocation pattern.
Priority Bias has the following options:
High Bias: All additional resources are allocated to higher priority jobs.
Medium Bias (Default): Each priority band is given a higher proportion of additional resources than the band below it. The priority bands are Highest, Above Normal, Normal, Below Normal, and Lowest.
No Bias: Additional resources are allocated equally across the job queue.
The Rebalancing Interval represents the time, in seconds, between rebalancing passes. The default value is 10 seconds.
A longer interval can improve scheduler performance, but it can take longer to respond to new jobs and converge on the desired allocation pattern. Longer intervals are good if you do not need instant growing and shrinking. If your cluster has a high turnaround rate (jobs are submitted frequently and finish quickly), you might want a longer interval to avoid excessive growing and shrinking.
A shorter rebalancing interval provides a faster response when new jobs are submitted, at the cost of additional load on the head node. The other settings that you can adjust if you need faster responses are the Task Cancel Grace Period and the Release Task Timeout, which can cause it to take longer for running work to get pushed out of the way.