Job Is Scheduled Using Stale Node Hardware Configuration Information

Updated: May 2011

Applies To: Windows HPC Server 2008, Windows HPC Server 2008 R2

After you make a change to the hardware configuration of an existing compute node, it can take some time until the hardware configuration change is discovered in the Windows HPC Server 2008 R2 or Windows HPC Server 2008 cluster. Until the hardware configuration change is discovered and certain internal data stores are updated with this information, it is possible for a compute job in the cluster to be scheduled using the previous (stale) hardware configuration information for that node.

Cause

Windows HPC Server uses an internal job scheduling database to store hardware and other configuration information about each compute node. This information is used by the job scheduler to select resources for a compute job.

After a hardware configuration change on a compute node, the cluster must discover the configuration change and then update the job scheduling database. If the job scheduling database is not updated after a hardware configuration change, a job may be scheduled for a compute node based on stale hardware configuration information.

The job scheduling database may fail to be updated with the changed hardware configuration information under conditions such as the following:

  • The compute node stays in the Online state during and after the discovery of the configuration change by the cluster.

  • The compute node is in the Offline state during the configuration change, but it is brought online before the cluster discovers the change.

Resolution

After you change the hardware configuration of a compute node, ensure that the configuration change is discovered in the Windows HPC Server 2008 cluster and that the internal job scheduling database is updated with the configuration information.

To update the cluster with current node hardware configuration information

  1. Take the compute node offline (or, if the node is already in the Offline state, leave it in this state).

  2. Restart the compute node.

  3. After the compute node starts, check the operations log to verify that the configuration of the compute node is discovered. Discovery may take several minutes or longer. Discovery is complete when an entry similar to the following appears in the operations log and is in the Committed state:

    Discovering the configuration of node <NodeName>.

    Note
    You can confirm that the configuration change is discovered by viewing the properties of the node in Node Management.
  4. After the hardware configuration change is discovered, bring the node online. At this time, the node can accept and run cluster jobs that use the changed hardware configuration.

    Important
    You must wait for discovery to complete before you bring the node online. If you bring the node online before the hardware change is discovered by the cluster, the job scheduling database is not updated.

For information about how to view the operations log, see Read the Operations Log.

Verification

To verify that the job scheduler is using the latest hardware configuration information about the compute node, run the node view command. This command returns information about the compute node from the job scheduling database. For example, to view a detailed list of properties and values for the compute node, type the command:

node view /detailed <NodeName>

For more information, see node view.