Troubleshooting Jobs

 

Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2

Jobs and tasks can fail for a number of reasons. The steps below provide a starting point for investigating failures using HPC Job Manager.

To review job and task error messages

  1. In the Navigation Pane, under My Jobs, click Failed.

  2. Double-click a job (or right-click a job, and then click View Job) to see the job details.

  3. In the Job Progress tab, review the Messages field for information about why the job and tasks failed.

  4. Click a task error message to pivot to a filtered view in the View Tasks tab that displays all the tasks that failed with that error message. To clear the filter and view all tasks, click Clear Filter.

  5. In the View Tasks tab, you can filter the visible tasks to see all failed tasks:

    1. In Filter, select State.

    2. In the filter value box, select Failed.

    3. Click the filter icon or press the Enter key.

Common causes of job failure

  • One or more tasks in the job have failed. This is the most common cause of job failure. This indicates that one or more tasks could not be run or did not complete successfully. View task level error messages to investigate this type of job failure. For more information, see Common causes of task failure in this topic.

  • A node assigned to the job could not be contacted. Jobs that fail because of a node falling out of contact are automatically retried a certain number of times, but eventually fail if the problem continues. If you receive this error message, you can try requesting different nodes for the job, or specifically exclude the problem node from the job. For more information, see Define Excluded Nodes for a Job - Job Manager.

  • The job’s run time expired. The HPC Job Scheduler Service cancels jobs that reach the end of their run time. If possible, modify the run time for your job, and then requeue your job. For more information, see Modify a Job - Job Manager and Requeue a Job or Task - Job Manager.

  • The job could not be started on one of its allocated nodes. The most common cause for this type of failure is that an invalid user name or password is associated with the job. You can use the job modify command-line command to update the credentials attached to your job, and then try requeueing.

Common causes of task failure

  • The task failed during execution. This type of error occurs in the application itself. Check the output and error files for details. If you did not specify standard output and error files for the task, review the Output and Error fields in the Task Properties dialog box.

    Note

    This message indicates that the task’s command line returned an- exit code that the HPC Job Scheduler Service interprets as a failure (by default, a non-zero exit code). However, some applications might return a non-zero exit code even when they succeed. For more information, see Command line evaluation statements for non-zero exit codes in this topic.

    In HPC Pack 2012, success error codes other than 0 can be defined for all tasks in a job or for individual tasks. For more information, see Understanding Job and Task Properties - Job Manager.

  • The task’s run time expired. The HPC Job Scheduler Service cancels tasks that reach the end of their run time. You can create a new copy of your task with a longer run time and attempt to requeue the job.

  • A file location required by the task could not be accessed. A frequent cause of task failures is inaccessibility of required file locations, including the standard input, output, and error files and the working directory locations. Check the following possible causes:

    • A permissions issue is preventing the task from accessing the specified file.

    • A networking issue is preventing access to the file from the specified compute node.

    • The working directory, input file, or output file location does not exist.

  • A node assigned to the task could not be contacted. Tasks that fail because of a node falling out of contact are automatically retried a certain number of times, but will eventually fail if the problem continues. If you receive this error message, you can try requesting different nodes for the job, or specifically exclude the problem node from the job. For more information, see Define Excluded Nodes for a Job - Job Manager.

Command line evaluation statements for non-zero exit codes

If your application returns non-zero exit codes for success, you can include an evaluation statement in the command line to check for successful exit codes. You can use the %ERRORLEVEL% environment variable to evaluate the application’s exit code, and in the case of success, return a value of 0 to the HPC Job Scheduler Service. Alternately, you can modify the command to ignore all exit codes.

For example, Robocopy.exe returns exit codes 0 and 1 for success. If you submit a task that specifies the command robocopy c:\dirA c:\dirB *.*, the task might complete successfully but be marked as Failed by the HPC Job Scheduler Service.

  • To check for successful exit codes (less than or equal to 1), you can modify the command as follows:

    robocopy c:\dirA c:\dirB *.* ^& IF %ERRORLEVEL% LEQ 1 exit 0

  • To ignore exit codes, you can modify the command as follows:

    robocopy c:\dirA c:\dirB *.* ^& exit 0

Additional references