Slurm Megathread - Check here first for problems with Slurm

The following are a list of common problems users have with Slurm.

If you do not see your problem below or a solution does not work, please create a new post on this topic.

Job Not Starting

Common Errors Associated

  • ReqNodeNotAvail, UnavailableNodes:
  • Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

Common Resolutions

As the error suggests, the nodes you are attempting to schedule the job to are unavailable. Common reasons that they might not be available are:

  • The node went into a DRAIN or DOWN state after a job failed on the node.
  • The node was put into a DRAIN state due to maintenance

To see the current reason for the node(s) being drained by using the command sinfo -R -n [NODE] (eg. sinfo -R -n r21).

If the node was drained for the reason “Kill task failed”, it can likely be put back into service by an administrator. We’re usually pretty good at catching these, but if it is in that state for multiple days, please reach out.

To see if the reason was due to a maintenance event, please check our Alerts Page, the General Category, as well as the SSH MOTD for any announcements and updates.