Job Control


Controlling Jobs

Job control Command

Explanation

squeue

Squeue is used to view job and job step information for jobs managed by SLURM.

 scontrol show node 

shows detailed information about compute nodes.

scontrol show partition <partition Name>

shows detailed information about a specific partition/queue

scontrol show job <job ID>

 shows detailed information about a specific job or all jobs if no job id is given.

sinfo

view information about slurm nodes and partitions/queues.

scancel <job ID>

Kill a job. Users can kill their own jobs, root can kill any job.

scontrol hold <job ID>

Hold a job

scontrol release <job ID>

Release a job:

sbalance

Check available account balance

  **User's can check their /home and /scratch quota using "myquota" command

 

Sample Command Outputs Given Below:

List jobs

 $ squeue

  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

    106 standard      slurm-jo  user1   R   0:04      1 atom01

Get job details

$ scontrol show job 106

JobId=106 Name=slurm-job.sh

   UserId=user1(1001) GroupId=user1(1001)

   Priority=4294901717 Account=(null) QOS=normal

   JobState=RUNNING Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0

   RunTime=00:00:07 TimeLimit=14-00:00:0 TimeMin=N/A

   SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02

   StartTime=2013-01-26T12:55:02 EndTime=Unknown

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=standard AllocNode:Sid=atom-head1:3526

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=atom01

   BatchHost=atom01

   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=(null)

   Shared=0 Contiguous=0 Licenses=(null) Network=(null)

   Command=/home/user1/slurm/local/slurm-job.sh

   WorkDir=/home/user1/slurm/local

Kill a job. Users can kill their own jobs, root can kill any job.

$ scancel 135

$ squeue

  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a job:

$ squeue

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

    139      standard   simple  user1  PD       0:00      1 (Dependency)

    138      standard   simple  user1   R       0:16      1 atom01

$ scontrol hold 139

$ squeue

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

    139      standard   simple  user1  PD       0:00      1 (JobHeldUser)

    138      standard   simple  user1   R        0:32      1 atom01

Release a job:

$ scontrol release 139

$ squeue

  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

    139      standard   simple  user1  PD       0:00      1 (Dependency)

    138      standard   simple  user1   R       0:46      1 atom01

To view the available Partition/Queues and Node status

 

$ sinfo –s

PARTITION     AVAIL   TIMELIMIT   NODES(A/I/O/T)  NODELIST

standard         up 3-00:00:00    32/356/54/442  cn[001-384],gpu[001-022],hm[001-036]

gpu              up 3-00:00:00        0/21/1/22  gpu[001-022]

hm               up 3-00:00:00        0/35/1/36  hm[001-036]

standard-low*    up 3-00:00:00    32/356/54/442  cn[001-384],gpu[001-022],hm[001-036]