Skip to content

Hyak: Viewing and interacting with Jobs

seanb80 edited this page Jun 21, 2017 · 1 revision

Viewing Jobs:

There are two main commands to view job statuses. squeue and scontrol. squeue shows information about all jobs currently running on Hyak, while scontrol shows information on a specific job, and requires additional arguments

squeue -

Typing squeue in any node type of Hyak shows the following output

[seanb80@mox1 CanuTest]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             24074     build   my_job    dsale PD       0:00      1 (QOSMaxCpuPerUserLimit)
             24355     build   my_job    dsale PD       0:00      1 (Priority)
             32589      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32590      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32591      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32592      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32594      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32595      chem dft_dime  inguyen PD       0:00      2 (Resources)
             32628      chem Modello_     gd24 PD       0:00      1 (Resources)
             32770    ilahie  R5local   vanouk PD       0:00     18 (Resources)
             32765       stf R5global   vanouk PD       0:00     11 (Resources)
             32776  ferrante     bash      af0  R      14:38      1 n2013
             32103      choe NPc_f2_5    ychoe  R 4-10:43:42      1 n2179
             32482      choe NPex_dn2    ychoe  R   22:14:20      1 n2012
             32481      choe  NPex_dn    ychoe  R   22:14:51      1 n2195
             32192      chem EXC-DMC-    lrm13  R 2-18:29:37      1 n2014
             32588      chem dft_dime  inguyen  R    3:34:51      2 n[2024-2025]
             32619      chem dft_snap   yliu92  R   17:08:07      1 n2201
             32618      chem dft_snap   yliu92  R   17:21:38      1 n2005
             32769    ilahie   R5orig   vanouk  R      57:21     18 n[2156-2173]
             32494      chem prova2_E     gd24  R   17:56:09      1 n2180
             32504      chem prova2_E     gd24  R   17:56:09      1 n2184
             ...

This shows the JobID (important for scontrol), the group who owns the JobID, the job name, time remaining, number of nodes used, and node IDs (important for sshing in to view process information). The output can be piped in to grep to identify individual groups via squeue | grep "srlab" for ease of finding relevant information.

[seanb80@n2149 CanuTest]$ squeue | grep "srlab"
             32779     srlab     bash  seanb80  R       0:09      1 n2149

scontrol -

scontrol shows more in depth information regarding a specific job and node.

Job information: scontrol show job JobID returns state, run time, time limit, and node architecture information. The output below shows that our job has been running for 00:14:55, (RunTime), has a total TimeLimit of 00:30:00 and is running on a 28 core node (NumCPUs) with 28gb of memory (mem).

[seanb80@n2149 CanuTest]$ scontrol show job 32779
JobId=32779 JobName=bash
   UserId=seanb80(557445) GroupId=hyak-srlab(415510) MCS_label=N/A
   Priority=100 Nice=0 Account=srlab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:14:55 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2017-06-21T07:32:33 EligibleTime=2017-06-21T07:32:33
   StartTime=2017-06-21T07:32:33 EndTime=2017-06-21T08:02:33 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=srlab AllocNode:Sid=mox1:15953
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n2149
   BatchHost=n2149
   NumNodes=1 NumCPUs=28 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=28,mem=28G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/gscratch/srlab/data/CanuTest
   Power=

Cancelling jobs:

Canceling jobs is done via the scancel JobID command. It cancels any job you have ownership of with a 12 second graceful shutdown period, so be sure you're canceling the right job when you execute it.

[seanb80@n2149 CanuTest]$ scancel 32779
srun: Force Terminated job 32779
[seanb80@n2149 CanuTest]$ srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: n2149: task 0: Killed
[seanb80@mox1 CanuTest]$ 
Clone this wiki locally