Why are my jobs staying so long in the wait queue ?

The machine is most likely being used and other jobs could have a higher priority than yours. However, it is possible that you have incorrectly parametrized your job. For example, if the squeue command indicates (QOSMaxWallDurationPerJobLimit) in the column entitled NODELIST(REASON), this signifies that you have requested too much execution time (directive #SBATCH –time=… in your job) for the selected QoS (either by default or via the directive #SBATCH –qos=… in your job). In this case, you should delete your job, modify it and then re-submit it.

Comment : If your CPU jobs (or GPU jobs) respect the criteria of the CPU partitions and the CPU QoS (respectively, the GPU partitions and the GPU QoS) defined on the machine, do not delete them. They will be executed as soon as possible.

Job priority depends on the following criteria :

  • The QoS (Quality of Service) chosen: The “dev” QoS accepts only short jobs (less than 2h) and has higher priority than the “t3” default QoS (accepts jobs up to 20h); the “t3” QoS has higher priority than the “t4” QoS which accepts mono-node jobs up to 100h.
  • The regulating of computing hours based on a fairshare principle: A project in advance on its consumption of allocated hours will have a lower priority. You can verify the consumption status of your project with the idr_compuse command.
  • The wait time in the queue: The longer the time since a job was submitted, the higher its priority.

These different criteria assure that all the users can use the resources and, at the same time, maximizes the effective usage of the machine.