User Tools

Site Tools


pbs_jobfile_structure

PBS jobfile structure

To start a job on one or more nodes, after having prepared your input files (in case of an application installed like Abaqus) and compiled the executable (in case You're programming the application), You need to prepare a jobfile with some shell commands and PBS directives to submit the job to the PBS scheduler, that will take care of it's execution on compute nodes.

Usually a jobfile is a plain text files with some shell commands and PBS directives. You can name it as you want, but we suggest that you give it a human readable name that identifies your specific job. Let's use the syntax <jobfile> to identify a specific jobfile name.

Jobfile structure

Here we will show the content of a jobfile independent from a specific application to run; this will be a part common to all of your jobfiles. The highlighted parts should be clearly understood because they affect the job scheduling and execution.

<jobfile> content:

#!/bin/bash
#
# Set Job execution shell
#PBS -S /bin/bash
 
# Set Job name: <jobname> will be the name used to refer to your job in the queue 
# lists, so use a clearly understandable name
#PBS -N <jobname>
 
# Set the execution queue: <queue name> is the name of the queue you want to submit 
# the job to. It's one of the defined queues: gandalf, merlino, default, morgana, 
# covenant. The choice of the queue depends on the resources needed (# of nodes, 
# # of cores = # of MPI threads per node, memory, node interconnect  
# network; be aware also to have the right to access the restricted queues 
#PBS -q <queue name>
 
# Set mail addresses that will receive mail from PBS about job
# Can be a list of addresses separated by commas (,)
#PBS -M <polimi.it or mail.polimi.it email address only>
 
# Job re-run (yes or no)
# Notice: the re-run capability depends on the specific application
#PBS -r n
 
# Set events for mail from PBS about job
# Send an email to the address specified above 
# for all events of the job: start, end, abort, error
#PBS -m abe
 
# Set standard output file 
# (if relative path defaults to dir of execution)
# For clearness use the same jobfile name
#PBS -o <jobfile>.out
 
# Set standard error file 
# (if relative path defaults to dir of execution)
# For clearness use the same jobfile name
#PBS -e <jobfile>.err
 
# Set total wall clock time (hh:mm:ss)
# Notice: this is a job-wide resource
# See more in the section Requesting job resources
#PBS -l walltime=00:20:00
 
# Set request for nodes,number of cpu (cores),number of mpi processes per node
# The #nodes is how many chunks (portion) of compute nodes you request the cores 
# and the MPI threads from.
#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores
 
# Pass environment to job
# This is important: you should mind that the jobs 
# actually run on nodes that do not share the same 
# environment with the masternode
#PBS -V
 
# Change to submission directory
#!!!!! IMPORTANT: READ THE SECTION ABOUT JOB EXECUTION DIRECTORY BELOW !!!!!
cd $PBS_O_WORKDIR
 
# Command to launch application and it's parameters
# It's application dependent so will be covered in each application section
 
module load <module specific to the application>
command-to-start-the-application <parameters and arguments> 

Job Execution Directory

The jobfile is submitted from a subfolder of your Home where you collect together input files and auxiliary files like C or Fortran source. If You submit the job with qsub <jobfile> then the execution directory will be this folder and all the I/O of the process will rely on the Home file system. To improve process performance is better to select another file system according to one of the two cases:

You have requested only one node (regardless of the number of cores)

All the processing will take place on the same node except for the file I/O with an overhead over the network that will somehow slow the job. To solve this issue and speedup the job You should modify your jobfile in this way:

1- Cancel the line cd $PBS_O_WORKDIR

2- Before your “module load” command copy-paste these three lines

mkdir /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
cd /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
cp -R $PBS_O_WORKDIR/* /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
REMOTE_MOUNT_CMD="ssh masternode mount_remote_scratch $(head -n 1  $PBS_NODEFILE) $(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)"
REMOTE_UMOUNT_CMD="ssh masternode umount_remote_scratch $(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)"
eval $REMOTE_MOUNT_CMD

3- At the end of the jobfile copy-paste these lines of code

eval $REMOTE_UMOUNT_CMD
mv /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)/* $PBS_O_WORKDIR
rm -Rf /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)

All the submission folder content will be copied to the /scratch_local folder on the remote node in a directory named <username>_<your job folder name>_<jobid number> and the execution folder will be set to that folder, eliminating the network I/O overhead. When job ends, all the content of the remote scratch will be copied back to the original folder in the user home.

To check the job running You must logon to masternode and cd to the folder remote_scratch in your home directory; inside You'll find a directory with the same name structure <username>_<your job folder name>_<jobid number> in which it is automatically mounted the remote node local scratch as long as the job runs. Eventually you can find more than one directory in remote_scratch, one for each single node job that you submitted. To know which is the remote node assigned to the job just give the command qstat -n.

You have requested more than one node (regardless of the number of cores)

The processing will take place on some nodes and the I/O overhead over the network will be for data exchange (MPI) between the nodes and for file I/O from the nodes to the masternode. To speedup the job You should modify your jobfile in this way:

1- Cancel the line cd $PBS_O_WORKDIR

2- Before your “module load” command copy-paste these three lines

mkdir /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
cd /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
cp -R $PBS_O_WORKDIR/* /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)

3- At the end of the jobfile copy-paste these lines of code

mv /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)/* $PBS_O_WORKDIR
rm -rf /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)

All the submission folder content will be copied to the shared /scratch folder between the nodes and the execution folder will be set to that folder, dividing the network I/O overhead of your job from the one of all the nodes accessing the home file system and using a faster file system. When job ends, all the content of the shared scratch will be copied back to the original folder in the user home.

To check the job running You must logon to masternode and just go to the execution folder with the command cd /scratch/<your username>_<name that you gave to the job folder>_<job-id>.

Network interconnect and Architecture

Our cluster is configured with queues that partition the nodes according to their hardware characteristics; so selecting a queue implicitly selects homogeneous network interconnect and architecture. This matters because in MPI jobs is important that the nodes have the same speed, otherwise jobs will run at the slowest node pace.

Nodes, cores and MPI processes

With the directive

#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores

You ask to PBS to reserve #nodes portions of compute nodes (chunks) each made by #ncpus cores and running the #mpiprocs MPI threads (sharing the cores). The #mpiprocs counts the MPI threads that will share the #ncpus requested on each node. If omitted, mpiprocs defaults to 1.

Examples:

  • select=3:ncpus=2 - Allocates from 3 portions of compute nodes, each with 2 cores and 1 MPI thread running on 2 cores
  • select=3:ncpus=2:mpiprocs=2 - Allocates from 3 portions of compute nodes, each with 2 cores and 2 MPI threads sharing 2 cores
  • select=3:ncpus=1:mpiprocs=2 - Allocates from 3 portions of compute nodes, each with 1 core and 2 MPI threads sharing 1 core
  • Putting mpiprocs=0 or ncpus=0 with select=1 means no MPI job (serial job)
  • If you run SMP MPI threads, request only one node: select=1:ncpus=8 - allocates from 1 node 8 cores and 1 MPI task sharing the 8 cores

Old style syntax

The old syntax for requesting nodes, cores and mpi proc is:

#PBS -l nodes=#nodes:ncpus or cpp=#cores:ppn=#processes-per-node

nodes is converted in select cpp or ncpus is converted in ncpus ppn is converted in mpiprocs

But often You find the short form

#PBS -l nodes=N:ppn=p

This is converted in: #PBS -l select=N:ncpus=p:mpiprocs=p

Memory

You need to reserve also memory for your jobs, otherwise defaults will apply (1 GB per node, regardless of the cores requested). You can request for memory on the same directive for cpus:

#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-per-node:mem=<# of gigabytes>gb

This reserves from each compute node the specified number of gigabytes SHARED between the cores and the MPi processes. Obviously You cannot request more memory above the physical limit of the nodes in the selected queue (for example on gandalf mem must be < 8gb, on morgana < 24gb).

Examples:

* select=3:ncpus=1:mpiprocs=2:mem=4gb - Allocates from 3 portions of compute nodes, each with 1 core and 2 MPI threads sharing 1 core and 4 gb of ram. The total memory used by the job will be 12 GB (distributed on 3 nodes).

MPI process placement

The portion of nodes requested for a job must be placed on the compute nodes using a placement strategy. There are three main strategies:

  • free - takes the portions of nodes from any available node and could put two or more portions on the same node
  • scatter - takes the portion of nodes from the available nodes but no two portions are on the same node
  • pack - takes all the portions from one node

The default strategy in our cluster is: scatter. To override this behaviour, you must explicitly ask for a different placement in the job file:

#PBS -l place=<one of the placement free, scatter or pack>

Examples:

#PBS -l place=pack - Tells PBS to put the portions of nodes on the same node.

Walltime

The total wallclock execution time of a job is the external elapsed time to complete the job. This is greater than the total CPU time consumed by a job running on a cpu, because generally no job can use a cpu 100% of the time. In case of parallel jobs that use N cores, the walltime is more or less equal to (total CPU time) / N.

You must request a walltime limit to your job execution, because the time is a critical resource for the scheduler needed to guess the starting time for other jobs queued. If a PBS server doesn't enable walltime limits for jobs, PBS sets a default limit to 5 years. While setting the walltime T for your job, please consider that if you have N cores You are implicitly imposing a CPU time limit of N*T.

#PBS -l walltime=10:00:00 - Sets a max walltime of 10 hours; if the jobs has 4 cores, this sets a job CPU time limit of 40 hours.

Actually the walltime is not needed because there are no walltime limits configured.

GPU devices requested for CUDA

To submit a CUDA job You must select a queue with nodes containing at least 1 GPU (endurance, covenant, raosq, minervanichoid), specify only 1 node in select=#nodes and add the ngpus=1 or ngpus=2 to the directive

#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores

Examples:

  • select=1:ncpus=2:ngpus=1 - Allocates from 1 compute node, 2 cores and 1 MPI thread running on 2 cores and reserves 1 gpu for CUDA computing

Notice that selecting 1 or 2 ngpus automatically select the first available node in the queue with the desired amount of GPUs so use this parameter as appropriate.

pbs_jobfile_structure.txt · Last modified: 2019/04/12 16:14 by druido