Quick Menu
Applications howtos
Archived pages
Support
- Phone 3004
Quick Menu
Applications howtos
Archived pages
Support
To start a job on one or more nodes, after having prepared your input files (in case of an application installed like Abaqus) and compiled the executable (in case You're programming the application), You need to prepare a jobfile with some shell commands and PBS directives to submit the job to the PBS scheduler, that will take care of it's execution on compute nodes.
Usually a jobfile is a plain text files with some shell commands and PBS directives. You can name it as you want, but we suggest that you give it a human readable name that identifies your specific job. Let's use the syntax <jobfile> to identify a specific jobfile name.
Here we will show the content of a jobfile independent from a specific application to run; this will be a part common to all of your jobfiles. The highlighted parts should be clearly understood because they affect the job scheduling and execution.
<jobfile> content:
#!/bin/bash # # Set Job execution shell #PBS -S /bin/bash # Set Job name: <jobname> will be the name used to refer to your job in the queue # lists, so use a clearly understandable name #PBS -N <jobname> # Set the execution queue: <queue name> is the name of the queue you want to submit # the job to. It's one of the defined queues: gandalf, merlino, default, morgana, # covenant. The choice of the queue depends on the resources needed (# of nodes, # # of cores = # of MPI threads per node, memory, node interconnect # network; be aware also to have the right to access the restricted queues #PBS -q <queue name> # Set mail addresses that will receive mail from PBS about job # Can be a list of addresses separated by commas (,) #PBS -M <polimi.it or mail.polimi.it email address only> # Job re-run (yes or no) # Notice: the re-run capability depends on the specific application #PBS -r n # Set events for mail from PBS about job # Send an email to the address specified above # for all events of the job: start, end, abort, error #PBS -m abe # Set standard output file # (if relative path defaults to dir of execution) # For clearness use the same jobfile name #PBS -o <jobfile>.out # Set standard error file # (if relative path defaults to dir of execution) # For clearness use the same jobfile name #PBS -e <jobfile>.err # Set total wall clock time (hh:mm:ss) # Notice: this is a job-wide resource # See more in the section Requesting job resources #PBS -l walltime=00:20:00 # Set request for nodes,number of cpu (cores),number of mpi processes per node # The #nodes is how many chunks (portion) of compute nodes you request the cores # and the MPI threads from. #PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores # Pass environment to job # This is important: you should mind that the jobs # actually run on nodes that do not share the same # environment with the masternode #PBS -V # Change to submission directory #!!!!! IMPORTANT: READ THE SECTION ABOUT JOB EXECUTION DIRECTORY BELOW !!!!! cd $PBS_O_WORKDIR # Command to launch application and it's parameters # It's application dependent so will be covered in each application section module load <module specific to the application> command-to-start-the-application <parameters and arguments>
The jobfile is submitted from a subfolder of your Home where you collect together input files and auxiliary files like C or Fortran source. If You submit the job with qsub <jobfile> then the execution directory will be this folder and all the I/O of the process will rely on the Home file system. To improve process performance is better to select another file system according to one of the two cases:
You have requested only one node (regardless of the number of cores)
All the processing will take place on the same node except for the file I/O with an overhead over the network that will somehow slow the job. To solve this issue and speedup the job You should modify your jobfile in this way:
1- Cancel the line cd $PBS_O_WORKDIR
2- Before your “module load” command copy-paste these three lines
mkdir /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID) cd /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID) cp -R $PBS_O_WORKDIR/* /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID) REMOTE_MOUNT_CMD="ssh masternode mount_remote_scratch $(head -n 1 $PBS_NODEFILE) $(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)" REMOTE_UMOUNT_CMD="ssh masternode umount_remote_scratch $(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)" eval $REMOTE_MOUNT_CMD
3- At the end of the jobfile copy-paste these lines of code
eval $REMOTE_UMOUNT_CMD mv /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)/* $PBS_O_WORKDIR rm -Rf /scratch_local/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
All the submission folder content will be copied to the /scratch_local folder on the remote node in a directory named <username>_<your job folder name>_<jobid number> and the execution folder will be set to that folder, eliminating the network I/O overhead. When job ends, all the content of the remote scratch will be copied back to the original folder in the user home.
To check the job running You must logon to masternode and cd to the folder remote_scratch in your home directory; inside You'll find a directory with the same name structure <username>_<your job folder name>_<jobid number> in which it is automatically mounted the remote node local scratch as long as the job runs. Eventually you can find more than one directory in remote_scratch, one for each single node job that you submitted. To know which is the remote node assigned to the job just give the command qstat -n.
You have requested more than one node (regardless of the number of cores)
The processing will take place on some nodes and the I/O overhead over the network will be for data exchange (MPI) between the nodes and for file I/O from the nodes to the masternode. To speedup the job You should modify your jobfile in this way:
1- Cancel the line cd $PBS_O_WORKDIR
2- Before your “module load” command copy-paste these three lines
mkdir /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID) cd /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID) cp -R $PBS_O_WORKDIR/* /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
3- At the end of the jobfile copy-paste these lines of code
mv /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)/* $PBS_O_WORKDIR rm -rf /scratch/$(echo $PBS_O_LOGNAME)_$(basename $PBS_O_WORKDIR)_jobid_$(echo $PBS_JOBID)
All the submission folder content will be copied to the shared /scratch folder between the nodes and the execution folder will be set to that folder, dividing the network I/O overhead of your job from the one of all the nodes accessing the home file system and using a faster file system. When job ends, all the content of the shared scratch will be copied back to the original folder in the user home.
To check the job running You must logon to masternode and just go to the execution folder with the command cd /scratch/<your username>_<name that you gave to the job folder>_<job-id>.
Our cluster is configured with queues that partition the nodes according to their hardware characteristics; so selecting a queue implicitly selects homogeneous network interconnect and architecture. This matters because in MPI jobs is important that the nodes have the same speed, otherwise jobs will run at the slowest node pace.
With the directive
#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores
You ask to PBS to reserve #nodes portions of compute nodes (chunks) each made by #ncpus cores and running the #mpiprocs MPI threads (sharing the cores). The #mpiprocs counts the MPI threads that will share the #ncpus requested on each node. If omitted, mpiprocs defaults to 1.
Examples:
Old style syntax
The old syntax for requesting nodes, cores and mpi proc is:
#PBS -l nodes=#nodes:ncpus or cpp=#cores:ppn=#processes-per-node
nodes is converted in select cpp or ncpus is converted in ncpus ppn is converted in mpiprocs
But often You find the short form
#PBS -l nodes=N:ppn=p
This is converted in: #PBS -l select=N:ncpus=p:mpiprocs=p
You need to reserve also memory for your jobs, otherwise defaults will apply (1 GB per node, regardless of the cores requested). You can request for memory on the same directive for cpus:
#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-per-node:mem=<# of gigabytes>gb
This reserves from each compute node the specified number of gigabytes SHARED between the cores and the MPi processes. Obviously You cannot request more memory above the physical limit of the nodes in the selected queue (for example on gandalf mem must be < 8gb, on morgana < 24gb).
Examples:
* select=3:ncpus=1:mpiprocs=2:mem=4gb - Allocates from 3 portions of compute nodes, each with 1 core and 2 MPI threads sharing 1 core and 4 gb of ram. The total memory used by the job will be 12 GB (distributed on 3 nodes).
The portion of nodes requested for a job must be placed on the compute nodes using a placement strategy. There are three main strategies:
The default strategy in our cluster is: scatter. To override this behaviour, you must explicitly ask for a different placement in the job file:
#PBS -l place=<one of the placement free, scatter or pack>
Examples:
#PBS -l place=pack - Tells PBS to put the portions of nodes on the same node.
The total wallclock execution time of a job is the external elapsed time to complete the job. This is greater than the total CPU time consumed by a job running on a cpu, because generally no job can use a cpu 100% of the time. In case of parallel jobs that use N cores, the walltime is more or less equal to (total CPU time) / N.
You must request a walltime limit to your job execution, because the time is a critical resource for the scheduler needed to guess the starting time for other jobs queued. If a PBS server doesn't enable walltime limits for jobs, PBS sets a default limit to 5 years. While setting the walltime T for your job, please consider that if you have N cores You are implicitly imposing a CPU time limit of N*T.
#PBS -l walltime=10:00:00 - Sets a max walltime of 10 hours; if the jobs has 4 cores, this sets a job CPU time limit of 40 hours.
Actually the walltime is not needed because there are no walltime limits configured.
To submit a CUDA job You must select a queue with nodes containing at least 1 GPU (endurance, covenant, raosq, minervanichoid), specify only 1 node in select=#nodes and add the ngpus=1 or ngpus=2 to the directive
#PBS -l select=#nodes:ncpus=#cores-per-node:mpiprocs=#mpi-threads-that-share-the-cores
Examples:
Notice that selecting 1 or 2 ngpus automatically select the first available node in the queue with the desired amount of GPUs so use this parameter as appropriate.