It is important to know that when you connect to a cluster, you do not have the access to execute code at your will. Since many users are connected, nobody should have immediate access to all the nodes on the cluster.
Therefore, you use a job scheduler, you may compile your code directly but no execute it. In order to do that, you have to submit it to a scheduler that will take your request and put it on a queue for execution.
There are different job schedulers implemented on clusters (PBS is one of them which is freely available), but the basic commands are three:
- Job submission: With this command you submit a program to be executed through a job script (defined below).
- Job monitoring: With this command you are shown the list of programs waiting to be executed, or check the status of just your own programs.
- Job termination: This command is used to remove a program from the queue of programs to be executed.
In order to submit a job to the system a job script must be sumbitted to the scheduler, it is basically a text file with options for the scheduler, these are mainly:
- Job name, a user specified string to indetify the program ( a user can submit several jobs at once).
- What shell to use (C Shell, Bash, Tcsh, etc.)
- An e-mail identifier to tell the user that his program has been submitted to the queue, begun execution, exited or aborted.
- The number of processors to run the program on.
- A time limit for program execution (in order to avoid runaway programs on the system waisting computing units).
- Since there is no interaction with the user when the program is executing, them the user must specify files were inputs to the program should be read from and a file to store the output.
Job scripts also offer the ability to include shell programming commands.
Sample job script using PBS
#PBS -m be
#PBS -N q1a
#PBS -e /home/user/parallel/q1a.error
#PBS -o /home/user/parallel/q1a.output
#PBS -l nodes=4:ppn=8
#PBS -M email@example.com
mpirun -np 16 ./q1a 20 20