|
|
Green cluster
The green cluster is our main computational resources. The machine is shared with Bill Wedemeyer's group.
The cluster consists of 44 dual Xeon compute nodes, two storage nodes, one for the Feig group, the other one for the Wedemeyer group, and a login node. The machine was installed in April 2004 by Dell.
The cluster consists of a login/master node, called green,
44 compute nodes greenc01-greenc44, and the two storage nodes
greens1 (Feig) and greens2. In order to use the compute nodes,
computational jobs have to be submitted in the form of job scripts to the
PBS queueing system
on the login node. The queueing system allocates nodes to jobs as resources
become available and allows fair usage by many users. Particularly, the
queueing system is setup so that half of the cluster is guaranteed to
each of the Feig and Wedemeyer groups if needed, but unused nodes become
available to all users otherwise.
The green cluster can only be reached from within the MSU network and only via secure shell connection to the login node green.bch.msu.edu. From a Linux or UNIX system this is easily done with slogin or ssh as in the following example: $ slogin dude@green.bch.msu.edu dude@green's password: xxxxxx Dell | Red Hat : Linux Beowulf Node [Tue Mar 30 15:05:56 CST 2004] [dude@green ~]$ _When connecting for the first time, a message may appear about adding green.bch.msu.edu to the list of known hosts, which can be safely answered with "yes" in order to avoid the same message when connecting the next time. When connecting from a Windows machine the process is essentially the same, however, it may be necessary to install a secure shell client first. There are many choices (see for example the list given here). On a Mac with OS X slogin can be run from a terminal window since it is based on a UNIX system.
The purpose of the queueing system is to share the compute nodes between
many computational jobs from different users. It is used by submitting a job
script along with a request for specific resources (one or more CPUs, the amount
of wall clock time that is needed to complete the job). The queueing system
then allocates an appropriate set of nodes and starts the job script.
If all of the compute nodes are busy, the queueing system will hold the job
until resources become available and begin job execution at a later time.
The PBS system manages different demands on resources through the use of
queues. A queue can be viewed as a channel through which a part or all of
the nodes are available with certain resource limits. The combination of
multiple queues with different priorities then allows the implementation
of more complicated usage policies.
This setup guarantees that at least 22 nodes will always be available to either group within a maximum of 2 hours, the time limit of the short queue, but unused nodes of the other group can be used for short jobs with the short queue.
A PBS job script is very similar to a regular shell script. An example PBS script is shown in the following: #!/bin/csh #PBS -l nodes=1:ppn=1,walltime=1:00:00 cd /feigdata/feig/simulations mdCHARMM.pl -par dynsteps=100,gb -trajout 3gb1.dcd 3gb1.pdb This script, to be executed by /bin/csh, would change the working directory to /feigdata/feig/simulations on the Feig storage node greens1 and then perform 100 steps of molecular dynamics with the MMTSB Tool Set utility mdCHARMM.pl. The change of directory is necessary, because all of the scripts submitted through the queue always initially begin in the user's home directory on green. The most interesting part of this script is the second line #PBS -l .... Because it begins with a # it is considered a comment for the shell script, but it is interpreted by the PBS queueing system in order to find out about PBS options and specifically what kind of resources the job needs. In the example the -l option (limits) is used to request a single node (nodes=1) with 1 CPU per node (ppn=1 and a total wallclock time of 1 hour (the format is hours:minutes:seconds). In the case of a parallel job, where multiple nodes/CPUs are needed at once, this line may look like this: #PBS -l nodes=8:ppn=2,walltime=12:00:00This would mean that 8 nodes, each with 2 CPUs for a total of 16 CPUs, are requested over a time period of 12 hours. The given time should accurately reflect the actual time a job needs since this information is used to schedule the available resources effectively. However, because this time limit is also enforced by terminating a job which exceeds its requested time, it is best to request 10-20% more time than actually needed to make sure that the job can finish. In the examples, a specific queue was not given. By default, jobs are submitted automatically either to the feig or bill queues depending on the user name. If another queue is desired, for example the short queue, it can be requested explicity with the additional line #PBS -q shortMany more resources are available and explained in more detail here.
In order to submit a PBS job script, the command qsub is used, which is only available on green. If the script is called md.job the command would be: [dude@green ~]$ qsub md.job 756.green [dude@green ~]$ _It is also possible to submit a shell script without any #PBS while passing the PBS options as arguments to qsub as in the following example: [dude@green ~]$ qsub -q feig -l nodes=2:ppn=1 min.job 757.green [dude@green ~]$ _In this case the job script min.job would be submitted to the feig queue, requesting 2 nodes with one CPU each.
Once a job is submitted, the command qstat can be used to find out whether and where this job (and others) is running:
[dude@green ~]$ qstat -a
green:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------- -------- ----- ------- ------ --- --- ------ ----- - -----
756.green dude feig md.job 11510 1 -- -- 01:00 R 00:10
...
[dude@green ~]$ _
In this case, the job has started and has been running for 10 minutes. If the job
is waiting for nodes to become available it will not show any elapsed time and
a Q (for queued) instead of R (for running) in
the status column on the far right. More extenisve information is available with
the following command, which probably prints out more information than anybody
would want to know:
[dude@green ~]$ qstat -f
Job Id: 756.green
Job_Name = md.job
Job_Owner = dude@green
job_state = R
queue = feig
server = green
Checkpoint = u
ctime = Sat May 1 10:38:14 2004
Error_Path = green:/home1/dude/md.job.e756
exec_host = greenc05/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sat May 1 10:38:14 2004
Output_Path = green:/home1/dude/md.job.o756
Priority = 0
qtime = Sat May 1 10:38:14 2004
Rerunable = True
Resource_List.cput = 01:00:00
Resource_List.nodect = 1
Resource_List.nodes = 1#feig
Resource_List.walltime = 01:00:00
session_id = 11510
Variable_List = PBS_O_HOME=/home1/dude,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=dude,
PBS_O_PATH=/opt/Felix/lam/bin:/usr/local/bin:/bin:/usr/bin:
/usr/X11R6/bin:/apps/bin:/apps/script:/apps/mmtsb/perl:
/apps/mmtsb/bin:/apps/modeller/bin:/apps/intel_cc_80/bin:
/apps/intel_fc_80/bin:/apps/intel_idb_73/bin:
/usr/local/pbs/bin,
PBS_O_SHELL=/bin/tcsh,PBS_O_HOST=green,
PBS_O_WORKDIR=/home1/dude,PBS_O_QUEUE=default
comment = Job started on Sat May 01 at 10:38
etime = Sat May 1 10:38:14 2004
[dude@green ~]$ _
The information that is probably most useful from this very detailed output
is the line
exec_host = greenc05/0which indicates on which node (in this case: greenc05) the job is actually running. This is helpful in case the job crashes and manual cleanup is necessary.
From time to time it may be necessary to terminate a job that has been submitted to the queue either while it is waiting for resources to become available or when it is running already. In both cases, the command qdel can be used to remove a job if the job id is given as the argument: [dude@green ~]$ qdel 756.green [dude@green ~]$ _If the command was successful no output is given, but qstat can be used to check, whether the job is indeed removed from the queue.
Since the queueing system runs a job on a remote node and possibly at a much later time after
job submission, the job output does not appear on the terminal from where the job was
submitted. Instead, standard output and error output are collected in separate files.
The names of these files are composed of the name of the job script, followed by .o
(standard output) or .e (error output) and the number of the PBS job that was
printed out when the job was started. In the example above, the output file names would
be md.job.o756 and md.job.e756. These files appear as soon as the job
finishes in the directory from where the job was started. #PBS -o /dev/null #PBS -e /dev/nullin job scripts that are working well and are not expected to generate interesting output in these files.
So far all of the examples involved a job script that is submitted to the queueing system and run when resources become available. However, it is also possible to request one or more nodes for interactive use. This is done with qsub -I. An example is given in the following: [dude@green ~]$ qsub -I -l nodes=1:ppn=2 qsub: waiting for job 1100.green to start qsub: job 1100.green ready [dude@greenc33 ~]$ _This looks (and works) like regular remote login to one of the compute nodes, in this case greenc33, except that PBS knows that the node is reserved, actually 2 CPUs in this example, and other jobs cannot be submitted to this node until the interactive session is completed. Such interactive jobs are particularly useful for testing applications over a period of time, especially when multiple nodes are needed. Otherwise jobs would have to be submitted to the queue repeatedly, which is cumbersome, but may also cause significant delays if the required resources become unavailable and a test job has to wait until it can be started. Interactive PBS sessions should not be used for production jobs since they may result in nodes running idle for unnecessarily long times after a job finishes if a new job is not restarted immediately or the interactive session is not terminated.
The advantage of a cluster with many nodes is that it is possible to run programs
in parallel. Parallel program execution may be "embarassingly parallel"
if multiple program runs are essentially unconnected requiring very little communication
if any and the results are collected and combined at the end. An example would be
the combinatorial exploration of protein-ligand interactions between one protein
and many different potential ligands. However, parallel programs may also be
very tightly coupled such as in molecular dynamics simulations. These types of parallel
programs split the operations required for a single computational task over many
processors but require constant communication in order to update data from all processes
at certain times.
MPI executables are tightly
coupled applications compiled with an MPI communications library. There are many free
and commercial MPI implementations available. The two main free MPI versions are
MPICH and
LAM/MPI. On the green cluster,
we have installed LAM/MPI. lamboot -lat the beginning of a parallel job script. At the very end of the script the LAM environment should be stopped with lamhaltThe actual MPI program is started with mpirun. In the simplest form it is called with the option -np and the number of processors followed by the name of the MPI executable and additional arguments if necessary: mpirun -np 4 charmm_mpi < charmm.scriptThis example would start charmm_mpi on four processors and read a CHARMM command script from the file charmm.script. Of course, this assumes that the job script has requested 4 CPUs from the queueing system. A complete job script would look like the following: #!/bin/csh #PBS -l nodes=2:ppn=2 cd /feigdata/feig/simulations lamboot -l mpirun -np 4 charmm_mpi < charmm.script lamhaltOther options can be used to control how mpirun allocates processes onto nodes. Depending on the type of application, the allocation of processes onto nodes can affect parallel performance greatly. This is an effect of the heterogeneous architecture of the green cluster with dual CPU nodes with very fast communication connected to other nodes through relatively slow gigabit ethernet hardware. As should be clear by now, a special program version is always needed for running in parallel. While it may be possible to start a serial version of a given program with mpirun, it won't run in parallel, probably requiring the same time as if it was run only on a single CPU. How to compile MPI programs is explained in more detail below.
The MMTSB Tool Set can be used to take advantage of parallel processing for a variety
of modeling tasks. There are three different ways to use multiple nodes with the
Tool Set:
Finally, it is possible to use multiple nodes simply with rsh or ssh in order to start processes remotely on allocated nodes. Client/server type applications could be submitted in this manner by starting the server first on the node, where the job script is started, and then submitting clients to other nodes with rsh or ssh as in the following example: #!/bin/csh #PBS -l nodes=4:ppn=1 cd /feigdata/feig/simulations start-server & foreach n ( `cat $PBS_NODEFILE` ) rsh $n "( cd /feigdata/feig/simulations; start-client )" & end waitThe use of rsh/ssh allows great flexibility when using multiple CPUs. However, as in the previous examples, one should not start processes on nodes/processors that have not been allocated, and if parallel jobs are run in this manual fashion, it is the user's resposibility to clean up all processes when the job terminates either successfully or due to an error. |