Skip to Main Content
MSU logo
Feig Lab  ·  Computational Biophysics
SimDB link
MMTSB link
Feig lab Wiki link
Green cluster

The green cluster is our main computational resources. The machine is shared with Bill Wedemeyer's group.


Green in the Dark

The cluster consists of 44 dual Xeon compute nodes, two storage nodes, one for the Feig group, the other one for the Wedemeyer group, and a login node. The machine was installed in April 2004 by Dell.

Mode of operation

The cluster consists of a login/master node, called green, 44 compute nodes greenc01-greenc44, and the two storage nodes greens1 (Feig) and greens2. In order to use the compute nodes, computational jobs have to be submitted in the form of job scripts to the PBS queueing system on the login node. The queueing system allocates nodes to jobs as resources become available and allows fair usage by many users. Particularly, the queueing system is setup so that half of the cluster is guaranteed to each of the Feig and Wedemeyer groups if needed, but unused nodes become available to all users otherwise.

It is important that programs are never started manually on nodes that have not been allocated by the queueing system. Otherwise the purpose of the queueing system would be defeated and shared, fair and efficient usage of the cluster would become impossible.

The administrative nodes green, greens1, and greens2 should not be used for running extensive computations (> 1 minute), since a substantial load on these machines would affect the performance of the entire cluster. It is ok, though, to perform quick data analysis, compile programs or run very short tests on either one of these nodes. Otherwise, longer jobs need to be submitted to the queue or run elsewhere, e.g. on the blue cluster, where the green data is available over the network through NFS.

Connecting to the green cluster

The green cluster can only be reached from within the MSU network and only via secure shell connection to the login node green.bch.msu.edu. From a Linux or UNIX system this is easily done with slogin or ssh as in the following example:

$ slogin dude@green.bch.msu.edu
dude@green's password: xxxxxx
Dell | Red Hat : Linux Beowulf Node [Tue Mar 30 15:05:56 CST 2004]
[dude@green ~]$ _
When connecting for the first time, a message may appear about adding green.bch.msu.edu to the list of known hosts, which can be safely answered with "yes" in order to avoid the same message when connecting the next time.

When connecting from a Windows machine the process is essentially the same, however, it may be necessary to install a secure shell client first. There are many choices (see for example the list given here).

On a Mac with OS X slogin can be run from a terminal window since it is based on a UNIX system.


PBS queueing system

The purpose of the queueing system is to share the compute nodes between many computational jobs from different users. It is used by submitting a job script along with a request for specific resources (one or more CPUs, the amount of wall clock time that is needed to complete the job). The queueing system then allocates an appropriate set of nodes and starts the job script. If all of the compute nodes are busy, the queueing system will hold the job until resources become available and begin job execution at a later time.

On green we are using OpenPBS as the queueing software. Extensive information is available on the web, for example at this site.


PBS queues on green
arrow

The PBS system manages different demands on resources through the use of queues. A queue can be viewed as a channel through which a part or all of the nodes are available with certain resource limits. The combination of multiple queues with different priorities then allows the implementation of more complicated usage policies.

In the case of the green cluster, the total machine is shared equally between the Feig and Wedemeyer groups. An easy solution would be to partition the machine rigidly. However, this assumes that both groups always use half of the machine. If one group does not use all of their half, but the other group would benefit from having additional nodes available, it would make sense to make the unused nodes available to the other group until they are needed again by the first group. While such flexible node sharing would not be possible with a rigid partitioning of the machine, this can be achieved with the help of multiple PBS queues.

For the green cluster a total of three queues are available:

feig
Only Feig group members can use this queue with a time limit of 24 hours per job and a total number of 22 nodes or 44 CPUs. Only nodes 1 through 22 are available for jobs submitted to this queue.
bill
Only Wedemeyer group members can use this queue with a time limit of 24 hours per job and a total number of 22 nodes or 44 CPUs. Only nodes 23 through 44 are available for jobs submitted to this queue.
short
All users can submit jobs to this queue, but the wall clock time limit is 2 hours. All nodes are available to this queue.
In addition, jobs submitted to both the feig and bill queues have higher priorities than jobs submitted to the short queue.

This setup guarantees that at least 22 nodes will always be available to either group within a maximum of 2 hours, the time limit of the short queue, but unused nodes of the other group can be used for short jobs with the short queue.



PBS job scripts
arrow

A PBS job script is very similar to a regular shell script. An example PBS script is shown in the following:

#!/bin/csh
#PBS -l nodes=1:ppn=1,walltime=1:00:00

cd /feigdata/feig/simulations

mdCHARMM.pl -par dynsteps=100,gb -trajout 3gb1.dcd 3gb1.pdb

This script, to be executed by /bin/csh, would change the working directory to /feigdata/feig/simulations on the Feig storage node greens1 and then perform 100 steps of molecular dynamics with the MMTSB Tool Set utility mdCHARMM.pl. The change of directory is necessary, because all of the scripts submitted through the queue always initially begin in the user's home directory on green.

The most interesting part of this script is the second line #PBS -l .... Because it begins with a # it is considered a comment for the shell script, but it is interpreted by the PBS queueing system in order to find out about PBS options and specifically what kind of resources the job needs. In the example the -l option (limits) is used to request a single node (nodes=1) with 1 CPU per node (ppn=1 and a total wallclock time of 1 hour (the format is hours:minutes:seconds). In the case of a parallel job, where multiple nodes/CPUs are needed at once, this line may look like this:
#PBS -l nodes=8:ppn=2,walltime=12:00:00
This would mean that 8 nodes, each with 2 CPUs for a total of 16 CPUs, are requested over a time period of 12 hours. The given time should accurately reflect the actual time a job needs since this information is used to schedule the available resources effectively. However, because this time limit is also enforced by terminating a job which exceeds its requested time, it is best to request 10-20% more time than actually needed to make sure that the job can finish.

In the examples, a specific queue was not given. By default, jobs are submitted automatically either to the feig or bill queues depending on the user name. If another queue is desired, for example the short queue, it can be requested explicity with the additional line
#PBS -q short
Many more resources are available and explained in more detail here.



Submitting PBS jobs
arrow

In order to submit a PBS job script, the command qsub is used, which is only available on green. If the script is called md.job the command would be:

[dude@green ~]$ qsub md.job
756.green
[dude@green ~]$ _
It is also possible to submit a shell script without any #PBS while passing the PBS options as arguments to qsub as in the following example:
[dude@green ~]$ qsub -q feig -l nodes=2:ppn=1 min.job
757.green
[dude@green ~]$ _
In this case the job script min.job would be submitted to the feig queue, requesting 2 nodes with one CPU each.



Querying PBS job status
arrow

Once a job is submitted, the command qstat can be used to find out whether and where this job (and others) is running:

[dude@green ~]$ qstat -a
green:
                                                Req'd  Req'd   Elap
Job ID    Username Queue Jobname SessID NDS TSK Memory Time  S Time
--------- -------- ----- ------- ------ --- --- ------ ----- - -----
756.green dude     feig  md.job  11510   1  --    --   01:00 R 00:10 
...
[dude@green ~]$ _
In this case, the job has started and has been running for 10 minutes. If the job is waiting for nodes to become available it will not show any elapsed time and a Q (for queued) instead of R (for running) in the status column on the far right. More extenisve information is available with the following command, which probably prints out more information than anybody would want to know:
[dude@green ~]$ qstat -f
Job Id: 756.green
 Job_Name = md.job
 Job_Owner = dude@green
 job_state = R
 queue = feig
 server = green
 Checkpoint = u
 ctime = Sat May  1 10:38:14 2004
 Error_Path = green:/home1/dude/md.job.e756
 exec_host = greenc05/0
 Hold_Types = n
 Join_Path = n
 Keep_Files = n
 Mail_Points = a
 mtime = Sat May  1 10:38:14 2004
 Output_Path = green:/home1/dude/md.job.o756
 Priority = 0
 qtime = Sat May  1 10:38:14 2004
 Rerunable = True
 Resource_List.cput = 01:00:00
 Resource_List.nodect = 1
 Resource_List.nodes = 1#feig
 Resource_List.walltime = 01:00:00
 session_id = 11510
 Variable_List = PBS_O_HOME=/home1/dude,PBS_O_LANG=en_US.UTF-8,
     PBS_O_LOGNAME=dude,
     PBS_O_PATH=/opt/Felix/lam/bin:/usr/local/bin:/bin:/usr/bin:
     /usr/X11R6/bin:/apps/bin:/apps/script:/apps/mmtsb/perl:
     /apps/mmtsb/bin:/apps/modeller/bin:/apps/intel_cc_80/bin:
     /apps/intel_fc_80/bin:/apps/intel_idb_73/bin:
     /usr/local/pbs/bin,
     PBS_O_SHELL=/bin/tcsh,PBS_O_HOST=green,
     PBS_O_WORKDIR=/home1/dude,PBS_O_QUEUE=default
 comment = Job started on Sat May 01 at 10:38
 etime = Sat May  1 10:38:14 2004

[dude@green ~]$ _
The information that is probably most useful from this very detailed output is the line
exec_host = greenc05/0
which indicates on which node (in this case: greenc05) the job is actually running. This is helpful in case the job crashes and manual cleanup is necessary.


Removing PBS jobs
arrow

From time to time it may be necessary to terminate a job that has been submitted to the queue either while it is waiting for resources to become available or when it is running already. In both cases, the command qdel can be used to remove a job if the job id is given as the argument:

[dude@green ~]$ qdel 756.green
[dude@green ~]$ _
If the command was successful no output is given, but qstat can be used to check, whether the job is indeed removed from the queue.



Output from PBS jobs
arrow

Since the queueing system runs a job on a remote node and possibly at a much later time after job submission, the job output does not appear on the terminal from where the job was submitted. Instead, standard output and error output are collected in separate files. The names of these files are composed of the name of the job script, followed by .o (standard output) or .e (error output) and the number of the PBS job that was printed out when the job was started. In the example above, the output file names would be md.job.o756 and md.job.e756. These files appear as soon as the job finishes in the directory from where the job was started.

It is possible to combine the output into a single file with the option -j oe either in form of a #PBS -j oe line in the job script or as an argument to qsub when the job is submitted. Other options, -o and -e, can be used to provide a specific file name and directory, where the output file(s) are written to, rather than the default name and the directory from where the job was submitted. For more information, take a look at the qsub manual page.

Because such output files are generated for every job, they start to accumulate over time. It might be a good idea to remove these files periodically or use lines such as

#PBS -o /dev/null
#PBS -e /dev/null
in job scripts that are working well and are not expected to generate interesting output in these files.



Interactive PBS jobs
arrow

So far all of the examples involved a job script that is submitted to the queueing system and run when resources become available. However, it is also possible to request one or more nodes for interactive use. This is done with qsub -I. An example is given in the following:

[dude@green ~]$ qsub -I -l nodes=1:ppn=2
qsub: waiting for job 1100.green to start
qsub: job 1100.green ready

[dude@greenc33 ~]$ _
This looks (and works) like regular remote login to one of the compute nodes, in this case greenc33, except that PBS knows that the node is reserved, actually 2 CPUs in this example, and other jobs cannot be submitted to this node until the interactive session is completed. Such interactive jobs are particularly useful for testing applications over a period of time, especially when multiple nodes are needed. Otherwise jobs would have to be submitted to the queue repeatedly, which is cumbersome, but may also cause significant delays if the required resources become unavailable and a test job has to wait until it can be started.

Interactive PBS sessions should not be used for production jobs since they may result in nodes running idle for unnecessarily long times after a job finishes if a new job is not restarted immediately or the interactive session is not terminated.


Parallel jobs

The advantage of a cluster with many nodes is that it is possible to run programs in parallel. Parallel program execution may be "embarassingly parallel" if multiple program runs are essentially unconnected requiring very little communication if any and the results are collected and combined at the end. An example would be the combinatorial exploration of protein-ligand interactions between one protein and many different potential ligands. However, parallel programs may also be very tightly coupled such as in molecular dynamics simulations. These types of parallel programs split the operations required for a single computational task over many processors but require constant communication in order to update data from all processes at certain times.

Tightly coupled parallel programs often use a special communications library, MPI (Message Passing Inteface) being the most popular one. Loosely coupled programs can often be submitted through rsh/ssh on multiple nodes as with some utilities of the MMTSB Tool Set.

This section outlines how to run parallel programs through the queueing system on the green cluster.


Using MPI executables
arrow

MPI executables are tightly coupled applications compiled with an MPI communications library. There are many free and commercial MPI implementations available. The two main free MPI versions are MPICH and LAM/MPI. On the green cluster, we have installed LAM/MPI.

Before using LAM/MPI executables, it is necessary to initialize the LAM environment by adding the line

lamboot -l
at the beginning of a parallel job script. At the very end of the script the LAM environment should be stopped with
lamhalt
The actual MPI program is started with mpirun. In the simplest form it is called with the option -np and the number of processors followed by the name of the MPI executable and additional arguments if necessary:
mpirun -np 4 charmm_mpi < charmm.script
This example would start charmm_mpi on four processors and read a CHARMM command script from the file charmm.script. Of course, this assumes that the job script has requested 4 CPUs from the queueing system. A complete job script would look like the following:
#!/bin/csh
#PBS -l nodes=2:ppn=2
  
cd /feigdata/feig/simulations 

lamboot -l
mpirun -np 4 charmm_mpi < charmm.script
lamhalt
Other options can be used to control how mpirun allocates processes onto nodes. Depending on the type of application, the allocation of processes onto nodes can affect parallel performance greatly. This is an effect of the heterogeneous architecture of the green cluster with dual CPU nodes with very fast communication connected to other nodes through relatively slow gigabit ethernet hardware.

As should be clear by now, a special program version is always needed for running in parallel. While it may be possible to start a serial version of a given program with mpirun, it won't run in parallel, probably requiring the same time as if it was run only on a single CPU. How to compile MPI programs is explained in more detail below.


Using the MMTSB Tool Set
arrow

The MMTSB Tool Set can be used to take advantage of parallel processing for a variety of modeling tasks. There are three different ways to use multiple nodes with the Tool Set:

  1. If utilities like mdCHARMM.pl are used, it is possible to redefine the location of the CHARMM executable (environment variable CHARMMEXEC) to include the mpirun command introduced above. An example is given in the following job script:

    #!/bin/csh
    #PBS -l nodes=2:ppn=2
    
    cd /feigdata/feig/simulations
    
    setenv CHARMMEXEC "mpirun -np 4 /apps/bin/charmm_mpi"
    
    lamboot -l   
    mdCHARMM.pl -par dynsteps=1000 -trajout md.dcd md.pdb
    lamhalt
    

  2. All of the ensemble computing utilities support a set of options for parallel execution. These kind of runs are embarassingly parallel, since it is the idea of ensemble computing that the same operation is performed on a large set of different structures.

    In the case of a distributed machine, such as the green cluster, the first thing that is needed for parallel execution is a host file describing which nodes are available. This file contains one line for each node beginning with the node name and followed by the number of CPUs on that node and an optional directory that could be used for local temporary storage (often /tmp/<user> would be a good idea) if needed. An example of such a file is given below:

    greenc02 2 /tmp/feig
    greenc03 2 /tmp/feig
    greenc05 1 /tmp/feig
    greenc06 1 /tmp/feig
    
    In this case 4 nodes with a total of 6 CPUs are available while /tmp/feig may be used for temporary storage. Such a file is then given with the option -hosts <file name>, e.g. in
    ensmin.pl -hosts available.hosts -cpus 6 sample min 
    
    With the hosts file given above, this example would use 6 CPUs in order to minimize an ensemble of structures. The option -cpus 6 is also needed in addition to the host file in order to indicate how many CPUs should actually be used.

    If such a command is run through the PBS queue, the hosts file should obviously correspond to the hosts that have been assigned to the job by the PBS scheduler. This means that this file can only be generated after the job has been started and it has become clear which CPUs are available. PBS generates a file at the location given in the environment variable $PBS_NODEFILE with a list of node names. If a node name occurs more than once, it means that multiple CPUs are available on that node. With the following simple awk script the PBS file is converted into the MMTSB hosts file format:
    awk '{printf("%s 1 /tmp/feig\n",$1);}' < $PBS_NODEFILE > hosts
    
    Another MMTSB parallel option is interesting for distributed clusters. If -mp (message passing) is given as an extra argument, parallel ensemble computing is run in a truly distributed manner. Input files are sent via the network to the clients, the clients perform the desired operation and then send the results back to the server where they are stored and archived. The default mode on the other hand, when -mp is not specified, assumes that a shared disk is available and reads and writes input and output via NFS. Either way should work fine, but there may be performance differences depending on the number of processors that are being used and the overall load on the cluster and NFS server(s).

  3. Replica exchange simulations, coupling simulations at multiple temperature windows at periodic intervals, are inherently parallel. Although it is possible to run replica exchange simulations on a single CPU, they are most ideally suited for parallel clusters by using one CPU for each temperature window. The options for the aarex.pl and latrex.pl scripts are similar to the ensemble computing utilities except that the number of CPUs that are being used is taken automatically from the number of temperature windows and does not have to be given explicitly. The following example shows how to start an all-atom replica exchange run in parallel:
    aarex.pl -hosts available.hosts -temp 8:250:350 start.pdb
    
    Again, a hosts file is needed as described above either generated manually or from a PBS node file if run through the queue.
    
    


Distributing processes manually
arrow

Finally, it is possible to use multiple nodes simply with rsh or ssh in order to start processes remotely on allocated nodes. Client/server type applications could be submitted in this manner by starting the server first on the node, where the job script is started, and then submitting clients to other nodes with rsh or ssh as in the following example:

#!/bin/csh
#PBS -l nodes=4:ppn=1

cd /feigdata/feig/simulations

start-server & 

foreach n ( `cat $PBS_NODEFILE` )
  rsh $n "( cd /feigdata/feig/simulations; start-client )" &
end
   
wait
The use of rsh/ssh allows great flexibility when using multiple CPUs. However, as in the previous examples, one should not start processes on nodes/processors that have not been allocated, and if parallel jobs are run in this manual fashion, it is the user's resposibility to clean up all processes when the job terminates either successfully or due to an error.