MMTSB
Tool Set Documentation

PARALLELoptions

From MMTSB
Jump to: navigation, search

-cpus number

This option sets the number of desired parallel CPUs in shared memory as well as distributed environments. For the ensemble tools ensmin.pl, enslatsim.pl, ensrun.pl, and calcprop.pl this is necessary to initiate parallel execution. The replica exchange tools latrex.pl, aarex.pl, lmcrex.pl, and fsamrex.pl do not usually require this option since the number of CPUs is taken by default from the number of temperature windows but it is possible to use this option to specify a smaller number of CPUs than temperature windows if computational resources are limited. For example, one could run 32 temperature windows on 8 CPUs by sharing each CPU among 4 windows. However, this job would take 4 times longer and require 4 times as much memory per CPU than if 32 CPUs were used. This option is ignored for replica exchange runs if more CPUs than temperature windows are requested The number of requested CPUs is not checked automatically against actually available CPUs and/or the load average on a given machine. The user should be careful not to request more CPUs than are available on a given machine in order to avoid system overloading.

-hosts file

For parallel execution in distributed environments where password-less remote login via ssh is possible a file with a list of available hosts can be given with this option. The file is expected to contain a separate line for each host with the host name or IP address, the number of available CPUs, and an optional local directory name for temporary files.

An example host file describing three nodes with a total of 14 CPUs/cores looks as follows:

 snoopy 2 /tmp/workspace
 goofy 8 /tmp/workspace
 scoobydoo 4 /tmp/workspace

-mp

During distributed parallel execution the default mode of operation is to use a common directory that is shared via NFS for all input and output. If NFS is not available or unstable (as in some Linux configurations) this option can be given so that a local directory (as specified in the hosts file) is used for running jobs instead and all final output is sent to the server via direct network communication. In some cases this option may affect the parallel performance and should only be used if necessary.

-keepmpdir

This option can be used in addition to the -mp option in order to keep the contents of the local directories after the parallel job has finished. This may be useful, e.g., for inspecting log files that are generated but not sent back to the server. By default all local data is removed at the end of each job.

-jobenv name

In distributed environements where remote job submission to a list of hosts via rsh is not possible, automated parallel runs are still runs are still possible if the utility is started separately on each CPU but with a environment variable set to the parallel rank of each CPU. In this case the name of the environment variable is given with this option. It is expected that the values range from 0 to the number of CPUs minus 1. If this option is used the server is started only if the rank is equal to 0. Parallel clients are started for all ranks after the server has become available.

-saveid file

This option can be used to save server address, port number, and ID number under the given file name. This information may be needed so that external monitoring programs can connect to the parallel server. Such a file is generated automatically for parallel execution in distributed environments but not in shared memory environments. This option can also be used to change the file name when the -jobenv option is used from the default name 'save.id'.

-rserv file

Normally this option is used internally for remote job execution in distributed parallel environments. and provides the name of a server information file as the one written out with the -saveid option. This option can also be used to manually connect to a running server that may have been started with jobserver.pl or rexserver.pl

-jobs [from:]to

This option is also used internally for remote job execution. It defines the subset of jobs that are run on a particular machine.