Cluster Help Pages

Before using any of the clusters, read the documents below:
System Rules
BDJ http://hpc.research.yale.edu/wiki/index.php/Bulldogj
BDI http://maguro.cs.yale.edu:8000/Bulldogi
and these help pages.

Everyone will have to get used to the idea of cluster computing. Submitted jobs will not always run immediately upon submission. They may have to wait in the queue for a while. This is healthy and allows everyone on the cluster fair access.

Courtney

The documentation for the N1 Grid Engine can be found in /opt/n1ge6/doc/ on any host. To submit a job use the command qsub. Interactive logins are not available, but short jobs may be run directly on the master node itself. The basic syntax for submitting a job via the command line is

qsub submit_script.sh

After a job is submitted the grid engine decides the best order to run it in given the status of the cluster and the runtime defined by you. (hint: your job will be scheduled more efficiently the more accurately you can estimate your program's run time). There are currently three queues configured on courtney: singlq, amdq, and pedgeq. The singlq queue contains all the single processor machines while the amdq queue contains the dual processor AMD machines. The single processor machines have 1GB of memory each and the dual processor machines have 2GB of memory each. The pedgeq queue contains the other Dell poweredge server identical to the master node with 2 processors and 2GB of memory.

You do not have to specify a queue explicity for your job to run. The time limit for any job running in this queue is 30 days. If you need to run a job longer than 30 days, then you will need to rethink the structure of your code. Download this submit script to see the syntax of a submit script. If you wish to run an MPI program, download this MPI submit script. qstat will show the status of Grid Engine jobs and queues. This will show which jobs are currently running and which jobs are in the queue waiting area. To see the status of every node explicitly use qstat -f and to see the details of all jobs type qstat -r. Type man qstat to see the multitude of other options available for this command.
qhost - Show the status of Grid Engine hosts, queues, jobs. This will give you a nice status report of all the nodes and the load they are carrying even if a job was not submitting via N1GE6.
Type qstat to find your Job ID number, then qdel jobid to delete your job. Using submit scripts and running them from the command line is probably the quickest way to submit any job. If you find yourself frustrated, however, you may use the graphical interface provided by N1GE6. You will need an X-server: cygwin has an excellent one, just make sure you click on xorg-x11-base when you install. Then simply run qmon & from the command line to start this graphical front end. You may submit jobs and just about anything else with this tool. See the documentation for more details.

Bulldog J (BDJ)

It should be noted that jobs on BDJ will not be allowed to run longer than 24 hours. If jobs have to run longer, checkpoints can be incorporated into the code so it can be restarted after it has stopped. Otherwise run on Courtney, or as a last resort, Bulldog I.

Common software used in our group is installed on BDJ in /home2/hc5/install. Download the .bashrc file which should replace the one in your currect home directory; and of course rename the file to .bashrc. Download a sample submit script for a meep control file which may use OpenMPI or execute a serial job. You can see the status of our group's jobs on BDJ by executing the command:

jstat

This will show all our currently running jobs, plus the number of total processes free and queued on the cluster. If our group uses 51 processes 24 hours a day all year, we will reach our quota limit. So jstat also shows the amount of processes available to our group if 51 were the limit. In reality, the job limit for anybody at one time is 256 jobs. A FORTRAN compilation script (if90) is installed that uses the Intel Fortran compiler and automatically links to the OpenMPI library, FFTW3, GSL, LAPACK, and IMSL. Note that MPI programs using IMSL need MPICH instead of OpenMPI so the .bashrc file would have to be changed for this compilation to work.

Bulldog I (BDI)

Common software used in our group is installed on BDI in /home2/jda3/install. Download the .bashrc file which should replace the one in your currect home directory; and of course rename the file to .bashrc. Download a sample submit script for a meep control file which may use OpenMPI or execute a serial job. You can see the status of our group's jobs on BDI by executing the command:

jstat

This will show all our currently running jobs, plus the number of total processes free and queued on the cluster. A FORTRAN compilation script (jf90) is installed that uses the Portland Group Fortran compiler and automatically links to the OpenMPI library and FFTW3.



Further help:

  • Simple examples of using the Message Passing Interface (MPI)

  • Tutorial of the N1 Grid Engine 6

  • Using the MIT Electromagnetic Equation Propagation (MEEP) FDTD code. Includes instructions for using meep-mpi.

  • Password-less SSH Setup

  • Technical Information about Courtney