Queen Bee Users Guide

Table of Contents


Logon to Queen Bee via GSISSH using TeraGrid certificates

For TeraGrid users, you can access Queen Bee system at login1-qb.loni-lsu.teragrid.org or queenbee.loni-lsu.teragrid.org by using your TeraGrid Certs via gsissh. On any TeraGrid resource or a non-TeraGrid resource that supports myproxy and gsissh with a fairly recent Globus (4.0.1 and later) installation, you can run:

$ myproxy-logon -l  -s myproxy.teragrid.org
Enter MyProxy pass phrase:

Note: Please replace with your teragrid portal username and enter your teragrid portal password when prompted for the myproxy passphrase.

After your credential has been received, you can execute the following to logon to Queen Bee:

$ gsissh login1-qb.loni-lsu.teragrid.org

If this is the first time you have logged in to Queen Bee you'll see something like this:

Generating public/private dsa key pair.
Enter file in which to save the key (/home/honggao/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/honggao/.ssh/id_dsa.
Your public key has been saved in /home/honggao/.ssh/id_dsa.pub.
The key fingerprint is:
31:53:a6:cb:ed:dd:8c:44:57:fd:d1:81:b5:b2:ec:29 honggao@qb2
      

You should accept the default file as the one in which to save the key, and you should use an empty passphrase. This will configure your account so that you can ssh to the other nodes without receiving the login prompt. This is necessary if you want to run parallel jobs on Queen Bee.

For this reason you should also be careful about modifying anything in your .ssh directory. If you cannot freely ssh between Queen Bee nodes, you will not be able to get your parallel program to run thus you WILL need to reset your ssh key by using the following commands (Notes: accept the default file, answer "y" to Overwrite and use an empty passphase):

$ cd ~/.ssh
$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/honggao/.ssh/id_dsa): 
/home/honggao/.ssh/id_dsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/honggao/.ssh/id_dsa.
Your public key has been saved in /home/honggao/.ssh/id_dsa.pub.
The key fingerprint is:
55:76:90:c5:ad:06:b3:8a:a3:fe:b9:6b:3b:16:8d:4e honggao@qb1.loni.org

Or you can create a new ssh key with NO passphrase and save it to the default ssh key file by adding flags to ssh-keygen.

$ ssh-keygen -N "" -q -t dsa -f ~/.ssh/id_dsa
/home/honggao/.ssh/id_dsa already exists.
Overwrite (y/n)? y
$ cp -p id_dsa.pub authorized_keys



Return to top of page

Setting up your environment for TeraGrid users.

Once you login to Queen Bee, you need to set up your TeraGrid environment. In order to make sure that all TeraGrid related software packages and tools are in your path, you need to make sure your $HOME/.soft file contains:

# TeraGrid wide basic software suite
@teragrid-basic

# TeraGrid wide Globus 4 and Grid software suite
@globus-4.0

# Platform recommended development software suite 
@teragrid-dev 

Please remove the "@default" entry after adding these and then save the file. After editing your $HOME/.soft file, you can update your environment using

$ resoft

TeraGrid users are allowed to access Queen Bee via gsissh using the TeraGrid certificates only thus no password is needed for login. However, if you need a password to login Queen Bee using SSH, please request one using the password reset form found at https://allocations.loni.org/user_reset.php or contact us at sys-help@loni.org.



Return to top of page

Logon to Queen Bee via SSH for LONI users

Queen Bee has two head nodes, qb1.loni.org and qb2.loni.org. You can login to one of them by connecting via ssh to any of the two. If you are a Windows user, you can find a good ssh client at here.

If this is the first time you have loggged in to Queen Bee you'll see something like this:

Generating public/private dsa key pair.
Enter file in which to save the key (/home/honggao/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/honggao/.ssh/id_dsa.
Your public key has been saved in /home/honggao/.ssh/id_dsa.pub.
The key fingerprint is:
31:53:a6:cb:ed:dd:8c:44:57:fd:d1:81:b5:b2:ec:29 honggao@qb2
      

You should accept the default file as the one in which to save the key, and you should use an empty passphrase. This will configure your account so that you can ssh to the other nodes without receiving the login prompt. This is necessary if you want to run parallel jobs on Queen Bee.

For this reason you should also be careful about modifying anything in your .ssh directory. If you cannot freely ssh between Queen Bee nodes, you will not be able to get your parallel program to run thus you need to reset your ssh key by using the following commands.

$ cd ~/.ssh  
$ ssh-keygen -t dsa
$ cp -p id_dsa.pub authorized_keys

Queen Bee has 2 head nodes (qb1 and qb2) and 668 compute nodes (qb001 to qb668). You will compile your code on a headnode, and execute it on one or more compute nodes. The remainder of this tutorial will guide you through an example of executing a parallel job on the compute nodes.



Return to top of page

Setting up your environment for LONI users.

First you have to set up your environment. You must decide which packages you want from the big list. Take note of the magic strings under the "softenv" column. In this case the magic strings we want are

Note that the suffix "intel9.1" on the mvapich package name indicates that this copy of mvapich was compiled with the 9.1 compilers from Intel.

Next you need to add the appropriate variables to your environment. You can do this by using softenv. You just need to add these magic strings to your .soft file under your home directory (${HOME}/.soft) and then reset your environment by using command resoft.

[user_name@qb ~]$ vi ${HOME}/.soft

@default
+intel-cc-9.1
+intel-fc-9.1
+mvapich-0.98-intel9.1
      
 [user_name@qb ~]$ resoft



Return to top of page

Creating home and work directories

Your home directory (/home/your_username) is automatically created when you login to Queen Bee the first time. Queen Bee has its own /home disk with quotas enabled at 5 GB. Please do not use the /home volume for batch job I/O, use the /work volume instead.

Your directory on /work volume (/work/your_username) will be automatically created within an hour after you login first time.

Please limit the number of files per directory to 10,000. No disk quotas are currently in effect for the /work volume, but all files will be purged after 30 days. Should disk space become critically low, files may purged sooner. Please do not try to circumvent the removal process. This may lead to restrictions on your access to the /work volume. If you need large storage, please contact us at sys-help@loni.org and a project based storage will be added for you per your request.

TeraGrid users can transfer files to NCSA'a archival storage for long term retention. Please refer to the Using NCSA Archival Storage session at the bottom of this guide.



Return to top of page

Changing your password and shell

You can change or reset your password for Queen Bee on-line at https://allocations.loni.org/user_reset.php (the form requires users to enter the email address associated with their account).You can change your shell online at https://allocations.loni.org/profile.php also.



Return to top of page

Compiling on Queen Bee

So, you've managed to login and set up your environment on Queen Bee. You've done whatever tweaking you like to do on any Linux machine you've worked on in the past and you've got your environment set up to point to the Intel compiler, MPI packages. What now?

Let's assume we have a fortran or C/C++ MPI program that we wish to compile and run under MPI. There are several flavors (versions) of MPI available on Queen Bee, and using the MPICH_HOME variable in your make files will make it easier for you to switch flavors if you need to.

For TeraGrid users, the MPI is MVAPICH2 version 0.98 compiled using Intel compilers 10.1 (softenv keyword +mvapich2-0.98-intel10.1) by default. MVAPICH2 is an implementation of mpich2 to make efficient usage of the infiniband network, developed at Ohio State University.

You can verify if you have set this up correctly by checking whether corresponding mpif90 and mpirun are in your path:

$ which mpif90
/usr/local/packages/mvapich2-0.98-intel10.1/bin/mpif90
$ which mpirun
/usr/local/packages/mvapich2-0.98-intel10.1/bin/mpirun

After the correct environment is set, you can compile your program using the following steps:

$ mpicc test.c -O3 -o a.out     (c code)
or 
$ mpif90 test.F -O3 -o a.out         (fortran code)

To run a mvapich2 job, a file ".mpd.conf", which contains the following line, needs to exist in your home directory:

MPD_SECRETWORD=xxxxxxxxxxx

where xxxxxxxxx is some password string you specify. Note that the file ".mpd.conf" should not be readable and writable by anyone except yourself. To set the permission, you need to:

$ chmod 600 ~/.mpd.conf 


Then, one needs to start the mpd daemon before running mvapich2 jobs.



Return to top of page

Running on Queen Bee

To run a parallel job on Queen Bee you will want to submit to the batch queue. Our queuing system is Torque Portable Batch System (PBS) which is the professional workload management system from Cluster Resources and Moab which works as job scheduler from Cluster Resources also. The command that you use to submit your job is "qsub".

The following PBS script shows an example of running a mvapich2 job, which is submitted using this command:

[user_name@qb ~]$ qsub test.qsub

The contents of test.qsub are as follows:

#!/bin/bash
#PBS -q workq
# the queue to be used. 
#
#PBS -A your_allocation
# specify your project allocation
#
#PBS -l nodes=4:ppn=8
# number of nodes and number of processors on each node to be used.
# Do NOT use ppn = 1. Note that there are 8 processors on each Queen Bee node.
#
#PBS -l cput=20:00:00
# requested CPU time.
#
#PBS -l walltime=20:00:00
# requested Wall-clock time.
#
#PBS -o myoutput2
# name of the standard out file to be "output-file".
#
#PBS -j oe
# standard error output merge to the standard output file.
#
#PBS -N s_type
# name of the job (that will appear on executing the qstat command).
#
# Following are non PBS commands. PLEASE ADOPT THE SAME EXECUTION SCHEME
# i.e. execute the job by copying the necessary files from your home directpory
# to the scratch space, execute in the scratch space, and copy back
# the necessary files to your home directory.
#
export WORK_DIR=/work/$USER/your_code_directory
cd $WORK_DIR
# changing to your working directory (we recommend you to use work volume for batch job run)
#
export NPROCS=`wc -l $PBS_NODEFILE |gawk '//{print $1}'`
# REQUIRED for PBS to work.
#
date
#timing the time job starts
#

# For MVAPICH2 jobs, start the mpd daemon on each allocated node.
export MPDSNP=`uniq $PBS_NODEFILE |wc -l| cut -d'/' -f1`
cat $PBS_NODEFILE | uniq > $WORK_DIR/mpd_nodefile_$USER
export MPD_NODEFILE=$WORK_DIR/mpd_nodefile_$USER
mpdboot -v -n $MPDSNP -f $MPD_NODEFILE
mpdtrace -l
rm $MPD_NODEFILE
# run mvapich2 jobs
mpirun -np $NPROCS $WORK_DIR/your_executable
# stop mpd daemons
mpdallexit

date
# timing your job

You can also use mpiexec to execute your MVAPICH2 applications. On your submit script you can replace the above "For MVAPICH2 jobs" part with the following:


# For MVAPICH2 jobs, start the mpd daemon on each allocated node.
# specify number of mpd daemons
let number_mpd=$NPROCS/8
# start mpd
mpdboot -n $number_mpd -f $PBS_NODEFILE
# run mpich2 jobs
mpiexec -np $NPROCS $WORK_DIR/your_executable
# stop mpd daemons
mpdallexit  

For TeraGrid users, if you prefer to use MVAPICH rather than MVAPICH2, you can add the following line to the .soft file under your home directory. For LONI users or TeraGrid users (but not for setting up your TeraGrid environment), the default MPI is MVAPICH version 0.98 built using Intel compilers 10.1.

+mvapich-0.98-intel10.1

Or alternatively, you can set the following environmental variables to achieve the same effect by adding the following to your .bashrc file under your home directory and running .bashrc script again:

export MPICH_HOME=/usr/local/packages/mvapich-0.98-intel10.1
export LD_LIBRARY_PATH=$MPICH_HOME/lib:$LD_LIBRARY_PATH
export PATH=$MPICH_HOME/bin:$PATH  

You can verify if you have this setup correctly by checking whether corresponding mpif90 and mpirun are in your path:

$ which mpif90
/usr/local/packages/mvapich-0.98-intel10.1/bin/mpif90
$ which mpirun
/usr/local/packages/mvapich-0.98-intel10.1/bin/mpirun
      

After the correct environment is set, you can compile your program using the following command:

$ mpicc test.c -O3 -o a.out
$ mpif90 test.F -O3 -o a.out

To run your application built using MVAPICH, you don't need to run the mpd daemon. If you want to run your program interactively on 16 processors, you need to first send an interactive job request to the PBS:

$ qsub -I -l nodes=2:ppn=8 -l walltime=00:30:00 -l cput=00:30:00 

When your job request is granted, enter the directory under which your parallel executable is, then launch:

$ mpirun -np 16 myexecutable

The following is a sample PBS script to send your mvapich application to the PBS queue:

#!/bin/bash
#PBS -q checkpt
# the queue to be used.
#PBS -M your_mail_address@somehost.edu
# your notification email address
#PBS -A your_TG_ALLOCATION 
# the project allocation
#
#PBS -l nodes=16:ppn=8
#
# number of nodes and number of processors on each node to be used.
# Do NOT use ppn = 1 except for serail job submitting to single queue.
#
#PBS -l cput=01:00:00
# requested CPU time.
#
#PBS -l walltime=01:00:00
# requested Wall-clock time.
#
#PBS -V
#
#PBS -o stdout
#PBS -e stdout
# name of the standard out file to be "output-file".
#
#PBS -j oe
# standard error output merge to the standard output file.
#
#PBS -N pbs-test
# name of the job (that will appear on executing the qstat command) to be "syschk".
#
# Following are non PBS commands. PLEASE ADOPT THE SAME EXECUTION SCHEME
# i.e. execute the job by copying the necessary files from your home directpory
# to the scratch space, execute in the scratch space, and copy back
# the necessary files to your home directory.
#

export WORK_DIR=/work/$USER/your_working_directory

export NPROCS=`wc -l $PBS_NODEFILE |gawk '//{print $1}'`

# REQUIRED for PBS to work.
# copies necessary files from home directory to scratch space.
cd $WORK_DIR
# changing the working directory to the scratch space

mpirun -machinefile $PBS_NODEFILE -np $NPROCS $WORK_DIR/test
#  executes the executable.

So now you've successfully submitted your job to the queue -- but is it actually running? And if it does run, how can you analyze how well it did?



Return to top of page

Commands for monitoring

  1. qstat: this will show you the status of your job and the jobs of others in the queue. It can show you various other bits of information about your job as well, such as the number of nodes it intends to use, the name of the queue it's in, etc.
  2. mshow: this command displays various diagnostic messages about the system and job queues. It lists all the jobs in the queue, first those that are running, then those that are queued in the order that they will be run.
  3. showq: this command displays jobs info within the batch system.
  4. showstart: this command gives an estimated starting time for your job.
  5. qdel: this command deletes a PBS job in the queue.

There are some systems tools written by our systems staff available

  1. qshow: this command shows the load on each compute node that your job is using. It shows and optionally kills user processes on remote nodes or execute commands.
  2. qfree: this commands shows how the nodes in a cluster are allocated and shows system usage.

More detailed information on the Torque PBS commands and Moab to schedule and monitor jobs can be found at Cluster Resources on-line Documentations.



Return to top of page

Queue limits and descriptions

There are currently four queues on Queen Bee: workq, checkpt, preempt, and priority.

Queue name
Total nodes
Maximum nodes allowed
Max wall clock time
Description
workq
530
256
48:00:00
default, for non-preemptable parallel jobs
checkpt
668
256
48:00:00
preemptable, for parallel jobs that can be checkpointed
preempt
138
128
48:00:00
for urgent parallel jobs that will preempt jobs in checkpt queue, requires special permission
priority
138
128
48:00:00
for on-demand parallel jobs that will have higher priority, requires special permission

At any given time, users may run up to 8 jobs at once, consuming a maximum of 384 total nodes. Additional user limits may be enforced as well. Please contact us at sys-help@loni.org for more information on user limits or special requests.



Return to top of page

Job queuing priority

The queuing system schedules jobs based on the job priority which takes into account several factors. Jobs with a higher job priority are scheduled ahead of jobs with a lower priority. Also it has a backfill capability when scheduling jobs that are short in duration or require a small number of nodes. That is the scheduler schedules small jobs while waiting for the start time of any large job requiring many nodes.

In determining which jobs to run first, Moab is using the following formula to calculate the Job priority:

Job priority = credential priority + fairshare priority + resource priority + service priority

(1) Credential Priority Subcomponent:

credential priority = credweight * (userweight * job.user.priority) credential priority = 100 * (10 * 100) = 100000 ( a constant )

(2) Fairshare Priority Subcomponent:

fairshare priority = fsweight * min (fscap, (fsuserweight * DeltaUserFSUsage)) fairshare priority = 100 * (10 * DeltaUserFSUsage)

A user's fair share usage is the sum of seven days of used daily processor seconds times daily decay factor divided by the sum of seven days of daily total processor seconds used times the daily decay factor. The decay factor is 0.9. DeltaUserFSUsage is the fair share target percent for each user (20 percent) minus the the calculated fair share usage percent.

In other words the target percentage minus the actual used percentage. For a user who has not used the cluster for a week:

fairshare priority = 100 * (10 * 20) = 20000

(3) Resource Priority Subcomponent

resource priority = resweight * min (rescap, (procweight * TotalProcessorsRequested) resource priority = 30 * min (26720, (10 * TotalProcessorsRequested)

For a 32 processor job: resource priority = 30 * 10 * 32 = 9600

(4) Service Priority Subcomponent

service priority = serviceweight * (queuetimeweight * QUEUETIME + xfactorweight * XFACTOR ) service priority = 2 * (2 * QUEUETIME + 20 * XFACTOR)

QUEUETIME is the time the job has been queued in minutes.

XFACTOR = 1 + QUEUETIME / WALLTIMELIMIT

For a one hour job in the queue for one day: service priority = 2 * (2 * 1440 + 20 * (1 + 1440 / 60 ) ) service priority = 2 * (2880 + 500 ) = 6760

These factors are adjusted as needed to make jobs of all sizes start fairly.



Return to top of page

Using NCSA Archival Storage

TeraGrid users can transfer files to NCSA's archival storage for longterm retention. They can use globus-url-copy and/or uberftp to transfer files between QuenBee and NCSA's archival storage.

Obtaining a Proxy

Before you can transfer files, you must create a temporary credential called a certificate proxy. Here is a more detailed explanation of the process.

To obtain the proxy, you will use myproxy. The general form of the command is:

myproxy-logon -l <username> -s myproxy.teragrid.org

You should use your TeraGrid User Portal username and password when requesting your proxy. If your username was abc, the following command would be used to obtain the proxy:

myproxy-logon -l abc -s myproxy.teragrid.org

You can use the grid-proxy-info command to verify that you have a valid proxy.

# grid-proxy-info
subject  : /C=US/O=National Center for Supercomputing Applications/CN=Allen Carlilse
issuer   : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy
identity : /C=US/O=National Center for Supercomputing Applications/CN=Allen Carlilse
type     : end entity credential
strength : 512 bits
path     : /tmp/x509up_u1217
timeleft : 11:47:45

Using globus-url-copy

globus-url-copy can be used to copy a file between QueenBee and NCSA's Archival storage. Here is TeraGrid'd documentation for globus-url-copy. The general form of the command is:

globus-url-copy [-dbg] file:///absolute/path/tofile gsiftp://lsumss.ncsa.uiuc.edu/~/destination-name

Here is an example:

globus-url-copy -dbg file:///home/abc/test4.txt gsiftp://lsumss.ncsa.uiuc.edu/~/test4.txt

Note that complete absolute paths must be used.

Using uberftp

When transfering more than one file, you may find it preferable to use uberftp instead of globus-url-copy. Here is TeraGrid's documentation on uberftp.

uberftp operates very much like ftp. You connect and transfer files in the same manner. Here is an example session:

# uberftp lsumss.ncsa.uiuc.edu
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

UNIX Archive FTP server (DiskXtender Version 2.9.1) active. Checking DiskXtender.conf

220 UNIX Archive FTP server ready.
230 User abc logged in.
uberftp> dir
drwx------  2  abc    ac  DK  common     1024 Dec 12  2005  .trash
-rw-------  1  abc    ac  DK  common  3145728 Mar  6 13:56  test3.txt
-rw-------  1  abc    ac  DK  common  3145728 Mar  6 14:32  test4.txt
uberftp> get test3.txt
test3.txt:  3145728 bytes in 2.09 seconds. 1501.90 KB/sec
uberftp> quit
221 Goodbye.
kthxbye

Sample Archival Storage Session

A typical session will include a proxy request followed by file transfers. Here is a typical session where the user abc will transfer a file to archival storage and then retrieve it.



Return to top of page