Comet is a cluster designed by Dell and SDSC delivering 2.76 peak petaflops. It features Intel next-gen processors with AVX2, Mellanox FDR InfiniBand interconnects, and Aeon storage. It was originally operated within the NSF's eXtreme Science and Engineering Discovery Environment (XSEDE).
The standard compute nodes consist of Intel Xeon E5-2680v3 processors, 128 GB DDR4 DRAM (64 GB per socket), and 320 GB of SSD local scratch memory. The GPU nodes contain four NVIDIA GPUs each. The large memory nodes contain 1.5 TB of DRAM and four Haswell processors each. The network topology is 56 Gbps FDR InfiniBand with rack-level full bisection bandwidth and 4:1 oversubscription cross-rack bandwidth. Comet has 7.6 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage. It also has dedicated gateway/portal hosting nodes and a Virtual Image repository. External connectivity to Internet2 and ESNet is 100Gbps.
Comet Retired from XSEDE service on 7/15/2021! SDSC Comet supercomputer was retired from XSEDE service on July 15, 2021. Users will have access to the Comet filesystems via the datamover nodes via Globus (www.globus.org) and the endpoint xsede#comet or interactive access at oasis-dm-interactive.sdsc.edu.
Comet was designed and is operated on the principle that the majority of computational research is performed at modest scale. Comet also supports science gateways, which are web-based applications that simplify access to HPC resources on behalf of a wide variety of research communities and science domains, typically with hundreds to thousands of users. Comet was originally an NSF-funded system operated by the San Diego Supercomputer Center at UC San Diego, and was available through the XSEDE program.
System Component | Configuration |
---|---|
Intel Haswell Standard Compute Nodes | |
Node count | 1,944 |
Clock speed | 2.5 GHz |
Cores/node | 24 |
DRAM/node | 128 GB |
SSD memory/node | 320 GB |
NVIDIA Kepler K80 GPU Nodes | |
Node count | 36 |
CPU cores:GPUs/node | 24:4 |
CPU:GPU DRAM/node | 128 GB:48 GB |
NVIDIA Pascal P100 GPU Nodes | |
Node count | 36 |
CPU cores:GPUs/node | 28:4 |
CPU:GPU DRAM/node | 128 GB:64 GB |
Large-memory Haswell Nodes | |
Node count | 4 |
Clock speed | 2.2 GHz |
Cores/node | 64 |
DRAM/node | 1.5 TB |
SSD memory/node | 400 GB |
Storage Systems | |
File systems | Lustre, NFS |
Performance Storage | 7.6 PB |
Home file system | 280 TB |
Trial Accounts give potential users rapid access to Comet for the purpose of evaluating Comet for their research. This can be a valuable step in assessing the usefulness of the system by allowing them to compile, run, and do initial benchmarking of their application prior to submitting a larger Startup or Research allocation. Trial Accounts are for 1000 CPU core-hours, and/or 100 GPU hours. Requests are fulfilled within 1 working day.
System Component | Configuration |
---|---|
1944 Standard Compute Nodes | |
Processor Type | Intel Xeon E5-2680v3 |
Sockets | 2 |
Cores/socket | 12 |
Clock speed | 2.5 GHz |
Flop speed | 1.866 PFlop/s |
Memory capacity |
128 GB DDR4 DRAM |
Flash memory |
320 GB SSD |
Memory bandwidth | 120 GB/s |
STREAM Triad bandwidth | 104 GB/s |
36 K80 GPU Nodes | |
GPUs | 4 NVIDIA |
Cores/socket | 12 |
Sockets | 2 |
Clock speed | 2.5 GHz |
Flop speed | 0.208 PFlop/s |
Memory capacity | 128 GB DDR4 DRAM |
Flash memory |
400 GB SSD |
Memory bandwidth | 120 GB/s |
STREAM Triad bandwidth | 104 GB/s |
36 P100 GPU Nodes | |
GPUs | 4 NVIDIA |
Cores/socket | 14 |
Sockets | 2 |
Clock speed | 2.4 GHz |
Flop speed | 0.676 PFlop/s |
Memory capacity | 128 GB DDR4 DRAM |
Flash memory |
400 GB SSD |
Memory bandwidth | 150 GB/s |
STREAM Triad bandwidth | 116 GB/s |
4 Large Memory Nodes | |
Sockets | 4 |
Cores/socket | 16 |
Clock speed | 2.2 GHz |
Flop speed | 0.009 PFlop/s |
Memory capacity | 1.5 TB |
Flash memory | 400 GB |
STREAM Triad bandwidth | 142 GB/sec |
Full System | |
Total compute nodes | 1984 |
Total compute cores | 47,776 |
Peak performance | 2.76 PFlop/s |
Total memory | 247 TB |
Total memory bandwidth | 228 TB/s |
Total flash memory | 634 TB |
FDR InfiniBand Interconnect | |
Topology | Hybrid Fat-Tree |
Link bandwidth | 56 Gb/s (bidirectional) |
Peak bisection bandwidth | 3.4 TB/s |
MPI latency | 1.03-1.97 µs |
DISK I/O Subsystem | |
File Systems | NFS, Lustre |
Storage capacity (durable) | 6 PB |
Storage capacity (performance) | 7.6 PB |
I/O bandwidth (performance disk) | 200 GB/s |
Comet supported the XSEDE core software stack, which included remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.
Software Function | Description |
---|---|
Cluster Management | Rocks |
Operating System | CentOS |
File Systems | NFS, Lustre |
Scheduler and Resource Manager | SLURM |
XSEDE Software | CTSS |
User Environment | Modules |
Compilers | Intel and PGI Fortran, C, C++ |
Message Passing | Intel MPI, MVAPICH, Open MPI |
Debugger | DDT |
Performance | IPM, mpiP, PAPI, TAU |
Domain | Software |
---|---|
Biochemistry |
APBS |
Bioinformatics |
BamTools, BEAGLE, BEAST, BEAST 2, bedtools, Bismark, BLAST, BLAT, Bowtie, Bowtie 2, BWA, Cufflinks, DPPDiv, Edena, FastQC, FastTree, FASTX-Toolkit, FSA, GARLI, GATK, GMAP-GSNAP, IDBA-UD, MAFFT, MrBayes, PhyloBayes, Picard, PLINK, QIIME, RAxML, SAMtools, SOAPdenovo2, SOAPsnp, SPAdes, TopHat, Trimmomatic, Trinity, Velvet |
Compilers |
GNU, Intel, Mono, PGI |
File format libraries |
HDF4, HDF5, NetCDF |
Interpreted languages |
MATLAB, Octave, R |
Large-scale data-analysis frameworks |
Hadoop 1, Hadoop 2 (with YARN), Spark, RDMA-Spark |
Molecular dynamics |
Amber, Gromacs, LAMMPS, NAMD |
MPI libraries |
MPICH2, MVAPICH2, Open MPI |
Numerical libraries |
ATLAS, FFTW, GSL, LAPACK, MKL, ParMETIS, PETSc, ScaLAPACK, SPRNG, Sundials, SuperLU, Trilinios |
Predictive analytics |
KNIME, Mahout, Weka |
Profiling and debugging |
DDT, IDB, IPM, mpiP, PAPI, TAU, Valgrind |
Quantum chemistry |
CPMD, CP2K, GAMESS, Gaussian, MOPAC, NWChem, Q-Chem, VASP |
Structural mechanics |
Abaqus |
Visualization |
IDL, VisIt |
As an XSEDE computing resource, Comet was accessible to XSEDE users who were given time on the system.
Interested parties may contact SDSC User Support for help with a Comet proposal (see sidebar for contact information).
Logging in to Comet
Comet supports Single Sign On through the XSEDE User Portal and from the command line using an XSEDE-wide password. To log in to Comet from the command line, use the hostname:
comet.sdsc.edu
The following are examples of Secure Shell (ssh) commands that may be used to log in to Comet:
ssh <your_username>@comet.sdsc.edu ssh -l <your_username> comet.sdsc.edu
Do not use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.
The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.
For example, if the Intel module and mvapich2_ib module are loaded and the user compiles with mpif90, the generated code is compiled with the Intel Fortran 90 compiler and linked with the mvapich2_ib MPI libraries.
Several modules that determine the default Comet environment are loaded at login time. These include the MVAPICH implementation of the MPI library and the Intel compilers. We strongly suggest that you use this combination whenever possible to get the best performance.
Useful Modules Commands
Here are some common module commands and their descriptions:
Command | Description |
---|---|
module list |
List the modules that are currently loaded |
module avail |
List the modules that are available |
module display <module_name> |
Show the environment variables used by <module name> and how they are affected |
module unload <module name> |
Remove <module name> from the environment |
module load <module name> |
Load <module name> into the environment |
module swap <module one> <module two> |
Replace <module one> with <module two> in the environment |
Loading and unloading modules
You must remove some modules before loading others.
Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich2_ib are both loaded, running the command module unload intel will automatically unload mvapich2_ib. Subsequently issuing the module load intel command does not automatically reload mvapich2_ib.
If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (.bashrc for bash users, .cshrc for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.
Module: command not found
The error message module: command not found is sometimes encountered when switching from one shell to another or attempting to run the module command from within a shell script or batch job. The reason that the module command may not be inherited as expected is that it is defined as a function for your login shell. If you encounter this error execute the following from the command line (interactive shells) or add to your shell script (including Slurm batch scripts)
source /etc/profile.d/modules.sh
The show_accounts, or show_accounts --gpu, command lists the accounts that you are authorized to use on the specificed resource, together with a summary of the used and remaining time.
[user@comet-login1 ~]$ show_accounts (--gpu) ID name project used available used_by_proj --------------------------------------------------------------- <user> <project> <SUs by user> <SUs available> <SUs by proj>
To charge your job to one of these projects, replace << project >> with one from the list and put this PBS directive in your job script:
#SBATCH -A << project >>
Many users will have access to multiple accounts (e.g. an allocation for a research project and a separate allocation for classroom or educational use). On some systems a default account is assumed, but please get in the habit of explicitly setting an account for all batch jobs. Awards are normally made for a specific purpose and should not be used for other projects.
Project PIs and co-PIs can add or remove users from an account. To do this, log in to your XSEDE portal account and go to the Add User page.
The charge unit for all SDSC machines, including Comet, is the Service Unit (SU). This corresponds to the use of one compute core for one hour. Keep in mind that your charges are based on the resources that are tied up by your job and don’t necessarily reflect how the resources are used. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger. The minimum charge for any job longer than 10 seconds is 1 SU.
Comet provides the Intel, Portland Group (PGI), and GNU compilers along with multiple MPI implementations (MVAPICH2, MPICH2, OpenMPI). Most applications will achieve the best performance on Comet using the Intel compilers and MVAPICH2 and the majority of libraries installed on Comet have been built using this combination. Although other compilers and MPI implementations are available, we suggest using these only for compatibility purposes.
All three compilers now support the Advanced Vector Extensions 2 (AVX2). Using AVX2, up to eight floating point operations can be executed per cycle per core, potentially doubling the performance relative to non-AVX2 processors running at the same clock speed. Note that AVX2 support is not enabled by default and compiler flags must be set as described below.
The Intel compilers and the MVAPICH2 MPI implementation will be loaded by default. If you have modified your environment, you can reload by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)
module purge module load intel mvapich2_ib
For AVX2 support, compile with the -xHOST option. Note that -xHOST alone does not enable aggressive optimization, so compilation with -O3 is also suggested. The -fast flag invokes -xHOST, but should be avoided since it also turns on interprocedural optimization (-ipo), which may cause problems in some instances.
Intel MKL libraries are available as part of the "intel" modules on Comet. Once this module is loaded, the environment variable MKL_ROOT points to the location of the mkl libraries. The MKL link advisor can be used to ascertain the link line (change the MKL_ROOT aspect appropriately).
For example to compile a C program statically linking 64 bit scalapack libraries on Comet:
mpicc -o pdpttr.exe pdpttr.c \ -I$MKL_ROOT/include ${MKL_ROOT}/lib/intel64/libmkl_scalapack_lp64.a \ -Wl,--start-group ${MKL_ROOT}/lib/intel64/libmkl_intel_lp64.a \ ${MKL_ROOT}/lib/intel64/libmkl_core.a ${MKL_ROOT}/lib/intel64/libmkl_sequential.a \ -Wl,--end-group ${MKL_ROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm
For more information on the Intel compilers: [ifort | icc | icpc] -help
Serial |
MPI |
OpenMP |
MPI+OpenMP |
|
Fortran |
ifort |
mpif90 |
ifort -openmp |
mpif90 -openmp |
C |
icc |
mpicc |
icc -openmp |
mpicc -openmp |
C++ |
icpc |
mpicxx |
icpc -openmp |
mpicxx -openmp |
Note for C/C++ users: compiler warning - feupdateenv is not implemented and will always fail. For most users, this error can safely be ignored. By default, the Intel C/C++ compilers only link against Intel's optimized version of the C standard math library (libmf
). The error stems from the fact that several of the newer C99 library functions related to floating point rounding and exception handling have not been implemented.
The PGI compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)
module purge module load pgi mvapich2_ib
For AVX support, compile with -fast
For more information on the PGI compilers: man [pgf90 | pgcc | pgCC]
Serial |
MPI |
OpenMP |
MPI+OpenMP |
|
Fortran |
pgf90 |
mpif90 |
pgf90 -mp |
mpif90 -mp |
C |
pgcc |
mpicc |
pgcc -mp |
mpicc -mp |
C++ |
pgCC |
mpicxx |
pgCC -mp |
mpicxx -mp |
The GNU compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup files (~/.cshrc or ~/.bashrc)
module purge module load gnu openmpi_ib
For AVX support, compile with -mavx. Note that AVX support is only available in version 4.7 or later, so it is necessary to explicitly load the gnu/4.9.2 module until such time that it becomes the default.
For more information on the GNU compilers: man [gfortran | gcc | g++]
Serial |
MPI |
OpenMP |
MPI+OpenMP |
|
Fortran |
gfortran |
mpif90 |
gfortran -fopenmp |
mpif90 -fopenmp |
C |
gcc |
mpicc |
gcc -fopenmp |
mpicc -fopenmp |
C++ |
g++ |
mpicxx |
g++ -fopenmp |
mpicxx -fopenmp |
The GPU nodes on Comet have MVAPICH2-GDR available. MVAPICH2-GDR is based on the standard MVAPICH2 software stack, incorporates designs that take advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand interconnect. The mvapich2-gdr modules are also available on the login nodes for compiling purposes. An example compile and run script is provided in /share/apps/examples/mvapich2gdr.
Comet uses the Simple Linux Utility for Resource Management (SLURM) batch environment. When you run in the batch mode, you submit jobs to be run on the compute nodes using the sbatch command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.
Comet places limits on the number of jobs queued and running on a per group (allocation) and partition basis. Please note that submitting a large number of jobs (especially very short ones) can impact the overall scheduler response for all users. If you are anticipating submitting a lot of jobs, please contact the SDSC consulting staff before you submit them. We can work to check if there are bundling options that make your workflow more efficient and reduce the impact on the scheduler.
The limits are provided for each partition in the table below.
Partition Name | Max Walltime | Max Nodes/Job |
Max RunningJobs |
Max Running + Queued Jobs |
Comments |
---|---|---|---|---|---|
compute | 48 hrs | 72 | 144 | 360 | Used for exclusive access to regular compute nodes |
gpu | 48 hrs | 8 | 8 | 20 | Used for exclusive access to the GPU nodes |
gpu-shared | 48 hrs | 1 | 16 | 25 | Single-node job using fewer then 4 GPUs |
shared | 48 hrs | 1 | 1728 | 4320 | Single-node jobs using fewer than 24 cores |
large-shared | 48 hrs | 1 | - | 8 | Single-node jobs using large memory up to 1.45 TB |
debug | 30 mins | 2 | 8 | 8 | Used for access to debug nodes |
You can request an interactive session using the srun command. The following example will request one node, all 24 cores, in the debug partition for 30 minutes
srun --partition=debug --pty --nodes=1 --ntasks-per-node=24 -t 00:30:00 --wait=0 --export=ALL /bin/bash
Jobs can be submitted to the sbatch partitions using the sbatch command as follows:
[user@comet-ln1]$ sbatch jobscriptfile
where jobscriptfile is the name of a UNIX format file containing special statements (corresponding to sbatch options), resource specifications and shell commands. Several example SLURM scripts are given below:
#!/bin/bash #SBATCH --job-name="hellompi" #SBATCH --output="hellompi.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH --export=ALL #SBATCH -t 01:30:00 #This job runs with 2 nodes, 24 cores per node for a total of 48 cores. #ibrun in verbose mode will give binding detail ibrun -v ../hello_mpi
#!/bin/bash #SBATCH --job-name="hello_openmp" #SBATCH --output="hello_openmp.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH --export=ALL #SBATCH -t 01:30:00 #SET the number of openmp threads export OMP_NUM_THREADS=24 #Run the job using mpirun_rsh ./hello_openmp
#!/bin/bash #SBATCH --job-name="hellohybrid" #SBATCH --output="hellohybrid.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH --export=ALL #SBATCH -t 01:30:00 #This job runs with 2 nodes, 24 cores per node for a total of 48 cores. # We use 8 MPI tasks and 6 OpenMP threads per MPI task export OMP_NUM_THREADS=6 ibrun --npernode 4 ./hello_hybrid
#!/bin/bash #SBATCH --job-name="hellompirunrsh" #SBATCH --output="hellompirunrsh.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH --export=ALL #SBATCH -t 01:30:00 #Generate a hostfile from the slurm node list export SLURM_NODEFILE=`generate_pbs_nodefile` #Run the job using mpirun_rsh mpirun_rsh -hostfile $SLURM_NODEFILE -np 48 ../hello_mpi
#!/bin/bash #SBATCH -p shared #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --mem=40G #SBATCH -t 01:00:00 #SBATCH -J HPL.8 #SBATCH -o HPL.8.%j.%N.out #SBATCH -e HPL.8.%j.%N.err #SBATCH --export=ALL export MV2_SHOW_CPU_BINDING=1 ibrun -np 8 ./xhpl.exe
The above script will run using 8 cores and 40 GB of memory. Please note that the performance in the shared partition may vary depending on how sensitive your application is to memory locality and the cores you are assigned by the scheduler. It is possible the 8 cores will span two sockets for example.
SLURM will requeue jobs if there is a node failure. However, in some cases this might be detrimental if files get overwritten. If users wish to avoid automatic requeue, the following line should be added to their script:
#SBATCH --no-requeue
#!/bin/bash #SBATCH --job-name="abaqus" #SBATCH --output="abaqus.%j.%N.out" #SBATCH --partition=compute #SBATCH --nodes=1 #SBATCH --export=ALL #SBATCH --ntasks-per-node=24 #SBATCH -L abaqus:24 #SBATCH -t 01:00:00 module load abaqus/6.14-1 export EXE=`which abq6141` $EXE job=s4b input=s4b scratch=/scratch/$USER/$SLURM_JOBID cpus=24 mp_mode=threads memory=120000mb interactive
SDSC User Services staff have developed sample run scripts for common applications. They are available in the /share/apps/examples
directory on Comet.
There are several scenarios (e.g. splitting long running jobs, workflows) where users may require jobs with dependencies on successful completions of other jobs. In such cases, SLURM's --dependency option can be used. The syntax is as follows:
[user@comet-ln1 ~]$ sbatch --dependency=afterok:jobid jobscriptfile
Users can monitor jobs using the squeue command.
[user@comet-ln1 ~]$ squeue -u user1 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 256556 compute raxml_na user1 R 2:03:57 4 comet-14-[11-14] 256555 compute raxml_na user1 R 2:14:44 4 comet-02-[06-09]
In this example, the output lists two jobs that are running in the "compute" partition. The jobID, partition name, job names, user names, status, time, number of nodes, and the node list are provided for each job. Some common squeue options include:
Option | Result |
---|---|
-i <interval> | Repeatedly report at intervals (in seconds) |
-ij<job_list> | Displays information for specified job(s) |
-p <part_list> | Displays information for specified partitions (queues) |
-t <state_list> | Shows jobs in the specified state(s) |
Users can cancel their own jobs using the scancel command as follows:
[user@comet-ln1 ~]$ scancel <jobid>
The options and arguments for ibrun
are as follows:
Usage: ibrun [options] <executable> [executable args] Options: -n, -np <n> launch n MPI ranks (default: use all cores provided by resource manager) -o, --offset <n> assign MPI ranks starting at the nth slot provided by the resource manager (default: 0) -no <n> assign MPI ranks starting at the nth unique node provided by the resource manager (default: 0) --npernode <n> only launch n MPI ranks per node (default: ppn from resource manager) --tpr|--tpp|--threads-per-rank|--threads-per-process <n> how many threads each MPI rank (often referred to as 'MPI process') will spawn. (default: $OMP_NUM_THREADS (if defined), <ppn>/<npernode> if ppn is divisible by npernode, or 1 otherwise) --switches '<implementation-specific>' Pass additional command-line switches to the underlying implementation's MPI launcher. These WILL be overridden by any switches ibrun subsequently enables (default: none) -bp|--binding-policy <scatter|compact|none> Define the CPU affinity's binding policy for each MPI rank. 'scatter' distributes ranks across each binding level, 'compact' fills up a binding level before allocating another, and 'none' disables all affinity settings (default: optimized for job geometry) -bl|--binding-level <core|socket|numanode|none> Define the level of granularity for CPU affinity for each MPI rank. 'core' binds each rank to a single core; 'socket' binds each rank to all cores on a single CPU socket (good for multithreaded ranks); 'numanode' binds each rank to the subset of cores belonging to a numanode; 'none' disables all affinity settings. (default: optimized for job geometry) --dryrun Do everything except actually launch the application -v|--verbose Print diagnostic messages -? Print this message
All of Comet's NFS and Lustre filesystems are acccessible via the Globus endpoint xsede#comet
. The servers also mount Gordon's filesystems, so the mount points are a different for each system. The following table shows the mount points on the data mover nodes (that are the backend for xsede#comet
and xsede#gordon
).
Machine | Location on machine | Location on Globus/Data Movers |
---|---|---|
Comet | /home/$USER |
/home/$USER |
Comet | /oasis/projects/nsf |
/oasis/projects/nsf |
Comet | /oasis/scratch/comet |
/oasis/scratch-comet |
Gordon | /oasis/scratch |
/oasis/scratch |
The compute nodes on Comet have access to fast flash storage. There is 250GB of SSD space available for use on each compute node. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution under the following directories local to each compute node:
/scratch/$USER/$SLURM_JOB_ID
Partiation | Space Available |
---|---|
compute,shared | 212 GB |
gpu, gpu-shared | 286 GB |
large-shared | 286 GB |
A limited number of nodes in the "compute" partition have larger SSDs with a total of 1464 GB available in local scratch. They can be accessed by adding the following to the Slurm script:
#SBATCH --constraint="large_scratch"
In addition to the local scratch storage, users will have access to global parallel filesystems on Comet. Overall, Comet has 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage. SDSC limits the number of files that can be stored in /oasis/scratch filesystem,to 2 million files per user. Users should contact user support for assistance, if their workflow requires extensive small I/O, to avoid causing system issues assosiated with load on the metadata server.
Users can now access /oasis/projects
from Comet. The two Lustre filesystems available on Comet are:
/oasis/scratch/comet/$USER/temp_project
/oasis/projects/nsf
After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME. The home directory is limited in space and should be used only for source code storage. Jobs should never be run from the home file system, as it is not set up for high performance throughput. Users should keep usage on $HOME under 100GB. Backups are currently being stored on a rolling 8-week period. In case of file corruption/data loss, please contact us to retrieve the requested files.
VCs are not meant to replace the standard HPC batch queuing system, which is well suited for most scientific and technical workloads. In addition, a VC should not be simply thought of as a VM (virtual machine). Future XSEDE resources, such as Indiana University’s Jetstream will address this need. VCs are primarily intended for those users who require both fine-grained control over their software stack and access to multiple nodes. With regards to the software stack, this may include access to operating systems different from the default version of CentOS available on Comet or to low-level libraries that are closely integrated with the Linux distribution. Science Gateways serving large research communities and that require a flexible software environment are encouraged to consider applying for a VC, as are current users of commercial clouds who want to make the transition for performance or cost reasons.
Maintaining and configuring a virtual cluster requires a certain level of technical expertise. We expect that each project will have at least one person possessing strong systems administration experience with the relevant OS since the owner of the VC will be provided with "bare metal" root level access. SDSC staff will be available primarily to address performance issues that may be related to problems with the Comet hardware and not to help users build their system images.
All VC requests must include a brief justification that addresses the following:
As of July 1, 2017: Comet provides both NVIDIA K80 and P100 GPU-based resources. These GPU nodes are now allocated as a separate resource and can no longer be accessed using your Comet allocation. Current users must request a transfer of time from Comet CPU to Comet GPU through XRAS. The conversion rate is 14 Comet Service Units (SUs) to 1 K80 GPU-hour. The P100 GPUs are substantially faster than the K80, achieving more than twice the performance for some applications. Accordingly, users will incur a 1.5x premium when running on the P100 vs the K80.
The GPU nodes can be accessed via either the "gpu" or the "gpu-shared" partitions.
#SBATCH -p gpu
or#SBATCH -p gpu-shared
In addition to the partition name(required), the type of gpu(optional) and the individual GPUs are scheduled as a resource.
#SBATCH --gres=gpu[:type]:n
GPUs will be allocated on a first available, first schedule basis, unless specified with the [type] option, where type can be k80 or p100 (type is case sensitive) .
#SBATCH --gres= gpu:4 #first available gpu node #SBATCH --gres=gpu:k80:4 #only k80 nodes #SBATCH --gres=gpu:p100:4 #only p100 nodes
For example, on the "gpu" partition, the following lines are needed to utilize all 4 p100 GPUs:
#SBATCH -p gpu #SBATCH --gres=gpu:p100:4
Users should always set --ntasks-per-node equal to 6 x [number of GPUs] requested on all k80 "gpu-shared" jobs, and 7 x [number of GPUs] requested on all p100 "gpu-shared" jobs, to ensure proper resource distribution by the scheduler. Additionally, when requesting the P100 nodes it is recommended to ask for 25GB per GPU (unless more is needed for the code). The following requests 2 p100 GPUs on a "gpu-shared" partition:
#SBATCH -p gpu-shared
#SBATCH --ntasks-per-node=14 #SBATCH --gres=gpu:p100:2
#SBATCH --mem=50GB
(for a single GPU which wuold be --gres=gpu:p100:1, --mem=25G)
Here is an example AMBER script using the gpu-shared queue, aimed at a K80 Node.
#!/bin/bash #SBATCH --job-name="ambergpu-shared" #SBATCH --output="ambergpu-shared.%j.%N.out" #SBATCH --partition=gpu-shared #SBATCH --nodes=1 #SBATCH --ntasks-per-node=6
#SBATCH --no-requeue
#SBATCH --gres=gpu:k80:1 #SBATCH -t 01:00:00 module purge
module load amber/18
module load cuda/8.0
pmemd.cuda -O -i mdin.GPU -o mdout.GPU.$SLURM_JOBID -x mdcrd.$SLURM_JOBID -nf mdinfo.$SLURM_JOBID -1 mdlog.$SLURM_JOBID -p prmtop -c inpcrd
Please see /share/apps/examples/gpu for more examples.
GPU modes can be controlled for jobs in the "gpu" partition. By default, the GPUs are in non-exclusive mode and the persistence mode is 'on'. If a particular "gpu" partion job needs exclusive access the folowing options should be set in your batch script:
#SBATCH --constraint=exclusive
To turn persistence off add the following line to your batch script:
#SBATCH --constraint=persistenceoff
Jobs run in the gpu-shared partition are charged differently from other shared partitions on Comet to reflect the fraction of resource used based on number of GPUs requested and the relative performance of the different GPU types. We charge a 1.5X premium on P100 GPUs for performance, P100 GPUs are generally substantially faster than k80 nodes, achieving more than twice the performance for some applications. 1 GPU is equivalent to 1/4th of the node or 6 cores on k80 nodes and 7 cores on p100 nodes.
The charging equation will be:
GPU SUs = (Number of K80 GPUs) + (Number of P100 GPUS)*1.5) x (wallclock time)
The large memory nodes can be accessed via the "large-shared" partition. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger. By default the system will only allocate 5 GB of memory per core. If additional memory is required, users should explicity use the --mem directive.
For example, on the "large-shared" partition, the following job requesting 16 cores and 445 GB of memory (about 31.3% of 1455 GB of one node's available memory) for 1 hour will be charged 20 SUs:
455/1455(memory) * 64(cores) * 1(duration) ~= 20
#SBATCH --ntasks=16 #SBATCH --mem=455G #SBATCH --partition = large-shared
While there is not a separate 'large' partition, a job can still explicitly request all of the resources on a large memory node. Please note that there is no premium for using Comet's large memory nodes, but the processors are slightly slower (2.2 GHz compared to 2.5 GHz on the standard nodes), Users are advised to request the large nodes only if they need the extra memory.
Software Package |
Compiler Suites |
Parallel Interface |
---|---|---|
intel |
mvapich2_ib |
|
intel |
mvapich2_ib |
|
intel |
mvapich2_ib |
|
|
|
|
intel,pgi,gnu |
mvapich2_ib |
|
GAMESS: General Atomic Molecular Electronic Structure System |
intel |
native: sockets, ip over ib |
pgi |
Single node, shared memory |
|
intel |
mvapich2_ib |
|
intel,pgi,gnu |
mvapich2_ib for hdf5 |
|
Lammps:Large-scale Atomic/Molecular Massively Parallel Simulator. |
intel |
mvapich2_ib |
intel |
mvapich2_ib |
|
intel,pgi,gnu |
none |
|
Intel,pgi,gnu |
none |
|
gnu:ipython,nose,pytz |
None |
|
None |
None |
|
None |
None |
|
None |
None |
|
intel |
openmpi |
AMBER is package of molecular simulation programs including SANDER (Simulated Annealing with NMR-Derived Energy Restraints) and a modified version PMEME (Particle Mesh Ewald Molecular Dynamics) that is faster and more scalable.
APBS evaluates the electrostatic properties of solvated biomolecular systems.
CP2K is a program to perform simulations of molecular systems. It provides a general framework for different methods such as Density Functional Theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials.
DDT is a debugging tool for scalar, multithreaded and parallel applications.
FFTW is a library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data.
GAMESS is a program for ab initio quantum chemistry. GAMESS can compute SCF wavefunctions, and correlation corrections to these wavefunctions as well as Density Functional Theory.
GAMESS documentation, examples, etc.
Gaussian 09 provides state-of-the-art capabilities for electronic structure modeling.
GROMACS is a versatile molecular dynamics package, primarily designed for biochemical molecules like proteins, lipids and nucleic acids.
HDF is a collection of utilities, applications and libraries for manipulating, viewing, and analyzing data in HDF format.
LAMMPS is a classical molecular dynamics simulation code.
NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems.
NCO operates on netCDF input files (e.g. derive new data, average, print, hyperslab, manipulate metadata) and outputs results to screen or text, binary, or netCDF file formats.
NCO documentation on SourceForge
netCDF is a set of libraries that support the creation, access, and sharing of array-oriented scientific data using machine-independent data formats.
netCDF documentation on UCAR's Unidata Program Center
The Python modules under /opt/scipy consist of: node, numpy, scipy, matplotlib, pyfits, ipython and pytz.
Video tutorial from a TACC workshop on Python
Python videos from Khan Academy
The HPC University Python resources
RDMA-based Apache Hadoop 2.x is a high performance derivative of Apache Hadoop developed as part of the High-Performance Big Data (HiBD) project at the Network-Based Computing Lab of The Ohio State University. The installed release on Comet (v0.9.7) is based on Apache Hadoop 2.6.0. The design uses Comet's InfiniBand network at the native level (verbs) for HDFS, MapReduce, and RPC components, and is optimized for use with Lustre.
The design features a hybrid RDMA based HDFS design with in-memory and heterogenous storage including RAM Disk, SSD, HDD, and Lustre. In addition, optimized MapReduce over Lustre (with RDMA based shuffle) is also available. The implementation is fully integrated with SLURM (and PBS) on Comet with scripts available to dynamically deploy hadoop clusters within the SLURM scheduling framework.
Examples for various modes of usage are available in /share/apps/examples/HADOOP/RDMA. Details on the RDMA Hadoop and HiBD project are available at http://hibd.cse.ohio-state.edu.
RDMA-based Apache Spark package is a high performance derivative of Apache Spark developed as part of the High-Performance Big Data (HiBD) project at the Network-Based Computing Lab of The Ohio State University. The installed release on Comet (v0.9.1) is based on Apache Spark 1.5.1. The design uses Comet's InfiniBand network at the native level (verbs) for RDMA based data shuffle, SEDA based shuffle architecture, efficient connection management, non-blocking and chunk based data transfer, and off-JVM-heap buffer management.
The RDMA-Spark cluster setup and usage is managed via the myHadoop framework. An example script is provided in /share/apps/examples/spark/sparkgraphx_rdma. Details on the RDMA Spark and HiBD project are available at http://hibd.cse.ohio-state.edu.
Singularity: User Defined Images
Singularity is a platform to support users that have different environmental needs then what is provided by the resource or service provider. Singularity leverages a workflow and security model that makes it a very reasonable candidate for shared or multi-tenant HPC resources like Comet without requiring any modifications to the scheduler or system architecture. Additionally, all typical HPC functions can be leveraged within a Singularity container (e.g. InfiniBand, high performance file systems, GPUs, etc.). While Singularity supports MPI running in a hybrid model where you invoke MPI outside the container and it runs the MPI programs inside the container, we have not yet tested this.
On Comet several applications have been enabled using Singularity. The singularity images are located in /share/apps/gpu/singularity and /share/apps/compute/singularity. Applications enabled via Singularity include ParaView, TensorFlow, Torch, and Fenics. For the GPU nodes we also provide a baseline Ubuntu(16.04) image with relevant cuda libraries, that match our system versions. This image can be used as a template by users who want to use applications in an Ubuntu environment on the GPU nodes.
A getting started guide is available on the system in /share/apps/examples/SINGULARITY/Singularity_getting_started. Details on the Singularity project are available at http://singularity.lbl.gov/#home.
The VisIt visualization package supports remote submission of parallel jobs and includes a Python interface that provides bindings to all of its plots and operators so they may be controlled by scripting.
Glenn K. Lockwood, Mahidhar Tatineni, Rick Wagner (SDSC)
XSEDE'14
July 15, Atlanta
R. L. Moore, C. Baru, D. Baxter, G. Fox (Indiana U), A Majumdar, P Papadopoulos, W Pfeiffer, R. S. Sinkovits, S. Strande (NCAR), M. Tatineni, R. P. Wagner, N. Wilkins-Diehr, M. L. Norman (UCSD/SDSC except as noted)
XSEDE'14
July 16, Atlanta