After March 2017, the SDSC Gordon-Simons cluster is no longer an NSF-allocated resource.
Gordon is a dedicated data-intensive supercompuer designed by Appro and SDSC consisting of 1024 compute nodes and 64 I/O nodes. Each compute node contains two 8-core 2.6 GHz Intel EM64T Xeon E5 (Sandy Bridge) processors and 64 GB of DDR3-1333 memory. The I/O nodes each contain two 6-core 2.67 GHz Intel X5650 (Westmere) processors, 48 GB of DDR3-1333 memory, and sixteen 300 GB Intel 710 solid state drives. The network topology is a 4x4x4 3D torus with adjacent switches connected by three 4x QDR InfiniBand links (120 Gb/s). Compute nodes (16 per switch) and I/O nodes (1 per switch) are connected to the switches by 4x QDR (40 Gb/s). The theoretical peak performance of Gordon is 341 TFlop/s.
System Component | Configuration |
---|---|
Intel EM64T Xeon E5 Compute Nodes | |
Sockets | 2 |
Cores | 16 |
Clock speed | 2.6 GHz |
Flop speed | 333 Gflop/s |
Memory capacity | 64 GB |
Memory bandwidth | 85 GB/s |
STREAM Triad bandwidth | 60 GB/s |
I/O Nodes | |
Sockets | 2 |
Cores | 12 |
Clock speed | 2.67 GHz |
Memory capacity | 48 GB |
Memory bandwidth | 64 GB/s |
Flash memory | 4.8 TB |
Full System | |
Total compute nodes | 1024 |
Total compute cores | 16384 |
Peak performance | 341 Tflop/s |
Total memory | 64 TB |
Total memory bandwidth | 87 TB/s |
Total flash memory | 300 TB |
QDR InfiniBand Interconnect | |
Topology | 3D Torus |
Link bandwidth | 8 Gb/s (bidirectional) |
Peak bisection bandwidth | TB/s (bidirectional) |
MPI latency | 1.3 µs |
DISK I/O Subsystems | |
File Systems | NFS + Lustre |
Storage capacity (usable) | 1.7 PB +100TB |
I/O bandwidth | 100 GB/s |
Software Function | Description |
---|---|
Cluster Management | Rocks |
Operating System | CentOS |
File Systems | NFS, Lustre |
Scheduler and Resource Manager | Slurm |
User Environment | Modules |
Compilers | Intel,GNU Fortran, C, C++ |
Message Passing | MVAPICH2, Open MPI |
As a Flatiron Institute computing resource, Gordon is accessible to all FI staff with access to the local FI clusters. Other Simons staff who are interested in using Gordon should send email to "scicomp@flatironinstitute.org".
Logging in to Gordon
To login to Gordon from the command line you must first login to "voms" (voms.simonsfoundation.org), then use the hostname:
gordon.sdsc.edu
The following are examples of Secure Shell (ssh) commands that may be used to log in to Gordon:
ssh <your_username>@gordon.sdsc.edu ssh -l <your_username> gordon.sdsc.edu
Do not use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.
The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.
For example, if the Intel module and mvapich2_ib module are loaded and the user compiles with mpif90, the generated code is compiled with the Intel Fortran 90 compiler and linked with the mvapich2_ib MPI libraries.
Several modules that determine the default Gordon environment are loaded at login time. These include the MVAPICH2 implementation of the MPI library and the Intel compilers. We strongly suggest that you use this combination whenever possible to get the best performance.
Useful Modules Commands
Here are some common module commands and their descriptions:
Command |
Description |
---|---|
module list |
List the modules that are currently loaded |
module avail |
List the modules that are available |
module display <module_name> |
Show the environment variables used by <module name> and how they are affected |
module unload <module name> |
Remove <module name> from the environment |
module load <module name> |
Load <module name> into the environment |
module swap <module one> <module two> |
Replace <module one> with <module two> in the environment |
Loading and unloading modules
You must remove some modules before loading others.
Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich2_ib are both loaded, running the command module unload intel will automatically unload mvapich2_ib. Subsequently issuing the module load intel command does not automatically reload mvapich2_ib.
If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (.bashrc for bash users, .cshrc for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.
Module: command not found
The error message module: command not found is sometimes encountered when switching from one shell to another or attempting to run the module command from within a shell script or batch job. The reason that the module command may not be inherited as expected is that it is defined as a function for your login shell. If you encounter this error execute the following from the command line (interactive shells) or add to your shell script (including Slurm batch scripts)
source /etc/profile.d/modules.sh
Gordon provides the Intel and GNU compilers along with multiple MPI implementations (MVAPICH2, MPICH2, OpenMPI). Most applications will achieve the best performance on Gordon using the Intel compilers and MVAPICH2, and the majority of libraries installed on Gordon have been built using this combination. Although other compilers and MPI implementations are available, we suggest using these only for compatibility purposes.
All compilers now support the Advanced Vector Extensions (AVX). Using AVX, up to eight floating point operations can be executed per cycle per core, potentially doubling the performance relative to non-AVX processors running at the same clock speed. Note that AVX support is not enabled by default and compiler flags must be set as described below.
The Intel compilers and the MVAPICH2 MPI implementation will be loaded by default. If you have modified your environment, you can reload by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)
module purge module load intel mvapich2_ib
For AVX support, compile with the -xHOST option. Note that -xHOST alone does not enable aggressive optimization, so compilation with -O3 is also suggested. The -fast flag invokes -xHOST, but should be avoided since it also turns on interprocedural optimization (-ipo), which may cause problems in some instances.
Intel MKL libraries are available as part of the "intel" module on Gordon. Once this module is loaded, the environment variable MKL_ROOT points to the location of the mkl libraries. The MKL link advisor can be used to ascertain the link line (Note: If you use MKL link advisor to get a link line used in your makefile or configuration, you will need to either modify the env variable name or change the link line explicitly from MKLROOT(as it is set in the link advisor) to MKL_ROOT as shown below.
For example to compile a C program statically linking 64 bit scalapack libraries on Gordon:
module unload mvapich2_ib module unload pgi module load intel module load mvapich2_ib mpicc -o pdpttr.exe pdpttr.c \ -I$MKL_ROOT/include ${MKL_ROOT}/lib/intel64/libmkl_scalapack_lp64.a \ -Wl,--start-group ${MKL_ROOT}/lib/intel64/libmkl_intel_lp64.a \ ${MKL_ROOT}/lib/intel64/libmkl_core.a ${MKL_ROOT}/lib/intel64/libmkl_sequential.a \ -Wl,--end-group ${MKL_ROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm
For more information on the Intel compilers: [ifort | icc | icpc] -help
Serial |
MPI |
OpenMP |
MPI+OpenMP |
|
Fortran |
ifort |
mpif90 |
ifort -openmp |
mpif90 -openmp |
C |
icc |
mpicc |
icc -openmp |
mpicc -openmp |
C++ |
icpc |
mpicxx |
icpc -openmp |
mpicxx -openmp |
Note for C/C++ users: The compiler warning - "feupdateenv is not implemented and will always fail", can safely be ignored by most users. By default, the Intel C/C++ compilers only link against Intel's optimized version of the C standard math library (libmf
). The error stems from the fact that several of the newer C99 library functions related to floating point rounding and exception handling have not been implemented.
The GNU compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup files (~/.cshrc or ~/.bashrc)
module purge
module load gnu openmpi
For AVX support, compile with -mavx.
For more information on the GNU compilers: man [gfortran | gcc | g++]
Serial |
MPI |
OpenMP |
MPI+OpenMP |
|
Fortran |
gfortran |
mpif90 |
gfortran -fopenmp |
mpif90 -fopenmp |
C |
gcc |
mpicc |
gcc -fopenmp |
mpicc -fopenmp |
C++ |
g++ |
mpicxx |
g++ -fopenmp |
mpicxx -fopenmp |
Gordon uses the Simple Linux Utility for Resource Management (Slurm) environment to manage user jobs. Whether you run in batch mode or interactively, you will access the compute nodes using the sbatch, or srun, command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.
Gordon has the following partitions available:
Queue Name | Max Walltime | Max Nodes | Max Jobs | Comments |
---|---|---|---|---|
general | 7days | 256 | unlimited | Used for exclusive priority access to compute nodes |
preempt | unlimited | unlimited | unlimited | Used for exclusive access to compute nodes, jobs are pre-emptable |
soap | unlimited | unlimited | unlimited | Used for SOAP, SOP and SAP groups |
You can request an interactive session using the srun command as follows:
[user@gordon-ln1]$ srun --pty --nodes=1 --ntasks-per-node=16 -p general -t 00:30:00 --wait 0 /bin/bash
Jobs can be submitted to the sbatch partitions using the sbatch command as follows:
[user@gordon-ln1]$ sbatch jobscriptfile
where jobscriptfile is the name of a UNIX format file containing special statements (corresponding to sbatch options), resource specifications and shell commands. Several example Slurm scripts are given below:
#!/bin/bash #SBATCH --job-name="hellompi" #SBATCH --output="hellompi.%j.%N.out" #SBATCH --partition=general #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --export=ALL #SBATCH -t 01:30:00 #This job runs with 2 nodes, 16 cores per node for a total of 32 cores. #ibrun in verbose mode will give binding detail
srun -n 32 --mpi=pmi2 ../hello_mpi
#!/bin/bash #SBATCH --job-name="hello_openmp" #SBATCH --output="hello_openmp.%j.%N.out" #SBATCH --partition=general #SBATCH --nodes=1 #SBATCH --ntasks-per-node=16 #SBATCH --export=ALL #SBATCH -t 01:30:00 #SET the number of openmp threads export OMP_NUM_THREADS=16 ./hello_openmp
#!/bin/bash #SBATCH --job-name="hellohybrid" #SBATCH --output="hellohybrid.%j.%N.out" #SBATCH --partition=general #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --export=ALL #SBATCH -t 01:30:00 #This job runs with 2 nodes, 16 cores per node for a total of 32 cores. # We use 8 MPI tasks and 4 OpenMP threads per MPI task export OMP_NUM_THREADS=4 srun -n 8 -cpus=per-task=4 --mpi=pmi2 ./hello_hybrid
#!/bin/bash #SBATCH --job-name="hellompirunrsh" #SBATCH --output="hellompirunrsh.%j.%N.out" #SBATCH --partition=general #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --export=ALL #SBATCH -t 01:30:00 #Generate a hostfile from the slurm node list export SLURM_NODEFILE=`generate_pbs_nodefile` #Run the job using mpirun_rsh mpirun_rsh -hostfile $SLURM_NODEFILE -np 32 ../hello_mpi
There are several scenarios (e.g. splitting long running jobs, workflows) where users may require jobs with dependencies on successful completions of other jobs. In such cases, Slurm's --dependency option can be used. The syntax is as follows:
[user@gordon-ln1 ~]$ sbatch --dependency=afterok:jobid jobscriptfile
Gordon compute nodes have access to fast flash storage. Each compute node mounts locally a single 300 GB SSD (280 GB usable space).
The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution under the following directory:
/scratch/$USER/$SLURM_JOBID
Lustre Scratch Space
The scratch disk space on Gordon has a capacity of 1.7 PB and is currently configured as follows.
/simons/scratch/$USER/
The former is created at the start of a job and should be used for applications that require a shared scratch space or need more storage than can be provided by the flash disks. Unlike SSD scratch storage, the data stored in Lustre scratch is not purged immediately after job completion. Users will have time after the job completes to copy back data they wish to retain to their projects directories, home directories, or to their home institution.
After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME. The home directory is limited in space and should be used only for source code storage. Jobs should never be run from the home file system, as it is not set up for high performance throughput. Users should keep usage on $HOME under 100GB. Backups are currently being stored on a rolling 8-week period. In case of file corruption/data loss, please contact us at support@sdsc.edu to retrieve the requested files.
In this section, we describe some standard tools that you can use to monitor your batch jobs. We suggest that you yourself with the section of the user guide that deals with running jobs to get a deeper understanding of the batch queuing system before starting this section.
More details are provided in the Slurm User Guide.sacct - command displays job accouting data
This example illustrates the default invocation of the default sacct command:# sacct Jobid Jobname Partition Account AllocCPUS State ExitCode ---------- ---------- ---------- ---------- ---------- ---------- -------- 2 script01 srun acct1 1 RUNNING 0 3 script02 srun acct1 1 RUNNING 0 4 endscript srun acct1 1 RUNNING 0 4.0 srun acct1 1 COMPLETED 0
The sacct command has several standar variable options. This example shows the same job accounting information with the brief option, which displays the jobid, status, and exitcode
# sacct --brief Jobid State ExitCode ---------- ---------- -------- 2 RUNNING 0 3 RUNNING 0 4 RUNNING 0 4.0 COMPLETED 0
The sacct command can also be customized. This example demonstrates that ability. The fields are displayed in the order designated on the command line.
# sacct --format=jobid,elapsed,ncpus,ntasks,state Jobid Elapsed Ncpus Ntasks State ---------- ---------- ---------- -------- ---------- 3 00:01:30 2 1 COMPLETED 3.0 00:01:30 2 1 COMPLETED 4 00:00:00 2 2 COMPLETED 4.0 00:00:01 2 2 COMPLETED 5 00:01:23 2 1 COMPLETED 5.0 00:01:31 2 1 COMPLETED