Home >
Support >
Popeye-Simons User Guide

Popeye-Simons User Guide

Technical Summary
Compiling
Running
Storage
Monitoring

Tutorials

Popeye-Simons User Guide

SDSC Popeye-Simons cluster is not an NSF-allocated resource.

Technical Summary

The Popeye supercomputer is a heterogenous cluster, physically located at SDSC, designed by Simons and SDSC staff, and integrated by Lenovo. The current hardware configuration consists of:

Five racks, each containing 72 standard compute nodes with either Intel Skylake or Cascade Lake processors
One rack with a Mellanox 648-port director switch that connects all nodes through an EDR InfiniBand network with fat-tree topology. This rack also houses two login and two management nodes.

In July 2019, the system will expand to include:

One rack with 16 GPU-accelerated compute nodes, each containing four NVIDIA V100 GPUs along with Intel Skylake processors.

Each compute node has 768 GB of DRAM, which gives an aggregate DRAM capacity of 270 TB. In June 2019 aggregate DRAM capacity will increase to 284 TB.

Current system peak speed is 1.55 Pflop/s. In July 2019 peak speed will increase to 2.05 Pflop/s, with 0.45 Pflop/s from the GPUs.

Data-intensive computations are supported by a large parallel file system based on Lustre, along with a home file system based on NFS. The current capacity of the parallel file system is 4.4 PB and will grow to 12 PB in June 2019.

Additional information on the hardware and software configurations is in the following tables.

Technical Details

System Component	Racks 1&2	Rack 4, 5,& 6	Rack 7
Production Date	Febuary 2019	June 2019	July 2019
CPU Type	Intel Skylake 8168	Intel Cascade Lake 8268	Intel Skylake 6148
Clock speed (GHz)	2.7	2.9	2.4
Cores/Node	48	48	40
Peak flops/clock-core via AVX-512	32	32	32
Peak CPU speed/node (Tflop/s)	4.15	4.45	3.07
CPU DRAM type & speed (MHz)	DDR4-2666	DDR4-2666	DDR4-2666
CPU DRAM capacity/node (GB)	768	768	768
Flash memory/node (TB)	1.9	1.9	-
GPU type	-	-	NVIDIA V100
GPU/node	-	-	4
Peak GPU DP speed/node (Tflop/s)	-	-	28.0
GPU DRAM capacity/node (GB)	-	-	128
Nodes/system thru date of column	144	360	376
Cores/system thru date of column	6,912	17,280	17,920
GPUs/system thru date of column	-	-	64
Peak speed/system (Pflop/s) thru date of column	0.59	1.55	2.05
Aggregate DRAM/system (TB) thru date of column	108	270	284
Interconnect type	EDR InfiniBand	EDR InfiniBand	EDR InfiniBand
Interconnect topology	Fat tree	Fat tree	Fat tree
Maximum unidirectional MPI bandwidth/node (GB/s)	12.3	12.3	12.3
Minimum MPI latency (µs)	1.4	1.4	1.4

Systems Software Environment

Software Function	Description
Cluster Management	Bright
Operating System	CentOS
File Systems	Lustre, NFS
Scheduler and Resource Manager	Slurm
User Environment	Modules
Compilers	Intel,GNU Fortran, C, C++
Message Passing	Intel MPI , Open MPI

System Access

Popeye is a Flatiron Institute(FI) computing resource. Popeye is accessible to all FI staff with access to the local FI clusters. Other Simons staff who are interested in using Popeye should send email to "scicomp@flatironinstitute.org".

Logging in to Popeye

Instructions for accessing Popeye/Gordon can be found on the Simons Foundation web page.

Notes and hints

When you login to popeye.sdsc.edu, you will be assigned one of the two login nodes popeye-login[1-2].sdsc.edu. These nodes are identical in both architecture and software environment. Users can directly access one of the two nodes directly if they see poor performance
When using screen or tmux be aware of the login node on which the session was started. When you try to reconnect to the old session be sure to use the same login node on which you started the original screen or tmux session.

Do not use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.

Modules

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

Modules are not loaded at login time. Users are responsible for loading all modules that they need. This includes the slurm module, which is needed to submit and monitor jobs.

Useful Modules Commands

Here are some common module commands and their descriptions:

Command	Description
module list	List the modules that are currently loaded
module avail	List the modules that are available
module display <module_name>	Show the environment variables used by <module name> and how they are affected
module unload <module name>	Remove <module name> from the environment
module load <module name>	Load <module name> into the environment
module swap <module one> <module two>	Replace <module one> with <module two> in the environment

Loading and unloading modules

You must remove some modules before loading others.

Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich2_ib are both loaded, running the command module unload intel will automatically unload mvapich2_ib. Subsequently issuing the module load intel command does not automatically reload mvapich2_ib.

If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (.bashrc for bash users, .cshrc for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.

Module: command not found

The error message module: command not found is sometimes encountered when switching from one shell to another or attempting to run the module command from within a shell script or batch job. The reason that the module command may not be inherited as expected is that it is defined as a function for your login shell. If you encounter this error execute the following from the command line (interactive shells) or add to your shell script (including Slurm batch scripts)

source /etc/profile.d/modules.sh

Compiling

Popeye provides the Intel and GNU compilers along with Intel MPI and OpenMPI. All compilers now support Intel's Advanced Vector Extensions (AVX-512). Using AVX-512, up to 32 floating point operations can be executed per cycle per core, potentially quadruping the performance relative to non-AVX-512 processors running at the same clock speed. Note that AVX-512 support is not enabled by default and compiler flags must be set as described below.

Using the Intel compilers

To load the intel compilers and Intel MPI implementation execute the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)

module purge
module load intel/compiler intel/mpi

For AVX-512 support, compile with the -xHOST option. Note that -xHOST alone does not enable aggressive optimization, so compilation with -O3 is also suggested. The -fast flag invokes -xHOST, but should be avoided since it also turns on interprocedural optimization (-ipo), which may cause problems in some instances.

Intel MKL libraries are available as part of the "intel/mkl" module on Popeye. Once this module is loaded, the environment variable MKLROOT points to the location of the mkl libraries. The MKL link advisor can be used to ascertain the link line.

For example to compile a C program statically linking 64 bit scalapack libraries on Popeye:

module load intel/compiler
module load intel/mpi
module load intel/mkl
mpicc -o pdpttr.exe pdpttr.c \
    -I$MKLROOT/include ${MKLROOT}/lib/intel64/libmkl_scalapack_lp64.a \
    -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a \
    ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_sequential.a \
    -Wl,--end-group ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm

For more information on the Intel compilers: [ifort | icc | icpc] -help

	Serial	MPI	OpenMP	MPI+OpenMP
Fortran	ifort	mpif90	ifort -openmp	mpif90 -openmp
C	icc	mpicc	icc -openmp	mpicc -openmpmao
C++	icpc	mpicxx	icpc -openmp	mpicxx -openmp

Note for C/C++ users: The compiler warning - "feupdateenv is not implemented and will always fail", can safely be ignored by most users. By default, the Intel C/C++ compilers only link against Intel's optimized version of the C standard math library (libmf). The error stems from the fact that several of the newer C99 library functions related to floating point rounding and exception handling have not been implemented.

Using the GNU compilers

The GNU compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup files (~/.cshrc or ~/.bashrc)

module purge
module load gcc openmpi

For AVX support, compile with -mavx.

For more information on the GNU compilers: man [gfortran | gcc | g++]

	Serial	MPI	OpenMP	MPI+OpenMP
Fortran	gfortran	mpif90	gfortran -fopenmp	mpif90 -fopenmp
C	gcc	mpicc	gcc -fopenmp	mpicc -fopenmp
C++	g++	mpicxx	g++ -fopenmp	mpicxx -fopenmp

Notes and Hints

The mpif90, mpicc, and mpicxx commands are actually wrappers that call the appropriate serial compilers and load the correct MPI libraries. While the same names are used for the Intel and GNU compilers, keep in mind that these are completely independent scripts.
If you use the GNU compiler or switch between compilers for different applications, make sure that you load the appropriate modules before running your executables.
When building OpenMP applications and moving between different compilers, one of the most common errors is to use the wrong flag to enable handling of OpenMP directives. Note that Intel and GNU compilers use the -openmp and -fopenmp flags, respectively.
Explicitly set the optimization level in your makefiles or compilation scripts. Most well written codes can safely use the highest optimization level (-O3), but many compilers set lower default levels (e.g. GNU compilers use the default -O0, which turns off all optimizations).

Running Jobs on Compute Nodes

Popeye uses the Simple Linux Utility for Resource Management (Slurm) environment to manage user jobs. Slurm must be loaded with modules.

module load slurm

Whether you run in batch mode or interactively, you will access the compute nodes using the sbatch, or srun, command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.

Popeye Partitions

Popeye has the following partitions available:

Queue Name	Max Walltime	Max Nodes	Max Jobs	Comments
general	7days	unilmited	unlimited	Used for exclusive priority access to compute nodes
preempt	unlimited	unlimited	unlimited	Used for exclusive access to compute nodes
soap	7 days	unlimited	unlimit	Used for SOAP group

Requesting interactive resources using srun

You can request an interactive session using the srun command as follows:

[user@popeye-ln1]$ srun --pty --nodes=1 --ntasks-per-node=48 -p general -t 00:30:00 --wait 0 /bin/bash

Submitting Jobs Using sbatch

Jobs can be submitted to the sbatch partitions using the sbatch command as follows:

[user@popeye-ln1]$ sbatch jobscriptfile

where jobscriptfile is the name of a UNIX format file containing special statements (corresponding to sbatch options), resource specifications and shell commands. Several example Slurm scripts are given below:

Basic MPI Job

#!/bin/bash
#SBATCH --job-name="hellompi"
#SBATCH --output="hellompi.%j.%N.out"
#SBATCH --partition=general
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#This job runs with 2 nodes, 48 cores per node for a total of 96 cores.
#ibrun in verbose mode will give binding detail

srun  -n 96 --mpi=pmi2  ../hello_mpi

Basic OpenMP Job

#!/bin/bash
#SBATCH --job-name="hello_openmp"
#SBATCH --output="hello_openmp.%j.%N.out"
#SBATCH --partition=general
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#SET the number of openmp threads
export OMP_NUM_THREADS=48

./hello_openmp

Hybrid MPI-OpenMP Job

#!/bin/bash
#SBATCH --job-name="hellohybrid"
#SBATCH --output="hellohybrid.%j.%N.out"
#SBATCH --partition=general
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#This job runs with 2 nodes, 48 cores per node for a total of 96 cores.
# We use 2 MPI tasks and 48 OpenMP threads per MPI task

export OMP_NUM_THREADS=4
srun  -n 2 -cpus-per-task=48 --mpi=pmi2 ./hello_hybrid

Job Dependencies

There are several scenarios (e.g. splitting long running jobs, workflows) where users may require jobs with dependencies on successful completions of other jobs. In such cases, Slurm's --dependency option can be used. The syntax is as follows:

[user@popeye-ln1 ~]$ sbatch --dependency=afterok:jobid jobscriptfile

Storage Overview

SSD Scratch Space

Popeye compute nodes have access to fast flash storage. Each compute node mounts locally a single 1.9 TB SSD (1.8 TB usable space).

The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution under the following directory:

/scratch/$USER/$SLURM_JOBID

Lustre Scratch Space

The scratch disk space on Popeye has a current capacity of 4.4 PB. The scratch disk space will grow to 12 PB in June 2019. The disk space is available at:

/simons/scratch/$USER/

This space should be used for applications that require shared scratch space or need more storage than can be provided by the flash disks. Unlike SSD scratch storage, the data stored in Lustre scratch is not purged immediately after job completion. Users will have time after the job completes to copy back data they wish to retain to their projects directories, home directories, or to their home institution.

Home File System

After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME. The home directory is limited in space and should be used only for source code storage. Jobs should never be run from the home file system, as it is not set up for high performance throughput. Users should keep usage on $HOME under 100GB. Backups are currently being stored on a rolling 8-week period. In case of file corruption/data loss, please contact us at support@sdsc.edu to retrieve the requested files.

Monitoring Your Job

In this section, we describe some standard tools that you can use to monitor your batch jobs. We suggest that you yourself with the section of the user guide that deals with running jobs to get a deeper understanding of the batch queuing system before starting this section.

More details are provided in the Slurm User Guide.

Examples

sacct - command displays job accouting data

This example illustrates the default invocation of the default sacct command:

# sacct
Jobid      Jobname    Partition    Account AllocCPUS State     ExitCode
---------- ---------- ---------- ---------- ---------- ---------- --------
2          script01   srun       acct1               1 RUNNING           0
3          script02   srun       acct1               1 RUNNING           0
4          endscript  srun       acct1               1 RUNNING           0
4.0                   srun       acct1               1 COMPLETED         0

The sacct command has several standar variable options. This example shows the same job accounting information with the brief option, which displays the jobid, status, and exitcode

# sacct --brief
     Jobid     State  ExitCode
---------- ---------- --------
2          RUNNING           0
3          RUNNING           0
4          RUNNING           0
4.0        COMPLETED         0

The sacct command can also be customized. This example demonstrates that ability. The fields are displayed in the order designated on the command line.

# sacct --format=jobid,elapsed,ncpus,ntasks,state
     Jobid    Elapsed      Ncpus   Ntasks     State
---------- ---------- ---------- -------- ----------
3            00:01:30          2        1 COMPLETED
3.0          00:01:30          2        1 COMPLETED
4            00:00:00          2        2 COMPLETED
4.0          00:00:01          2        2 COMPLETED
5            00:01:23          2        1 COMPLETED
5.0          00:01:31          2        1 COMPLETED