Voyager is an Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) innovative AI system designed specifically for science and engineering research at scale. Voyager is focused on supporting research in science and engineering that is increasingly dependent upon artificial intelligence and deep learning as a critical element in the experimental and/or computational work. Featuring the Habana Gaudi training and first-generation Habana inference processors, along with a high-performance, low latency 400 gigabit-per-second interconnect from Arista. Voyager will give researchers the ability to work with extremely large data sets using standard AI tools, like TensorFlow and PyTorch, or develop their own deep learning models using developer tools and libraries from Habana Labs.
Voyager is an NSF-funded system, developed in collaboration with Supermicro, and Intel’s Habana Lab and operated by the San Diego Supercomputer Center at UC San Diego, and began a 3-year testbed phase in early 2022.
System Component | Configuration |
---|---|
Supermicro X12 Gaudi Training Nodes |
|
CPU Type | Intel Xeon Gold 6336 |
Habana Gaudi processors | 336 |
Nodes | 42 |
Training processors/Node | 8 |
Host x86 processors/node | 2 |
Sockets | 2 |
Memory capacity |
* 512 GB DDR4 DRAM |
Memory/training processor |
32 GB HDM2 |
Local Storage |
6.4 TB local NVMe |
Max CPU Memory bandwidth | ** GB/s |
Intel First Generation Habana Inference Nodes | |
CPU Type | Xeon Gold 6240 |
First-Generation Habana Inference Processors | 16 |
Nodes | 2 |
First-Generation Habana Inference Cards/node | 8 |
Cores/socket | 20 |
Sockets | 2 |
Clock speed | 2.5 GHz |
Flop speed | 34.4 TFlop/s |
Memory capacity | *384 GB DDR4 DRAM |
Local Storage |
1.6TB Samsung PM1745b NVMe PCIe SSD |
Max CPU Memory bandwidth | 281.6 GB/s |
Standard Compute Nodes | |
CPU Type | Intelx86 |
Nodes | 36 |
x86 processors/node | 2 |
Memory Capacity | 384 GB |
Local NVMe | 3.2 TB |
Interconnect | |
Topology | Full bi-section bandwidth switch |
Per Node bandwidth | 6*400 Gb/s (bidirectional) |
DISK I/O Subsystem | |
File Systems | Ceph |
Ceph Storage | 1 PB |
Software Function | Description |
---|---|
Cluster Management | Bright Cluster Manager |
Operating System | Ubuntu 20.04 LTS |
File Systems | Ceph |
Scheduler and Resource Manager | Kubernetis |
User Environment | Lmod, Containers |
Voyager uses ssh key pairs for access. Approved users will need to send their ssh public key to consult@sdsc.edu to gain access to the system.
To log in to Voyager from the command line, use the hostname:
login.voyager.sdsc.edu
The following are examples of Secure Shell (ssh) commands that may be used to log in to Expanse:
ssh <your_username>@login.voyager.sdsc.edu ssh -l <your_username> login.voyager.sdsc.edu
Voyager will feature cnvrg.io. cnvrg.io is a machine learning(ML) platform to help manage, build and automate ML workflows. cnvrg.io will provide a quick and easy way for Voyager users to collaborate, integrate, manage files and submit and monitor jobs.
Approved Voyager project PIs and co-PIs can add/remove users(accounts) to/from a Voyager. Please submit a support ticket to consult@sdsc.edu to add/remove users.
Environment Modules provide for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.
Voyager uses Lmod, a Lua-based module system. Users will ONLY need the kubernetes module, which is loaded by default.
Voyager runs Kubernetes. Kubernetes is an open-source platform for managing containerized workloads and services. A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic Kubernetes commands and examples of running jobs are provided below
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.
On login.voyager.sdsc.edu set up the kubectl environment by loading the kubernetes module:
$vgr-1-20:~$ module load kubernetes/voyager
Review your currently loaded modules
$vgr-1-20:~$ module list
Currently Loaded Modules:
1) shared 3) default-environment 5) kubernetes/voyager/1.18.15
2) dot 4) DefaultModules
There are currently no limits set on Voyager resources. The limits for each partition noted in the table below are the maximum available. Resource limits will be set based on Early User Period evaluation.
Resource Name | Max Walltime |
Max Nodes/Job |
Nodes | Notes |
---|---|---|---|---|
First-Generation Habana Inference | 48 hrs | 2 | vgr-10-01, vgr-10-02 | Inference |
Gaudi | 48 hrs | 42 | vgr-[2-4]-[01-06],vgr-[6-9]-[01-06] | Training |
Compute | 48 hrs | 36 | vgr-10-[02-38] | Preprocessing/Postprocessing |
Kubernetes objects are persistent entities in the Kubernetes system. Objects describe:
Most commonly used objects on Voyager are a pod or a job.
kind: pod
kind: job
kind: MPIJob
Other k8s objects available but not recommended for routine application runs on Voyager: deployments, RelicaSet, DaemonSet, CronJob, ReplicationController
Ain't Markup Language/Yet another markup language (YAML) is a human-readable data serialization language for all programming languages, often used as a format for configuration files. YAML uses colon-centered syntax, used for expressing key-value pairs. The official recommended filename extension for YAML files is .yaml
SDSC User Services staff have developed sample run scripts for common applications available on Voyager in directory:
/cm/shared/examples/sdsc
YAMLs formatting is very sensitive to outline indentation and whitespace, please do not try to copy and paste examples from this user guide.
#This job runs with 2 nodes, 128 cores per node for a total of 256 tasks.
apiVersion: v1
kind: Pod
metadata:
name: hpu-test-pod
spec:
restartPolicy: Never
containers:
- name: gaudi-container
image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.8.4:1.8.0-690-20230214
command: ["hl-smi"]
resources:
limits:
habana.ai/gaudi: 1
hugepages-2Mi: 3800Mi
memory: 32G
cpu: 1
requests:
memory: 32G
cpu: 1
apiVersion: v1
kind: Pod
metadata:
name: goya-example
spec:
restartPolicy: Never
nodeSelector:
brightcomputing.com/node-category: 'goya'containers:
- name: goya-container
image: ghcr.io/mkandes/naked-docker:habana-goya-0.9.15-31R
command: ["hl-smi"]
resources:
limits:
cpu: 4
memory: 16Gi
hugepages-2Mi: 500Mi
habana.ai/goya: 1
apiVersion: v1
kind: Pod
metadata:
name: compute-example
spec:
restartPolicy: Never
serviceAccountName: username
nodeSelector:
brightcomputing.com/node-category: 'compute'
containers:
- name: hpl-2-3-ubuntu-20-04-openmpi-4-0-5-openblas-0-3-14
image: ghcr.io/mkandes/naked-docker:hpl-2.3-ubuntu-20.04-openmpi-4.0.5-openblas-0.3.14
resources:
requests:
cpu: 52
memory: 368Gi
limits:
cpu: 104
memory: 371Gi
command: ["/bin/bash", "-c"]
args:
- >-
lscpu;
free -h;
printenv;
The following YAML file will request a pod named: gpu-pod-interactive. It will set up a Pod with 1 HPU, using a particular container from dockerhub (nvidia/cuda:9.2-runtime) . The command will keep the container running for 1000s (using sleep) so that a user can login to the pod to run interactively, otherwise the pod will complete as soon as the command executes.
apiVersion: v1
kind: Pod
metadata:
name: hpu-pod-interactive
spec:
restartPolicy: Never
containers:
- name: gaudi-container
image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1:latest
command: ["/bin/sh", "-ec", "sleep 1000"]
resources:
limits:
habana.ai/gaudi: 1
You can use the kubernetes directive 'create' or 'apply' command. Both commands will create a Pod based on the YAML configuration file in your designated namespace. (A kubernetes namespace is a virtual cluster, and can be considered like an ACCESS allocation) In hpu-interactive-test.yaml file we are asking for a command that will keep the container running for 1000s (using sleep) so that we can login to the pod, otherwise the pod will complete as soon as the command executes. The 'apply' command allows the pod to be modified while it is running, to allow for debugging.
$ kubectl apply -f hpu-interactive-test.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hpu-pod-interactive 1/1 Running 0 12s
$ kubectl exec --stdin --tty hpu-pod-interactive -- /bin/bash
To exit the interactive session prior to end of session:
$ hpu-pod-interactive:/# exit
Note: The short options -i and -t are the same as the long options --stdin and --tty
$ kubectl exec -it gpu-pod-interactive -- /bin/bash
Users can monitor pods and their status using kubectl get command.
List the running Pods and their status:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hpu-pod-interactive 1/1 Running 0 12s
In this example, the output lists pods in the users default namespace, that are currently active.
Check logs for exisitng pods. Logs are deleted when the pods is deleted.
# check logs
%kubectl logs hpu-test-pod
Delete Pods when run is complete, and the logs are no longer needed
$ kubectl delete pods tf-benchmarks
pod "tf-benchmarks-hostpath" deleted
All of Expanse's NFS and Lustre filesystems are accessible via the Globus endpoint xsede#expanse
. The following table shows the mount points on the data mover nodes (that are the backend for xsede#expanse
).
Machine | Location on machine | Location on Globus/Data Movers |
---|---|---|
Expanse | /home/$USER |
/expanse/home/$USER |
Expnase | /expanse/lustre/projects |
/expanse/lustre/projects/ |
Expnase | /expanse/lustre/scratch |
/expanse/lustre/scratch/... |
** Voyager | /voyager/projects |
/voyager/projects |
Users are responsible for backing up all important data to protect against data loss at SDSC
Voyager provides several storage options. Many of the options will need to be mounted into your pods via yaml file for the job to be able to interact with them.
volumes:
- name: home
hostPath:
path: /home/username
type: Directory
- name: scratch
emptyDir: {}
- name: ceph
hostPath:
path: /voyager/ceph/users/username
type: Directory
The compute nodes on Voyager have access to fast flash storage. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution. The scratch directory will need to be mounted as an emptyDir
, which indicates that the volume will be created when the Pod is assigned to a node, and will only exist while the Pod is running on the node. Make sure to copy any important data off of scrach before removing the pod.
volumes:
- name: scratch
emptyDir: {}
Partition | Space Available |
---|---|
First-Generation Habana Inference | 3.2 TB |
Gaudi | 6.4 TB |
Compute | 3.2 TB |
In addition to the local scratch storage, users will have access to global parallel filesystems on Ceph. Every Voyager node has access to a 3 PB Ceph parallel file system, 140 GB/second performance storage. SDSC limits the number of files that can be stored in the /voyager/ceph
filesystem to 2 million files per user. The SDSC Ceph file system ( /voyager/ceph/user/$USER
) IS NOT an archival file system. The SDSC Voyager Ceph file system IS NOT backed up. Users should contact support for assistance at the consult@sdsc.edu if their workflow requires extensive small I/O, to avoid causing system issues assosiated with load on the metadata server.
The Ceph filesystem available on Voyager is:
/voyager/ceph/user/username
Users will need to mount a directory
volume inside your yaml file. This allows an existing directory to be mounted into your Pod. The directory will be preserved and the volume will be unmounted when the pod is deleted. A directory
volume can be pre-populated with data, and that data can be shared between pods. The directory
volume can be mounted by multiple writers simultaneously.To mount the ceph file system you need to include the following in your yaml.
volumes:
- name: ceph
hostPath:
path: /voyager/ceph/users/username
type: Directory
After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME
. The home directory is limited in space and should be used only for source code storage. User will have access to 200GB in /home
. Users should keep usage on $HOME
under 200GB. The SDSC Voyager /home and /voyager/projects file system ARE NOT backed up . User can mount their home directories
volumes:
- name: home
hostPath:
path: /home/username
type: Directory
At this time Voyager also has an NSF mounted project space with 153TB available at:
/voyager/projects/project/username
volumes:
- name: projects
hostPath:
path: /voyager/projects/project/username
type: Directory
Voyager supports habana and custom containers that inlude habana drivers. Please use 4. Installation Guide — Gaudi Documentation 1.3.0 documentation (habana.ai) to developer page.
Containerized stack including prebuilt habana conatiners (Tensorflow, PyTorch) and customized containers will be avaiable on a local git lab.
Request access at: GitLab