User Guide

Jump to:

Technical Summary System Access Modules Running Jobs Data Movement Storage Software Citations & Publications

Technical Summary

Voyager is an Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) innovative AI system designed specifically for science and engineering research at scale. Voyager is focused on supporting research in science and engineering that is increasingly dependent upon artificial intelligence and deep learning as a critical element in the experimental and/or computational work. Featuring the Habana Gaudi training and first-generation Habana inference processors, along with a high-performance, low latency 400 gigabit-per-second interconnect from Arista. Voyager will give researchers the ability to work with extremely large data sets using standard AI tools, like TensorFlow and PyTorch, or develop their own deep learning models using developer tools and libraries from Habana Labs.

Voyager is an NSF-funded system, developed in collaboration with Supermicro, and Intel’s Habana Lab and operated by the San Diego Supercomputer Center at UC San Diego, and began a 3-year testbed phase in early 2022.

Resource Allocation Policies

Current Status: Testbed Phase
3-year testbed phase will be available to select focused projects, as well as workshops and industry interactions.
The testbed phase will be followed up with a 2 year allocation phase to the broader NSF community and User workshops.
To get access to Voyager, please send a request to HPC Consulting.

Job Scheduling Policies

Currently no policies set
Kubernetes for job scheduling

Technical Details

System Component	Configuration
Supermicro X12 Gaudi Training Nodes
CPU Type	Intel Xeon Gold 6336
Habana Gaudi processors	336
Nodes	42
Training processors/Node	8
Host x86 processors/node	2
Sockets	2
Memory capacity	* 512 GB DDR4 DRAM
Memory/training processor	32 GB HDM2
Local Storage	6.4 TB local NVMe
Max CPU Memory bandwidth	** GB/s
Intel First Generation Habana Inference Nodes
CPU Type	Xeon Gold 6240
First-Generation Habana Inference Processors	16
Nodes	2
First-Generation Habana Inference Cards/node	8
Cores/socket	20
Sockets	2
Clock speed	2.5 GHz
Flop speed	34.4 TFlop/s
Memory capacity	*384 GB DDR4 DRAM
Local Storage	1.6TB Samsung PM1745b NVMe PCIe SSD
Max CPU Memory bandwidth	281.6 GB/s
Standard Compute Nodes
CPU Type	Intelx86
Nodes	36
x86 processors/node	2
Memory Capacity	384 GB
Local NVMe	3.2 TB
Interconnect
Topology	Full bi-section bandwidth switch
Per Node bandwidth	6*400 Gb/s (bidirectional)
DISK I/O Subsystem
File Systems	Ceph
Ceph Storage	1 PB

Systems Software Environment

Software Function	Description
Cluster Management	Bright Cluster Manager
Operating System	Ubuntu 20.04 LTS
File Systems	Ceph
Scheduler and Resource Manager	Kubernetis
User Environment	Lmod, Containers

System Access

Logging in to Voyager

Voyager uses ssh key pairs for access. Approved users will need to send their ssh public key to consult@sdsc.edu to gain access to the system.

To log in to Voyager from the command line, use the hostname:

login.voyager.sdsc.edu

The following are examples of Secure Shell (ssh) commands that may be used to log in to Expanse:

ssh <your_username>@login.voyager.sdsc.edu
ssh -l <your_username> login.voyager.sdsc.edu

Notes and hints

Voyager will not maintain local passwords, your public key will need to be appended to your ~/.ssh/authorized_keys file to enable access from authorized hosts. We accept RSA, ECDSA and ed25519 keys. Make sure you have a strong passphrase on the private key on your local machine.
- You can use ssh-agent or keychain to avoid repeatedly typing the private key password.
- Hosts which connect to SSH more frequently than ten times per minute may get blocked for a short period of time
Do not use the login node for computationally intensive processes, as hosts for running workflow management tools, as primary data transfer nodes for large or numerous data transfers or as servers providing other services accessible to the Internet. The login nodes are meant for file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be run using kubernetes.

Voyager cnvrg.io

Voyager will feature cnvrg.io. cnvrg.io is a machine learning(ML) platform to help manage, build and automate ML workflows. cnvrg.io will provide a quick and easy way for Voyager users to collaborate, integrate, manage files and submit and monitor jobs.

Adding Users to a Project

Approved Voyager project PIs and co-PIs can add/remove users(accounts) to/from a Voyager. Please submit a support ticket to consult@sdsc.edu to add/remove users.

Modules

Environment Modules provide for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

Voyager uses Lmod, a Lua-based module system. Users will ONLY need the kubernetes module, which is loaded by default.

Running Jobs

Voyager runs Kubernetes. Kubernetes is an open-source platform for managing containerized workloads and services. A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic Kubernetes commands and examples of running jobs are provided below

The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.

Set up Kubectl environment

On login.voyager.sdsc.edu set up the kubectl environment by loading the kubernetes module:

$vgr-1-20:~$ module load kubernetes/voyager

Review your currently loaded modules:

$vgr-1-20:~$ module list
Currently Loaded Modules:

```
shared
```
```
dot
```
```
default-environment
```
```
DefaultModules
```
```
kubernetes/voyager/1.18.15
```

Usage Guidelines

There are currently no limits set on Voyager resources. The limits for each partition noted in the table below are the maximum available. Resource limits will be set based on Early User Period evaluation.

Resource Name	Max Walltime	Max Nodes/Job	Nodes	Notes
First-Generation Habana Inference	48 hrs	2	vgr-10-01, vgr-10-02	Inference
Gaudi	48 hrs	42	vgr-[2-4]-[01-06],vgr-[6-9]-[01-06]	Training
Compute	48 hrs	36	vgr-10-[02-38]	Preprocessing/Postprocessing

Kubernetes Objects

Kubernetes objects are persistent entities in the Kubernetes system. Objects describe:

The containerized applications to run, the software stack is in the container.
The resources needed by the applications, including CPUs, Memory, Gaudis, First-Generation Habana Inferences, etc.
The policies to control how application behaves

Most commonly used objects on Voyager are a pod or a job.

kind: pod

A pod is the smallest deployable unit of computing - includes resources, containers, storage, run policies

kind: job

A job creates one or more pods and will continue to retry execution of the pods until a specified number of them successfully terminate. In the event of a node failure, pods that are node managed by a Job will be rescheduled on other nodes. In addition, Jobs allow users to run multiple instances using completions and parallelism features.

kind: MPIJob

A MPIJob is currently required for running multi-node jobs.

Other k8s objects available but not recommended for routine application runs on Voyager: deployments, RelicaSet, DaemonSet, CronJob, ReplicationController

Creating YAML files and running Jobs Using kubernetes(k8s)

Ain't Markup Language/Yet another markup language (YAML) is a human-readable data serialization language for all programming languages, often used as a format for configuration files. YAML uses colon-centered syntax, used for expressing key-value pairs. The official recommended filename extension for YAML files is .yaml

Example YAML file and available containers

SDSC User Services staff have developed sample run scripts for common applications available on Voyager in directory:

/cm/shared/examples/sdsc

YAMLs formatting is very sensitive to outline indentation and whitespace, please do not try to copy and paste examples from this user guide.

Using Gaudi

Basic Gaudi YAML

#This job runs with 2 nodes, 128 cores per node for a total of 256 tasks.

apiVersion: v1
kind: Pod
metadata:
        name: hpu-test-pod
spec:
        restartPolicy: Never
        containers:
        - name: gaudi-container
          image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.8.4:1.8.0-690-20230214
          command: ["hl-smi"]
          resources:
            limits:
              habana.ai/gaudi: 1
              hugepages-2Mi: 3800Mi               memory: 32G               cpu: 1              requests:               memory: 32G               cpu: 1

Using First-Generation Habana Inference Processors

Basic First-Generation Habana Inference Processor YAML

apiVersion: v1
kind: Pod
metadata:
        name: goya-example
spec:
        restartPolicy: Never
        nodeSelector:
          brightcomputing.com/node-category: 'goya'
        containers:
        - name: goya-container
          image: ghcr.io/mkandes/naked-docker:habana-goya-0.9.15-31R
          command: ["hl-smi"]
          resources:
            limits:
              cpu: 4
              memory: 16Gi
              hugepages-2Mi: 500Mi
              habana.ai/goya: 1

Using Compute Nodes

Basic compute YAML

apiVersion: v1
kind: Pod
metadata:
  name: compute-example
spec:
      restartPolicy: Never
      serviceAccountName: username
      nodeSelector:
        brightcomputing.com/node-category: 'compute'
      containers:
      - name: hpl-2-3-ubuntu-20-04-openmpi-4-0-5-openblas-0-3-14
        image: ghcr.io/mkandes/naked-docker:hpl-2.3-ubuntu-20.04-openmpi-4.0.5-openblas-0.3.14
        resources:
          requests:
            cpu: 52
            memory: 368Gi
          limits:
            cpu: 104
            memory: 371Gi
        command: ["/bin/bash", "-c"]
        args:
        - >-
            lscpu;
            free -h;
            printenv;

Requesting Interactive access to pods

Sample Interactive YAML

The following YAML file will request a pod named: gpu-pod-interactive. It will set up a Pod with 1 HPU, using a particular container from dockerhub (nvidia/cuda:9.2-runtime) . The command will keep the container running for 1000s (using sleep) so that a user can login to the pod to run interactively, otherwise the pod will complete as soon as the command executes.

apiVersion: v1
kind: Pod
metadata:
        name: hpu-pod-interactive
spec:
        restartPolicy: Never
        containers:
        - name: gaudi-container
          image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1:latest
          command: ["/bin/sh", "-ec", "sleep 1000"]
          resources:
            limits:
              habana.ai/gaudi: 1

To activate the pod:

You can use the kubernetes directive 'create' or 'apply' command. Both commands will create a Pod based on the YAML configuration file in your designated namespace. (A kubernetes namespace is a virtual cluster, and can be considered like an ACCESS allocation) In hpu-interactive-test.yaml file we are asking for a command that will keep the container running for 1000s (using sleep) so that we can login to the pod, otherwise the pod will complete as soon as the command executes. The 'apply' command allows the pod to be modified while it is running, to allow for debugging.

$ kubectl apply -f hpu-interactive-test.yaml

To review pods and their status:

$ kubectl get pods

NAME READY STATUS RESTARTS AGE
hpu-pod-interactive 1/1 Running 0 12s

To get shell access:

$ kubectl exec --stdin --tty hpu-pod-interactive -- /bin/bash

To exit the interactive session prior to end of session:

$ hpu-pod-interactive:/# exit

Note: The short options -i and -t are the same as the long options --stdin and --tty

$ kubectl exec -it gpu-pod-interactive -- /bin/bash

Job Monitoring and Management

Users can monitor pods and their status using kubectl get command.

List the running Pods and their status:

$ kubectl get pods

NAME READY STATUS RESTARTS AGE
hpu-pod-interactive 1/1 Running 0 12s

In this example, the output lists pods in the users default namespace, that are currently active.

Check logs for existing pods. Logs are deleted when the pods is deleted.

# check logs
%kubectl logs hpu-test-pod

Delete Pods when run is complete, and the logs are no longer needed

$ kubectl delete pods tf-benchmarks

pod "tf-benchmarks-hostpath" deleted

Data Movement

Globus Endpoints, Data Movers and Mount Points (** Coming Soon)

All of Expanse's NFS and Lustre filesystems are accessible via the Globus endpoint xsede#expanse. The following table shows the mount points on the data mover nodes (that are the backend for xsede#expanse).

Globus Endpoints, Data Movers and Mount Points
Machine	Location on machine	Location on Globus/Data Movers
Expanse	`/home/$USER`	`/expanse/home/$USER`
Expanse	`/expanse/lustre/projects`	`/expanse/lustre/projects/`
Expanse	`/expanse/lustre/scratch`	`/expanse/lustre/scratch/...`
** Voyager	`/voyager/projects`	`/voyager/projects`

Storage

Overview

Users are responsible for backing up all important data to protect against data loss at SDSC.

Voyager provides several storage options. Many of the options will need to be mounted into your pods via yaml file for the job to be able to interact with them.

volumes:
- name: home
   hostPath:
       path: /home/username
       type: Directory
- name: scratch
   emptyDir: {}
- name: ceph
   hostPath:
       path: /voyager/ceph/users/username
       type: Directory

Local Scratch Disk

The compute nodes on Voyager have access to fast flash storage. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution. The scratch directory will need to be mounted as an emptyDir, which indicates that the volume will be created when the Pod is assigned to a node, and will only exist while the Pod is running on the node. Make sure to copy any important data off of scrach before removing the pod.

volumes:
- name: scratch
emptyDir: {}

Local Scratch Disk
Partition	Space Available
First-Generation Habana Inference	3.2 TB
Gaudi	6.4 TB
Compute	3.2 TB

Parallel Filesystems

In addition to the local scratch storage, users will have access to global parallel filesystems on Ceph. Every Voyager node has access to a 3 PB Ceph parallel file system, 140 GB/second performance storage. SDSC limits the number of files that can be stored in the /voyager/ceph filesystem to 2 million files per user. The SDSC Ceph file system ( /voyager/ceph/user/$USER) IS NOT an archival file system. The SDSC Voyager Ceph file system IS NOT backed up. Users should contact support for assistance at the consult@sdsc.edu if their workflow requires extensive small I/O, to avoid causing system issues assosiated with load on the metadata server.

The Ceph filesystem available on Voyager is:

Ceph Voyager filesystem: /voyager/ceph/user/username

Users will need to mount a directory volume inside your yaml file. This allows an existing directory to be mounted into your Pod. The directory will be preserved and the volume will be unmounted when the pod is deleted. A directory volume can be pre-populated with data, and that data can be shared between pods. The directory volume can be mounted by multiple writers simultaneously.To mount the ceph file system you need to include the following in your yaml.

volumes:
- name: ceph
   hostPath:
       path: /voyager/ceph/users/username
       type: Directory

Home and Project File System

After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME. The home directory is limited in space and should be used only for source code storage. User will have access to 200GB in /home. Users should keep usage on $HOME under 200GB. The SDSC Voyager /home and /voyager/projects file system ARE NOT backed up . User can mount their home directories

volumes:
- name: home
   hostPath:
       path: /home/username
       type: Directory

At this time Voyager also has an NSF mounted project space with 153TB available at:

Voyager project filesystem: /voyager/projects/project/username

volumes:
- name: projects
   hostPath:
       path: /voyager/projects/project/username
       type: Directory

Software

Voyager supports habana and custom containers that include habana drivers. Please use 4. Installation Guide — Gaudi Documentation 1.3.0 documentation (habana.ai) to developer page.

Containerized stack including prebuilt habana conatiners (Tensorflow, PyTorch) and customized containers will be available on a local git lab.

Request access at: GitLab

Citations & Publications

How to cite Voyager

We request that you cite your use of the Voyager with the following citation format, and modified as needed to conform with citation style guidelines. Most importantly, please include the Digital Object Identifier (DOI) — https://dl.acm.org/doi/10.1145/3569951.3597597 — that is unique to Voyager.

Example

San Diego Supercomputer Center (2025): Voyager. University of California San Diego. Service. https://dl.acm.org/doi/10.1145/3569951.3597597

Publications

View a list of publications resulting from the use of Voyager.

Voyager

System Architecture User Guide