User Guide
Technical Summary
Voyager is an Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) innovative AI system designed specifically for science and engineering research at scale. Voyager is focused on supporting research in science and engineering that is increasingly dependent upon artificial intelligence and deep learning as a critical element in the experimental and/or computational work. Featuring the Habana Gaudi training and first-generation Habana inference processors, along with a high-performance, low latency 400 gigabit-per-second interconnect from Arista. Voyager will give researchers the ability to work with extremely large data sets using standard AI tools, like TensorFlow and PyTorch, or develop their own deep learning models using developer tools and libraries from Habana Labs.
Voyager is an NSF-funded system, developed in collaboration with Supermicro, and Intel’s Habana Lab and operated by the San Diego Supercomputer Center at UC San Diego, and began a 3-year testbed phase in early 2022.
Resource Allocation Policies
- Current Status: Testbed Phase
- 3-year testbed phase will be available to select focused projects, as well as workshops and industry interactions.
- The testbed phase will be followed up with a 2 year allocation phase to the broader NSF community and User workshops.
- To get access to Voyager, please send a request to HPC Consulting.
Job Scheduling Policies
- Currently no policies set
- Kubernetes for job scheduling
Technical Details
System Component | Configuration |
Supermicro X12 Gaudi Training Nodes |
CPU Type | Intel Xeon Gold 6336 |
Habana Gaudi processors | 336 |
Nodes | 42 |
Training processors/Node | 8 |
Host x86 processors/node | 2 |
Sockets | 2 |
Memory capacity |
* 512 GB DDR4 DRAM |
Memory/training processor |
32 GB HDM2 |
Local Storage |
6.4 TB local NVMe |
Max CPU Memory bandwidth | ** GB/s |
Intel First Generation Habana Inference Nodes | |
CPU Type | Xeon Gold 6240 |
First-Generation Habana Inference Processors | 16 |
Nodes | 2 |
First-Generation Habana Inference Cards/node | 8 |
Cores/socket | 20 |
Sockets | 2 |
Clock speed | 2.5 GHz |
Flop speed | 34.4 TFlop/s |
Memory capacity | *384 GB DDR4 DRAM |
Local Storage |
1.6TB Samsung PM1745b NVMe PCIe SSD |
Max CPU Memory bandwidth | 281.6 GB/s |
Standard Compute Nodes | |
CPU Type | Intelx86 |
Nodes | 36 |
x86 processors/node | 2 |
Memory Capacity | 384 GB |
Local NVMe | 3.2 TB |
Interconnect | |
Topology | Full bi-section bandwidth switch |
Per Node bandwidth | 6*400 Gb/s (bidirectional) |
DISK I/O Subsystem | |
File Systems | Ceph |
Ceph Storage | 1 PB |
Systems Software Environment
Software Function | Description |
Cluster Management | Bright Cluster Manager |
Operating System | Ubuntu 20.04 LTS |
File Systems | Ceph |
Scheduler and Resource Manager | Kubernetis |
User Environment | Lmod, Containers |
System Access
Logging in to Voyager
Voyager uses ssh key pairs for access. Approved users will need to send their ssh public key to to gain access to the system.
To log in to Voyager from the command line, use the hostname:
The following are examples of Secure Shell (ssh) commands that may be used to log in to Expanse:
ssh <your_username> ssh -l <your_username>
Notes and hints
- Voyager will not maintain local passwords, your public key will need to be appended to your ~/.ssh/authorized_keys file to enable access from authorized hosts. We accept RSA, ECDSA and ed25519 keys. Make sure you have a strong passphrase on the private key on your local machine.
- You can use ssh-agent or keychain to avoid repeatedly typing the private key password.
- Hosts which connect to SSH more frequently than ten times per minute may get blocked for a short period of time
- Do not use the login node for computationally intensive processes, as hosts for running workflow management tools, as primary data transfer nodes for large or numerous data transfers or as servers providing other services accessible to the Internet. The login nodes are meant for file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be run using kubernetes.
Voyager will feature is a machine learning(ML) platform to help manage, build and automate ML workflows. will provide a quick and easy way for Voyager users to collaborate, integrate, manage files and submit and monitor jobs.
Adding Users to a Project
Approved Voyager project PIs and co-PIs can add/remove users(accounts) to/from a Voyager. Please submit a support ticket to to add/remove users.
Environment Modules provide for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.
Voyager uses Lmod, a Lua-based module system. Users will ONLY need the kubernetes module, which is loaded by default.
Running Jobs
Voyager runs Kubernetes. Kubernetes is an open-source platform for managing containerized workloads and services. A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic Kubernetes commands and examples of running jobs are provided below
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.
Set up Kubectl environment
On set up the kubectl environment by loading the kubernetes module:
$vgr-1-20:~$ module load kubernetes/voyager
Review your currently loaded modules:
$vgr-1-20:~$ module list
Currently Loaded Modules:
Usage Guidelines
There are currently no limits set on Voyager resources. The limits for each partition noted in the table below are the maximum available. Resource limits will be set based on Early User Period evaluation.
Resource Name | Max Walltime |
Max Nodes/Job |
Nodes | Notes |
First-Generation Habana Inference | 48 hrs | 2 | vgr-10-01, vgr-10-02 | Inference |
Gaudi | 48 hrs | 42 | vgr-[2-4]-[01-06],vgr-[6-9]-[01-06] | Training |
Compute | 48 hrs | 36 | vgr-10-[02-38] | Preprocessing/Postprocessing |
Kubernetes Objects
Kubernetes objects are persistent entities in the Kubernetes system. Objects describe:
- The containerized applications to run, the software stack is in the container.
- The resources needed by the applications, including CPUs, Memory, Gaudis, First-Generation Habana Inferences, etc.
- The policies to control how application behaves
Most commonly used objects on Voyager are a pod or a job.
kind: pod
- A pod is the smallest deployable unit of computing - includes resources, containers, storage, run policies
kind: job
- A job creates one or more pods and will continue to retry execution of the pods until a specified number of them successfully terminate. In the event of a node failure, pods that are node managed by a Job will be rescheduled on other nodes. In addition, Jobs allow users to run multiple instances using completions and parallelism features.
kind: MPIJob
- A MPIJob is currently required for running multi-node jobs.
Other k8s objects available but not recommended for routine application runs on Voyager: deployments, RelicaSet, DaemonSet, CronJob, ReplicationController
Creating YAML files and running Jobs Using kubernetes(k8s)
Ain't Markup Language/Yet another markup language (YAML) is a human-readable data serialization language for all programming languages, often used as a format for configuration files. YAML uses colon-centered syntax, used for expressing key-value pairs. The official recommended filename extension for YAML files is .yaml
Example YAML file and available containers
SDSC User Services staff have developed sample run scripts for common applications available on Voyager in directory:
YAMLs formatting is very sensitive to outline indentation and whitespace, please do not try to copy and paste examples from this user guide.
Using Gaudi
Basic Gaudi YAML
#This job runs with 2 nodes, 128 cores per node for a total of 256 tasks.
apiVersion: v1
kind: Pod
name: hpu-test-pod
restartPolicy: Never
- name: gaudi-container
command: ["hl-smi"]
limits: 1
hugepages-2Mi: 3800Mi
memory: 32G
cpu: 1
memory: 32G
cpu: 1
Using First-Generation Habana Inference Processors
Basic First-Generation Habana Inference Processor YAML
apiVersion: v1
kind: Pod
name: goya-example
restartPolicy: Never
nodeSelector: 'goya'containers:
- name: goya-container
command: ["hl-smi"]
cpu: 4
memory: 16Gi
hugepages-2Mi: 500Mi 1
Using Compute Nodes
Basic compute YAML
apiVersion: v1
kind: Pod
name: compute-example
restartPolicy: Never
serviceAccountName: username
nodeSelector: 'compute'
- name: hpl-2-3-ubuntu-20-04-openmpi-4-0-5-openblas-0-3-14
cpu: 52
memory: 368Gi
cpu: 104
memory: 371Gi
command: ["/bin/bash", "-c"]
- >-
free -h;
Requesting Interactive access to pods
Sample Interactive YAML
The following YAML file will request a pod named: gpu-pod-interactive. It will set up a Pod with 1 HPU, using a particular container from dockerhub (nvidia/cuda:9.2-runtime) . The command will keep the container running for 1000s (using sleep) so that a user can login to the pod to run interactively, otherwise the pod will complete as soon as the command executes.
apiVersion: v1
kind: Pod
name: hpu-pod-interactive
restartPolicy: Never
- name: gaudi-container
command: ["/bin/sh", "-ec", "sleep 1000"]
limits: 1
To activate the pod:
You can use the kubernetes directive 'create' or 'apply' command. Both commands will create a Pod based on the YAML configuration file in your designated namespace. (A kubernetes namespace is a virtual cluster, and can be considered like an ACCESS allocation) In hpu-interactive-test.yaml file we are asking for a command that will keep the container running for 1000s (using sleep) so that we can login to the pod, otherwise the pod will complete as soon as the command executes. The 'apply' command allows the pod to be modified while it is running, to allow for debugging.
$ kubectl apply -f hpu-interactive-test.yaml
To review pods and their status:
$ kubectl get pods
hpu-pod-interactive 1/1 Running 0 12s
To get shell access:
$ kubectl exec --stdin --tty hpu-pod-interactive -- /bin/bash
To exit the interactive session prior to end of session:
$ hpu-pod-interactive:/# exit
Note: The short options -i and -t are the same as the long options --stdin and --tty
$ kubectl exec -it gpu-pod-interactive -- /bin/bash
Job Monitoring and Management
Users can monitor pods and their status using kubectl get command.
List the running Pods and their status:
$ kubectl get pods
hpu-pod-interactive 1/1 Running 0 12s
In this example, the output lists pods in the users default namespace, that are currently active.
Check logs for existing pods. Logs are deleted when the pods is deleted.
# check logs
%kubectl logs hpu-test-pod
Delete Pods when run is complete, and the logs are no longer needed
$ kubectl delete pods tf-benchmarks
pod "tf-benchmarks-hostpath" deleted
Data Movement
Globus Endpoints, Data Movers and Mount Points (** Coming Soon)
All of Expanse's NFS and Lustre filesystems are accessible via the Globus endpoint xsede#expanse
. The following table shows the mount points on the data mover nodes (that are the backend for xsede#expanse
Machine | Location on machine | Location on Globus/Data Movers |
Expanse | /home/$USER |
/expanse/home/$USER |
Expanse | /expanse/lustre/projects |
/expanse/lustre/projects/ |
Expanse | /expanse/lustre/scratch |
/expanse/lustre/scratch/... |
** Voyager | /voyager/projects |
/voyager/projects |
Users are responsible for backing up all important data to protect against data loss at SDSC.
Voyager provides several storage options. Many of the options will need to be mounted into your pods via yaml file for the job to be able to interact with them.
- name: home
path: /home/username
type: Directory
- name: scratch
emptyDir: {}
- name: ceph
path: /voyager/ceph/users/username
type: Directory
Local Scratch Disk
The compute nodes on Voyager have access to fast flash storage. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution. The scratch directory will need to be mounted as an emptyDir
, which indicates that the volume will be created when the Pod is assigned to a node, and will only exist while the Pod is running on the node. Make sure to copy any important data off of scrach before removing the pod.
- name: scratch
emptyDir: {}
Partition | Space Available |
First-Generation Habana Inference | 3.2 TB |
Gaudi | 6.4 TB |
Compute | 3.2 TB |
Parallel Filesystems
In addition to the local scratch storage, users will have access to global parallel filesystems on Ceph. Every Voyager node has access to a 3 PB Ceph parallel file system, 140 GB/second performance storage. SDSC limits the number of files that can be stored in the /voyager/ceph
filesystem to 2 million files per user. The SDSC Ceph file system ( /voyager/ceph/user/$USER
) IS NOT an archival file system. The SDSC Voyager Ceph file system IS NOT backed up. Users should contact support for assistance at the if their workflow requires extensive small I/O, to avoid causing system issues assosiated with load on the metadata server.
The Ceph filesystem available on Voyager is:
- Ceph Voyager filesystem:
Users will need to mount a directory
volume inside your yaml file. This allows an existing directory to be mounted into your Pod. The directory will be preserved and the volume will be unmounted when the pod is deleted. A directory
volume can be pre-populated with data, and that data can be shared between pods. The directory
volume can be mounted by multiple writers simultaneously.To mount the ceph file system you need to include the following in your yaml.
- name: ceph
path: /voyager/ceph/users/username
type: Directory
Home and Project File System
After logging in, users are placed in their home directory, /home, also referenced by the environment variable $HOME
. The home directory is limited in space and should be used only for source code storage. User will have access to 200GB in /home
. Users should keep usage on $HOME
under 200GB. The SDSC Voyager /home and /voyager/projects file system ARE NOT backed up . User can mount their home directories
- name: home
path: /home/username
type: Directory
At this time Voyager also has an NSF mounted project space with 153TB available at:
- Voyager project filesystem:
- name: projects
path: /voyager/projects/project/username
type: Directory
Voyager supports habana and custom containers that include habana drivers. Please use 4. Installation Guide — Gaudi Documentation 1.3.0 documentation ( to developer page.
Containerized stack including prebuilt habana conatiners (Tensorflow, PyTorch) and customized containers will be available on a local git lab.
Request access at: GitLab
Citations & Publications
How to cite Voyager
We request that you cite your use of the Voyager with the following citation format, and modified as needed to conform with citation style guidelines. Most importantly, please include the Digital Object Identifier (DOI) — — that is unique to Voyager.
San Diego Supercomputer Center (2025): Voyager. University of California San Diego. Service.
View a list of publications resulting from the use of Voyager.