For our systems using AMD GPUs (Tioga, Tuo, RZAdams, El Capitan, Tenaya), we recommend you use one of our pre-built torch wheels shown here on Nexus. To avoid known bugs in older versions, you must use PyTorch >= 2.7 on systems with AMD GPUs

Please refer to this version matrix for PyTorch installation commands:

PyTorch version Python version ROCm version Installation command
2.8.0 3.11 6.3.1 pip install torch==2.8.0a0+gitba56102.rocm631
2.8.0 3.11 6.4.2 Coming soon!

Installing PyTorch

Recommended steps:

  1. Load the python/3.11.5 module, for which PyTorch wheels have been built
  2. Create and activate virtual environment based on this module
    1. Use `python3 -m venv <directory>` or `virtualenv <directory>`
    2. Do not use `--system-site-packages`
  3. Install one of the available PyTorch wheels listed above (of the form `<torch version>+<git hash>.rocm<rocm version>`) to your virtual environment
 module load python/3.11.5
 python3 -m venv mytorchenv
 source mytorchenv/bin/activate
 pip install torch==2.8.0a0+gitba56102.rocm631

Test this worked:

To check whether PyTorch is installed and whether GPUs are visible, run the following command from the command line :

python3 -c 'import torch ; print(torch.rand(5, 3)) ; print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'

which is equivalent to the following in the Python REPL:

>>> import torch; print(torch.rand(5, 3)); print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'

Using PyTorch on multiple nodes

RCCL/OFI plug-in

When scaling PyTorch across multiple nodes via the Cray Slingshot network, getting performance requires a plugin that lets RCCL use the libfabric library. If you are using one of the recommended PyTorch wheels on Nexus, the plugin will be used by default.

Otherwise, versions of this plugin are located under /collab/usr/global/tools/rccl

MPI4Py users

MPI4Py users are recommended to install one of our wheels provided here; those compatible with your python version will show with a git hash including `dev0` in `pip index versions --pre mpi4py` output.

For example,

pip install mpi4py==4.1.0.dev0+mpich.8.1.32

Spindle

Coming soon!

Using PyTorch from within a Jupyter notebook

Please use the docs Orbit and Jupyter notebooks to create a Jupyter kernel from your python virtual environment. In particular, after creating your virtual environment as described above, you will need to

  1. Install `ipykernel` to your virtual environment
  2. Install your custom kernel to `~/.local`
  3. Manually update LD_LIBRARY_PATH in `kernel.json`.
pip install ipykernel
python3 -m ipykernel install --prefix=$HOME/.local --name 'mytorchenv' --display-name 'mytorchenv'
echo $LD_LIBRARY_PATH

Use the output of `echo $LD_LIBRARY_PATH` to update `$HOME/.local/share/jupyter/kernels/<yourKernelName>/kernel.json` as shown in the "Custom Kernel ENV" section of Orbit and Jupyter notebooks. Your definition for "env" in kernel.json might look like this:

 "env": {
  "LD_LIBRARY_PATH": "/collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-6.3.1/install/lib:/opt/cray/pe/lib64:/opt/cray/lib64:/opt/cray/pe/papi/7.2.0.2/lib64:/opt/cray/libfabric/2.1/lib64:${LD_LIBRARY_PATH}" 
 },