For our systems using AMD GPUs (Tioga, Tuo, RZAdams, El Capitan, Tenaya), we recommend you use one of our pre-built torch wheels shown here on Nexus. To avoid known bugs in older versions, you must use PyTorch >= 2.7 on systems with AMD GPUs.
Please refer to this version matrix for PyTorch installation commands:
PyTorch version | Python version | ROCm version | Installation command |
---|---|---|---|
2.8.0 | 3.11 | 6.3.1 | pip install torch==2.8.0a0+gitba56102.rocm631 |
2.8.0 | 3.11 | 6.4.2 | Coming soon! |
Installing PyTorch
Recommended steps:
- Load the python/3.11.5 module, for which PyTorch wheels have been built
- Create and activate virtual environment based on this module
- Use `python3 -m venv <directory>` or `virtualenv <directory>`
- Do not use `--system-site-packages`
- Install one of the available PyTorch wheels listed above (of the form `<torch version>+<git hash>.rocm<rocm version>`) to your virtual environment
module load python/3.11.5
python3 -m venv mytorchenv
source mytorchenv/bin/activate
pip install torch==2.8.0a0+gitba56102.rocm631
Test this worked:
To check whether PyTorch is installed and whether GPUs are visible, run the following command from the command line :
python3 -c 'import torch ; print(torch.rand(5, 3)) ; print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'
which is equivalent to the following in the Python REPL:
>>> import torch; print(torch.rand(5, 3)); print("Torch Version", torch.__version__) ; print("GPU available:", torch.cuda.is_available())'
Using PyTorch on multiple nodes
RCCL/OFI plug-in
When scaling PyTorch across multiple nodes via the Cray Slingshot network, getting performance requires a plugin that lets RCCL use the libfabric library. If you are using one of the recommended PyTorch wheels on Nexus, the plugin will be used by default.
Otherwise, versions of this plugin are located under /collab/usr/global/tools/rccl.
MPI4Py users
MPI4Py users are recommended to install one of our wheels provided here; those compatible with your python version will show with a git hash including `dev0` in `pip index versions --pre mpi4py` output.
For example,
pip install mpi4py==4.1.0.dev0+mpich.8.1.32
Spindle
Coming soon!
Using PyTorch from within a Jupyter notebook
Please use the docs Orbit and Jupyter notebooks to create a Jupyter kernel from your python virtual environment. In particular, after creating your virtual environment as described above, you will need to
- Install `ipykernel` to your virtual environment
- Install your custom kernel to `~/.local`
- Manually update LD_LIBRARY_PATH in `kernel.json`.
pip install ipykernel python3 -m ipykernel install --prefix=$HOME/.local --name 'mytorchenv' --display-name 'mytorchenv' echo $LD_LIBRARY_PATH
Use the output of `echo $LD_LIBRARY_PATH` to update `$HOME/.local/share/jupyter/kernels/<yourKernelName>/kernel.json` as shown in the "Custom Kernel ENV" section of Orbit and Jupyter notebooks. Your definition for "env" in kernel.json might look like this:
"env": { "LD_LIBRARY_PATH": "/collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-6.3.1/install/lib:/opt/cray/pe/lib64:/opt/cray/lib64:/opt/cray/pe/papi/7.2.0.2/lib64:/opt/cray/libfabric/2.1/lib64:${LD_LIBRARY_PATH}" },