Troubleshoot

Recommended Installation Options

We recommend the following conda installation flags for successful installation of AWS PyTorch conda package.

The —strict-channel-priority flag changes the default conda behavior, such that if a package is found in higher order channel (say https://aws-ml-conda.s3.us-west-2.amazonaws.com) according to the package specification provided by the user, conda will ignore this package found in lower order channels (like “pytorch”, “conda-forge”). In our case, this tells conda to ignore the package “pytorch” from pytorch, conda-forge channels, therefore we are forcing conda to only install pytorch from the https://aws-ml-conda.s3.us-west-2.amazonaws.com channel provided that the https://aws-ml-conda.s3.us-west-2.amazonaws.com channel contains the package that satisfies the user input.

The —override-channels flag overrides user’s local conda configuration and forces conda to only consider packages from the channels specified afterwards, and in that order. This is also an important flag to set, because user may have pre-existing conda configurations in the .condarc file which may contain configurations that contradicts the required settings.

If user doesn’t want to use the —strict-channel-priority flag when installing AWS distribution of pytorch, they could also specify part of the build string used in the conda package. For example, they can specify pytorch=1.12.1=aws* to install the AWS distribution of PyTorch provided that the https://aws-ml-conda.s3.us-west-2.amazonaws.com channel is configured correctly.

Note: User may need to explicitly allow listing the use of https://aws-ml-conda.s3.us-west-2.amazonaws.com in their IAM policy if the user uses IAM profiles to manage AWS credentials.

How can I find out what packages are available in the https://aws-ml-conda.s3.us-west-2.amazonaws.com channel?

Run command conda search --override-channels -c https://aws-ml-conda.s3.us-west-2.amazonaws.com and the output should be similar to below.

# Name                       Version           Build  Channel
aws-ofi-nccl                   1.4.0             aws
pytorch                       1.12.1 aws_py3.8_cuda11.6_cudnn8.3.2_1
torchvision                   0.13.1      py38_cu116

Conda installed pytorch from the pytorch channel instead of https://aws-ml-conda.s3.us-west-2.amazonaws.com, what should I do?

Make sure you are using both --strict-channel-priority and —override-channels flags to install conda packages. Like the example below.

conda install python=3.9 pytorch=1.13.1 pytorch-cuda=11.7 torchvision torchaudio \
    --strict-channel-priority --override-channels \
    -c s3://aws-ml-conda \
    -c pytorch \
    -c nvidia \
    -c conda-forge

Alternatively, you could also specify the aws keyword in the installation command like below, then you don’t need to use the --strict-channel-priority flag.

conda install python=3.9 pytorch=1.13.1=aws* pytorch-cuda=11.7  torchvision torchaudio \
    --override-channels \
    -c s3://aws-ml-conda \
    -c pytorch \
    -c nvidia \
    -c conda-forge

Troubleshooting Distributed Training

Environment variables

When you activate the conda environment, we’ve pre-set the following environment variables for you to enable [RFC] Asynchronous Error Handling for Distributed Training with NCCL.

export NCCL_ASYNC_ERROR_HANDLING=1

Note: When you deactivate the environment, above variables will be automatically unset.

If you are building custom CUDA libraries, you may want to configure you application to pick up the right CUDA version by setting CUDA_HOME and add a symlink of CUDA library path. Refer to our DLAMI configuration guide.

export CUDA_HOME=/usr/local/cuda-11.6
sudo ln -s /usr/local/cuda-11.6 /usr/local/cuda
sudo rm /usr/local/cuda

Verify NCCL is using EFA

NCCL should automatically use EFA out of the box. You should see the following log from NCCL if EFA is been used. Also make sure your security group is configured to use EFA, refer to the EFA configuration guide.

NCCL INFO Using network AWS Libfabric
NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
NCCL INFO NET/OFI Selected Provider is efa

The following log indicates EFA is not been used, please reach out to AWS support for such case.

NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
INFO Using network Socket
INFO NET/IB : No device found.

NCCL WARN Could not open XML topology file

The aws-ofi-ncclplugin should automatically configure the P4D/P4De/P5 topology file by setting the NCCL_TOPO_FILE variable and point it to the correct xml file location. The xml topology file is located at /your/conda/env/shared/aws-ofi-nccl/xml/. If the file location is wrong, you can either manually copy it, or build aws-ofi-nccl plugin from source the specify the file location via —prefix argument. Failure to find such file may lead to performance regression. You should see the following NCCL log and no warning about not able to find the xml file.

NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to <path>/p4d-24xl-topo.xml

NCCL failed to initialize/bootstrap

Make sure all nodes’ socket names are consistent and NCCL is able to initialize via one of the socket names. You can check socket name using ifconfig and set it using NCCL_SOCKET_IFNAME. (example issue)