Troubleshoot
Recommended Installation Options
We recommend the following conda installation flags for successful installation of AWS PyTorch conda package.
The —strict-channel-priority
flag changes the default conda behavior, such that if a package is found in higher order channel (say https://aws-ml-conda.s3.us-west-2.amazonaws.com
) according to the package specification provided by the user, conda will ignore this package found in lower order channels (like “pytorch”, “conda-forge”). In our case, this tells conda to ignore the package “pytorch” from pytorch, conda-forge channels, therefore we are forcing conda to only install pytorch from the https://aws-ml-conda.s3.us-west-2.amazonaws.com
channel provided that the https://aws-ml-conda.s3.us-west-2.amazonaws.com
channel contains the package that satisfies the user input.
The —override-channels
flag overrides user’s local conda configuration and forces conda to only consider packages from the channels specified afterwards, and in that order. This is also an important flag to set, because user may have pre-existing conda configurations in the .condarc file which may contain configurations that contradicts the required settings.
If user doesn’t want to use the —strict-channel-priority
flag when installing AWS distribution of pytorch, they could also specify part of the build string used in the conda package. For example, they can specify pytorch=1.12.1=aws*
to install the AWS distribution of PyTorch provided that the https://aws-ml-conda.s3.us-west-2.amazonaws.com
channel is configured correctly.
Note: User may need to explicitly allow listing the use of https://aws-ml-conda.s3.us-west-2.amazonaws.com
in their IAM policy if the user uses IAM profiles to manage AWS credentials.
How can I find out what packages are available in the https://aws-ml-conda.s3.us-west-2.amazonaws.com channel?
Run command conda search --override-channels -c https://aws-ml-conda.s3.us-west-2.amazonaws.com
and the output should be similar to below.
# Name Version Build Channel
aws-ofi-nccl 1.4.0 aws
pytorch 1.12.1 aws_py3.8_cuda11.6_cudnn8.3.2_1
torchvision 0.13.1 py38_cu116
Conda installed pytorch from the pytorch channel instead of https://aws-ml-conda.s3.us-west-2.amazonaws.com, what should I do?
Make sure you are using both --strict-channel-priority
and —override-channels
flags to install conda packages. Like the example below.
conda install python=3.9 pytorch=1.13.1 pytorch-cuda=11.7 torchvision torchaudio \
--strict-channel-priority --override-channels \
-c s3://aws-ml-conda \
-c pytorch \
-c nvidia \
-c conda-forge
Alternatively, you could also specify the aws
keyword in the installation command like below, then you don’t need to use the --strict-channel-priority
flag.
conda install python=3.9 pytorch=1.13.1=aws* pytorch-cuda=11.7 torchvision torchaudio \
--override-channels \
-c s3://aws-ml-conda \
-c pytorch \
-c nvidia \
-c conda-forge
Troubleshooting Distributed Training
Environment variables
When you activate the conda environment, we’ve pre-set the following environment variables for you to enable [RFC] Asynchronous Error Handling for Distributed Training with NCCL.
export NCCL_ASYNC_ERROR_HANDLING=1
Note: When you deactivate the environment, above variables will be automatically unset.
If you are building custom CUDA libraries, you may want to configure you application to pick up the right CUDA version by setting CUDA_HOME and add a symlink of CUDA library path. Refer to our DLAMI configuration guide.
export CUDA_HOME=/usr/local/cuda-11.6
sudo ln -s /usr/local/cuda-11.6 /usr/local/cuda
sudo rm /usr/local/cuda
Verify NCCL is using EFA
NCCL should automatically use EFA out of the box. You should see the following log from NCCL if EFA is been used. Also make sure your security group is configured to use EFA, refer to the EFA configuration guide.
NCCL INFO Using network AWS Libfabric
NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
NCCL INFO NET/OFI Selected Provider is efa
The following log indicates EFA is not been used, please reach out to AWS support for such case.
NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
INFO Using network Socket
INFO NET/IB : No device found.
NCCL WARN Could not open XML topology file
The aws-ofi-ncclplugin should automatically configure the P4D/P4De/P5 topology file by setting the NCCL_TOPO_FILE variable and point it to the correct xml file location. The xml topology file is located at /your/conda/env/shared/aws-ofi-nccl/xml/. If the file location is wrong, you can either manually copy it, or build aws-ofi-nccl plugin from source the specify the file location via —prefix argument. Failure to find such file may lead to performance regression. You should see the following NCCL log and no warning about not able to find the xml file.
NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to <path>/p4d-24xl-topo.xml
NCCL failed to initialize/bootstrap
Make sure all nodes’ socket names are consistent and NCCL is able to initialize via one of the socket names. You can check socket name using ifconfig and set it using NCCL_SOCKET_IFNAME. (example issue)