Deprecation Announcement
Dear AWS customers,
We are writing to announce the upcoming deprecation of the AWS Conda channel for PyTorch. This channel was created to address the usability issues associated with the aws-ofi-nccl library, which previously required a specific NCCL version to be used.
Customers can now use the aws-ofi-nccl library more easily because the aws-ofi-nccl library is no longer tightly coupled with the NCCL version. As a result, AWS Conda channel for PyTorch is no longer required. We are deprecating this channel and no further development or updates will be made. For customers who would like to take advantage of Elastic Fabric Adapter (EFA) , we recommend using the SageMaker HyperPod, or AWS Deep Learning AMI where EFA and aws-ofi-nccl library are installed by default.
This deprecation will take effect on 10/24/2024. After this date, we encourage you to use the official PyTorch distribution or other supported channels for your PyTorch needs. The channel will no longer be available after 10/24/2025. If you have any questions or concerns about this deprecation, don’t hesitate to reach out to us.
Frequently Asked Questions
Q1. When will the AWS Conda Channel will be made unavailable?
We will keep the AWS Conda channel available until 10/24/2025.
Q2. I have an existing workload that depends on the AWS conda channel for PyTorch, will it fail after the deprecation date?
No, we will keep the existing channel available until 10/24/2025, so your existing workload will continue to function after the deprecation date.
Q3. I want to use newer PyTorch version like 2.4 and beyond, what should I do?
In your existing conda install or create commands or environment.yml files, replace “https://aws-ml-conda.s3.us-west-2.amazonaws.com” with “pytorch” then run the conda install or create commands to get newer PyTorch from official distribution. You can also remove “–strict-channel-priority” and “–override-channels” flags if you are using them.
Q4. Will I need to recreate my existing conda environments?
If you want to use PyTorch version beyond 2.3 then you should recreate a new conda environment. However if you want to keep using the existing software versions from the existing environment, then you don’t need to do anything.
Q5. What are the recommended alternatives to the deprecated AWS Conda channel?
We recommend using the official PyTorch distribution. Please refer to the getting started guide for details.
Q6. I want to install latest PyTorch security fixes, what should i do?
We recommend using the official PyTorch distribution for getting PyTorch with latest security fixes. Check PyTorch Security Policy for more details.
Hi there
We are building AWS PyTorch to help address usability and performance issues of PyTorch on AWS. Following is a list of changes we’ve made:
- Use EFA out of the box: We’ve pre-built OFI-NCCL plugin as a conda package for you. It’s been set as a dependency of our distribution of PyTorch now and will be automatically installed by following the installation guide.
- We provide the latest CUDA/cuDNN/NCCL for the best performance on P5 instances. We work with pytorch/builder for issues found and future contributions.
Note: We don’t make any source code level changes on PyTorch.
Please read this document for details about AWS PyTorch release. To get started, go over setup conda guide first then choose installation command based on PyTorch version of your choice:
Package Manager | ||
Compute Platform | ||
Run this command |
|
If you face any issue, please check troubleshooting guide and see if that helps. If not, feel free reach out to us via Arindam Paul (aripauly@amazon.com).
Latest Release - 05/02/2024
- Added PyTorch 2.3.0 Support: We now added PyTorch 2.3.0 with CUDA 11.8 - 12.1, Python 3.8 - 3.11 variance.
For older release notes, please visit Release Notes page.
Known Issues / Callouts
DeviceIndex
is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. OnP5
, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. In order to reduce configuration overhead,Deviceindex
should be set to0
for the first EFA device, and1
for the rest of them. Below is an example using awscli.--network-interfaces \ "NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ ...
Note: AWS ParallelCluster automatically configures the correct network device settings for
p5
instances so user doesn’t need to explicitly configure theDeviceindex
setting for EFA devices when using AWS ParallelCluster.CUDA error: driver shutting down
can be observed with autograd when not explicitly passing “device” into PyTorch calls such astorch.tensor()
. This error impacts pytorch-2.0.1 built with CUDA12.1 support. The workaround is to explicitly passdevice
into PyTorch calls, for more details please check this post. This will be fixed with our next release.- Conda installing
opencv
along with the latest aws pytorch release could cause dependency conflict. You can workaround this problem by using pip install for opencv.