Introduction

In multi-GPU training scenarios, Peer-to-Peer (P2P) communication is crucial for efficient model training, particularly for synchronizing loss values across all GPUs. NVIDIA has restricted NCCL P2P communication support for the RTX 4090, necessitating a workaround to enable this functionality.

Expected Results

The following images demonstrate the successful implementation of P2P communication:

P2P Communication Result 1 P2P Communication Result 2

Implementation Guide

1. Driver Installation

1.1 Remove Existing NVIDIA Drivers

sudo apt purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean

1.2 System Reboot

Restart your system to ensure clean driver removal.

1.3 Unload NVIDIA DRM Module

systemctl isolate multi-user.target
modprobe -r nvidia-drm

# If GUI doesn't appear after completion
systemctl start graphical.target

1.4 Install Modified Driver

Clone the modified driver repository:

git clone https://github.com/tinygrad/open-gpu-kernel-modules/tree/565.57.01-p2p

Switch to the appropriate branch:

git branch -a
git switch 565.57.01-p2p

Compile the modules:

make modules -j$(nproc)

If GCC errors occur, install GCC-12:

sudo apt update
sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 120 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Install the compiled modules:

sudo make modules_install -j$(nproc)

Download and install the corresponding NVIDIA driver:

# Download from https://www.nvidia.com/en-us/drivers/details/233008/
sh ./NVIDIA-Linux-[...].run --no-kernel-modules

Reboot the system

2. System Configuration

2.1 ReBar Verification

Verify ReBar activation using:

nvidia-smi -q | grep -i bar -A 3

ReBar is considered active if Total ≥ 256MB. If inactive, update your BIOS and motherboard firmware.

ReBar Configuration

2.2 IOMMU Configuration

Disable IOMMU by modifying GRUB configuration:

Edit GRUB configuration:

sudo nano /etc/default/grub

Modify the following line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off iommu=off"

Note: P2P communication requires both ReBar activation and IOMMU deactivation.

3. CUDA Toolkit Setup

3.1 Installation and Configuration

Download CUDA Toolkit from NVIDIA’s website

Configure environment variables:

export PATH=/usr/local/cuda-12.9/bin:$PATH
export CUDAHOSTCXX=/usr/bin/g++-12

4. P2P Testing

4.1 SimpleP2P Test

Clone the CUDA samples repository:

git clone https://github.com/NVIDIA/cuda-samples

Compile and run SimpleP2P:

cd cuda-samples/Samples/0_Introduction/simpleP2P/
mkdir build && cd build
cmake ..
make -j$(nproc)
./simpleP2P

4.2 P2P Latency Test

Compile and run the latency test:

cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
mkdir build && cd build
cmake ..
make -j$(nproc)
./p2pBandwidthLatencyTest

Conclusion

This guide provides a comprehensive approach to enabling P2P communication on NVIDIA RTX 4090 workstations. The implementation requires careful attention to driver installation, system configuration, and proper testing to ensure successful P2P communication.

Enabling NCCL P2P Communication for NVIDIA RTX 4090 Workstations