Introduction
In multi-GPU training scenarios, Peer-to-Peer (P2P) communication is crucial for efficient model training, particularly for synchronizing loss values across all GPUs. NVIDIA has restricted NCCL P2P communication support for the RTX 4090, necessitating a workaround to enable this functionality.
Expected Results
The following images demonstrate the successful implementation of P2P communication:
Implementation Guide
1. Driver Installation
1.1 Remove Existing NVIDIA Drivers
1
2
3
sudo apt purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean
1.2 System Reboot
Restart your system to ensure clean driver removal.
1.3 Unload NVIDIA DRM Module
1
2
3
4
5
systemctl isolate multi-user.target
modprobe -r nvidia-drm
# If GUI doesn't appear after completion
systemctl start graphical.target
1.4 Install Modified Driver
- Clone the modified driver repository:
1
git clone https://github.com/tinygrad/open-gpu-kernel-modules/tree/565.57.01-p2p
- Switch to the appropriate branch:
1 2
git branch -a git switch 565.57.01-p2p
- Compile the modules:
1
make modules -j$(nproc)
If GCC errors occur, install GCC-12:
1 2 3
sudo apt update sudo apt install gcc-12 g++-12 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 120 --slave /usr/bin/g++ g++ /usr/bin/g++-12
- Install the compiled modules:
1
sudo make modules_install -j$(nproc)
- Download and install the corresponding NVIDIA driver:
1 2
# Download from https://www.nvidia.com/en-us/drivers/details/233008/ sh ./NVIDIA-Linux-[...].run --no-kernel-modules
- Reboot the system
2. System Configuration
2.1 ReBar Verification
Verify ReBar activation using:
1
nvidia-smi -q | grep -i bar -A 3
ReBar is considered active if Total ≥ 256MB. If inactive, update your BIOS and motherboard firmware.
2.2 IOMMU Configuration
Disable IOMMU by modifying GRUB configuration:
- Edit GRUB configuration:
1
sudo nano /etc/default/grub
- Modify the following line:
1
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off iommu=off"
Note: P2P communication requires both ReBar activation and IOMMU deactivation.
3. CUDA Toolkit Setup
3.1 Installation and Configuration
- Download CUDA Toolkit from NVIDIA’s website
- Configure environment variables:
1 2
export PATH=/usr/local/cuda-12.9/bin:$PATH export CUDAHOSTCXX=/usr/bin/g++-12
4. P2P Testing
4.1 SimpleP2P Test
- Clone the CUDA samples repository:
1
git clone https://github.com/NVIDIA/cuda-samples
- Compile and run SimpleP2P:
1 2 3 4 5
cd cuda-samples/Samples/0_Introduction/simpleP2P/ mkdir build && cd build cmake .. make -j$(nproc) ./simpleP2P
4.2 P2P Latency Test
- Compile and run the latency test:
1 2 3 4 5
cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/ mkdir build && cd build cmake .. make -j$(nproc) ./p2pBandwidthLatencyTest
Conclusion
This guide provides a comprehensive approach to enabling P2P communication on NVIDIA RTX 4090 workstations. The implementation requires careful attention to driver installation, system configuration, and proper testing to ensure successful P2P communication.