Enabling NCCL P2P Communication for NVIDIA RTX 4090 Workstations

Posted by Allan on May 21, 2025

Introduction

In multi-GPU training scenarios, Peer-to-Peer (P2P) communication is crucial for efficient model training, particularly for synchronizing loss values across all GPUs. NVIDIA has restricted NCCL P2P communication support for the RTX 4090, necessitating a workaround to enable this functionality.

Expected Results

The following images demonstrate the successful implementation of P2P communication:

P2P Communication Result 1 P2P Communication Result 2

Implementation Guide

1. Driver Installation

1.1 Remove Existing NVIDIA Drivers

1
2
3
sudo apt purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean

1.2 System Reboot

Restart your system to ensure clean driver removal.

1.3 Unload NVIDIA DRM Module

1
2
3
4
5
systemctl isolate multi-user.target
modprobe -r nvidia-drm

# If GUI doesn't appear after completion
systemctl start graphical.target

1.4 Install Modified Driver

  1. Clone the modified driver repository:
    1
    
    git clone https://github.com/tinygrad/open-gpu-kernel-modules/tree/565.57.01-p2p
    
  2. Switch to the appropriate branch:
    1
    2
    
    git branch -a
    git switch 565.57.01-p2p
    
  3. Compile the modules:
    1
    
    make modules -j$(nproc)
    

    If GCC errors occur, install GCC-12:

    1
    2
    3
    
    sudo apt update
    sudo apt install gcc-12 g++-12
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 120 --slave /usr/bin/g++ g++ /usr/bin/g++-12
    
  4. Install the compiled modules:
    1
    
    sudo make modules_install -j$(nproc)
    
  5. Download and install the corresponding NVIDIA driver:
    1
    2
    
    # Download from https://www.nvidia.com/en-us/drivers/details/233008/
    sh ./NVIDIA-Linux-[...].run --no-kernel-modules
    
  6. Reboot the system

2. System Configuration

2.1 ReBar Verification

Verify ReBar activation using:

1
nvidia-smi -q | grep -i bar -A 3

ReBar is considered active if Total ≥ 256MB. If inactive, update your BIOS and motherboard firmware.

ReBar Configuration

2.2 IOMMU Configuration

Disable IOMMU by modifying GRUB configuration:

  1. Edit GRUB configuration:
    1
    
    sudo nano /etc/default/grub
    
  2. Modify the following line:
    1
    
    GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off iommu=off"
    

Note: P2P communication requires both ReBar activation and IOMMU deactivation.

3. CUDA Toolkit Setup

3.1 Installation and Configuration

  1. Download CUDA Toolkit from NVIDIA’s website
  2. Configure environment variables:
    1
    2
    
    export PATH=/usr/local/cuda-12.9/bin:$PATH
    export CUDAHOSTCXX=/usr/bin/g++-12
    

4. P2P Testing

4.1 SimpleP2P Test

  1. Clone the CUDA samples repository:
    1
    
    git clone https://github.com/NVIDIA/cuda-samples
    
  2. Compile and run SimpleP2P:
    1
    2
    3
    4
    5
    
    cd cuda-samples/Samples/0_Introduction/simpleP2P/
    mkdir build && cd build
    cmake ..
    make -j$(nproc)
    ./simpleP2P
    

4.2 P2P Latency Test

  1. Compile and run the latency test:
    1
    2
    3
    4
    5
    
    cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
    mkdir build && cd build
    cmake ..
    make -j$(nproc)
    ./p2pBandwidthLatencyTest
    

Conclusion

This guide provides a comprehensive approach to enabling P2P communication on NVIDIA RTX 4090 workstations. The implementation requires careful attention to driver installation, system configuration, and proper testing to ensure successful P2P communication.