Why bother with NCCL P2P?

When you train or fine-tune on more than one GPU, libraries such as PyTorch and JAX rely on collective communication so every device sees the same gradients or activations at the right time. On NVIDIA hardware that usually means NCCL is in the hot path. Peer-to-peer (P2P)—direct GPU-to-GPU memory access—is how NCCL prefers to move data when the driver and topology allow it; when P2P is blocked, collectives may still run but through slower fallbacks (see the next section for definitions).

NVIDIA does not officially enable NCCL P2P on some consumer GeForce boards, including the RTX 4090, even when the hardware could support it in principle. If you bought two 4090s for a workstation and expect NCCL to “just work” like on a datacenter GPU, you can end up chasing cryptic logs until you either accept degraded comms or apply a driver/kernel workaround. This post documents one such path: modified open GPU kernel modules plus a few firmware and boot settings.

What is NCCL P2P?

NCCL (NVIDIA Collective Communications Library) implements multi-GPU collective operations: patterns such as all-reduce (combine gradients across GPUs), all-gather, broadcast, and reduce-scatter. Frameworks typically call these through a distributed backend (for example PyTorch’s ProcessGroup using NCCL). Internally, NCCL chooses topologies—rings, trees, or hybrids—and schedules send/recv-style steps along edges between GPUs.

P2P in this context means CUDA peer access: GPU i is allowed to load and store another GPU j’s device memory without an explicit copy through host DRAM. On PCIe-only setups that usually implies P2P DMA over the fabric between those endpoints (when NVLink exists, NCCL can use that too). “NCCL P2P” is shorthand for: NCCL is using peer-to-peer GPU memory paths as part of those collectives, rather than only staging through pinned host buffers.

When peer access is unavailable or disabled, NCCL can fall back to other transports, but you often see higher latency, lower effective bandwidth, or extra PCIe traffic—sometimes bad enough that scaling to two GPUs barely helps. Diagnostic hooks include cudaDeviceCanAccessPeer / cudaDeviceEnablePeerAccess, NCCL’s environment knobs (for example NCCL_P2P_LEVEL, NCCL_P2P_DISABLE), and NCCL debug / topology logs that record whether the runtime believes P2P is usable between pairs of GPUs.

What is ReBAR (and why it shows up here)?

ReBAR is Resizable BAR (part of the PCIe specification). Traditionally the CPU could only map a small fixed window of GPU video memory at a time. With ReBAR, the system can expose a much larger contiguous mapping of VRAM to the CPU. That mainly helps CPU ↔ GPU traffic (textures, uploads, some unified-memory style use). It is not the same thing as GPU-to-GPU P2P, but on consumer platforms BIOS and driver stacks often treat BAR sizing and routing as part of the same configuration story.

For the workaround in this guide, ReBAR should be enabled in firmware where possible. The practical check is: Total BAR size reported by the driver is at least on the order of hundreds of MB (the steps below use nvidia-smi). If ReBAR is off, update motherboard BIOS and enable the relevant “Above 4G decoding” / Resizable BAR options before spending time on kernels.

IOMMU (I/O memory management unit) virtualization for devices can interfere with certain P2P paths on some boards. This guide turns IOMMU off in the kernel command line for the workstation case—only do that if you understand the tradeoff (simpler device DMA; less isolation for PCI passthrough / VFIO workflows).

Once the definitions and motivation are clear, the rest is execution: driver build, firmware/boot knobs, CUDA, and sanity tests.

Expected results

The following images show P2P working end-to-end after configuration:

P2P Communication Result 1 P2P Communication Result 2

How: implementation guide

1. Driver installation

1.1 Remove existing NVIDIA drivers

sudo apt purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean

1.2 Reboot

Restart the machine so the old stack is fully unloaded.

1.3 Unload the NVIDIA DRM module (text mode)

systemctl isolate multi-user.target
modprobe -r nvidia-drm

# If the GUI does not return afterward:
systemctl start graphical.target

1.4 Install the modified kernel modules

Clone the open GPU kernel modules repository (use the repo URL, not the GitHub tree page):

git clone https://github.com/tinygrad/open-gpu-kernel-modules.git
cd open-gpu-kernel-modules

Check out the P2P branch that matches your target driver line (example branch name—confirm on the repo):
```
git fetch --all
git branch -a
git switch 565.57.01-p2p
```

Build the modules:

make modules -j$(nproc)

If the build fails on GCC, install GCC 12 and point alternatives at it:

sudo apt update
sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 120 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Install the built modules:
```
sudo make modules_install -j$(nproc)
```

Install the user-space driver from NVIDIA for the same version, without replacing the kernel modules you just built—for example:

# Download the matching runfile from NVIDIA, e.g.:
# https://www.nvidia.com/en-us/drivers/details/233008/
sh ./NVIDIA-Linux-[...].run --no-kernel-modules

Reboot again.

2. System configuration

2.1 Verify ReBAR

nvidia-smi -q | grep -i bar -A 3

Treat Total ≥ 256 MiB (order-of-magnitude) as a sign that ReBAR-style mapping is in play; if numbers look tiny, fix BIOS options before debugging NCCL.

ReBar Configuration

2.2 Disable IOMMU in GRUB (AMD example)

Edit /etc/default/grub and adjust the default command line, for example:

sudo nano /etc/default/grub

Set something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off iommu=off"

Then sudo update-grub (Debian/Ubuntu) and reboot.

Reminder: P2P in this setup expects ReBAR on and IOMMU off for the paths described here.

3. CUDA toolkit

Install a CUDA toolkit from NVIDIA’s CUDA downloads.

Example environment for build tools:

export PATH=/usr/local/cuda-12.9/bin:$PATH
export CUDAHOSTCXX=/usr/bin/g++-12

4. P2P tests

4.1 SimpleP2P

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/0_Introduction/simpleP2P/
mkdir build && cd build
cmake ..
make -j$(nproc)
./simpleP2P

4.2 Bandwidth and latency

cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/
mkdir build && cd build
cmake ..
make -j$(nproc)
./p2pBandwidthLatencyTest

Conclusion

NCCL P2P is worth the trouble when multi-GPU collectives are on your critical path and you are on consumer cards with official limitations. ReBAR is primarily about how much GPU memory the CPU can map at once; here it is a firmware prerequisite to line up with the driver workaround, alongside IOMMU settings and a matched open-kernel-module build + runfile install. Treat the whole stack as unsupported by NVIDIA for production—validate with the CUDA samples above, then run your real training job and watch NCCL logs for a clean bill of health.