11.2.4. InifiniBand / RoCE support
Error
TODO This section needs to be converted from FAQ Q&A style to regular documentation style.
11.2.4.1. How are InfiniBand / RoCE devices supported in Open MPI?
Open MPI’s support for InfiniBand and RoCE devices has changed over time.
In the Open MPI v5.0.x series, InfiniBand and RoCE devices are
supported via the UCX (ucx
) PML.
Note
Prior versions of Open MPI also included the openib
BTL
for InfiniBand and RoCE devices. Open MPI v5.0.x no
longer includes the openib
BTL.
11.2.4.2. What is UCX?
UCX is an open-source optimized communication library which supports multiple networks, including RoCE, InfiniBand, uGNI, TCP, shared memory, and others. UCX mixes-and-matches transports and protocols which are available on the system to provide optimal performance. It also has built-in support for GPU transports (with CUDA and ROCm providers) which lets RDMA-capable transports access the GPU memory directly.
11.2.4.3. How do I use UCX with Open MPI?
If Open MPI includes UCX support, then UCX is enabled and selected by
default for InfiniBand and RoCE network devices; typically, no
additional parameters are required. In this case, the network port
with the highest bandwidth on the system will be used for inter-node
communication, and shared memory will be used for intra-node
communication. To select a specific network device to use (for
example, mlx5_0
device port 1):
shell$ mpirun -x UCX_NET_DEVICES=mlx5_0:1 ...
It’s also possible to force using UCX for MPI point-to-point and one-sided operations:
shell$ mpirun --mca pml ucx --mca osc ucx ...
For OpenSHMEM, in addition to the above, it’s possible to force using UCX for remote memory access and atomic memory operations:
shell$ mpirun --mca pml ucx --mca osc ucx --mca scoll ucx --mca atomic ucx ...
11.2.4.4. What is RDMA over Converged Ethernet (RoCE)?
RoCE (which stands for RDMA over Converged Ethernet) provides InfiniBand native RDMA transport on top of lossless Ethernet data links.
Since we’re talking about Ethernet, there’s no Subnet Manager, no Subnet Administrator, no InfiniBand SL, nor any other InfiniBand Subnet Administration parameters.
Connection management in RoCE is based on the OFED RDMACM (RDMA Connection Manager) service:
The OS IP stack is used to resolve remote (IP,hostname) tuples to a DMAC.
The outgoing Ethernet interface and VLAN are determined according to this resolution.
The appropriate RoCE device is selected accordingly.
Network parameters (such as MTU, SL, timeout) are set locally by the RDMACM in accordance with kernel policy.
11.2.4.5. How do I know what MCA parameters are available for tuning MPI performance?
The ompi_info
command can display all the parameters available for
any Open MPI component. For example:
shell$ ompi_info --param pml ucx --level 9
Important
Unlike most other Open MPI components, the UCX PML mainly uses environment variables for run-time tuning — not Open MPI MCA parameters. Consult the UCX documentation for details about what environment variables are available.
11.2.4.6. How do I tell Open MPI which IB Service Level to use?
In order to tell the UCX PML which SL to use, the IB SL must be
specified using the UCX_IB_SL
environment variable. For example:
shell$ mpirun --mca pml ucx -x UCX_IB_SL=N ...
The value of IB SL N
should be between 0 and 15, where 0 is the
default value.
11.2.4.7. How do I run Open MPI over RoCE?
In order to use RoCE with the UCX PML, the relevant Ethernet port must
be specified using the UCX_NET_DEVICES
environment variable. For
example:
shell$ mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ...
UCX selects IPv4 RoCEv2 by default. If different behavior is needed, you can set a specific GID index:
shell$ mpirun --mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_GID_INDEX=1 ...
11.2.4.8. I’m experiencing a problem with Open MPI on my InfiniBand / RoCE network; how do I troubleshoot and get help?
In order for us to help you, it is most helpful if you can run a few steps before sending an e-mail to both perform some basic troubleshooting and provide us with enough information about your environment to help you. Please include answers to the following questions in your e-mail:
Which UCX and OpenFabrics version are you running? Please specify where you got the software from (e.g., from the OpenFabrics and/or UCX community web sites, already included in your Linux distribution, downloade from NVIDIA’s web site, etc.).
What distro and version of Linux are you running? What is your kernel version?
What is the output of the
ibv_devinfo
command on a known “good” node and a known “bad” node?Note
There must be at least one port listed as “PORT_ACTIVE” for Open MPI to work. If there is not at least one PORT_ACTIVE port, something is wrong with your InfiniBand / RoCE environment and Open MPI will not be able to run.
What is the output of the
ifconfig
command on a known “good” node and a known “bad” node?Note
Note that some Linux distributions do not put
ifconfig
in the default path for normal users; look for it at/sbin/ifconfig
or/usr/sbin/ifconfig
.If running under Bourne shells, what is the output of the
ulimit -l
command?If running under C shells, what is the output of the
limit | grep memorylocked
command?Note
If the value is not
unlimited
, ……………..Error
TODO Would be good to point to some UCX/vendor docs here about setting memory limits (rather than reproducing this information ourselves).