Metonymical Deflection

ゆるく日々のコト・たまにITインフラ

How to set up Scalable Function with vdpa for VMs

This article describes how to configure Scalable Function (referred to below as "SF") on CentOS8.
SF is a technology implemented using the Sub Function of the Linux Kernel, and a similar technology is Intel Scalable IOV.*1

This article will describe how to set up inter-VM communication using SF instead of SR-IOV VF in VMs using vhost vDPA module.

For more information on SF, please refer to Home · Mellanox/scalablefunctions Wiki · GitHub.

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
How to set up vDPA - appendix - - Metonymical Deflection

1.Overview

1-1.Environment
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.4(2105)
Kernel                           : 5.13.4-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.4-1.0.3.0
qemu-kvm                         : v6.1.0-rc0
ovs                              : v2.14.1
libvirtd                         : v7.5.0
1-2.Overall flow

Advance Preparation
Building the Kernel
Build the related application
Enable switchdev mode
Configure ovs and VM
Operation check

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210727105016p:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210727104945p:plain

Quoted from Red Hat's Blog
vDPA kernel framework part 3: usage for VMs and containers

The orange dotted lines (A) and (B) correspond to fig.1 and fig.2, respectively.
Furthermore, in fig.2, the actual traffic flow is described in blue and red letters. *3

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Install the Mellanox driver (OFED)

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
The following command will install the Mellanox driver, but it will also install ovs v2.14.1 at the same time.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.4-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot
2-2.Prepare the Scalable Function

Use the mst tool to configure the NIC settings at the H/W level.
First, use the lshw command to check the bsf number of the NIC (0000:07:00.0, 0000:07:00.1).

lshw -businfo -c network
mst start
mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024
mlxconfig -d 0000:07:00.1 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  ens1f0     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1     network        MT2892 Family [ConnectX-6 Dx]

[root@c84g155 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

[root@c84g155 ~]# mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

Device #1:
----------

Device type:    ConnectX6DX
Name:           MCX623106AS-CDA_Ax
Description:    ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Secure Boot; No Crypto
Device:         0000:07:00.0

Configurations:                              Next Boot       New
         PF_BAR2_ENABLE                      False(0)        False(0)
         PER_PF_NUM_SF                       True(1)         True(1)
         PF_TOTAL_SF                         236             236
         PF_SF_BAR_SIZE                      10              10
         SRIOV_EN                            True(1)         True(1)
         NUM_OF_VFS                          16              16
         NUM_PF_MSIX_VALID                   False(0)        False(0)
         PF_NUM_PF_MSIX                      1024            1024

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

0000:07:00.1 ens2f1 should be set in the same way.

Note
SRIOV_EN=1 and NUM_OF_VFS=16 are SR-IOV settings, but we are not used this time, so you do not need to set them.

After completing the settings, reboot the system.

reboot

3.Building the Kernel

Kernel 5.13 or later is required. In addition, SF modules, etc. need to be enabled, so build from kernel's source.

3-2.Installing epel, etc
dnf -y install epel-release && \
dnf -y install http://repo.okay.com.mx/centos/8/x86_64/release/okay-release-1-5.el8.noarch.rpm && \
dnf -y --enablerepo=powertools install dwarves libbpf-devel
3-3.Install Kernel nosrc.rpm and download the source
cd /usr/src && \
wget https://elrepo.org/linux/kernel/el8/SRPMS/kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm && \
dnf -y localinstall kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm

wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.13.4.tar.xz && \
cp linux-5.13.4.tar.xz /root/rpmbuild/SOURCES && \
rpmbuild -bp /root/rpmbuild/SPECS/kernel-ml-5.13.spec
3-4.Running builddep
dnf builddep -y kernel
3-5.Prepare for make
cd /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/ && \
make mrproper && \
make oldconfig

The "make oldconfig" command will prompt you to enable/disable/modulate each module, but it will output a huge amount of items, so keep pressing Enter.
Note
After executing the "make oldconfig" command, "cp /boot/config-4.18.0-305.3.1.el8.x86_64 .config" will be executed first.
Then, if the configuration items are the same between Kernel 4.18 and Kernel 5.13, the settings of Kernel 4.18 will be inherited.
In addition, since the newly added items in Kernel 5.13 are default settings, we will configure them individually using the "make menuconfig" command.

3-6.Configuration by "make menuconfig"
make menuconfig

After executing the above command, the following screen will appear.
f:id:metonymical:20210726171324p:plain
Enable/modify the following modules.

MLX5_ESWITCH
MLX5_SF
VDPA

Make sure that the following items are set to [=y] or [=m], and if they are set to [=n], change the setting to [=m].*4
The configuration items correspond to the strings described in "Prompt". (e.g. Mellanox Technologies MLX5 SRIOV E-Switch support)
Please follow "Location" for the hierarchy of configuration items. (e.g. -> Device Drivers -> Network device support)

MLX5
│ Symbol: MLX5_ESWITCH [=y]                                                                                             │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:71                                                         │
│   Prompt: Mellanox Technologies MLX5 SRIOV E-Switch support                                                           │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE_EN [=y] && NET_SWITCHDEV [=y] │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│           -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│ (9)         -> Mellanox 5th generation network adapters (ConnectX series) Ethernet support (MLX5_CORE_EN [=y])        │

│ Symbol: MLX5_SF [=y]                                                                                                  │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:217                                                        │
│   Prompt: Mellanox Technologies subfunction device support using auxiliary device                                     │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE [=m] && MLX5_CORE_EN [=y]     │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│ (1)       -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│                                                                                                                       │
│                                                                                                                       │
│ Symbol: MLX5_SF_MANAGER [=y]                                                                                          │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:226                                                        │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_SF [=y] && MLX5_ESWITCH [=y]       │

VDPA
│ Symbol: VDPA [=m]                                          │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:2                          │
│   Prompt: vDPA drivers                                     │
│   Depends on: NET [=y]                                     │
│   Location:                                                │
│ (1) -> Device Drivers                                      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA [=y]                                     │
│ Type  : bool                                               │
│ Defined at drivers/vdpa/Kconfig:45                         │
│   Depends on: VDPA [=m]                                    │
│ Selects: VHOST_IOTLB [=m]                                  │
│ Selected by [m]:                                           │
│   - MLX5_VDPA_NET [=m] && VDPA [=m] && MLX5_CORE [=m]      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA_NET [=m]                                 │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:53                         │
│   Prompt: vDPA driver for ConnectX devices                 │
│   Depends on: VDPA [=m] && MLX5_CORE [=m]                  │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (2)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: MLX5_VDPA [=y]                                    │
│                                                            │
│                                                            │
│ Symbol: VHOST_VDPA [=m]                                    │
│ Type  : tristate                                           │
│ Defined at drivers/vhost/Kconfig:64                        │
│   Prompt: Vhost driver for vDPA-based backend              │
│   Depends on: VHOST_MENU [=y] && EVENTFD [=y] && VDPA [=m] │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (6)   -> VHOST drivers (VHOST_MENU [=y])                   │
│ Selects: VHOST [=m] && IRQ_BYPASS_MANAGER [=m]             │
│                                                            │
│                                                            │
│ Symbol: VP_VDPA [=m]                                       │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:63                         │
│   Prompt: Virtio PCI bridge vDPA driver                    │
│   Depends on: VDPA [=m] && PCI_MSI [=y]                    │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (8)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: VIRTIO_PCI_LIB [=y]                               │

After completing the configuration, save the file.
The configuration file will be saved to the following path.

/root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config
3-7.Modify the config file
sed -i -e 's/certs\/rhel.pem//g' /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config

The above command modifies the ".config" file as follows. If you do not do this, make will fail. *5

Before
CONFIG_SYSTEM_TRUSTED_KEYS="certs/rhel.pem"
After
CONFIG_SYSTEM_TRUSTED_KEYS=""
3-8.Build the Kernel
LOCALVERSION=-SFvDPA make -j32 && \
make modules_install && \
make install

Note

-SFvDPA This is the string displayed when the OS starts. e.g. CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA)
-j32 Specify the number of jobs to be processed in parallel to speed up the make process. A good rule of thumb is about twice the number of CPU cores.
3-9.Startup Kernel Configuration
grubby --default-index
grubby --default-kernel

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-index
0
[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-kernel
/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

Normally, it should be set as above.
If it is not, please check with the following command and set it to boot on the new Kernel.

Check the Index number
grubby --info=ALL

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --info=ALL
index=0
kernel="/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-5.13.4-1.el8.x86_64-SFvDPA.img $tuned_initrd"
title="CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA"
index=1
kernel="/boot/vmlinuz-4.18.0-305.3.1.el8.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-4.18.0-305.3.1.el8.x86_64.img $tuned_initrd"
title="CentOS Linux (4.18.0-305.3.1.el8.x86_64) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-4.18.0-305.3.1.el8.x86_64"
index=2
kernel="/boot/vmlinuz-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57.img"
title="CentOS Linux (0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-0-rescue"

Startup Kernel Settings
grubby --set-default-index=0

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --set-default-index=0
The default is /boot/loader/entries/f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA.conf with index 0 and kernel /boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

Reboot the system to start with the built Kernel.

reboot

4.Build the related application

Build and install qemu, libvirtd, and iproute2-next.

4-1.Enabling various repositories
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo && \
dnf -y install https://pkgs.dyn.su/el8/base/x86_64/raven-release-1.0-1.el8.noarch.rpm && \
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/raven.repo
4-2.Install the necessary packages
dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel rpcgen python3-docutils \
gnutls gnutls-devel libxml2-devel yajl-devel libtirpc-devel libudev-devel libpciaccess-devel \
usbredir-devel spice-server-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y install https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/meson-0.55.3-3.el8.noarch.rpm
4-3.Building qemu
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.1.0-rc0 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

Checking Version after Installation

/usr/local/bin/qemu-system-x86_64 --version

[root@c84g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 6.0.90 (v6.1.0-rc0)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.Building libvirtd
cd /usr/src && \
git clone https://github.com/libvirt/libvirt.git && \
cd libvirt/ && \
git checkout v7.5.0 && \
meson build && \
ninja -C build && \
ninja -C build install

Checking Version after Installation

libvirtd -V

[root@c84g155 ~]# libvirtd -V
libvirtd (libvirt) 7.5.0
4-5.Change qemu execution path
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-6.Change the user to run qemu
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out
4-7.Change libvirt-sock settings

Add blue text

vi /etc/rc.local

touch /var/lock/subsys/local
sleep 1
mkdir /var/run/libvirt
ln -s /var/local/run/libvirt/libvirt-sock /var/run/libvirt/libvirt-sock
sleep 1
exit 0

Granting execution permissions

chmod +x /etc/rc.d/rc.local
4-8.Building iproute2-next
cd /usr/src && \
git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git iproute2-next && \
cd iproute2-next && \
./configure --include_dir /usr && \
make -j all && \
make install

Checking Version after Installation

devlink -V

[root@c84g155 ~]# devlink -V
devlink utility, iproute2-5.13.0

reboot once again.

reboot

5.Enabling switchdev mode

5-1.Changing the operating mode
devlink dev eswitch set pci/0000:07:00.0 mode switchdev &&¥
devlink dev eswitch set pci/0000:07:00.1 mode switchdev

Note
0000:07:00.0 and 0000:07:00.1 is the bsf number of the NIC, as confirmed by the lshw command.

5-2.Checking the operation mode

Noteworthy points are indicated in red text.

devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1
devlink port show

[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
5-3.Configure the Scalable Function
devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
devlink port function set pci/0000:07:00.0/32768 state active
devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
devlink port function set pci/0000:07:00.1/98304 state active

[root@c84g155 ~]# devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
pci/0000:07:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 state active
[root@c84g155 ~]# devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
pci/0000:07:00.1/98304: type eth netdev eth0 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 state active
5-4.Check the Scalable Function.

Noteworthy parts are shown in red text.

lshw -businfo -c network
devlink port show en7f0pf0sf11 -jp
devlink port show en7f1pf1sf12 -jp

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device        Class          Description
==========================================================
pci@0000:04:00.0  ens1f0        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.0  en7f0pf0sf11  network        Ethernet interface
pci@0000:07:00.1  en7f1pf1sf12  network        Ethernet interface
[root@c84g155 ~]# devlink port show en7f0pf0sf11 -jp
{
    "port": {
        "pci/0000:07:00.0/32768": {
            "type": "eth",
            "netdev": "en7f0pf0sf11",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 11,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:11",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
[root@c84g155 ~]# devlink port show en7f1pf1sf12 -jp
{
    "port": {
        "pci/0000:07:00.1/98304": {
            "type": "eth",
            "netdev": "en7f1pf1sf12",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 1,
            "sfnum": 12,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:12",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Note
At this stage, we have completed the configuration of SF, but we have not bound SF to the vdpa device.
Therefore, in the next section, we will enable the vdpa module, unbind the existing driver, and add the vdpa device.
This will be the main difference from the SR-IOV VF.

6.Configure ovs and VM

6-1.Overall Flow - Overview -

Configure the settings in the order (1)-(9) described in fig.1 below.
fig.1
f:id:metonymical:20210727105016p:plain

  1. Enable the vdpa module and add vdpa devices: (1)
  2. Initial configuration of ovs
  3. Configuration of br30-ovs: (2)(3)(4)
  4. Configuration of br31-ovs: (5)(6)(7)
  5. Configure and start virtual machine c77g153: (8)
  6. Configure and start virtual machine c77g159: (9)
6-2.Overall flow - Commands only -

We will throw in the following commands.
Detailed explanations will follow, but if you don't need the explanations, just execute the commands.

1.Enable the vdpa module and add vdpa devices
(1)
modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

2.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

3.Configuration of br30-ovs
(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11

4.Configuration of br31-ovs
(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

5.Configure and start virtual machine c77g153
(8)
virsh edit c77g153
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>

6.Configure and start virtual machine c77g159
(9)
virsh edit c77g159
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>
6-3.Enable the vdpa module and add vdpa devices : (1)

Enable the vdpa module, unbind the existing drivers, and add vdpa devices.

modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

vdpa module enabled
vhost_vdpa module enabled
Unbind rdma driver for en7f0pf0sf11
Unbind the eth driver of en7f0pf0sf11
Unbind the rdma driver of en7f1pf1sf12
Unbind the eth driver of en7f1pf1sf12
Adding a vdpa0 device
Adding a vdpa1 device

Make sure you have the following settings.
Noteworthy parts are shown in red text.

devlink dev show
devlink port show
vdpa mgmtdev show
vdpa dev show
ls -Fal /dev/
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c84g155 ~]# devlink dev show
pci/0000:07:00.0
pci/0000:07:00.1
auxiliary/mlx5_core.sf.2
auxiliary/mlx5_core.sf.3
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.0/32768: type eth netdev en7f0pf0sf11 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:11 state active opstate attached
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
pci/0000:07:00.1/98304: type eth netdev en7f1pf1sf12 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:12 state active opstate attached
[root@c84g155 ~]# vdpa mgmtdev show
auxiliary/mlx5_core.sf.2:
  supported_classes net
auxiliary/mlx5_core.sf.3:
  supported_classes net
[root@c84g155 ~]# vdpa dev show
vdpa0: type network mgmtdev auxiliary/mlx5_core.sf.2 vendor_id 5555 max_vqs 16 max_vq_size 256
vdpa1: type network mgmtdev auxiliary/mlx5_core.sf.3 vendor_id 5555 max_vqs 16 max_vq_size 256
[root@c84g155 ~]# ls -Fal /dev/
total 0
drwxr-xr-x  22 root root          3660 Jul 27 09:45 ./
dr-xr-xr-x. 17 root root           244 Jul 26 13:17 ../
crw-r--r--   1 root root       10, 235 Jul 27 09:41 autofs
============ s n i p ============
crw-------   1 root root       10, 137 Jul 27 09:41 vhci
crw-------   1 root root       10, 238 Jul 27 09:48 vhost-net
crw-------   1 root root      240,   0 Jul 27 09:45 vhost-vdpa-0
crw-------   1 root root      240,   1 Jul 27 09:45 vhost-vdpa-1
[root@c84g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Jul 27 09:45 ./
drwxr-xr-x 3 root root    0 Jul 27 09:45 ../
--w------- 1 root root 4096 Jul 27 09:45 bind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Jul 27 09:45 uevent
--w------- 1 root root 4096 Jul 27 09:45 unbind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.0/mlx5_core.sf.2/vdpa0/
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.1/mlx5_core.sf.3/vdpa1/

From the above output results, we can confirm the following

  • 0000:07:00.0, 0000:07:00.1 and auxiliary/mlx5_core.sf.2, auxiliary/mlx5_core.sf.3 are recognized as devlink devices.
  • pci/0000:07:00.0/32768 is recognized as en7f0pf0sf11 (as a devlink port).
  • Auxiliary/mlx5_core.sf.2 and auxiliary/mlx5_core.sf.3 are recognized as vdpa management devices.
  • vdpa device auxiliary/mlx5_core.sf.2 is recognized as vdpa0.
  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices.
  • 0000:07:00.0/mlx5_core.sf.2/vdpa0 and 0000:07:00.1/mlx5_core.sf.3/vdpa1 is controlled by vhost_vdpa driver
6-4.Initial configuration of ovs

Since ovs has already been installed, start the service from systemctl.*6

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

Start the ovs service
HW offload and tc-policy configuration
Restart the ovs service (to reflect the above settings)

Use the following command to check the settings.

ovs-vsctl get Open_vSwitch . other_config

[root@c84g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}

Note 1:
Here is a supplementary explanation of other_config:tc-policy.
The following options can be set for tc-policy.

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

Note 2:
If you want to remove the configuration, execute the command as follows.
"hw-offload" is the key, so specify any key you want to delete, such as "tc-policy" or "tc-policy".

ovs-vsctl remove Open_vSwitch . other_config tc-policy
6-5.Configuration of br30-ovs : (2)(3)(4)

Create the first bridge.

(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11

(2) Create a bridge
(3) Create the uplink (specify PF and set the interface for the external NW)
(4) Create downlink (specify VF Representer and set up the interface for VM)

Check the settings with the following command.

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    ovs_version: "2.14.1"
6-6.Configuration of br31-ovs : (5)(6)(7)

Create the second bridge.

(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

Same as (2), (3), and (4).

Check the settings with the following command. The blue text is the part that has been added.

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    Bridge br31-ovs
        Port br31-ovs
            Interface br31-ovs
                type: internal
        Port ens2f1
            Interface ens2f1
        Port en7f1pf1sf12
            Interface en7f1pf1sf12
    ovs_version: "2.14.1"
6-7.Configure and start virtual machine c77g153 : (8)

Please upload the qcow2 file to "/var/lib/libvirt/images/".
In this article, the qcow2 file with CentOS7.7 installed was prepared beforehand.
Additionally, once you have created a virtual machine with virt-manager, you will edit it with the "virsh edit" commands.

Login to the host OS via VNC or other means, and start virt-manager.
When creating a new virtual machine, delete the following [1]-[5] devices.*7
f:id:metonymical:20210412213049j:plain
After booting the VM, shutdown it once.
After shutdown, the device configuration should look like the following.
The NICs listed here are not used in vDPA, but they will allow you to ssh to them, so if you need to, assign a management IP to them.
f:id:metonymical:20210412212616j:plain

After shutdown, use the virsh edit command to perform the following settings.

(8)
virsh edit c77g153

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
6-8.Configure and start virtual machine c77g159 : (9)

Same as 6-7, except /dev/vhost-vdpa-1.

(9)
virsh edit c77g159

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>

7.Operation check

7-1.advance preparation

Prepare five consoles on hostOS c84g155.

ConsoleA tail -f /var/log/messages To refer to the logs that should be checked at VM startup
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs To check the packet count on c77g153
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs To check the packet count on c77g159
ConsoleD virsh start c77g153; virsh console c77g153 For the console of virtual machine c77g153
ConsoleE virsh start c77g159; virsh console c77g159 For the console of virtual machine c77g159
7-2.Booting the VM

Before starting the VM, please execute the above commands on ConsoleA, B, and C.
Then, start c77g153.
Wait a few seconds and then start c77g159.
Send a ping from c77g153 or c77g159.
As an example, follow fig.1 and execute ping 192.168.30.159 from c77g153.

fig.1
f:id:metonymical:20210727105016p:plain

The following is the output result. Noteworthy points are indicated in red text.
ConsoleA

Jul 27 11:50:49 c84g155 systemd[1]: Starting Virtualization daemon...
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtualization daemon.
Jul 27 11:50:49 c84g155 kvm[3831]: 1 guest now active
Jul 27 11:50:49 c84g155 kvm[3832]: 0 guests now active
Jul 27 11:50:49 c84g155 systemd[1]: Listening on Virtual machine log manager socket.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual machine log manager.
Jul 27 11:50:49 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_set_status:1786:(pid 3806): performing device reset
Jul 27 11:50:49 c84g155 systemd-machined[1192]: New machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual Machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 kernel: cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
Jul 27 11:50:49 c84g155 kvm[3850]: 1 guest now active
Jul 27 11:50:59 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_set_status:1786:(pid 3803): performing device reset
Jul 27 11:50:59 c84g155 systemd-machined[1192]: New machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 systemd[1]: Started Virtual Machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 kvm[3892]: 2 guests now active
Jul 27 11:51:11 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_handle_set_map:475:(pid 3855): memory map update
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00002|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:20 c84g155 kernel: Mirror/redirect action on
Jul 27 11:51:26 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_handle_set_map:475:(pid 3897): memory map update
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br30-ovs is down
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br31-ovs is down
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler5)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler3)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)

ConsoleB

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=135.157s
  port  ens2f0: rx pkts=33, bytes=4723, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=34, bytes=4835, drop=0, errs=0, coll=0
           duration=135.163s
  port  en7f0pf0sf11: rx pkts=34, bytes=4699, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=167, bytes=16515, drop=0, errs=0, coll=0
           duration=135.161s

ConsoleC

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=140.882s
  port  ens2f1: rx pkts=34, bytes=4835, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33, bytes=4723, drop=0, errs=0, coll=0
           duration=140.875s
  port  en7f1pf1sf12: rx pkts=33, bytes=4591, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=168, bytes=16627, drop=0, errs=0, coll=0
           duration=140.887s

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.
64 bytes from 192.168.30.159: icmp_seq=1 ttl=64 time=123 ms
64 bytes from 192.168.30.159: icmp_seq=2 ttl=64 time=0.809 ms
64 bytes from 192.168.30.159: icmp_seq=3 ttl=64 time=0.454 ms
64 bytes from 192.168.30.159: icmp_seq=4 ttl=64 time=0.457 ms

--- 192.168.30.159 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.454/25.142/123.539/49.198 ms

Note:

performing device reset mlx5_vdpa is initialized by mlx5_core.
memory map update memory mapping and update of mlx5_vdpa is being performed by mlx5_core. Note that this is the most important message because we can never communicate with it unless this log is output.
tc mirred to Houston A few lines will be output as a temporary error, but there is no problem.
internal error parsing flow key The ovs is giving multicast-related errors, but there is nothing wrong with it. Don't worry about it.
en7f0pf0sf11, en7f1pf1sf12 You can see that the tx/rx packet count and byte count for each port are increasing.

8.Finally

We referred to the following website.
https://github.com/Mellanox/scalablefunctions/wiki
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/devlink/devlink-port.rst?h=v5.12-rc3#n125
https://legacy.netdevconf.info/0x14/pub/slides/45/sf_mgmt_using_devlink_netdevconf_0x14.pdf
https://legacy.netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux

We can think of SF as a lightweight SR-IOV VF with excellent scalability.
Of course, there are many more advantages than just that, as you can see in the URL above.
In addition, SF will be at its best when used in conjunction with vDPA, but I do not expect it to take much time for such a technology to be widely adopted in anticipation of going cloud-native.

Intel Scalable IOV is coming in 2018, and Sub Function and Scalable Function of the Linux Kernel are relatively new technologies that have emerged in the last few years.
On the other hand, SR-IOV is a technology that has been in use for more than 10 years already, although I remember that it was introduced in 2007 and became popular little by little around 2010.
It is said that low layer technologies evolve at a slower pace than higher layer technologies, but I believe that we are now approaching a turning point for new technologies.
As an engineer, I think it is necessary to catch these trends quickly and sublimate them into technical skills that can be used at the practical level.

*1:Intel Scalable IOV can be enabled by setting the kernel parameter "intel_iommu=sm_on". In addition, you need to use a NIC (Intel SmartNIC N3000 series, N5000 series) that supports the ifcvf driver.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the VM are physically sent to the outside, we use the configuration shown in fig.1.

*3:This is a description of what I understand. If the content is incorrect, please point it out.

*4:If it is [=n], change the setting to [=m]. [=m] indicates modularization. In this case, it is possible to enable it with modprobe commands after the OS is booted. Note, however, that if [=n] is set, it will be excluded from the build and cannot be enabled by modprobe commands.

*5:Strictly speaking, when you run "make modules_install", an error will be output.

*6:It has already been installed in 2-1 sections.

*7:This is because related packages such as spice were not installed when qemu was built, and the virtual machine could not be started without removing these devices. Since this is not directly related to vDPA, we will not discuss how to deal with these issues.