This article describes how to configure Scalable Function (referred to below as "SF") on CentOS8.
SF is a technology implemented using the Sub Function of the Linux Kernel, and a similar technology is Intel Scalable IOV.*1
This article will describe how to set up inter-VM communication using SF instead of SR-IOV VF in VMs using vhost vDPA module.
For more information on SF, please refer to Home · Mellanox/scalablefunctions Wiki · GitHub.
The following is a list of related articles.
1.Overview
1-1.Environment
IA server : ProLiant DL360p Gen8 or DL360 Gen9 System ROM : P71 01/22/2018 NIC : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT) OS : CentOS8.4(2105) Kernel : 5.13.4-1.el8.elrepo.x86_64 Installed Environment Groups : @^graphical-server-environment @container-management @development @virtualization-client @virtualization-hypervisor @virtualization-tools Mellanox OFED : v5.4-1.0.3.0 qemu-kvm : v6.1.0-rc0 ovs : v2.14.1 libvirtd : v7.5.0
1-2.Overall flow
Advance Preparation
Building the Kernel
Build the related application
Enable switchdev mode
Configure ovs and VM
Operation check
1-3.overall structure
Loop connection using a DAC(Direct Attached Cable). *2
fig.1
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.
fig.2
Quoted from Red Hat's Blog
vDPA kernel framework part 3: usage for VMs and containers
The orange dotted lines (A) and (B) correspond to fig.1 and fig.2, respectively.
Furthermore, in fig.2, the actual traffic flow is described in blue and red letters. *3
2.Advance preparation
Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.
2-1.Install the Mellanox driver (OFED)
You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
The following command will install the Mellanox driver, but it will also install ovs v2.14.1 at the same time.
dnf -y install tcl tk unbound && \ mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.4-x86_64.iso /mnt && \ /mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint
After the installation is complete, reboot.
reboot
2-2.Prepare the Scalable Function
Use the mst tool to configure the NIC settings at the H/W level.
First, use the lshw command to check the bsf number of the NIC (0000:07:00.0, 0000:07:00.1).
lshw -businfo -c network mst start mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024 mlxconfig -d 0000:07:00.1 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024 [root@c84g155 ~]# lshw -businfo -c network Bus info Device Class Description ======================================================= pci@0000:04:00.0 ens1f0 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:04:00.1 ens1f1 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:03:00.0 eno1 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.1 eno2 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.2 eno3 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.3 eno4 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:07:00.0 ens2f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.1 ens2f1 network MT2892 Family [ConnectX-6 Dx] [root@c84g155 ~]# mst start Starting MST (Mellanox Software Tools) driver set Loading MST PCI module - Success Loading MST PCI configuration module - Success Create devices Unloading MST PCI module (unused) - Success [root@c84g155 ~]# mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024 Device #1: ---------- Device type: ConnectX6DX Name: MCX623106AS-CDA_Ax Description: ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Secure Boot; No Crypto Device: 0000:07:00.0 Configurations: Next Boot New PF_BAR2_ENABLE False(0) False(0) PER_PF_NUM_SF True(1) True(1) PF_TOTAL_SF 236 236 PF_SF_BAR_SIZE 10 10 SRIOV_EN True(1) True(1) NUM_OF_VFS 16 16 NUM_PF_MSIX_VALID False(0) False(0) PF_NUM_PF_MSIX 1024 1024 Apply new Configuration? (y/n) [n] : y Applying... Done! -I- Please reboot machine to load new configurations. 0000:07:00.1 ens2f1 should be set in the same way.
Note
SRIOV_EN=1 and NUM_OF_VFS=16 are SR-IOV settings, but we are not used this time, so you do not need to set them.
After completing the settings, reboot the system.
reboot
3.Building the Kernel
Kernel 5.13 or later is required. In addition, SF modules, etc. need to be enabled, so build from kernel's source.
3-1.Installing elrepo
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org && \ dnf -y install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm
3-2.Installing epel, etc
dnf -y install epel-release && \ dnf -y install http://repo.okay.com.mx/centos/8/x86_64/release/okay-release-1-5.el8.noarch.rpm && \ dnf -y --enablerepo=powertools install dwarves libbpf-devel
3-3.Install Kernel nosrc.rpm and download the source
cd /usr/src && \ wget https://elrepo.org/linux/kernel/el8/SRPMS/kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm && \ dnf -y localinstall kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.13.4.tar.xz && \ cp linux-5.13.4.tar.xz /root/rpmbuild/SOURCES && \ rpmbuild -bp /root/rpmbuild/SPECS/kernel-ml-5.13.spec
3-4.Running builddep
dnf builddep -y kernel
3-5.Prepare for make
cd /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/ && \ make mrproper && \ make oldconfig
The "make oldconfig" command will prompt you to enable/disable/modulate each module, but it will output a huge amount of items, so keep pressing Enter.
Note
After executing the "make oldconfig" command, "cp /boot/config-4.18.0-305.3.1.el8.x86_64 .config" will be executed first.
Then, if the configuration items are the same between Kernel 4.18 and Kernel 5.13, the settings of Kernel 4.18 will be inherited.
In addition, since the newly added items in Kernel 5.13 are default settings, we will configure them individually using the "make menuconfig" command.
3-6.Configuration by "make menuconfig"
make menuconfig
After executing the above command, the following screen will appear.
Enable/modify the following modules.
MLX5_ESWITCH MLX5_SF VDPA
Make sure that the following items are set to [=y] or [=m], and if they are set to [=n], change the setting to [=m].*4
The configuration items correspond to the strings described in "Prompt". (e.g. Mellanox Technologies MLX5 SRIOV E-Switch support)
Please follow "Location" for the hierarchy of configuration items. (e.g. -> Device Drivers -> Network device support)
MLX5 │ Symbol: MLX5_ESWITCH [=y] │ │ Type : bool │ │ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:71 │ │ Prompt: Mellanox Technologies MLX5 SRIOV E-Switch support │ │ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE_EN [=y] && NET_SWITCHDEV [=y] │ │ Location: │ │ -> Device Drivers │ │ -> Network device support (NETDEVICES [=y]) │ │ -> Ethernet driver support (ETHERNET [=y]) │ │ -> Mellanox devices (NET_VENDOR_MELLANOX [=y]) │ │ (9) -> Mellanox 5th generation network adapters (ConnectX series) Ethernet support (MLX5_CORE_EN [=y]) │ │ Symbol: MLX5_SF [=y] │ │ Type : bool │ │ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:217 │ │ Prompt: Mellanox Technologies subfunction device support using auxiliary device │ │ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE [=m] && MLX5_CORE_EN [=y] │ │ Location: │ │ -> Device Drivers │ │ -> Network device support (NETDEVICES [=y]) │ │ -> Ethernet driver support (ETHERNET [=y]) │ │ (1) -> Mellanox devices (NET_VENDOR_MELLANOX [=y]) │ │ │ │ │ │ Symbol: MLX5_SF_MANAGER [=y] │ │ Type : bool │ │ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:226 │ │ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_SF [=y] && MLX5_ESWITCH [=y] │ VDPA │ Symbol: VDPA [=m] │ │ Type : tristate │ │ Defined at drivers/vdpa/Kconfig:2 │ │ Prompt: vDPA drivers │ │ Depends on: NET [=y] │ │ Location: │ │ (1) -> Device Drivers │ │ │ │ │ │ Symbol: MLX5_VDPA [=y] │ │ Type : bool │ │ Defined at drivers/vdpa/Kconfig:45 │ │ Depends on: VDPA [=m] │ │ Selects: VHOST_IOTLB [=m] │ │ Selected by [m]: │ │ - MLX5_VDPA_NET [=m] && VDPA [=m] && MLX5_CORE [=m] │ │ │ │ │ │ Symbol: MLX5_VDPA_NET [=m] │ │ Type : tristate │ │ Defined at drivers/vdpa/Kconfig:53 │ │ Prompt: vDPA driver for ConnectX devices │ │ Depends on: VDPA [=m] && MLX5_CORE [=m] │ │ Location: │ │ -> Device Drivers │ │ (2) -> vDPA drivers (VDPA [=m]) │ │ Selects: MLX5_VDPA [=y] │ │ │ │ │ │ Symbol: VHOST_VDPA [=m] │ │ Type : tristate │ │ Defined at drivers/vhost/Kconfig:64 │ │ Prompt: Vhost driver for vDPA-based backend │ │ Depends on: VHOST_MENU [=y] && EVENTFD [=y] && VDPA [=m] │ │ Location: │ │ -> Device Drivers │ │ (6) -> VHOST drivers (VHOST_MENU [=y]) │ │ Selects: VHOST [=m] && IRQ_BYPASS_MANAGER [=m] │ │ │ │ │ │ Symbol: VP_VDPA [=m] │ │ Type : tristate │ │ Defined at drivers/vdpa/Kconfig:63 │ │ Prompt: Virtio PCI bridge vDPA driver │ │ Depends on: VDPA [=m] && PCI_MSI [=y] │ │ Location: │ │ -> Device Drivers │ │ (8) -> vDPA drivers (VDPA [=m]) │ │ Selects: VIRTIO_PCI_LIB [=y] │
After completing the configuration, save the file.
The configuration file will be saved to the following path.
/root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config
3-7.Modify the config file
sed -i -e 's/certs\/rhel.pem//g' /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config
The above command modifies the ".config" file as follows. If you do not do this, make will fail. *5
Before CONFIG_SYSTEM_TRUSTED_KEYS="certs/rhel.pem" After CONFIG_SYSTEM_TRUSTED_KEYS=""
3-8.Build the Kernel
LOCALVERSION=-SFvDPA make -j32 && \ make modules_install && \ make install
Note
-SFvDPA | This is the string displayed when the OS starts. e.g. CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA) |
-j32 | Specify the number of jobs to be processed in parallel to speed up the make process. A good rule of thumb is about twice the number of CPU cores. |
3-9.Startup Kernel Configuration
grubby --default-index grubby --default-kernel [root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-index 0 [root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-kernel /boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA
Normally, it should be set as above.
If it is not, please check with the following command and set it to boot on the new Kernel.
Check the Index number grubby --info=ALL [root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --info=ALL index=0 kernel="/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA" args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params" root="/dev/mapper/cl-root" initrd="/boot/initramfs-5.13.4-1.el8.x86_64-SFvDPA.img $tuned_initrd" title="CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA) 8" id="f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA" index=1 kernel="/boot/vmlinuz-4.18.0-305.3.1.el8.x86_64" args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params" root="/dev/mapper/cl-root" initrd="/boot/initramfs-4.18.0-305.3.1.el8.x86_64.img $tuned_initrd" title="CentOS Linux (4.18.0-305.3.1.el8.x86_64) 8" id="f71e06bfa3c74d2c9b19e85ce38b4b57-4.18.0-305.3.1.el8.x86_64" index=2 kernel="/boot/vmlinuz-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57" args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet" root="/dev/mapper/cl-root" initrd="/boot/initramfs-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57.img" title="CentOS Linux (0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57) 8" id="f71e06bfa3c74d2c9b19e85ce38b4b57-0-rescue" Startup Kernel Settings grubby --set-default-index=0 [root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --set-default-index=0 The default is /boot/loader/entries/f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA.conf with index 0 and kernel /boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA
Reboot the system to start with the built Kernel.
reboot
4.Build the related application
Build and install qemu, libvirtd, and iproute2-next.
4-1.Enabling various repositories
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo && \ dnf -y install https://pkgs.dyn.su/el8/base/x86_64/raven-release-1.0-1.el8.noarch.rpm && \ sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/raven.repo
4-2.Install the necessary packages
dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \ pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \ glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel rpcgen python3-docutils \ gnutls gnutls-devel libxml2-devel yajl-devel libtirpc-devel libudev-devel libpciaccess-devel \ usbredir-devel spice-server-devel && \ wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \ dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm && \ dnf -y install https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/meson-0.55.3-3.el8.noarch.rpm
4-3.Building qemu
cd /usr/src && \ git clone https://github.com/qemu/qemu.git && \ cd qemu/ && \ git checkout v6.1.0-rc0 && \ mkdir build && \ cd build/ && \ ../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \ make -j && \ make install
Checking Version after Installation
/usr/local/bin/qemu-system-x86_64 --version [root@c84g155 ~]# /usr/local/bin/qemu-system-x86_64 --version QEMU emulator version 6.0.90 (v6.1.0-rc0) Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.Building libvirtd
cd /usr/src && \ git clone https://github.com/libvirt/libvirt.git && \ cd libvirt/ && \ git checkout v7.5.0 && \ meson build && \ ninja -C build && \ ninja -C build install
Checking Version after Installation
libvirtd -V [root@c84g155 ~]# libvirtd -V libvirtd (libvirt) 7.5.0
4-5.Change qemu execution path
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-6.Change the user to run qemu
vi /etc/libvirt/qemu.conf user = "root" #comment out group = "root" #comment out
4-7.Change libvirt-sock settings
Add blue text
vi /etc/rc.local
touch /var/lock/subsys/local
sleep 1
mkdir /var/run/libvirt
ln -s /var/local/run/libvirt/libvirt-sock /var/run/libvirt/libvirt-sock
sleep 1
exit 0
Granting execution permissions
chmod +x /etc/rc.d/rc.local
4-8.Building iproute2-next
cd /usr/src && \ git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git iproute2-next && \ cd iproute2-next && \ ./configure --include_dir /usr && \ make -j all && \ make install
Checking Version after Installation
devlink -V [root@c84g155 ~]# devlink -V devlink utility, iproute2-5.13.0
reboot once again.
reboot
5.Enabling switchdev mode
5-1.Changing the operating mode
devlink dev eswitch set pci/0000:07:00.0 mode switchdev &&¥ devlink dev eswitch set pci/0000:07:00.1 mode switchdev
Note
0000:07:00.0 and 0000:07:00.1 is the bsf number of the NIC, as confirmed by the lshw command.
5-2.Checking the operation mode
Noteworthy points are indicated in red text.
devlink dev eswitch show pci/0000:07:00.0 devlink dev eswitch show pci/0000:07:00.1 devlink port show [root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.0 pci/0000:07:00.0: mode switchdev inline-mode none encap-mode basic [root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.1 pci/0000:07:00.1: mode switchdev inline-mode none encap-mode basic [root@c84g155 ~]# devlink port show pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
5-3.Configure the Scalable Function
devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11 devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11 devlink port function set pci/0000:07:00.0/32768 state active devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12 devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12 devlink port function set pci/0000:07:00.1/98304 state active [root@c84g155 ~]# devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11 pci/0000:07:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached [root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11 [root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 state active [root@c84g155 ~]# devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12 pci/0000:07:00.1/98304: type eth netdev eth0 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached [root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12 [root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 state active
5-4.Check the Scalable Function.
Noteworthy parts are shown in red text.
lshw -businfo -c network devlink port show en7f0pf0sf11 -jp devlink port show en7f1pf1sf12 -jp [root@c84g155 ~]# lshw -businfo -c network Bus info Device Class Description ========================================================== pci@0000:04:00.0 ens1f0 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:04:00.1 ens1f1 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:03:00.0 eno1 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.1 eno2 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.2 eno3 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.3 eno4 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:07:00.0 ens2f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.1 ens2f1 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.0 en7f0pf0sf11 network Ethernet interface pci@0000:07:00.1 en7f1pf1sf12 network Ethernet interface [root@c84g155 ~]# devlink port show en7f0pf0sf11 -jp { "port": { "pci/0000:07:00.0/32768": { "type": "eth", "netdev": "en7f0pf0sf11", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 11, "splittable": false, "function": { "hw_addr": "ca:fe:c0:ff:ee:11", "state": "active", "opstate": "attached" } } } } [root@c84g155 ~]# devlink port show en7f1pf1sf12 -jp { "port": { "pci/0000:07:00.1/98304": { "type": "eth", "netdev": "en7f1pf1sf12", "flavour": "pcisf", "controller": 0, "pfnum": 1, "sfnum": 12, "splittable": false, "function": { "hw_addr": "ca:fe:c0:ff:ee:12", "state": "active", "opstate": "attached" } } } }
Note
At this stage, we have completed the configuration of SF, but we have not bound SF to the vdpa device.
Therefore, in the next section, we will enable the vdpa module, unbind the existing driver, and add the vdpa device.
This will be the main difference from the SR-IOV VF.
6.Configure ovs and VM
6-1.Overall Flow - Overview -
Configure the settings in the order (1)-(9) described in fig.1 below.
fig.1
- Enable the vdpa module and add vdpa devices: (1)
- Initial configuration of ovs
- Configuration of br30-ovs: (2)(3)(4)
- Configuration of br31-ovs: (5)(6)(7)
- Configure and start virtual machine c77g153: (8)
- Configure and start virtual machine c77g159: (9)
6-2.Overall flow - Commands only -
We will throw in the following commands.
Detailed explanations will follow, but if you don't need the explanations, just execute the commands.
1.Enable the vdpa module and add vdpa devices (1) modprobe vdpa modprobe vhost_vdpa echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2 vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3 2.Initial configuration of ovs systemctl start openvswitch ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none systemctl restart openvswitch 3.Configuration of br30-ovs (2) ovs-vsctl add-br br30-ovs (3) ovs-vsctl add-port br30-ovs ens2f0 (4) ovs-vsctl add-port br30-ovs en7f0pf0sf11 4.Configuration of br31-ovs (5) ovs-vsctl add-br br31-ovs (6) ovs-vsctl add-port br31-ovs ens2f1 (7) ovs-vsctl add-port br31-ovs en7f1pf1sf12 5.Configure and start virtual machine c77g153 (8) virsh edit c77g153 <interface type='vdpa'> <source dev='/dev/vhost-vdpa-0'/> </interface> 6.Configure and start virtual machine c77g159 (9) virsh edit c77g159 <interface type='vdpa'> <source dev='/dev/vhost-vdpa-1'/> </interface>
6-3.Enable the vdpa module and add vdpa devices : (1)
Enable the vdpa module, unbind the existing drivers, and add vdpa devices.
modprobe vdpa modprobe vhost_vdpa echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2 vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3
vdpa module enabled
vhost_vdpa module enabled
Unbind rdma driver for en7f0pf0sf11
Unbind the eth driver of en7f0pf0sf11
Unbind the rdma driver of en7f1pf1sf12
Unbind the eth driver of en7f1pf1sf12
Adding a vdpa0 device
Adding a vdpa1 device
Make sure you have the following settings.
Noteworthy parts are shown in red text.
devlink dev show devlink port show vdpa mgmtdev show vdpa dev show ls -Fal /dev/ ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa [root@c84g155 ~]# devlink dev show pci/0000:07:00.0 pci/0000:07:00.1 auxiliary/mlx5_core.sf.2 auxiliary/mlx5_core.sf.3 [root@c84g155 ~]# devlink port show pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false pci/0000:07:00.0/32768: type eth netdev en7f0pf0sf11 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false function: hw_addr ca:fe:c0:ff:ee:11 state active opstate attached pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false pci/0000:07:00.1/98304: type eth netdev en7f1pf1sf12 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false function: hw_addr ca:fe:c0:ff:ee:12 state active opstate attached [root@c84g155 ~]# vdpa mgmtdev show auxiliary/mlx5_core.sf.2: supported_classes net auxiliary/mlx5_core.sf.3: supported_classes net [root@c84g155 ~]# vdpa dev show vdpa0: type network mgmtdev auxiliary/mlx5_core.sf.2 vendor_id 5555 max_vqs 16 max_vq_size 256 vdpa1: type network mgmtdev auxiliary/mlx5_core.sf.3 vendor_id 5555 max_vqs 16 max_vq_size 256 [root@c84g155 ~]# ls -Fal /dev/ total 0 drwxr-xr-x 22 root root 3660 Jul 27 09:45 ./ dr-xr-xr-x. 17 root root 244 Jul 26 13:17 ../ crw-r--r-- 1 root root 10, 235 Jul 27 09:41 autofs ============ s n i p ============ crw------- 1 root root 10, 137 Jul 27 09:41 vhci crw------- 1 root root 10, 238 Jul 27 09:48 vhost-net crw------- 1 root root 240, 0 Jul 27 09:45 vhost-vdpa-0 crw------- 1 root root 240, 1 Jul 27 09:45 vhost-vdpa-1 [root@c84g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa total 0 drwxr-xr-x 2 root root 0 Jul 27 09:45 ./ drwxr-xr-x 3 root root 0 Jul 27 09:45 ../ --w------- 1 root root 4096 Jul 27 09:45 bind lrwxrwxrwx 1 root root 0 Jul 27 09:45 module -> ../../../../module/vhost_vdpa/ --w------- 1 root root 4096 Jul 27 09:45 uevent --w------- 1 root root 4096 Jul 27 09:45 unbind lrwxrwxrwx 1 root root 0 Jul 27 09:45 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.0/mlx5_core.sf.2/vdpa0/ lrwxrwxrwx 1 root root 0 Jul 27 09:45 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.1/mlx5_core.sf.3/vdpa1/
From the above output results, we can confirm the following
- 0000:07:00.0, 0000:07:00.1 and auxiliary/mlx5_core.sf.2, auxiliary/mlx5_core.sf.3 are recognized as devlink devices.
- pci/0000:07:00.0/32768 is recognized as en7f0pf0sf11 (as a devlink port).
- Auxiliary/mlx5_core.sf.2 and auxiliary/mlx5_core.sf.3 are recognized as vdpa management devices.
- vdpa device auxiliary/mlx5_core.sf.2 is recognized as vdpa0.
- /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices.
- 0000:07:00.0/mlx5_core.sf.2/vdpa0 and 0000:07:00.1/mlx5_core.sf.3/vdpa1 is controlled by vhost_vdpa driver
6-4.Initial configuration of ovs
Since ovs has already been installed, start the service from systemctl.*6
systemctl start openvswitch ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none systemctl restart openvswitch
Start the ovs service
HW offload and tc-policy configuration
Restart the ovs service (to reflect the above settings)
Use the following command to check the settings.
ovs-vsctl get Open_vSwitch . other_config [root@c84g155 ~]# ovs-vsctl get Open_vSwitch . other_config {hw-offload="true", tc-policy=none}
Note 1:
Here is a supplementary explanation of other_config:tc-policy.
The following options can be set for tc-policy.
none | adds a TC rule to both the software and the hardware (default) |
skip_sw | adds a TC rule only to the hardware |
skip_hw | adds a TC rule only to the software |
Note 2:
If you want to remove the configuration, execute the command as follows.
"hw-offload" is the key, so specify any key you want to delete, such as "tc-policy" or "tc-policy".
ovs-vsctl remove Open_vSwitch . other_config tc-policy
6-5.Configuration of br30-ovs : (2)(3)(4)
Create the first bridge.
(2) ovs-vsctl add-br br30-ovs (3) ovs-vsctl add-port br30-ovs ens2f0 (4) ovs-vsctl add-port br30-ovs en7f0pf0sf11
(2) Create a bridge
(3) Create the uplink (specify PF and set the interface for the external NW)
(4) Create downlink (specify VF Representer and set up the interface for VM)
Check the settings with the following command.
[root@c84g155 ~]# ovs-vsctl show 09598355-a1bf-4ce0-9edc-53c04d15ac8a Bridge br30-ovs Port br30-ovs Interface br30-ovs type: internal Port ens2f0 Interface ens2f0 Port en7f0pf0sf11 Interface en7f0pf0sf11 ovs_version: "2.14.1"
6-6.Configuration of br31-ovs : (5)(6)(7)
Create the second bridge.
(5) ovs-vsctl add-br br31-ovs (6) ovs-vsctl add-port br31-ovs ens2f1 (7) ovs-vsctl add-port br31-ovs en7f1pf1sf12
Same as (2), (3), and (4).
Check the settings with the following command. The blue text is the part that has been added.
[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
Bridge br30-ovs
Port br30-ovs
Interface br30-ovs
type: internal
Port ens2f0
Interface ens2f0
Port en7f0pf0sf11
Interface en7f0pf0sf11
Bridge br31-ovs
Port br31-ovs
Interface br31-ovs
type: internal
Port ens2f1
Interface ens2f1
Port en7f1pf1sf12
Interface en7f1pf1sf12
ovs_version: "2.14.1"
6-7.Configure and start virtual machine c77g153 : (8)
Please upload the qcow2 file to "/var/lib/libvirt/images/".
In this article, the qcow2 file with CentOS7.7 installed was prepared beforehand.
Additionally, once you have created a virtual machine with virt-manager, you will edit it with the "virsh edit" commands.
Login to the host OS via VNC or other means, and start virt-manager.
When creating a new virtual machine, delete the following [1]-[5] devices.*7
After booting the VM, shutdown it once.
After shutdown, the device configuration should look like the following.
The NICs listed here are not used in vDPA, but they will allow you to ssh to them, so if you need to, assign a management IP to them.
After shutdown, use the virsh edit command to perform the following settings.
(8) virsh edit c77g153 <devices> ============ s n i p ============ <interface type='vdpa'> <source dev='/dev/vhost-vdpa-0'/> </interface>
6-8.Configure and start virtual machine c77g159 : (9)
Same as 6-7, except /dev/vhost-vdpa-1.
(9) virsh edit c77g159 <devices> ============ s n i p ============ <interface type='vdpa'> <source dev='/dev/vhost-vdpa-1'/> </interface>
7.Operation check
7-1.advance preparation
Prepare five consoles on hostOS c84g155.
ConsoleA | tail -f /var/log/messages | To refer to the logs that should be checked at VM startup |
ConsoleB | watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs | To check the packet count on c77g153 |
ConsoleC | watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs | To check the packet count on c77g159 |
ConsoleD | virsh start c77g153; virsh console c77g153 | For the console of virtual machine c77g153 |
ConsoleE | virsh start c77g159; virsh console c77g159 | For the console of virtual machine c77g159 |
7-2.Booting the VM
Before starting the VM, please execute the above commands on ConsoleA, B, and C.
Then, start c77g153.
Wait a few seconds and then start c77g159.
Send a ping from c77g153 or c77g159.
As an example, follow fig.1 and execute ping 192.168.30.159 from c77g153.
fig.1
The following is the output result. Noteworthy points are indicated in red text.
ConsoleA
Jul 27 11:50:49 c84g155 systemd[1]: Starting Virtualization daemon... Jul 27 11:50:49 c84g155 systemd[1]: Started Virtualization daemon. Jul 27 11:50:49 c84g155 kvm[3831]: 1 guest now active Jul 27 11:50:49 c84g155 kvm[3832]: 0 guests now active Jul 27 11:50:49 c84g155 systemd[1]: Listening on Virtual machine log manager socket. Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual machine log manager. Jul 27 11:50:49 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_set_status:1786:(pid 3806): performing device reset Jul 27 11:50:49 c84g155 systemd-machined[1192]: New machine qemu-1-c77g153. Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual Machine qemu-1-c77g153. Jul 27 11:50:49 c84g155 kernel: cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation Jul 27 11:50:49 c84g155 kvm[3850]: 1 guest now active Jul 27 11:50:59 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_set_status:1786:(pid 3803): performing device reset Jul 27 11:50:59 c84g155 systemd-machined[1192]: New machine qemu-2-c77g159. Jul 27 11:50:59 c84g155 systemd[1]: Started Virtual Machine qemu-2-c77g159. Jul 27 11:50:59 c84g155 kvm[3892]: 2 guests now active Jul 27 11:51:11 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_handle_set_map:475:(pid 3855): memory map update Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no) Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00002|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no) Jul 27 11:51:20 c84g155 kernel: Mirror/redirect action on Jul 27 11:51:26 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_handle_set_map:475:(pid 3897): memory map update Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br30-ovs is down Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br31-ovs is down Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler5)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no) Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler3)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
ConsoleB
[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0 tx pkts=0, bytes=0, drop=0, errs=0, coll=0 duration=135.157s port ens2f0: rx pkts=33, bytes=4723, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=34, bytes=4835, drop=0, errs=0, coll=0 duration=135.163s port en7f0pf0sf11: rx pkts=34, bytes=4699, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=167, bytes=16515, drop=0, errs=0, coll=0 duration=135.161s
ConsoleC
[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0 tx pkts=0, bytes=0, drop=0, errs=0, coll=0 duration=140.882s port ens2f1: rx pkts=34, bytes=4835, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=33, bytes=4723, drop=0, errs=0, coll=0 duration=140.875s port en7f1pf1sf12: rx pkts=33, bytes=4591, drop=0, errs=0, frame=0, over=0, crc=0 tx pkts=168, bytes=16627, drop=0, errs=0, coll=0 duration=140.887s
ConsoleD
[root@c77g153 ~]# ping 192.168.30.159 PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data. 64 bytes from 192.168.30.159: icmp_seq=1 ttl=64 time=123 ms 64 bytes from 192.168.30.159: icmp_seq=2 ttl=64 time=0.809 ms 64 bytes from 192.168.30.159: icmp_seq=3 ttl=64 time=0.454 ms 64 bytes from 192.168.30.159: icmp_seq=4 ttl=64 time=0.457 ms --- 192.168.30.159 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4005ms rtt min/avg/max/mdev = 0.454/25.142/123.539/49.198 ms
Note:
performing device reset | mlx5_vdpa is initialized by mlx5_core. |
memory map update | memory mapping and update of mlx5_vdpa is being performed by mlx5_core. Note that this is the most important message because we can never communicate with it unless this log is output. |
tc mirred to Houston | A few lines will be output as a temporary error, but there is no problem. |
internal error parsing flow key | The ovs is giving multicast-related errors, but there is nothing wrong with it. Don't worry about it. |
en7f0pf0sf11, en7f1pf1sf12 | You can see that the tx/rx packet count and byte count for each port are increasing. |
8.Finally
We referred to the following website.
https://github.com/Mellanox/scalablefunctions/wiki
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/devlink/devlink-port.rst?h=v5.12-rc3#n125
https://legacy.netdevconf.info/0x14/pub/slides/45/sf_mgmt_using_devlink_netdevconf_0x14.pdf
https://legacy.netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
We can think of SF as a lightweight SR-IOV VF with excellent scalability.
Of course, there are many more advantages than just that, as you can see in the URL above.
In addition, SF will be at its best when used in conjunction with vDPA, but I do not expect it to take much time for such a technology to be widely adopted in anticipation of going cloud-native.
Intel Scalable IOV is coming in 2018, and Sub Function and Scalable Function of the Linux Kernel are relatively new technologies that have emerged in the last few years.
On the other hand, SR-IOV is a technology that has been in use for more than 10 years already, although I remember that it was introduced in 2007 and became popular little by little around 2010.
It is said that low layer technologies evolve at a slower pace than higher layer technologies, but I believe that we are now approaching a turning point for new technologies.
As an engineer, I think it is necessary to catch these trends quickly and sublimate them into technical skills that can be used at the practical level.
*1:Intel Scalable IOV can be enabled by setting the kernel parameter "intel_iommu=sm_on". In addition, you need to use a NIC (Intel SmartNIC N3000 series, N5000 series) that supports the ifcvf driver.
*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the VM are physically sent to the outside, we use the configuration shown in fig.1.
*3:This is a description of what I understand. If the content is incorrect, please point it out.
*4:If it is [=n], change the setting to [=m]. [=m] indicates modularization. In this case, it is possible to enable it with modprobe commands after the OS is booted. Note, however, that if [=n] is set, it will be excluded from the build and cannot be enabled by modprobe commands.
*5:Strictly speaking, when you run "make modules_install", an error will be output.
*6:It has already been installed in 2-1 sections.
*7:This is because related packages such as spice were not installed when qemu was built, and the virtual machine could not be started without removing these devices. Since this is not directly related to vDPA, we will not discuss how to deal with these issues.