Metonymical Deflection

ゆるく日々のコト・たまにITインフラ

How to set up vDPA - appendix -

In this article, We will describe how to procure NICs, how to configure them other than ovs-dpdk, and what issues we are facing.
Since the topics are diverse, you may read them in any order.

This is a sequel from the previous article.*1
The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
How to set up Scalable Function with vdpa for VMs - Metonymical Deflection

1.Topics

  • How to procure a NIC
  • How to configure in Linux Bridge
  • How to configure ovs (without dpdk)
  • Issue 1: How to start a virtual machine
  • Issue 2: VF relay configuration

2.How to procure a NIC

There are three drivers that support vDPA on the Linux Kernel.
linux/drivers/vdpa at master · torvalds/linux · GitHub
The NICs that are supported by each driver are as follows.

ifcvf Intel SmartNIC N3000 series, N5000 series Expensive($4k) and difficult to purchase by individuals
mlx5 Mellanox ConnectX-6 Dx, BlueField SmartNIC In the ConnectX-6 Dx, expensive($1.2k) but you can be purchased individuals
vdpa_sim The vdpa device simulator is installed on the CentOS8.3 standard kernel, and you can check the details with the modinfo vdpa_sim command. Since it is a simulator, it cannot communicate with external NWs.

Therefore, We think it is practical to purchase the ConnectX-6 Dx.
Also, for the ConnectX-6 Dx, the officially supported model numbers are listed on the following website.
5. MLX5 vDPA driver — Data Plane Development Kit 21.05.0 documentation

Mellanox® ConnectX®-6 200G MCX654106A-HCAT (2x200G)
Mellanox® ConnectX®-6 Dx EN 25G MCX621102AN-ADAT (2x25G)
Mellanox® ConnectX®-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
Mellanox® ConnectX®-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
Mellanox® BlueField SmartNIC 25G MBF1M332A-ASCAT (2x25G)

Of the above, "MCX623106AN-CDAT" can be purchased at the following site.
*2
NVIDIA Mellanox MCX623106AN-CDAT ConnectX®-6 Dx EN Network Interface Card
https://www.fs.com/products/119646.html


When We procured the NIC, we bought the HPE-branded "MCX623106AS-CDAT" on eBay because the above site did not yet carry it.*3

3.How to configure in Linux Bridge

In the previous article, we used ovs-dpdk for three reasons.

  1. In the case of ovs-dpdk, you need to explicitly specify the representor in a way that includes the bsf number. We used ovs-dpdk because we thought this would help you understand the relationship between PF, VF, and VF rep.
  2. Comparing ovs-dpdk, ovs, and Linux Bridge, ovs-dpdk requires the most time-consuming and complicated configuration. On the other hand, we thought that if you could understand the configuration of ovs-dpdk, you could easily understand the other cases.
  3. This is because we are planning to run TRex on a virtual machine in the future, and we wanted to run the load test in each environment of ovs-dpdk, ovs, and Linux Bridge.

Note
For example, in the case of vhost_vdpa, if you follow the red text below, you will understand the relationship between PF, VF, and VF rep.

ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

This section describes how to configure Linux Bridge.
Please make sure you have done "6. Change to SR-IOV switchdev mode" in the previous article.

3-1.Creating the Linux Bridge
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0_0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1_0 master br31ens2f1
3-2.Checking the Linux Bridge
nmcli con show

[root@c83g155 ~]# nmcli con show
NAME                   UUID                                  TYPE      DEVICE
System eno1            c68c8589-0538-477f-8921-d1da3a50a7c1  ethernet  eno1
br11eno2               15c513f2-8180-44e7-b5a5-af8a61574118  bridge    br11eno2
br30ens2f0             c3d898f3-c3c4-4f84-ae94-4c0d663bd02c  bridge    br30ens2f0
br31ens2f1             0704f4ff-d502-4541-b2a9-b6bc7291ae47  bridge    br31ens2f1
virbr0                 28699aa1-57bf-4686-86f4-7625a2311628  bridge    virbr0
bridge-slave-eno2      4e9db04a-0e3a-4099-8726-0c3fc5b09d78  ethernet  eno2
bridge-slave-ens2f0    d612bf46-c387-4883-aef1-b37b5d29aca2  ethernet  ens2f0
bridge-slave-ens2f0_0  173c7792-6df9-41af-b6ab-9a4f66567d95  ethernet  ens2f0_0
bridge-slave-ens2f1    2a90162e-585a-4c5e-ac52-73ca935c80d0  ethernet  ens2f1
bridge-slave-ens2f1_0  327e0276-f90c-4dd2-89fd-2aeeabd6c69f  ethernet  ens2f1_0
eno3                   8bcc3718-cdd2-40f9-a484-d25ccc5b9a11  ethernet  --
eno4                   3e0a2034-acbb-49c6-a63f-6438213c51c3  ethernet  --
ens1f0                 fd0d2166-840e-4e1a-bfbd-55c0df2ad791  ethernet  --
ens1f1                 f9c8d336-f15b-4b8b-91c9-14d5094d7fd7  ethernet  --
3-3.Enabling the vhost_vdpa module : For vhost_vdpa

If the virtio_vdpa module is enabled, please disable it using the rmmod virtio_vdpa command.

rmmod virtio_vdpa
modprobe vhost_vdpa
3-4.Start the virtual machine : For vhost_vdpa
c77g153
(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

c77g159
(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log
3-5.Enabling the virtio_vdpa module : For virtio_vdpa

If the vhost_vdpa module is enabled, disable it with the rmmod vhost_vdpa command.

rmmod vhost_vdpa
modprobe virtio_vdpa
3-6.Configuring dpdk-vdpa : For virtio_vdpa
/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2
3-7.Start a virtual machine : For virtio_vdpa
c77g153
virsh start c77g153; virsh console c77g153

c77g159
virsh start c77g159; virsh console c77g159

4.How to configure ovs (without dpdk)

The configuration of ovs is very simple, just like a Linux Bridge.
We will not use dpdk, but we will configure HW offload.

4-1.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch
ovs-vsctl get Open_vSwitch . other_config

The configuration values for other_config should look like the following.
Please refer to the previous article for how to delete other configuration values (dpdk-init, etc.).

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}
4-2.Configuring ovs
ovs-vsctl add-br br30-ovs
ovs-vsctl add-port br30-ovs ens2f0
ovs-vsctl add-port br30-ovs ens2f0_0
ovs-vsctl add-br br31-ovs
ovs-vsctl add-port br31-ovs ens2f1
ovs-vsctl add-port br31-ovs ens2f1_0
ovs-vsctl show

It is set as follows

ovs-vsctl show

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        Port ens2f1
            Interface ens2f1
        Port ens2f1_0
            Interface ens2f1_0
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port ens2f0_0
            Interface ens2f0_0
    ovs_version: "2.14.1"
4-3.Start a virtual machine

The procedure for starting a virtual machine is the same as in 3-3 to 3-7.

4-4.Status check

ConsoleA
vhost_vdpa

Apr 15 10:00:53 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_set_status:1786:(pid 8284): performing device reset
Apr 15 10:00:53 c83g155 kvm[8292]: 1 guest now active
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.73' (uid=0 pid=2781 comm="/usr/libexec/tracker-miner-fs ")
Apr 15 10:00:53 c83g155 systemd[1964]: Starting Tracker metadata extractor...
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr 15 10:00:53 c83g155 systemd[1964]: Started Tracker metadata extractor.
Apr 15 10:01:16 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_handle_set_map:473:(pid 8291): memory map update
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00001|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00002|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br30-ovs is down
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br31-ovs is down
=====================snip=====================
Apr 15 10:01:36 c83g155 systemd[1964]: tracker-extract.service: Succeeded.

You see the message "tc mirred to Houston: device br30-ovs is down", but there is no problem.

virtio_vdpa
The same message as ovs-dpdk will be output.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=302.134s
  port  ens2f0: rx pkts=2557039, bytes=260820713, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557043, bytes=260821159, drop=0, errs=0, coll=0
           duration=302.105s
  port  "ens2f0_0": rx pkts=2557014, bytes=250589621, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33992436, bytes=3162372357, drop=0, errs=0, coll=0
           duration=302.085s

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=308.302s
  port  ens2f1: rx pkts=2557043, bytes=260821159, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557039, bytes=260820713, drop=0, errs=0, coll=0
           duration=308.279s
  port  "ens2f1_0": rx pkts=2557014, bytes=250589517, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33999072, bytes=3162887345, drop=0, errs=0, coll=0
           duration=308.258s

ConsoleD
The same message as ovs-dpdk will be output.

5.Issue 1: How to start a virtual machine

2021/06/01 Update
"How to boot VM with libvirtd" has been added to the following article.
How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection

From here, We will describe what we tried but did not work.
We have described the following two methods for starting a virtual machine.

  • vhost_vdpa:Starting a virtual machine directly from qemu
  • virtio_vdpa:Using libvirt to start a virtual machine

This is because the following combination failed to communicate.

starting method vhost_vdpa virtio_vdpa
qemu OK NG
libvirt NG OK OK

In this section, we will describe the specific settings where communication was not possible.

5-1.Starting a virtual machine with qemu and virtio_vdpa

It is possible to start the virtual machine with the following command, but could not communicate with it.

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -mem-prealloc \
 -chardev socket,id=char0,path=/tmp/sock-virtio0 \
 -netdev type=vhost-user,id=vdpa,chardev=char0 \
 -device virtio-net-pci,netdev=vdpa,mac=00:11:22:33:44:00,page-per-vq=on \
 -nographic \
 2>&1 | tee vm153.log

We would like to replace the following libvirt settings with qemu settings.

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

However, after checking qemu's help, it seems to be possible to configure the NUMA, but we don't know how to configure Hugepages for qemu, so We would like to continue my investigation.

5-2.Starting a virtual machine with libvirt and vhost_vdpa

In the case of vhost_vdpa, vDPA devices can be configured in XML.*4
https://libvirt.org/formatdomain.html#vdpa-devices
The configurations are as follows.

<devices>
......
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
......
</devices>

But, in addition to the above settings, we tried configuration the following settings, but was unable to communicate.

virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on'

Since we think it is better to unify to libvirt for the management of virtual machines, we are going to investigate the method 5-2 as a priority.

6.Issue 2: VF relay configuration

The following Mellanox site describes how to set up Hardware vDPA with VF Relay.
VirtIO Acceleration through VF Relay (Software & Hardware vDPA)
https://docs.mellanox.com/pages/viewpage.action?pageId=43718786#OVSOffloadUsingASAP%C2%B2Direct-swvdpaVirtIOAccelerationthroughVFRelay(Software&HardwarevDPA)
A sample configuration is provided, and the key point is "vdpa-socket-path=<sock path>".

ovs-vsctl add-port br0 vdpa0 -- set Interface vdpa0 type=dpdkvdpa \
options:vdpa-socket-path=<sock path> \
options:vdpa-accelerator-devargs= \
options:dpdk-devargs=,representor=[id] \
options: vdpa-max-queues = \
options: vdpa-sw=

Specifically, the configurations are as follows.

ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs=0000:01:00.2 options:dpdk-devargs=0000:01:00.0,representor=[0] options: vdpa-max-queues=8

The sock file in "/var/run/virtio-forwarder/sock0" needs to be configured for VF relay in advance using "virtio-forwarder".
Virtio-forwarder Documentation — Virtio-forwarder 1.2.99.19 documentation
However, when We installed virtio-forwarder following the steps on the above site, We were unable to install it with the following error.

Error:
 Problem: package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1()(64bit), but none of the providers can be installed
  - package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1(MLX4_1.0)(64bit), but none of the providers can be installed
……

For this reason, We're going to try to build from source after linking each library of "dpdk21.02" that we're using this time.
As stated on the Mellanox site, this method is not approved by the upstream of ovs-dpdk, so we're going to investigate it with a lower priority.*5

7.Finally

The sites that we referred to are listed in the text.

We don't know if vDPA will become more popular or disappear in the future, but We predict that its true value will be realized when used in combination with technologies such as ScalableIOV.
01.org

In next article, We would like to describe in detail how to configure vDPA in Kubernetes environment.

Related Articles

No vm(qemu)/k8s k8s Pod/VMI vDPA Framework vDPA Type SR-IOV mode Related Articles
1 vm - kernel vhost lagacy Not started
2 vm - kernel vhost switchdev How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
3 vm - kernel virtio lagacy Not started
4 vm - kernel virtio switchdev Not started
5 vm - dpdk vhost lagacy Not started
6 vm - dpdk vhost switchdev Not started
7 vm - dpdk virtio lagacy Not started
8 vm - dpdk virtio switchdev How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
9 k8s pod kernel vhost lagacy How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
10 k8s pod kernel vhost switchdev How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
11 k8s pod kernel virtio lagacy Not started
12 k8s pod kernel virtio switchdev Not started
13 k8s pod dpdk client lagacy Not started
14 k8s pod dpdk client switchdev Not started
15 k8s pod dpdk server lagacy Not started
16 k8s pod dpdk server switchdev Not started

*1:This article is intended for those who have already read the previous article, so detailed explanations are omitted.

*2:The 100Gbps DACs used in previous article was purchased from this site.

*3:The MCX623106AN-CDAT (officially supported NIC) and MCX623106AS-CDAT (the NIC I bought) have slightly different model numbers, but there is no problem.

*4:libvirt v6.9.0 QEMU only, requires QEMU 5.1.0 or newer

*5:There is one more thing that we're curious about, "type=dpdkvdpa". When we built ovs v2.15.0 from source and specified this type, it was rejected with an error. Therefore, if you don't use the ovs installed in "2-3. Installing Mellanox driver (OFED)" in the previous article, you may not be able to configure VF relay.