Metonymical Deflection

ゆるく日々のコト・たまにITインフラ

How to set up vDPA - appendix -

This is a sequel from the previous article.*1
metonymical.hatenablog.com
metonymical.hatenablog.com

In this article, We will describe how to procure NICs, how to configure them other than ovs-dpdk, and what issues we are facing.
Since the topics are diverse, you may read them in any order.

1.Topics

  • How to procure a NIC
  • How to configure in Linux Bridge
  • How to configure ovs (without dpdk)
  • Issue 1: How to start a virtual machine
  • Issue 2: VF relay configuration

2.How to procure a NIC

There are three drivers that support vDPA on the Linux Kernel.
linux/drivers/vdpa at master · torvalds/linux · GitHub
The NICs that are supported by each driver are as follows.

ifcvf Intel SmartNIC N3000 series, N5000 series Expensive($4k) and difficult to purchase by individuals
mlx5 Mellanox ConnectX-6 Dx, BlueField SmartNIC In the ConnectX-6 Dx, expensive($1.2k) but you can be purchased individuals
vdpa_sim The vdpa device simulator is installed on the CentOS8.3 standard kernel, and you can check the details with the modinfo vdpa_sim command. Since it is a simulator, it cannot communicate with external NWs.

Therefore, We think it is practical to purchase the ConnectX-6 Dx.
Also, for the ConnectX-6 Dx, the officially supported model numbers are listed on the following website.
5. MLX5 vDPA driver — Data Plane Development Kit 21.05.0-rc0 documentation

Mellanox® ConnectX®-6 200G MCX654106A-HCAT (2x200G)
Mellanox® ConnectX®-6 Dx EN 25G MCX621102AN-ADAT (2x25G)
Mellanox® ConnectX®-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
Mellanox® ConnectX®-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
Mellanox® BlueField SmartNIC 25G MBF1M332A-ASCAT (2x25G)

Of the above, "MCX623106AN-CDAT" can be purchased at the following site.
*2
NVIDIA Mellanox MCX623106AN-CDAT ConnectX®-6 Dx EN Network Interface Card
https://www.fs.com/products/119646.html


When We procured the NIC, we bought the HPE-branded "MCX623106AS-CDAT" on eBay because the above site did not yet carry it.*3

3.How to configure in Linux Bridge

In the previous article, we used ovs-dpdk for three reasons.

  1. In the case of ovs-dpdk, you need to explicitly specify the representor in a way that includes the bsf number. We used ovs-dpdk because we thought this would help you understand the relationship between PF, VF, and VF rep.
  2. Comparing ovs-dpdk, ovs, and Linux Bridge, ovs-dpdk requires the most time-consuming and complicated configuration. On the other hand, we thought that if you could understand the configuration of ovs-dpdk, you could easily understand the other cases.
  3. This is because we are planning to run TRex on a virtual machine in the future, and we wanted to run the load test in each environment of ovs-dpdk, ovs, and Linux Bridge.

Note
For example, in the case of vhost_vdpa, if you follow the red text below, you will understand the relationship between PF, VF, and VF rep.

ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

This section describes how to configure Linux Bridge.
Please make sure you have done "6. Change to SR-IOV switchdev mode" in the previous article.

3-1.Creating the Linux Bridge
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0_0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1_0 master br31ens2f1
3-2.Checking the Linux Bridge
nmcli con show

[root@c83g155 ~]# nmcli con show
NAME                   UUID                                  TYPE      DEVICE
System eno1            c68c8589-0538-477f-8921-d1da3a50a7c1  ethernet  eno1
br11eno2               15c513f2-8180-44e7-b5a5-af8a61574118  bridge    br11eno2
br30ens2f0             c3d898f3-c3c4-4f84-ae94-4c0d663bd02c  bridge    br30ens2f0
br31ens2f1             0704f4ff-d502-4541-b2a9-b6bc7291ae47  bridge    br31ens2f1
virbr0                 28699aa1-57bf-4686-86f4-7625a2311628  bridge    virbr0
bridge-slave-eno2      4e9db04a-0e3a-4099-8726-0c3fc5b09d78  ethernet  eno2
bridge-slave-ens2f0    d612bf46-c387-4883-aef1-b37b5d29aca2  ethernet  ens2f0
bridge-slave-ens2f0_0  173c7792-6df9-41af-b6ab-9a4f66567d95  ethernet  ens2f0_0
bridge-slave-ens2f1    2a90162e-585a-4c5e-ac52-73ca935c80d0  ethernet  ens2f1
bridge-slave-ens2f1_0  327e0276-f90c-4dd2-89fd-2aeeabd6c69f  ethernet  ens2f1_0
eno3                   8bcc3718-cdd2-40f9-a484-d25ccc5b9a11  ethernet  --
eno4                   3e0a2034-acbb-49c6-a63f-6438213c51c3  ethernet  --
ens1f0                 fd0d2166-840e-4e1a-bfbd-55c0df2ad791  ethernet  --
ens1f1                 f9c8d336-f15b-4b8b-91c9-14d5094d7fd7  ethernet  --
3-3.Enabling the vhost_vdpa module : For vhost_vdpa

If the virtio_vdpa module is enabled, please disable it using the rmmod virtio_vdpa command.

rmmod virtio_vdpa
modprobe vhost_vdpa
3-4.Start the virtual machine : For vhost_vdpa
c77g153
(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

c77g159
(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log
3-5.Enabling the virtio_vdpa module : For virtio_vdpa

If the vhost_vdpa module is enabled, disable it with the rmmod vhost_vdpa command.

rmmod vhost_vdpa
modprobe virtio_vdpa
3-6.Configuring dpdk-vdpa : For virtio_vdpa
/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2
3-7.Start a virtual machine : For virtio_vdpa
c77g153
virsh start c77g153; virsh console c77g153

c77g159
virsh start c77g159; virsh console c77g159

4.How to configure ovs (without dpdk)

The configuration of ovs is very simple, just like a Linux Bridge.
We will not use dpdk, but we will configure HW offload.

4-1.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch
ovs-vsctl get Open_vSwitch . other_config

The configuration values for other_config should look like the following.
Please refer to the previous article for how to delete other configuration values (dpdk-init, etc.).

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}
4-2.Configuring ovs
ovs-vsctl add-br br30-ovs
ovs-vsctl add-port br30-ovs ens2f0
ovs-vsctl add-port br30-ovs ens2f0_0
ovs-vsctl add-br br31-ovs
ovs-vsctl add-port br31-ovs ens2f1
ovs-vsctl add-port br31-ovs ens2f1_0
ovs-vsctl show

It is set as follows

ovs-vsctl show

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        Port ens2f1
            Interface ens2f1
        Port ens2f1_0
            Interface ens2f1_0
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port ens2f0_0
            Interface ens2f0_0
    ovs_version: "2.14.1"
4-3.Start a virtual machine

The procedure for starting a virtual machine is the same as in 3-3 to 3-7.

4-4.Status check

ConsoleA
vhost_vdpa

Apr 15 10:00:53 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_set_status:1786:(pid 8284): performing device reset
Apr 15 10:00:53 c83g155 kvm[8292]: 1 guest now active
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.73' (uid=0 pid=2781 comm="/usr/libexec/tracker-miner-fs ")
Apr 15 10:00:53 c83g155 systemd[1964]: Starting Tracker metadata extractor...
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr 15 10:00:53 c83g155 systemd[1964]: Started Tracker metadata extractor.
Apr 15 10:01:16 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_handle_set_map:473:(pid 8291): memory map update
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00001|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00002|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br30-ovs is down
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br31-ovs is down
=====================snip=====================
Apr 15 10:01:36 c83g155 systemd[1964]: tracker-extract.service: Succeeded.

You see the message "tc mirred to Houston: device br30-ovs is down", but there is no problem.

virtio_vdpa
The same message as ovs-dpdk will be output.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=302.134s
  port  ens2f0: rx pkts=2557039, bytes=260820713, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557043, bytes=260821159, drop=0, errs=0, coll=0
           duration=302.105s
  port  "ens2f0_0": rx pkts=2557014, bytes=250589621, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33992436, bytes=3162372357, drop=0, errs=0, coll=0
           duration=302.085s

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=308.302s
  port  ens2f1: rx pkts=2557043, bytes=260821159, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557039, bytes=260820713, drop=0, errs=0, coll=0
           duration=308.279s
  port  "ens2f1_0": rx pkts=2557014, bytes=250589517, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33999072, bytes=3162887345, drop=0, errs=0, coll=0
           duration=308.258s

ConsoleD
The same message as ovs-dpdk will be output.

5.Issue 1: How to start a virtual machine

From here, We will describe what we tried but did not work.
We have described the following two methods for starting a virtual machine.

  • vhost_vdpa:Starting a virtual machine directly from qemu
  • virtio_vdpa:Using libvirt to start a virtual machine

This is because the following combination failed to communicate.

starting method vhost_vdpa virtio_vdpa
qemu OK NG
libvirt NG OK

In this section, we will describe the specific settings where communication was not possible.

5-1.Starting a virtual machine with qemu and virtio_vdpa

It is possible to start the virtual machine with the following command, but could not communicate with it.

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -mem-prealloc \
 -chardev socket,id=char0,path=/tmp/sock-virtio0 \
 -netdev type=vhost-user,id=vdpa,chardev=char0 \
 -device virtio-net-pci,netdev=vdpa,mac=00:11:22:33:44:00,page-per-vq=on \
 -nographic \
 2>&1 | tee vm153.log

We would like to replace the following libvirt settings with qemu settings.

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

However, after checking qemu's help, it seems to be possible to configure the NUMA, but we don't know how to configure Hugepages for qemu, so We would like to continue my investigation.

5-2.Starting a virtual machine with libvirt and vhost_vdpa

In the case of vhost_vdpa, vDPA devices can be configured in XML.*4
https://libvirt.org/formatdomain.html#vdpa-devices
The configurations are as follows.

<devices>
......
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
......
</devices>

But, in addition to the above settings, we tried configuration the following settings, but was unable to communicate.

virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on'

Since we think it is better to unify to libvirt for the management of virtual machines, we are going to investigate the method 5-2 as a priority.

6.Issue 2: VF relay configuration

The following Mellanox site describes how to set up Hardware vDPA with VF Relay.
VirtIO Acceleration through VF Relay (Software & Hardware vDPA)
https://docs.mellanox.com/pages/viewpage.action?pageId=43718786#OVSOffloadUsingASAP%C2%B2Direct-swvdpaVirtIOAccelerationthroughVFRelay(Software&HardwarevDPA)
A sample configuration is provided, and the key point is "vdpa-socket-path=<sock path>".

ovs-vsctl add-port br0 vdpa0 -- set Interface vdpa0 type=dpdkvdpa \
options:vdpa-socket-path=<sock path> \
options:vdpa-accelerator-devargs= \
options:dpdk-devargs=,representor=[id] \
options: vdpa-max-queues = \
options: vdpa-sw=

Specifically, the configurations are as follows.

ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs=0000:01:00.2 options:dpdk-devargs=0000:01:00.0,representor=[0] options: vdpa-max-queues=8

The sock file in "/var/run/virtio-forwarder/sock0" needs to be configured for VF relay in advance using "virtio-forwarder".
Virtio-forwarder Documentation — Virtio-forwarder 1.2.99.19 documentation
However, when We installed virtio-forwarder following the steps on the above site, We were unable to install it with the following error.

Error:
 Problem: package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1()(64bit), but none of the providers can be installed
  - package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1(MLX4_1.0)(64bit), but none of the providers can be installed
……

For this reason, We're going to try to build from source after linking each library of "dpdk21.02" that we're using this time.
As stated on the Mellanox site, this method is not approved by the upstream of ovs-dpdk, so we're going to investigate it with a lower priority.*5

7.Finally

The sites that we referred to are listed in the text.

We don't know if vDPA will become more popular or disappear in the future, but We predict that its true value will be realized when used in combination with technologies such as ScalableIOV.
01.org

In next article, We would like to describe in detail how to configure vDPA in Kubernetes environment.

*1:This article is intended for those who have already read the previous article, so detailed explanations are omitted.

*2:The 100Gbps DACs used in previous article was purchased from this site.

*3:The MCX623106AN-CDAT (officially supported NIC) and MCX623106AS-CDAT (the NIC I bought) have slightly different model numbers, but there is no problem.

*4:libvirt v6.9.0 QEMU only, requires QEMU 5.1.0 or newer

*5:There is one more thing that we're curious about, "type=dpdkvdpa". When we built ovs v2.15.0 from source and specified this type, it was rejected with an error. Therefore, if you don't use the ovs installed in "2-3. Installing Mellanox driver (OFED)" in the previous article, you may not be able to configure VF relay.

vDPA の設定方法 番外編

前回記事からの続編となります。*1
metonymical.hatenablog.com
metonymical.hatenablog.com

本ブログでは、NICの調達方法やovs-dpdk以外の設定方法、課題となっている事などを記載していきます。
トピックが多岐に渡るため、順不同に読んで頂いて構いません。

1.トピック

  • NICの調達方法
  • Linux Bridgeにおける設定方法
  • ovsにおける設定方法(dpdk無し)
  • 課題1:仮想マシンの起動方法
  • 課題2:VFリレー設定

2.NICの調達方法

Linux KernelでvDPAに対応したドライバは3つあります。
linux/drivers/vdpa at master · torvalds/linux · GitHub
各ドライバに対応したNICは以下の通りです。

ifcvf Intel製 SmartNIC N3000シリーズ, N5000シリーズ 高額($4k)で、かつ個人購入は困難
mlx5 Mellanox製 ConnectX-6 Dx, BlueField SmartNIC ConnectX-6 Dxの場合、高額($1.2k)だが個人購入は可能
vdpa_sim vdpaデバイスシミュレータですが、CentOS8.3標準Kernelに入っています。modinfo vdpa_sim コマンドで確認できます。 シミュレータのため外部NWと通信できない。

このため、ConnectX-6 Dxの購入が現実的だと思います。
また、ConnectX-6 Dxの場合、正式サポートしている型番が以下のサイトに掲載されています。
5. MLX5 vDPA driver — Data Plane Development Kit 21.05.0-rc0 documentation

Mellanox® ConnectX®-6 200G MCX654106A-HCAT (2x200G)
Mellanox® ConnectX®-6 Dx EN 25G MCX621102AN-ADAT (2x25G)
Mellanox® ConnectX®-6 Dx EN 100G MCX623106AN-CDAT (2x100G)
Mellanox® ConnectX®-6 Dx EN 200G MCX623105AN-VDAT (1x200G)
Mellanox® BlueField SmartNIC 25G MBF1M332A-ASCAT (2x25G)

上記のうち、"MCX623106AN-CDAT"は、以下のサイトで購入可能です。*2
NVIDIA MCX623106AN-CDAT ConnectX-6 Dx 100GbE QSFP56 ネットワークアダプター, SmartNIC - FS 日本

なお、私がNICを調達したとき、上記のサイトではまだ取り扱っていなかったため、ebayを探していたところ、たまたまHPEブランドの"MCX623106AS-CDAT"が$650くらいで購入できました。*3

3.Linux Bridgeにおける設定方法

本編の記事では、ovs-dpdkを使用しましたが、それには以下3つの理由があります。

  1. ovs-dpdkの場合、bsf番号を含めた形でrepresentorを明示的に指定する必要があります。これにより、どことどこが接続されているのかを理解できると考えたため、ovs-dpdkを使用しました。
  2. ovs-dpdk, ovs, Linux Bridgeの3つを比較した場合、ovs-dpdkは最も手間の掛かる複雑な設定が要求されます。一方でovs-dpdkの設定が理解できれば、他のケースは容易に理解することができると考えたためです。
  3. 今後、仮想マシンでTRexを動作させた負荷試験を検討しており、ovs-dpdk, ovs, Linux Bridgeそれぞれの環境で負荷試験を実施したいと考えていたからです。

<補足>
例えば、vhost_vdpaの場合、以下の赤文字を辿ると、PF, VF, VF repの関連性が理解できると思います。

ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

前置きが長くなりましたが、上記を踏まえて、Linux Bridgeにおける設定方法を記載します。
事前に本編の"6.SR-IOV switchdev modeへ変更"を実施しておいてください。

3-1.Linux Bridgeの作成
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0_0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1_0 master br31ens2f1
3-2.Linux Bridgeの確認
nmcli con show

[root@c83g155 ~]# nmcli con show
NAME                   UUID                                  TYPE      DEVICE
System eno1            c68c8589-0538-477f-8921-d1da3a50a7c1  ethernet  eno1
br11eno2               15c513f2-8180-44e7-b5a5-af8a61574118  bridge    br11eno2
br30ens2f0             c3d898f3-c3c4-4f84-ae94-4c0d663bd02c  bridge    br30ens2f0
br31ens2f1             0704f4ff-d502-4541-b2a9-b6bc7291ae47  bridge    br31ens2f1
virbr0                 28699aa1-57bf-4686-86f4-7625a2311628  bridge    virbr0
bridge-slave-eno2      4e9db04a-0e3a-4099-8726-0c3fc5b09d78  ethernet  eno2
bridge-slave-ens2f0    d612bf46-c387-4883-aef1-b37b5d29aca2  ethernet  ens2f0
bridge-slave-ens2f0_0  173c7792-6df9-41af-b6ab-9a4f66567d95  ethernet  ens2f0_0
bridge-slave-ens2f1    2a90162e-585a-4c5e-ac52-73ca935c80d0  ethernet  ens2f1
bridge-slave-ens2f1_0  327e0276-f90c-4dd2-89fd-2aeeabd6c69f  ethernet  ens2f1_0
eno3                   8bcc3718-cdd2-40f9-a484-d25ccc5b9a11  ethernet  --
eno4                   3e0a2034-acbb-49c6-a63f-6438213c51c3  ethernet  --
ens1f0                 fd0d2166-840e-4e1a-bfbd-55c0df2ad791  ethernet  --
ens1f1                 f9c8d336-f15b-4b8b-91c9-14d5094d7fd7  ethernet  --
3-3.vhost_vdpaモジュールの有効化 : vhost_vdpaの場合

virtio_vdpaモジュールが有効化されている場合、rmmod virtio_vdpaコマンドで無効化してください。

rmmod virtio_vdpa
modprobe vhost_vdpa
3-4.仮想マシンの起動 : vhost_vdpaの場合
c77g153
(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

c77g159
(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log
3-5.virtio_vdpaモジュールの有効化 : virtio_vdpaの場合

vhost_vdpaモジュールが有効化されている場合、rmmod vhost_vdpaコマンドで無効化してください。

rmmod vhost_vdpa
modprobe virtio_vdpa
3-6.dpdk-vdpaの設定 : virtio_vdpaの場合
/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2
3-7.仮想マシンの起動 : virtio_vdpaの場合
c77g153
virsh start c77g153; virsh console c77g153

c77g159
virsh start c77g159; virsh console c77g159

4.ovsにおける設定方法(dpdk無し)

ovsの設定方法は、Linux Bridgeと同様に、とてもシンプルな設定です。
dpdkは使用しませんが、HW offloadは設定しておきます。

4-1.ovsの初期設定
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch
ovs-vsctl get Open_vSwitch . other_config

other_configの設定値は以下のようになっていればOKです。
他の設定値(dpdk-initなど)の削除方法は前回記事を参照してください。

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}
4-2.ovsの設定
ovs-vsctl add-br br30-ovs
ovs-vsctl add-port br30-ovs ens2f0
ovs-vsctl add-port br30-ovs ens2f0_0
ovs-vsctl add-br br31-ovs
ovs-vsctl add-port br31-ovs ens2f1
ovs-vsctl add-port br31-ovs ens2f1_0
ovs-vsctl show

以下のように設定されていればOKです。

ovs-vsctl show

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        Port ens2f1
            Interface ens2f1
        Port ens2f1_0
            Interface ens2f1_0
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port ens2f0_0
            Interface ens2f0_0
    ovs_version: "2.14.1"
4-3.仮想マシンの起動

仮想マシンの起動方法は3-3から3-7と同じ手順となります。

4-4.ステータス確認

<ConsoleA>
vhost_vdpaの場合

Apr 15 10:00:53 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_set_status:1786:(pid 8284): performing device reset
Apr 15 10:00:53 c83g155 kvm[8292]: 1 guest now active
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.73' (uid=0 pid=2781 comm="/usr/libexec/tracker-miner-fs ")
Apr 15 10:00:53 c83g155 systemd[1964]: Starting Tracker metadata extractor...
Apr 15 10:00:53 c83g155 dbus-daemon[2123]: [session uid=0 pid=2123] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr 15 10:00:53 c83g155 systemd[1964]: Started Tracker metadata extractor.
Apr 15 10:01:16 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_handle_set_map:473:(pid 8291): memory map update
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00001|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 ovs-vswitchd[8061]: ovs|00002|odp_util(handler4)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br30-ovs is down
Apr 15 10:01:18 c83g155 kernel: tc mirred to Houston: device br31-ovs is down
=====================snip=====================
Apr 15 10:01:36 c83g155 systemd[1964]: tracker-extract.service: Succeeded.

"tc mirred to Houston: device br30-ovs is down"のメッセージが出力されますが、特に問題ありません。

virtio_vdpaの場合
ovs-dpdkと同じメッセージが出力されます。

<ConsoleB>

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=302.134s
  port  ens2f0: rx pkts=2557039, bytes=260820713, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557043, bytes=260821159, drop=0, errs=0, coll=0
           duration=302.105s
  port  "ens2f0_0": rx pkts=2557014, bytes=250589621, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33992436, bytes=3162372357, drop=0, errs=0, coll=0
           duration=302.085s

<ConsoleC>

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=308.302s
  port  ens2f1: rx pkts=2557043, bytes=260821159, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=2557039, bytes=260820713, drop=0, errs=0, coll=0
           duration=308.279s
  port  "ens2f1_0": rx pkts=2557014, bytes=250589517, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33999072, bytes=3162887345, drop=0, errs=0, coll=0
           duration=308.258s

<ConsoleD>
ovs-dpdkと同じ出力となります。

5.課題1:仮想マシンの起動方法

ここからは、試してみたけど上手く行かなかったことを記載していきます。
仮想マシンの起動方法として、以下2つの方法を記載しています。

これは以下の組合せの場合、通信が出来なかったためです。

起動方法 vhost_vdpa virtio_vdpa
qemu OK NG
libvirt NG OK

ここでは、具体的にどのような設定をして通信不可となったのかを記載します。

5-1.qemuとvirtio_vdpaによる仮想マシンの起動

以下のコマンドで仮想マシンを起動することは可能ですが、通信できませんでした。

qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -mem-prealloc \
 -chardev socket,id=char0,path=/tmp/sock-virtio0 \
 -netdev type=vhost-user,id=vdpa,chardev=char0 \
 -device virtio-net-pci,netdev=vdpa,mac=00:11:22:33:44:00,page-per-vq=on \
 -nographic \
 2>&1 | tee vm153.log

ポイントは、以下のlibvirtの設定をqemuの設定に置換できれば、通信ができるようになると考えています。

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

qemuのヘルプを確認したところ、numaの設定はできそうです。しかし、qemuにおけるhugepagesの設定方法がわからないため、引き続き調査をしていきたいと考えています。

5-2.libvirtとvhost_vdpaによる仮想マシンの起動

vhost_vdpaの場合、vDPAデバイスXMLで設定することが可能となっています。*4
https://libvirt.org/formatdomain.html#vdpa-devices
具体的には以下のように設定します。

<devices>
......
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
......
</devices>

しかし、上記の設定に加え、以下の設定を実施してみましたが通信できませんでした。

virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on'

こちらについても、引き続き調査をしていきたいと考えています。
仮想マシンの管理上、libvirtに統一した方が良いと考えているため、5-2の方法を優先的に調べようと思います。

6.課題2:VFリレー設定

以下のMellanoxサイトにVFリレーによるHardware vDPAの設定方法が記載されています。
VirtIO Acceleration through VF Relay (Software & Hardware vDPA)
https://docs.mellanox.com/pages/viewpage.action?pageId=43718786#OVSOffloadUsingASAP%C2%B2Direct-swvdpaVirtIOAccelerationthroughVFRelay(Software&HardwarevDPA)
サンプルコンフィグが記載されていますが、ポイントは"vdpa-socket-path=<sock path>"となります。

ovs-vsctl add-port br0 vdpa0 -- set Interface vdpa0 type=dpdkvdpa \
options:vdpa-socket-path=<sock path> \
options:vdpa-accelerator-devargs= \
options:dpdk-devargs=,representor=[id] \
options: vdpa-max-queues = \
options: vdpa-sw=

具体的には以下のように設定します。

ovs-vsctl add-port br0-ovs vdpa0 -- set Interface vdpa0 type=dpdkvdpa options:vdpa-socket-path=/var/run/virtio-forwarder/sock0 options:vdpa-accelerator-devargs=0000:01:00.2 options:dpdk-devargs=0000:01:00.0,representor=[0] options: vdpa-max-queues=8

ここで、"<sock path>"は"/var/run/virtio-forwarder/sock0"となりますが、"virtio-forwarder"を利用して、事前にVFリレーの設定を実施しておく必要があるようです。
Virtio-forwarder Documentation — Virtio-forwarder 1.2.99.19 documentation
しかし、上記サイトの手順でvirtio-forwarderをインストールしたところ、以下のエラーでインストールできませんでした。

Error:
 Problem: package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1()(64bit), but none of the providers can be installed
  - package dpdk-19.11.3-1.el8.x86_64 requires libmlx4.so.1(MLX4_1.0)(64bit), but none of the providers can be installed
……

このため、今回利用している"dpdk21.02"の各ライブラリのリンクをした後、ソースからビルドしてみようと考えています。
なお、Mellanoxサイトにも記載のある通り、この方法はovs-dpdkのアップストリームでは承認されていない方法となるため、優先度を下げて調査しようと考えています。*5

以上です。

7.最後に

参考にさせて頂いたサイトは、本文中に記載しました。

vDPA自体も難しい点は多々ありましたが、特に難しかったのはSR-IOV switchdevモードのところで、switchdevモード時のトラフィックフローを理解するまでに多くの時間を要しました。
しかし、以下のブログの解説により突破口が開き、何とか先に進むことが出来ました。
yunazunoさんの解説には大変感謝致しております。
yunazuno.hatenablog.com

今後、vDPAが普及するのか、それとも消えてしまうのかはわかりませんが、ScalableIOVなどの技術と組合せて利用することにより、その真価が発揮されるのではないかと予想しています。
01.org

次回は、Kubernetes環境におけるvDPAの設定方法について、詳細を記載していきたいと考えています。

*1:既に前回記事を読まれている方に向けて記載していますので、詳細な説明などは省略しています。

*2:ちなみに、100GbpsのDACは、このサイトで購入しています。

*3:MCX623106AN-CDAT(正式サポートNIC)とMCX623106AS-CDAT(私が購入したNIC)では、型番が微妙に異なるため、vDPAのHW offloadができないかもしれない、という不安と戦いながら検証を進めていました。

*4:libvirt v6.9.0 QEMU only, requires QEMU 5.1.0 or newer

*5:もう一つ、気になる点として、"type=dpdkvdpa"があります。ovsのv2.15.0をソースからビルドして、このtypeを指定したらとエラーで弾かれました。このため、本編"2-3.Mellanoxドライバ(OFED)のインストール"で導入されるovsを使用しないとVFリレーによる設定はできないかもしれません。

How to set up vDPA with virtio_vdpa for VMs

This is a sequel from the previous article.
metonymical.hatenablog.com

Update 2021/04/16
Related articles are listed below.
metonymical.hatenablog.com

In this article, we will describe how to set up communication between VMs (virtual machines) using the virtio_vdpa module.

1.Overview

1-1.Environment
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.3(2011)
Kernel                           : 5.11.11-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.2-2.2.0.0
qemu-kvm                         : v6.0.0-rc1
DPDK                             : v21.02
ovs                              : v2.14.1
1-2.Overall flow

Advance preparation
Kernel update
Building qemu
Building dpdk
Change to SR-IOV switchdev mode
Configure ovs-dpdk and VM : Different from previous article
Operation check : Different from previous article

Note
Since many items are the same as in the previous article, items that are different are written in bold blue text.
If your environment is already set up in the previous article, please reboot the host OS and start reading from "Change to SR-IOV switchdev mode".

1-3.overall structure

The following points are different from the previous article.

this article (1) /tmp/sock-virtio0
previous article (1) /dev/vhost-vdpa-0

fig.1
f:id:metonymical:20210412202928j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210413220050j:plain

Quoted from Red Hat's Blog
vDPA kernel framework part 3: usage for VMs and containers

The orange dotted lines (A) and (B) correspond to fig.1 and fig.2, respectively.
Furthermore, in fig.2, the actual traffic flow is described in blue and red letters. *1

In fig.2, PF and VF of SR-IOV are written respectively, and "VF rep" is written in addition to them.
It should be noted that the bsf (Bus, Slot, Function) numbers of PF and VF rep are the same.

PF VF0 VF0 rep
ens2f0 ens2f0v0 ens2f0_0
07:00.0 07:00.2 07:00.0

rep=representor is an interface specific to swtichdev mode in SR-IOV, and is created by enabling swtichdev mode.
In contrast to swtichdev mode, the conventional SR-IOV VF is called legacy mode and must be explicitly separated from it.
In addition, switchdev mode is a mandatory requirement for ConnectX-6 Dx to enable the vDPA HW offload.

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Enabling HugePage and IOMMU
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-2.SR-IOV VF settings

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *2

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-3.Install the Mellanox driver (OFED)

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
The following command will install the Mellanox driver, but it will also install ovs v2.14.1 at the same time.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g155 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g155 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update

As of April 8, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-2.Installation of Kernel
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *3
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g155 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.Building qemu

4-1.Enabling the PowerTools Repository
vi /etc/yum.repos.d/CentOS-Linux-PowerTools.repo

enable=1
4-2.Install the necessary packages

In addition to qemu, we have also installed the packages that are required for the dpdk build.

dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm
4-3.Building qemu
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.0.0-rc1 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

Checking Version after Installation

/usr/local/bin/qemu-system-x86_64 --version

[root@c83g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 5.2.91 (v6.0.0-rc1)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.Change qemu execution path
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-5.Change the user to run qemu
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out

5.Building dpdk

5-1.Building dpdk
cd /usr/src/ && \
git clone git://dpdk.org/dpdk && \
cd dpdk && \
git checkout v21.02 && \
meson -Dexamples=all build && \
ninja -C build && \
ninja -C build install
5-2.Links to dpdk-related libraries

Create a new file with vi and include the path of lib.

vi /etc/ld.so.conf.d/libdpdk.conf

/usr/src/dpdk/build/lib

After running ldconfig, make sure the libs are linked.

ldconfig
ldconfig -p |grep dpdk

It is OK if it is pointed as follows.

[root@c83g155 dpdk]# ldconfig -p |grep dpdk
        librte_vhost.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so.21
        librte_vhost.so (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so
        librte_timer.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_timer.so.21
============ s n i p ============

Now, reboot once again.

reboot

6.Change to SR-IOV switchdev mode

6-1.Check the current operation mode.
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Check the bsf (bus, slot, function) number of the PCI device.
Check the status of 07:00.0 (ens2f0)
Check the status of 07:00.1 (ens2f1)

The output will look like the following

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
6-2.Changing the operating mode

Note that the bsf numbers are slightly different.*4

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

Unbind the mlx5_core driver for VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0

Change the PF operation mode to switchdev.

07:00.0 ens2f0
07:00.1 ens2f1

Rebind the mlx5_core driver of VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0
6-3.Check the operation mode after the change.
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Changed to switchdev mode.

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representer has been added.

[root@c83g155 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

In addition, make sure that the HW offload function of the NIC is enabled.

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g155 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g155 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

7.Configure ovs-dpdk and VM : Different from previous article

7-1.Overall Flow - Overview -

Configure the settings in the order (1)-(9) described in fig.1 below.
fig.1
f:id:metonymical:20210412202928j:plain

  1. Enabling the virtio_vdpa module and configuring dpdk-vdpa : (1) : Different from previous article
  2. Initial configuration of ovs
  3. Configuration of br30-ovs: (2)(3)(4)
  4. Configuration of br31-ovs: (5)(6)(7)
  5. Configure and start virtual machine c77g153: (8) : Different from previous article
  6. Configure and start virtual machine c77g159: (9) : Different from previous article
7-2.Overall flow - Commands only -

We will throw in the following commands.
Detailed explanations will follow, but if you don't need the explanations, just execute the commands.

1.Enabling the virtio_vdpa module and configuring dpdk-vdpa
(1)
modprobe virtio_vdpa

/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2

2.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

3.Configuration of br30-ovs
(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

4.Configuration of br31-ovs
(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

5.Configure and start virtual machine c77g153
(8)
virsh edit c77g153
<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

virt-xml c77g153 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g153 --edit --qemu-commandline='-chardev'
virt-xml c77g153 --edit --qemu-commandline='socket,id=charnet1,path=/tmp/sock-virtio0'
virt-xml c77g153 --edit --qemu-commandline='-netdev'
virt-xml c77g153 --edit --qemu-commandline='vhost-user,chardev=charnet1,queues=16,id=hostnet1'
virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=00:11:22:33:44:00,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'

6.Configure and start virtual machine c77g159
(9)
virsh edit c77g159
<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

virt-xml c77g159 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g159 --edit --qemu-commandline='-chardev'
virt-xml c77g159 --edit --qemu-commandline='socket,id=charnet2,path=/tmp/sock-virtio1'
virt-xml c77g159 --edit --qemu-commandline='-netdev'
virt-xml c77g159 --edit --qemu-commandline='vhost-user,chardev=charnet2,queues=16,id=hostnet2'
virt-xml c77g159 --edit --qemu-commandline='-device'
virt-xml c77g159 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet2,id=net1,mac=00:11:22:33:44:10,addr=0x7,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'
7-3.Enabling the virtio_vdpa module and configuring dpdk-vdpa:(1) : Different from previous article

Enabling the virtio_vdpa module
We will check the changes before and after executing the modprobe virtio_vdpa command.

Before running modprobe virtio_vdpa

lsmod |grep vd
ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa

[root@c83g155 ~]# lsmod |grep vd
mlx5_vdpa              45056  0
vhost_iotlb            16384  2 vhost,mlx5_vdpa
vdpa                   16384  1 mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa
ls: cannot access '/sys/bus/vdpa/drivers/virtio_vdpa': No such file or directory

After running modprobe virtio_vdpa

modprobe virtio_vdpa
lsmod |grep vd
ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa

[root@c83g155 ~]# lsmod |grep vd
virtio_vdpa            16384  0
mlx5_vdpa              45056  0
vhost_iotlb            16384  1 mlx5_vdpa
vdpa                   16384  2 virtio_vdpa,mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr 12 21:00 ./
drwxr-xr-x 3 root root    0 Apr 12 21:00 ../
--w------- 1 root root 4096 Apr 12 21:00 bind
lrwxrwxrwx 1 root root    0 Apr 12 21:00 module -> ../../../../module/virtio_vdpa/
--w------- 1 root root 4096 Apr 12 21:00 uevent
--w------- 1 root root 4096 Apr 12 21:00 unbind
lrwxrwxrwx 1 root root    0 Apr 12 21:00 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr 12 21:00 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following.

  • 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the virtio_vdpa driver

Configuring dpdk-vdpa
Next, run the dpdk-vdpa command.

/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

When the prompt changes to "vdpa>", execute the following command.

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2

Connect to the host OS via ssh in another terminal and confirm that the sock file has been generated using the following command.

[root@c83g155 ~]# ls -Fal /tmp
total 36
drwxrwxrwt. 17 root root 4096 Apr 12 21:08 ./
dr-xr-xr-x. 17 root root  244 Apr  7 20:30 ../
-rw-r--r--   1 root root 1874 Apr  7 20:30 anaconda.log
===================== s n i p =====================
srwxr-xr-x   1 root root    0 Apr 12 21:08 sock-virtio0=
srwxr-xr-x   1 root root    0 Apr 12 21:08 sock-virtio1=
drwx------   3 root root   17 Apr 12 19:56 systemd-private-f5b122148a7c4019be8cf0116bd9f2cc-chronyd.service-IEe7hb/
===================== s n i p =====================

Note
The following is an example of output from the dpdk-vdpa command.

[root@c83g155 ~]# /usr/src/dpdk/build/examples/dpdk-vdpa \
> --socket-mem 1024,1024 \
> -a 0000:07:00.2,class=vdpa \
> -a 0000:07:01.2,class=vdpa \
> --log-level=pmd,debug -- -i
EAL: Detected 16 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: No available 2048 kB hugepages reported
EAL: Probing VFIO support...
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:07:00.2 (socket 0)
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: PCI information matches for device "mlx5_2".
common_mlx5: Netlink "devlink" family ID is 20.
common_mlx5: ROCE is enabled for device "0000:07:00.2".
common_mlx5: Device 0000:07:00.2 ROCE was disabled by Netlink successfully.
common_mlx5: Device "0000:07:00.2" was reloaded by Netlink successfully.
mlx5_vdpa: ROCE is disabled by Netlink successfully.
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: Checking device "mlx5_1"..
mlx5_vdpa: Checking device "mlx5_0"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: event mode is 1.
mlx5_vdpa: event_us is 0 us.
mlx5_vdpa: no traffic time is 2 s.
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:07:01.2 (socket 0)
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: PCI information matches for device "mlx5_3".
common_mlx5: Netlink "devlink" family ID is 20.
common_mlx5: ROCE is enabled for device "0000:07:01.2".
common_mlx5: Device 0000:07:01.2 ROCE was disabled by Netlink successfully.
common_mlx5: Device "0000:07:01.2" was reloaded by Netlink successfully.
mlx5_vdpa: ROCE is disabled by Netlink successfully.
mlx5_vdpa: Checking device "mlx5_1"..
mlx5_vdpa: Checking device "mlx5_0"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: event mode is 1.
mlx5_vdpa: event_us is 0 us.
mlx5_vdpa: no traffic time is 2 s.
EAL: No legacy callbacks, legacy socket not created
Interactive-mode selected
vdpa>                                     < < < < After executing the command, the prompt changes to "vdpa>".
vdpa> create /tmp/sock-virtio0 0000:07:00.2
VHOST_CONFIG: vhost-user server: socket created, fd: 83
VHOST_CONFIG: bind to /tmp/sock-virtio0
vdpa> create /tmp/sock-virtio1 0000:07:01.2
VHOST_CONFIG: vhost-user server: socket created, fd: 86
VHOST_CONFIG: bind to /tmp/sock-virtio1
vdpa>

Please keep this terminal as it is, as we will use it in the operation check later.

7-4.Initial configuration of ovs

Since ovs has already been installed, start the service from systemctl.*5

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

Start the ovs service
Initialize dpdk
HW offload and tc-policy configuration
Memory allocation
IOMMU configuration for vhost
Configure representer
Restart the ovs service (to reflect the above settings)

Use the following command to check the settings.

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{dpdk-extra=" -w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 -w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0", dpdk-init="true", dpdk-socket-mem="1024,1024", hw-offload="true", tc-policy=none, vhost-iommu-support="true"}

Note 1:
Here is a supplementary explanation of other_config:dpdk-extra.
There is the following correspondence between the output results of "lshw -businfo -c network" and the commands configured in "other_config:dpdk-extra".

0000:07:00.0 ens2f0_0 -w 0000:07:00.0,representor=[0]
0000:07:00.1 ens2f1_0 -w 0000:07:00.1,representor=[0]

Note 2:
Here is a supplementary explanation of other_config:tc-policy.
The following options can be set for tc-policy.

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

Note 3:
If you want to remove the configuration, execute the command as follows.
"dpdk-extra" is the key, so specify any key you want to delete, such as "dpdk-init" or "hw-offload".

ovs-vsctl remove Open_vSwitch . other_config dpdk-extra
7-5.Configuration of br30-ovs : (2)(3)(4)

Create the first bridge.

(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

(2) Create a bridge
(3) Create the uplink (specify PF and set the interface for the external NW)
(4) Create downlink (specify VF Representer and set up the interface for VM)

Check the settings with the following command.

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-6.Configuration of br31-ovs : (5)(6)(7)

Create the second bridge.

(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

Same as (2), (3), and (4).

Check the settings with the following command. The blue text is the part that has been added.

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        datapath_type: netdev
        Port ens2f1_0
            Interface ens2f1_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1,representor=[0]"}
        Port ens2f1
            Interface ens2f1
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1"}
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-7.Configure and start virtual machine c77g153 : (8) : Different from previous article

Please upload the qcow2 file to "/var/lib/libvirt/images/".
In this article, the qcow2 file with CentOS7.7 installed was prepared beforehand.
Additionally, once you have created a virtual machine with virt-manager, you will edit it with the "virsh edit" and "virt-xml" commands.*6

Login to the host OS via VNC or other means, and start virt-manager.
When creating a new virtual machine, delete the following [1]-[5] devices.*7
f:id:metonymical:20210412213049j:plain
After booting the VM, shutdown it once.
After shutdown, the device configuration should look like the following.
The NICs listed here are not used in vDPA, but they will allow you to ssh to them, so if you need to, assign a management IP to them.
f:id:metonymical:20210412212616j:plain

After shutdown, use the virsh edit command to perform the following settings.

(8)
virsh edit c77g153

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

After returning to the bash, further configure the following settings using the virt-xml command.

(8)
virt-xml c77g153 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g153 --edit --qemu-commandline='-chardev'
virt-xml c77g153 --edit --qemu-commandline='socket,id=charnet1,path=/tmp/sock-virtio0'
virt-xml c77g153 --edit --qemu-commandline='-netdev'
virt-xml c77g153 --edit --qemu-commandline='vhost-user,chardev=charnet1,queues=16,id=hostnet1'
virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=00:11:22:33:44:00,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'
-mem-prealloc We haven't been able to confirm the details, but it seems to be a mandatory setting since it is used for exchanging virtqueue with PlatformIOMMU from fig.2.
path=/tmp/sock-virtio0 Explicitly specify the sock file for dpdk-vdpa.
mq=on This is the setting for using multi-queue.
page-per-vq=on This setting is required to use virtqueue.

Note
When you run the virt-xml command, you will see the following WARNING message, please ignore it.

WARNING  XML did not change after domain define. You may have changed a value that libvirt is setting by default.
7-8.Configure and start virtual machine c77g159 : (9) : Different from previous article

Same as 7-7, except /tmp/sock-virtio1.

(9)
virsh edit c77g159

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

After returning to the bash, further configure the following settings using the virt-xml command.

(9)
virt-xml c77g159 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g159 --edit --qemu-commandline='-chardev'
virt-xml c77g159 --edit --qemu-commandline='socket,id=charnet2,path=/tmp/sock-virtio1'
virt-xml c77g159 --edit --qemu-commandline='-netdev'
virt-xml c77g159 --edit --qemu-commandline='vhost-user,chardev=charnet2,queues=16,id=hostnet2'
virt-xml c77g159 --edit --qemu-commandline='-device'
virt-xml c77g159 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet2,id=net1,mac=00:11:22:33:44:10,addr=0x7,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'

8.Operation check : Different from previous article

8-1.advance preparation

Prepare five consoles on hostOS c83g155.

ConsoleA Already activated at 7-3 To refer to the dpdk-vdpa log
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs To check the packet count on c77g153
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs To check the packet count on c77g159
ConsoleD virsh start c77g153; virsh console c77g153 For the console of virtual machine c77g153
ConsoleE virsh start c77g159; virsh console c77g159 For the console of virtual machine c77g159
8-2.Booting the VM

ConsoleA has been started in debug mode when the dpdk-vdpa command was executed in 7-3.
For ConsoleB and C, please run the above commands before starting the VM.
Then, for ConsoleD, start c77g153 with the above command.
After waiting for a few seconds, ConsoleE will start c77g159 with the above command.
Send a ping from c77g153 or c77g159.
As an example, follow fig.1 and execute ping 192.168.30.159 -f from c77g153.

fig.1
f:id:metonymical:20210412202928j:plain

The following is the output result. The points of interest are in red.

ConsoleA
The ConsoleA log is an excerpt.
The full output has been saved to this link.

vdpa> 
VHOST_CONFIG: new vhost user connection is 87
VHOST_CONFIG: new device, handle is 0
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_PROTOCOL_FEATURES
===================== s n i p =====================
VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
VHOST_CONFIG: negotiated Virtio features: 0x140601803
VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
VHOST_CONFIG: guest memory region size: 0x80000000
         guest physical addr: 0x0
         guest virtual  addr: 0x7faa40000000
         host  virtual  addr: 0x7f8080000000
         mmap addr : 0x7f8080000000
         mmap size : 0x80000000
         mmap align: 0x40000000
         mmap off  : 0x0
VHOST_CONFIG: guest memory region size: 0x80000000
         guest physical addr: 0x100000000
         guest virtual  addr: 0x7faac0000000
         host  virtual  addr: 0x7f8000000000
         mmap addr : 0x7f7f80000000
         mmap size : 0x100000000
         mmap align: 0x40000000
         mmap off  : 0x80000000
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
===================== s n i p =====================
new port /tmp/sock-virtio0, device : 0000:07:00.2
mlx5_vdpa: Cannot get vhost MTU - -95.
mlx5_vdpa: MTU cannot be set on device 0000:07:00.2.
mlx5_vdpa: Region 0: HVA 0x7f8080000000, GPA 0x0, size 0x80000000.
mlx5_vdpa: Region 1: HVA 0x7f8000000000, GPA 0x100000000, size 0x80000000.
mlx5_vdpa: Indirect mkey mode is KLM Fixed Buffer Size.
mlx5_vdpa: Memory registration information: nregions = 2, mem_size = 0x180000000, GCD = 0x80000000, klm_fbs_entries_num = 0x3, klm_entries_num = 0x3.
mlx5_vdpa: Dump fill Mkey = 1792.
mlx5_vdpa: Registered error interrupt for device0.
mlx5_vdpa: VAR address of doorbell mapping is 0x7f8157669000.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 0.
mlx5_vdpa: Register fd 123 interrupt for virtq 0.
mlx5_vdpa: vid 0 virtq 0 was created successfully.
mlx5_vdpa: Virtq 0 notifier state is enabled.
mlx5_vdpa: Ring virtq 0 doorbell.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 1.
mlx5_vdpa: Register fd 89 interrupt for virtq 1.
mlx5_vdpa: vid 0 virtq 1 was created successfully.
mlx5_vdpa: Virtq 1 notifier state is enabled.
mlx5_vdpa: Ring virtq 1 doorbell.
mlx5_vdpa: vDPA device 0 was configured.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:127
mlx5_vdpa: Update virtq 1 status enable -> disable.
mlx5_vdpa: vid 0 virtq 1 was stopped.
mlx5_vdpa: Query vid 0 vring 1: hw_available_idx=0, hw_used_index=0
mlx5_vdpa: Update virtq 1 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 1.
mlx5_vdpa: Register fd 89 interrupt for virtq 1.
mlx5_vdpa: vid 0 virtq 1 was created successfully.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 1 to qp idx: 0
===================== s n i p =====================
mlx5_vdpa: Update virtq 2 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 2.
mlx5_vdpa: Register fd 90 interrupt for virtq 2.
mlx5_vdpa: vid 0 virtq 2 was created successfully.
mlx5_vdpa: Virtq 2 notifier state is enabled.
mlx5_vdpa: Ring virtq 2 doorbell.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 1 to qp idx: 3
mlx5_vdpa: Update virtq 3 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 3.
mlx5_vdpa: Register fd 91 interrupt for virtq 3.
mlx5_vdpa: vid 0 virtq 3 was created successfully.
===================== s n i p =====================
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 0 to qp idx: 31
mlx5_vdpa: Virtq 3 notifier state is enabled.
mlx5_vdpa: Ring virtq 3 doorbell.
mlx5_vdpa: Device 0000:07:00.2 virtq 3 cq 2277 event was captured. Timer is off, cq ci is 1.
mlx5_vdpa: Device 0000:07:00.2 virtq 1 cq 2270 event was captured. Timer is on, cq ci is 1.
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped.
mlx5_vdpa: Device 0000:07:00.2 virtq 3 cq 2277 event was captured. Timer is off, cq ci is 18.
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f0: rx pkts=159317, bytes=15614385, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159318, bytes=15614457, drop=0, errs=0, coll=?
           duration=173.964s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=53, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=54, errs=0, coll=0
           duration=173.957s
  port  "ens2f0_0": rx pkts=159318, bytes=15614457, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159317, bytes=15614385, drop=0, errs=0, coll=?
           duration=173.729s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f1: rx pkts=159318, bytes=15614493, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159317, bytes=15614349, drop=0, errs=0, coll=?
           duration=180.549s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=53, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=54, errs=0, coll=0
           duration=181.910s
  port  "ens2f1_0": rx pkts=159317, bytes=15614349, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159318, bytes=15614493, drop=0, errs=0, coll=?
           duration=180.861s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159 -f
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.
.
--- 192.168.30.159 ping statistics ---
159288 packets transmitted, 159288 received, 0% packet loss, time 24357ms
rtt min/avg/max/mdev = 0.069/0.086/60.812/0.202 ms, pipe 5, ipg/ewma 0.152/0.101 ms

Note

mlx5_vdpa: Cannot get vhost MTU - -95. The MTU message is output, but there is no problem.
mlx5_vdpa: vid 0 virtq 0 was created successfully. Indicates that the creation of virtq was successful.
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped. You will see this message after a while after starting the virtual machine, but it does not mean that sending and receiving traffic has been stopped, so there is no problem.
ens2f0 "ens2f0_0" You can see that the tx/rx packet count and byte count for each port are increasing.

That's all.

*1:This is a description of what I understand. If the content is incorrect, please point it out.

*2:We have confirmed that if the MAC address is not settings, the VM will not recognize the VF after VM startup.

*3:core and modules will be installed at the same time

*4:The "0000" in front of the bsf number is called the Domain number. As far as I know, I have never seen a value other than "0000", so I don't think you need to worry too much about it.

*5:It has already been installed in 2-3.

*6:We will describe the details in the extra chapter, but in the case of vhost_vdpa in the previous article, We were able to start the virtual machine with virt-manager, but We were not able to communicate with it. For this reason, in vhost_vdpa, we booted directly from qemu-kvm.

*7:This is because related packages such as spice were not installed when qemu was built, and the virtual machine could not be started without removing these devices. Since this is not directly related to vDPA, we will not discuss how to deal with these issues.

vDPA の設定方法 virtio_vdpa + VM編

前回記事からの続編となります。
metonymical.hatenablog.com

本ブログでは、virtio_vdpaモジュールを使用したVM(仮想マシン)間通信の設定方法について記載していきます。

1.構成

1-1.環境
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.3(2011)
Kernel                           : 5.11.11-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.2-2.2.0.0
qemu-kvm                         : v6.0.0-rc1
DPDK                             : v21.02
ovs                              : v2.14.1
1-2.全体の流れ

事前準備
Kernelアップデート
qemuのビルド
dpdkのビルド
SR-IOV switchdev modeへ変更
ovs-dpdkとVMの設定:前回記事と異なる
動作確認:前回記事と異なる

<補足>
多くの項目が前回記事と同一なため、異なる項目は青太文字で記載してます。
また、前回記事で既に環境が整っている方は、ホストOSを再起動後、"SR-IOV switchdev modeへ変更"から読み始めてください。

1-3.全体構成

以下の点が前回記事と異なっています。

今回 (1) /tmp/sock-virtio0
前回 (1) /dev/vhost-vdpa-0

fig.1
f:id:metonymical:20210412202928j:plain
fig.1は簡易的に記載しており、内部アーキテクチャは省略しています。このため、実際は以下の構成をイメージして頂ければと思います。

fig.2
f:id:metonymical:20210413220050j:plain

Red HatのBlogより引用
vDPA kernel framework part 3: usage for VMs and containers

オレンジ色点線の(A)(B)が、それぞれfig.1とfig.2に対応しています。
さらに、fig.2においては、実際のトラフィックフローを青文字と赤文字で記載しています。*1

fig.2において、SR-IOVのPFとVFをそれぞれ記載していますが、これに加えて「VF rep」を記載しています。PFとVF repのbsf(Bus, Slot, Function)番号が同一である点も注意が必要です。

PF VF0 VF0 rep
ens2f0 ens2f0v0 ens2f0_0
07:00.0 07:00.2 07:00.0

rep=representorは、SR-IOVにおけるswtichdevモード特有のインターフェースで、swtichdevモードを有効化することにより作成されます。
また、swtichdevモードに対して、従来のSR-IOV VFをlegacyモードと呼び、明示的に別けて使用する必要があります。加えて、ConnectX-6 Dxでは、vDPA HW offloadを有効化するために、swtichdevモードが必須要件となっています。

2.事前準備

特に記載しませんが、SELinux無効化、FW無効化、NTP時刻同期設定は事前に行っています。

2-1.HugePageとIOMMUの有効化
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

次にHugePageのマウント設定を実施しておきます。次回OS起動時に自動マウントされます。

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-2.SR-IOV VFの設定

SR-IOV VFの設定を実施します。VF数を増やしても問題ありませんが、ここでは説明をシンプルにするため、VF数「1」としています。加えて、MACアドレスの設定は必須です。*2

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-3.Mellanoxドライバ(OFED)のインストール

isoファイルはMellanoxのサイトからDLしてください。Mellanox Download Site
DLしたisoファイルは、/root/tmp/に保存してください。
以下のコマンドにて、Mellanoxドライバをインストールしますが、ovs v2.14.1も同時にインストールされます。

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

インストールが完了したら、再起動してください。

reboot

再起動が完了したら、HugePageを確認します。

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g155 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g155 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernelアップデート

2021年4月8日現在、vDPA関連モジュールは高い頻度で更新されていますので、最新のKernelをインストールします。

3-2.Kernelのインストール
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

現在インストール済みのKernelを確認
kernel-ml, kernel-ml-develをインストール*3
インストール済みKernelの確認
再起動

3-3.Kernel headersなどのインストール
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

現在起動中のKernel Version確認
kernel-headersのインストール
既存のkernel-tools kernel-tools-libsの削除
kernel-tools kernel-tools-libsのインストール
インストール済みKernelの確認

以下の出力になっていればOKです。

[root@c83g155 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.qemuのビルド

4-1.PowerToolsリポジトリの有効化
vi /etc/yum.repos.d/CentOS-Linux-PowerTools.repo

enable=1
4-2.必要なパッケージのインストール

qemuに加えて、dpdkのビルドで必要とされるパッケージも合わせてインストールしています。

dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm
4-3.qemuのビルド
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.0.0-rc1 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

インストール後のVersion確認

/usr/local/bin/qemu-system-x86_64 --version

[root@c83g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 5.2.91 (v6.0.0-rc1)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.qemu実行Pathの変更
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-5.qemu実行ユーザの変更
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out

5.dpdkのビルド

5-1.dpdkのビルド
cd /usr/src/ && \
git clone git://dpdk.org/dpdk && \
cd dpdk && \
git checkout v21.02 && \
meson -Dexamples=all build && \
ninja -C build && \
ninja -C build install
5-2.dpdk関連ライブラリのリンク

viで新規ファイルを作成し、libのPathを記載してください。

vi /etc/ld.so.conf.d/libdpdk.conf

/usr/src/dpdk/build/lib

ldconfigを実行後、libがリンクされたことを確認してください。

ldconfig
ldconfig -p |grep dpdk

以下のようにポイントされていればOKです。

[root@c83g155 dpdk]# ldconfig -p |grep dpdk
        librte_vhost.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so.21
        librte_vhost.so (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so
        librte_timer.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_timer.so.21

ここで、一旦再起動しておきます。

reboot

6.SR-IOV switchdev modeへ変更

6-1.現在の動作モードを確認
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

PCIバイスのbsf(bus, slot, function)番号を確認
07:00.0(ens2f0)のステータス確認
07:00.1(ens2f1)のステータス確認

以下のように出力されます。

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
6-2.動作モードの変更

bsf番号が微妙に異なっている点に注意してください。*4

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

VFのmlx5_coreドライバをアンバインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0

PFの動作モードをswitchdevに変更します。

07:00.0 ens2f0
07:00.1 ens2f1

VFのmlx5_coreドライバを再バインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0
6-3.変更後の動作モードを確認
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

switchdev モードに変更されました。

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representerが追加されています。

[root@c83g155 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

さらに、NICのHW offload機能が有効化されていることも確認します。

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g155 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g155 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

7.ovs-dpdkとVMの設定:前回記事と異なる

7-1.全体の流れ ~概要~

以下のfig.1に記載されている(1)-(9)の順に設定していきます。
fig.1
f:id:metonymical:20210412202928j:plain

  1. virtio_vdpaモジュールの有効化とdpdk-vdpaの設定:(1):前回記事と異なる
  2. ovsの初期設定
  3. br30-ovsの設定:(2)(3)(4)
  4. br31-ovsの設定:(5)(6)(7)
  5. 仮想マシンc77g153の設定と起動:(8):前回記事と異なる
  6. 仮想マシンc77g159の設定と起動:(9):前回記事と異なる
7-2.全体の流れ ~コマンドのみ~

以下のコマンドを投入していきます。
詳細な解説は後述しますが、解説が不要な方はコマンドだけを実行してください。

1.virtio_vdpaモジュールの有効化とdpdk-vdpaの設定
(1)
modprobe virtio_vdpa

/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2

2.ovsの初期設定
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

3.br30-ovsの設定
(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]
 
4.br31-ovsの設定
(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

5.仮想マシンc77g153の設定と起動
(8)
virsh edit c77g153
<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

virt-xml c77g153 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g153 --edit --qemu-commandline='-chardev'
virt-xml c77g153 --edit --qemu-commandline='socket,id=charnet1,path=/tmp/sock-virtio0'
virt-xml c77g153 --edit --qemu-commandline='-netdev'
virt-xml c77g153 --edit --qemu-commandline='vhost-user,chardev=charnet1,queues=16,id=hostnet1'
virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=00:11:22:33:44:00,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'

6.仮想マシンc77g159の設定と起動
(9)
virsh edit c77g159
<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

virt-xml c77g159 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g159 --edit --qemu-commandline='-chardev'
virt-xml c77g159 --edit --qemu-commandline='socket,id=charnet2,path=/tmp/sock-virtio1'
virt-xml c77g159 --edit --qemu-commandline='-netdev'
virt-xml c77g159 --edit --qemu-commandline='vhost-user,chardev=charnet2,queues=16,id=hostnet2'
virt-xml c77g159 --edit --qemu-commandline='-device'
virt-xml c77g159 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet2,id=net1,mac=00:11:22:33:44:10,addr=0x7,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'
7-3.virtio_vdpaモジュールの有効化とdpdk-vdpaの設定:(1):前回記事と異なる

virtio_vdpaモジュールの有効化
modprobe virtio_vdpa コマンドの実行前と実行後の変化を確認していきます。

modprobe virtio_vdpa実行前

lsmod |grep vd
ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa

[root@c83g155 ~]# lsmod |grep vd
mlx5_vdpa              45056  0
vhost_iotlb            16384  2 vhost,mlx5_vdpa
vdpa                   16384  1 mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa
ls: cannot access '/sys/bus/vdpa/drivers/virtio_vdpa': No such file or directory

modprobe virtio_vdpa実行後

modprobe virtio_vdpa
lsmod |grep vd
ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa

[root@c83g155 ~]# lsmod |grep vd
virtio_vdpa            16384  0
mlx5_vdpa              45056  0
vhost_iotlb            16384  1 mlx5_vdpa
vdpa                   16384  2 virtio_vdpa,mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/virtio_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr 12 21:00 ./
drwxr-xr-x 3 root root    0 Apr 12 21:00 ../
--w------- 1 root root 4096 Apr 12 21:00 bind
lrwxrwxrwx 1 root root    0 Apr 12 21:00 module -> ../../../../module/virtio_vdpa/
--w------- 1 root root 4096 Apr 12 21:00 uevent
--w------- 1 root root 4096 Apr 12 21:00 unbind
lrwxrwxrwx 1 root root    0 Apr 12 21:00 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr 12 21:00 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

上記の出力結果より、以下のことが確認できます。

  • 0000:07:00.2/vdpa0と0000:07:01.2/vdpa1がvirtio_vdpaドライバで制御されている

dpdk-vdpaの設定
次にdpdk-vdpaコマンドを実行します。

/usr/src/dpdk/build/examples/dpdk-vdpa \
--socket-mem 1024,1024 \
-a 0000:07:00.2,class=vdpa \
-a 0000:07:01.2,class=vdpa \
--log-level=pmd,debug -- -i

プロンプトが vdpa> に変わりますので、以下のコマンドを実行します。

create /tmp/sock-virtio0 0000:07:00.2
create /tmp/sock-virtio1 0000:07:01.2

別なターミナルでホストOSにssh接続し、以下のコマンドでsockファイルが生成されたことを確認します。

[root@c83g155 ~]# ls -Fal /tmp
total 36
drwxrwxrwt. 17 root root 4096 Apr 12 21:08 ./
dr-xr-xr-x. 17 root root  244 Apr  7 20:30 ../
-rw-r--r--   1 root root 1874 Apr  7 20:30 anaconda.log
===================== s n i p =====================
srwxr-xr-x   1 root root    0 Apr 12 21:08 sock-virtio0=
srwxr-xr-x   1 root root    0 Apr 12 21:08 sock-virtio1=
drwx------   3 root root   17 Apr 12 19:56 systemd-private-f5b122148a7c4019be8cf0116bd9f2cc-chronyd.service-IEe7hb/
===================== s n i p =====================

<補足>
以下にdpdk-vdpaコマンドの出力例を記載します。

[root@c83g155 ~]# /usr/src/dpdk/build/examples/dpdk-vdpa \
> --socket-mem 1024,1024 \
> -a 0000:07:00.2,class=vdpa \
> -a 0000:07:01.2,class=vdpa \
> --log-level=pmd,debug -- -i
EAL: Detected 16 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: No available 2048 kB hugepages reported
EAL: Probing VFIO support...
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:07:00.2 (socket 0)
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: PCI information matches for device "mlx5_2".
common_mlx5: Netlink "devlink" family ID is 20.
common_mlx5: ROCE is enabled for device "0000:07:00.2".
common_mlx5: Device 0000:07:00.2 ROCE was disabled by Netlink successfully.
common_mlx5: Device "0000:07:00.2" was reloaded by Netlink successfully.
mlx5_vdpa: ROCE is disabled by Netlink successfully.
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: Checking device "mlx5_1"..
mlx5_vdpa: Checking device "mlx5_0"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: event mode is 1.
mlx5_vdpa: event_us is 0 us.
mlx5_vdpa: no traffic time is 2 s.
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:07:01.2 (socket 0)
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: PCI information matches for device "mlx5_3".
common_mlx5: Netlink "devlink" family ID is 20.
common_mlx5: ROCE is enabled for device "0000:07:01.2".
common_mlx5: Device 0000:07:01.2 ROCE was disabled by Netlink successfully.
common_mlx5: Device "0000:07:01.2" was reloaded by Netlink successfully.
mlx5_vdpa: ROCE is disabled by Netlink successfully.
mlx5_vdpa: Checking device "mlx5_1"..
mlx5_vdpa: Checking device "mlx5_0"..
mlx5_vdpa: Checking device "mlx5_2"..
mlx5_vdpa: Checking device "mlx5_3"..
mlx5_vdpa: event mode is 1.
mlx5_vdpa: event_us is 0 us.
mlx5_vdpa: no traffic time is 2 s.
EAL: No legacy callbacks, legacy socket not created
Interactive-mode selected
vdpa>                                     < < < < コマンド実行後、プロンプトが "vdpa>" になる
vdpa> create /tmp/sock-virtio0 0000:07:00.2
VHOST_CONFIG: vhost-user server: socket created, fd: 83
VHOST_CONFIG: bind to /tmp/sock-virtio0
vdpa> create /tmp/sock-virtio1 0000:07:01.2
VHOST_CONFIG: vhost-user server: socket created, fd: 86
VHOST_CONFIG: bind to /tmp/sock-virtio1
vdpa>

この後の動作確認でも使用しますので、このターミナルはこのままの状態にしておいてください。

7-4.ovsの初期設定

ovsは、既にインストール済みなので*5、systemctlからサービスをスタートします。

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

ovsサービスの起動
dpdkの初期化
HW offloadとtc-policyの設定
メモリ割当て
vhostのIOMMU設定
representerの設定
ovsサービスの再起動(上記設定を反映させるため)

以下のコマンドで設定内容を確認します。

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{dpdk-extra=" -w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 -w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0", dpdk-init="true", dpdk-socket-mem="1024,1024", hw-offload="true", tc-policy=none, vhost-iommu-support="true"}

<補足1>
other_config:dpdk-extraについて、補足します。
lshw -businfo -c network の出力結果とother_config:dpdk-extra で設定したコマンドには、以下の対応関係があります。

0000:07:00.0 ens2f0_0 -w 0000:07:00.0,representor=[0]
0000:07:00.1 ens2f1_0 -w 0000:07:00.1,representor=[0]

<補足2>
other_config:tc-policyについて、補足します。
tc-policyは、以下のオプションが設定可能です。

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

<補足3>
設定を削除したい場合は、以下のようにコマンドを実行してください。
dpdk-extra がキーとなっていますので、dpdk-initやhw-offloadなど、削除したい任意のキーを指定してください。

ovs-vsctl remove Open_vSwitch . other_config dpdk-extra
7-5.br30-ovsの設定:(2)(3)(4)

1つ目のブリッジを作成します。

(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

(2)ブリッジの作成
(3)アップリンクの作成(PFを指定し、外部NW向けのインターフェースを設定)
(4)ダウンリンクの作成(VF Representerを指定し、VM向けのインターフェースを設定)

以下のコマンドで設定を確認します。

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-6.br31-ovsの設定:(5)(6)(7)

2つ目のブリッジを作成します。

(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

(2)(3)(4)と同様です。

以下のコマンドで設定を確認します。青文字が追加された部分です。

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        datapath_type: netdev
        Port ens2f1_0
            Interface ens2f1_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1,representor=[0]"}
        Port ens2f1
            Interface ens2f1
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1"}
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-7.仮想マシンc77g153の設定と起動:(8):前回記事と異なる

/var/lib/libvirt/images/にqcow2ファイルをアップロードしておいてください。
本ブログでは、CentOS7.7をインストールしたqcow2ファイルを予め準備していました。
加えて、一度virt-manager仮想マシンを作成後、"virsh edit" と "virt-xml" コマンドで編集していきます。*6

VNCなどでホストOSにログインし、virt-managerを起動してください。
新規仮想マシンを作成する際、以下の[1]-[5]のデバイスを削除してください。*7
f:id:metonymical:20210412213049j:plain
VM起動後、一旦shutdownします。
shutdown後、以下のようなデバイス構成になっていればOKです。
ここに記載されているNICはvDPAでは使用しませんが、sshできるようになるため、必要であれば管理用IPをアサインしてください。
f:id:metonymical:20210412212616j:plain

shutdown後、virsh editコマンドで以下の設定を実施します。

(8)
virsh edit c77g153

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

bashに戻った後、さらにvirt-xmlコマンドで以下の設定を実施します。

(8)
virt-xml c77g153 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g153 --edit --qemu-commandline='-chardev'
virt-xml c77g153 --edit --qemu-commandline='socket,id=charnet1,path=/tmp/sock-virtio0'
virt-xml c77g153 --edit --qemu-commandline='-netdev'
virt-xml c77g153 --edit --qemu-commandline='vhost-user,chardev=charnet1,queues=16,id=hostnet1'
virt-xml c77g153 --edit --qemu-commandline='-device'
virt-xml c77g153 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet1,id=net1,mac=00:11:22:33:44:00,addr=0x6,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'
-mem-prealloc 詳細は確認できていませんが、fig.2よりPlatformIOMMUとvirtqueueのやり取りで使用するため必須の設定と思われます。
path=/tmp/sock-virtio0 dpdk-vdpaのsockファイルを明示的に指定しています。
mq=on multi-queueを使用するための設定です。
page-per-vq=on virtqueueを使用するため必須の設定です

<補足>
virt-xmlコマンドを実行すると、以下のWARNINGメッセージが表示されますが、無視してください。

WARNING  XML did not change after domain define. You may have changed a value that libvirt is setting by default.
7-8.仮想マシンc77g159の設定と起動:(9):前回記事と異なる

/tmp/sock-virtio1以外は、7-7と同様です。

(9)
virsh edit c77g159

<currentMemory unit='KiB'>4194304</currentMemory>
<memoryBacking>
  <hugepages>
    <page size='1048576' unit='KiB'/>
  </hugepages>
</memoryBacking>

<cpu mode='custom' match='exact' check='partial'>
  <numa>
    <cell id='0' cpus='0-1' memory='4194304' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>

bashに戻った後、virt-xmlコマンドで以下の設定を実施します。

(9)
virt-xml c77g159 --edit --qemu-commandline='-mem-prealloc'
virt-xml c77g159 --edit --qemu-commandline='-chardev'
virt-xml c77g159 --edit --qemu-commandline='socket,id=charnet2,path=/tmp/sock-virtio1'
virt-xml c77g159 --edit --qemu-commandline='-netdev'
virt-xml c77g159 --edit --qemu-commandline='vhost-user,chardev=charnet2,queues=16,id=hostnet2'
virt-xml c77g159 --edit --qemu-commandline='-device'
virt-xml c77g159 --edit --qemu-commandline='virtio-net-pci,mq=on,vectors=6,netdev=hostnet2,id=net1,mac=00:11:22:33:44:10,addr=0x7,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024'

8.動作確認:前回記事と異なる

8-1.事前準備

ホストOS c83g155で5つのコンソールを準備してください。

ConsoleA 7-3で起動済み dpdk-vdpaのログを参照するため。
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs c77g153のパケットカウントを確認するため
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs c77g159のパケットカウントを確認するため
ConsoleD virsh start c77g153; virsh console c77g153 仮想マシンc77g153のコンソール用。
ConsoleE virsh start c77g159; virsh console c77g159 仮想マシンc77g159のコンソール用。
8-2.VMの起動

ConsoleAは、7-3でdpdk-vdpaコマンドを実行した際、debugモードで起動しています。
ConsoleB, Cは、VMを起動する前に、上記のコマンドを実行しておいてください。
その後、ConsoleDでは上記のコマンドでc77g153を起動します。
数秒待ってから、ConsoleEでは上記のコマンドでc77g159を起動します。
c77g153 or c77g159からPingを飛ばしてください。
例として、fig.1に従い、c77g153から ping 192.168.30.159 -f を実行します。

fig.1
f:id:metonymical:20210412202928j:plain

以下、出力結果です。注目する箇所は赤文字にしています。
ConsoleA
ConsoleAのログは一部抜粋となります。
全出力はこのリンクに保存しました。

vdpa> 
VHOST_CONFIG: new vhost user connection is 87
VHOST_CONFIG: new device, handle is 0
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_PROTOCOL_FEATURES
===================== s n i p =====================
VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
VHOST_CONFIG: negotiated Virtio features: 0x140601803
VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
VHOST_CONFIG: guest memory region size: 0x80000000
         guest physical addr: 0x0
         guest virtual  addr: 0x7faa40000000
         host  virtual  addr: 0x7f8080000000
         mmap addr : 0x7f8080000000
         mmap size : 0x80000000
         mmap align: 0x40000000
         mmap off  : 0x0
VHOST_CONFIG: guest memory region size: 0x80000000
         guest physical addr: 0x100000000
         guest virtual  addr: 0x7faac0000000
         host  virtual  addr: 0x7f8000000000
         mmap addr : 0x7f7f80000000
         mmap size : 0x100000000
         mmap align: 0x40000000
         mmap off  : 0x80000000
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
===================== s n i p =====================
new port /tmp/sock-virtio0, device : 0000:07:00.2
mlx5_vdpa: Cannot get vhost MTU - -95.
mlx5_vdpa: MTU cannot be set on device 0000:07:00.2.
mlx5_vdpa: Region 0: HVA 0x7f8080000000, GPA 0x0, size 0x80000000.
mlx5_vdpa: Region 1: HVA 0x7f8000000000, GPA 0x100000000, size 0x80000000.
mlx5_vdpa: Indirect mkey mode is KLM Fixed Buffer Size.
mlx5_vdpa: Memory registration information: nregions = 2, mem_size = 0x180000000, GCD = 0x80000000, klm_fbs_entries_num = 0x3, klm_entries_num = 0x3.
mlx5_vdpa: Dump fill Mkey = 1792.
mlx5_vdpa: Registered error interrupt for device0.
mlx5_vdpa: VAR address of doorbell mapping is 0x7f8157669000.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 0.
mlx5_vdpa: Register fd 123 interrupt for virtq 0.
mlx5_vdpa: vid 0 virtq 0 was created successfully.
mlx5_vdpa: Virtq 0 notifier state is enabled.
mlx5_vdpa: Ring virtq 0 doorbell.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 1.
mlx5_vdpa: Register fd 89 interrupt for virtq 1.
mlx5_vdpa: vid 0 virtq 1 was created successfully.
mlx5_vdpa: Virtq 1 notifier state is enabled.
mlx5_vdpa: Ring virtq 1 doorbell.
mlx5_vdpa: vDPA device 0 was configured.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:127
mlx5_vdpa: Update virtq 1 status enable -> disable.
mlx5_vdpa: vid 0 virtq 1 was stopped.
mlx5_vdpa: Query vid 0 vring 1: hw_available_idx=0, hw_used_index=0
mlx5_vdpa: Update virtq 1 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 1.
mlx5_vdpa: Register fd 89 interrupt for virtq 1.
mlx5_vdpa: vid 0 virtq 1 was created successfully.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 1 to qp idx: 0
===================== s n i p =====================
mlx5_vdpa: Update virtq 2 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 2.
mlx5_vdpa: Register fd 90 interrupt for virtq 2.
mlx5_vdpa: vid 0 virtq 2 was created successfully.
mlx5_vdpa: Virtq 2 notifier state is enabled.
mlx5_vdpa: Ring virtq 2 doorbell.
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 1 to qp idx: 3
mlx5_vdpa: Update virtq 3 status disable -> enable.
mlx5_vdpa: vid 0: Init last_avail_idx=0, last_used_idx=0 for virtq 3.
mlx5_vdpa: Register fd 91 interrupt for virtq 3.
mlx5_vdpa: vid 0 virtq 3 was created successfully.
===================== s n i p =====================
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ENABLE
VHOST_CONFIG: set queue enable: 0 to qp idx: 31
mlx5_vdpa: Virtq 3 notifier state is enabled.
mlx5_vdpa: Ring virtq 3 doorbell.
mlx5_vdpa: Device 0000:07:00.2 virtq 3 cq 2277 event was captured. Timer is off, cq ci is 1.
mlx5_vdpa: Device 0000:07:00.2 virtq 1 cq 2270 event was captured. Timer is on, cq ci is 1.
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped.
mlx5_vdpa: Device 0000:07:00.2 virtq 3 cq 2277 event was captured. Timer is off, cq ci is 18.
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f0: rx pkts=159317, bytes=15614385, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159318, bytes=15614457, drop=0, errs=0, coll=?
           duration=173.964s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=53, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=54, errs=0, coll=0
           duration=173.957s
  port  "ens2f0_0": rx pkts=159318, bytes=15614457, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159317, bytes=15614385, drop=0, errs=0, coll=?
           duration=173.729s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f1: rx pkts=159318, bytes=15614493, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159317, bytes=15614349, drop=0, errs=0, coll=?
           duration=180.549s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=53, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=54, errs=0, coll=0
           duration=181.910s
  port  "ens2f1_0": rx pkts=159317, bytes=15614349, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=159318, bytes=15614493, drop=0, errs=0, coll=?
           duration=180.861s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159 -f
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.
.
--- 192.168.30.159 ping statistics ---
159288 packets transmitted, 159288 received, 0% packet loss, time 24357ms
rtt min/avg/max/mdev = 0.069/0.086/60.812/0.202 ms, pipe 5, ipg/ewma 0.152/0.101 ms

<補足>

mlx5_vdpa: Cannot get vhost MTU - -95. MTUのメッセージが出力されますが、特に問題ありません。
mlx5_vdpa: vid 0 virtq 0 was created successfully. virtqの作成が成功したことを示しています。
mlx5_vdpa: Device 0000:07:00.2 traffic was stopped. 仮想マシン起動後、しばらくするとこのメッセージが表示されますが、トラフィックの送受信が停止されたという意味ではないため、特に問題ありません。
ens2f0 "ens2f0_0" 各ポートのtx/rxのパケットカウントとバイトカウントが上昇していることが確認できます。


以上です。

*1:私が理解した内容を記載しています。内容が誤っている場合には、ご指摘ください。

*2:MACアドレスの設定を実施しておかないと、VM起動後にVMがVFを認識しない事象を確認しました。

*3:coreやmodulesも同時にインストールされます

*4:蛇足ですが、bsf番号の前の「0000」は、Domain番号と呼ばれています。私が知る限りでは「0000」以外の値を見たことが無いため、あまり気にしなくてよいと思います。

*5:2-3.Mellanoxドライバ(OFED)のインストールで、インストール済みです。

*6:詳細は番外編で記載しますが、前回記事のvhost_vdpaの場合、virt-manager仮想マシンを起動できましたが、通信ができませんでした。このため、vhost_vdpaではqemu-kvmから直接起動しています。

*7:これはqemuのビルド時にspiceなどの関連パッケージをインストールしていないため、これらのデバイスを削除しないと仮想マシンが起動できませんでした。なお、vDPAとは直接関係がないため、これらの対処方法については割愛します。

How to set up vDPA with vhost_vdpa for VMs

This article describes how to configure vDPA (virtio data path acceleration) on CentOS8. *1
vDPA is a framework that combines the performance of SR-IOV with the flexibility of VirtIO.
It is a promising technology for cloud-native NWs, and has been officially merged with Linux Kernel 5.7.0.
For more information on vDPA, please refer to The official Red Hat blog.

Update 2021/04/16
Related articles are listed below.
metonymical.hatenablog.com
metonymical.hatenablog.com

In this article, we will describe how to set up communication between VMs (virtual machines) using the vhost_vdpa module.

1.Overview

1-1.Environment
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.3(2011)
Kernel                           : 5.11.11-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.2-2.2.0.0
qemu-kvm                         : v6.0.0-rc1
DPDK                             : v21.02
ovs                              : v2.14.1
1-2.Overall flow

Advance preparation
Kernel update
Building qemu
Building dpdk
Change to SR-IOV switchdev mode
Configure ovs-dpdk and VM
Operation check

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210407180251j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210408164131j:plain

Quoted from Red Hat's Blog
vDPA kernel framework part 3: usage for VMs and containers

The orange dotted lines (A) and (B) correspond to fig.1 and fig.2, respectively.
Furthermore, in fig.2, the actual traffic flow is described in blue and red letters. *3

In fig.2, PF and VF of SR-IOV are written respectively, and "VF rep" is written in addition to them.
It should be noted that the bsf (Bus, Slot, Function) numbers of PF and VF rep are the same.

PF VF0 VF0 rep
ens2f0 ens2f0v0 ens2f0_0
07:00.0 07:00.2 07:00.0

rep=representor is an interface specific to swtichdev mode in SR-IOV, and is created by enabling swtichdev mode.
In contrast to swtichdev mode, the conventional SR-IOV VF is called legacy mode and must be explicitly separated from it.
In addition, switchdev mode is a mandatory requirement for ConnectX-6 Dx to enable the vDPA HW offload.

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Enabling HugePage and IOMMU
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-2.SR-IOV VF settings

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *4

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-3.Install the Mellanox driver (OFED)

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
The following command will install the Mellanox driver, but it will also install ovs v2.14.1 at the same time.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g155 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g155 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update

As of April 8, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-2.Installation of Kernel
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *5
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g155 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.Building qemu

4-1.Enabling the PowerTools Repository
vi /etc/yum.repos.d/CentOS-Linux-PowerTools.repo

enable=1
4-2.Install the necessary packages

In addition to qemu, we have also installed the packages that are required for the dpdk build.

dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm
4-3.Building qemu
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.0.0-rc1 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

Checking Version after Installation

/usr/local/bin/qemu-system-x86_64 --version

[root@c83g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 5.2.91 (v6.0.0-rc1)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.Change qemu execution path
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-5.Change the user to run qemu
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out

5.Building dpdk

5-1.Building dpdk
cd /usr/src/ && \
git clone git://dpdk.org/dpdk && \
cd dpdk && \
git checkout v21.02 && \
meson -Dexamples=all build && \
ninja -C build && \
ninja -C build install
5-2.Links to dpdk-related libraries

Create a new file with vi and include the path of lib.

vi /etc/ld.so.conf.d/libdpdk.conf

/usr/src/dpdk/build/lib

After running ldconfig, make sure the libs are linked.

ldconfig
ldconfig -p |grep dpdk

It is OK if it is pointed as follows.

[root@c83g155 dpdk]# ldconfig -p |grep dpdk
        librte_vhost.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so.21
        librte_vhost.so (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so
        librte_timer.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_timer.so.21
============ s n i p ============

Now, reboot once again.

reboot

6.Change to SR-IOV switchdev mode

6-1.Check the current operation mode.
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Check the bsf (bus, slot, function) number of the PCI device.
Check the status of 07:00.0 (ens2f0)
Check the status of 07:00.1 (ens2f1)

The output will look like the following

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
6-2.Changing the operating mode

Note that the bsf numbers are slightly different.*6

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

Unbind the mlx5_core driver for VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0

Change the PF operation mode to switchdev.

07:00.0 ens2f0
07:00.1 ens2f1

Rebind the mlx5_core driver of VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0
6-3.Check the operation mode after the change.
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Changed to switchdev mode.

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representer has been added.

[root@c83g155 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

In addition, make sure that the HW offload function of the NIC is enabled.

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g155 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g155 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

7.Configure ovs-dpdk and VM

7-1.Overall Flow - Overview -

Configure the settings in the order (1)-(9) described in fig.1 below.
fig.1
f:id:metonymical:20210407180251j:plain

  1. Enable vhost_vdpa module: (1)
  2. Initial configuration of ovs
  3. Configuration of br30-ovs: (2)(3)(4)
  4. Configuration of br31-ovs: (5)(6)(7)
  5. Configure and start virtual machine c77g153: (8)
  6. Configure and start virtual machine c77g159: (9)
7-2.Overall flow - Commands only -

We will throw in the following commands.
Detailed explanations will follow, but if you don't need the explanations, just execute the commands.

1.Enable vhost_vdpa module
(1)
modprobe vhost_vdpa

2.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

3.Configuration of br30-ovs
(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

4.Configuration of br31-ovs
(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

5.Configure and start virtual machine c77g153
(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

6.Configure and start virtual machine c77g159
(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log
7-3.Enable vhost_vdpa module:(1)

We will check the changes before and after executing the modprobe vhost_vdpa command.

Before running modprobe vhost_vdpa

lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g155 ~]# lsmod |grep vd
mlx5_vdpa              45056  0
vhost_iotlb            16384  2 vhost,mlx5_vdpa
vdpa                   16384  1 mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
ls: cannot access '/sys/bus/vdpa/drivers/vhost_vdpa': No such file or directory

After running modprobe vhost_vdpa

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g155 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following

  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
  • 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver
7-4.Initial configuration of ovs

Since ovs has already been installed, start the service from systemctl.*7

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

Start the ovs service
Initialize dpdk
HW offload and tc-policy configuration
Memory allocation
IOMMU configuration for vhost
Configure representer
Restart the ovs service (to reflect the above settings)

Use the following command to check the settings.

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{dpdk-extra=" -w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 -w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0", dpdk-init="true", dpdk-socket-mem="1024,1024", hw-offload="true", tc-policy=none, vhost-iommu-support="true"}

Note 1:
Here is a supplementary explanation of other_config:dpdk-extra.
There is the following correspondence between the output results of "lshw -businfo -c network" and the commands configured in "other_config:dpdk-extra".

0000:07:00.0 ens2f0_0 -w 0000:07:00.0,representor=[0]
0000:07:00.1 ens2f1_0 -w 0000:07:00.1,representor=[0]

Note 2:
Here is a supplementary explanation of other_config:tc-policy.
The following options can be set for tc-policy.

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

Note 3:
If you want to remove the configuration, execute the command as follows.
"dpdk-extra" is the key, so specify any key you want to delete, such as "dpdk-init" or "hw-offload".

ovs-vsctl remove Open_vSwitch . other_config dpdk-extra
7-5.Configuration of br30-ovs : (2)(3)(4)

Create the first bridge.

(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

(2) Create a bridge
(3) Create the uplink (specify PF and set the interface for the external NW)
(4) Create downlink (specify VF Representer and set up the interface for VM)

Check the settings with the following command.

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-6.Configuration of br31-ovs : (5)(6)(7)

Create the second bridge.

(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

Same as (2), (3), and (4).

Check the settings with the following command. The blue text is the part that has been added.

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        datapath_type: netdev
        Port ens2f1_0
            Interface ens2f1_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1,representor=[0]"}
        Port ens2f1
            Interface ens2f1
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1"}
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-7.Configure and start virtual machine c77g153 : (8)

Please upload the qcow2 file to "/var/lib/libvirt/images/".
In this article, the qcow2 file with CentOS7.7 installed was prepared beforehand.

(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log
type=vhost-vdpa In order to be able to specify vhost-vdpa as type, qemu built from source.
vhostdev=/dev/vhost-vdpa-0 The vdpa device generated by the modprobe vhost_vdpa command is specified.
page-per-vq=on This setting is required to use virtqueue.
iommu_platform=on We haven't checked the details, but it seems to be a mandatory setting since it is used for virtqueue communication with Platform IOMMU from fig.2.
7-8.Configure and start virtual machine c77g159 : (9)

Other than the qcow2 file and vdpa device, it is the same as 7-7.

(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log

8.Operation check

Although the VMs (c77g153 and c77g159) have already been started as a result of executing the commands (8) and (9), please shut them down once to confirm normal operation.

8-1.advance preparation

Prepare five consoles on hostOS c83g155.

ConsoleA tail -f /var/log/messages To refer to the logs that should be checked at VM startup
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs To check the packet count on c77g153
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs To check the packet count on c77g159
ConsoleD Execute the command in (8) For the console of virtual machine c77g153
ConsoleE Execute the command in (9) For the console of virtual machine c77g159
8-2.Booting the VM

Before starting the VM, please execute the above commands on ConsoleA, B, and C.
Then, execute the command in (8) to start c77g153.
Wait for a few seconds, then execute command (9) to start c77g159.
Send a ping from c77g153 or c77g159.
As an example, follow fig.1 and execute ping 192.168.30.159 -f from c77g153.

fig.1
f:id:metonymical:20210407180251j:plain

The following is the output result. The points of interest are in red.
ConsoleA

Apr  8 10:38:21 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.72' (uid=0 pid=2706 comm="/usr/libexec/tracker-miner-fs ")
Apr  8 10:38:21 c83g155 systemd[1919]: Starting Tracker metadata extractor...
Apr  8 10:38:21 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_set_status:1786:(pid 5077): performing device reset
Apr  8 10:38:21 c83g155 kvm[5088]: 1 guest now active
Apr  8 10:38:21 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr  8 10:38:21 c83g155 systemd[1919]: Started Tracker metadata extractor.
Apr  8 10:38:46 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_handle_set_map:473:(pid 5086): memory map update
Apr  8 10:38:51 c83g155 ovs-vswitchd[4998]: ovs|00001|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(6),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:38:51 c83g155 ovs-vswitchd[4998]: ovs|00002|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(3),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:39:08 c83g155 systemd[1919]: tracker-extract.service: Succeeded.

Apr  8 10:41:52 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.72' (uid=0 pid=2706 comm="/usr/libexec/tracker-miner-fs ")
Apr  8 10:41:52 c83g155 systemd[1919]: Starting Tracker metadata extractor...
Apr  8 10:41:52 c83g155 kernel: mlx5_core 0000:07:01.2: mlx5_vdpa_set_status:1786:(pid 5370): performing device reset
Apr  8 10:41:52 c83g155 kvm[5380]: 2 guests now active
Apr  8 10:41:52 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr  8 10:41:52 c83g155 systemd[1919]: Started Tracker metadata extractor.
Apr  8 10:42:16 c83g155 kernel: mlx5_core 0000:07:01.2: mlx5_vdpa_handle_set_map:473:(pid 5381): memory map update
Apr  8 10:42:21 c83g155 ovs-vswitchd[4998]: ovs|00005|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(2),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:10,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:42:21 c83g155 ovs-vswitchd[4998]: ovs|00006|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:10,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:42:40 c83g155 systemd[1919]: tracker-extract.service: Succeeded.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f0: rx pkts=85847, bytes=8414161, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414028, drop=0, errs=0, coll=?
           duration=423.079s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=54, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=55, errs=0, coll=0
           duration=423.075s
  port  "ens2f0_0": rx pkts=85847, bytes=8414028, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414161, drop=0, errs=0, coll=?
           duration=422.848s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f1: rx pkts=85847, bytes=8414104, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414085, drop=0, errs=0, coll=?
           duration=450.620s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=54, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=55, errs=0, coll=0
           duration=451.970s
  port  "ens2f1_0": rx pkts=85847, bytes=8414085, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414104, drop=0, errs=0, coll=?
           duration=450.915s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159 -f
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.

--- 192.168.30.159 ping statistics ---
85742 packets transmitted, 85742 received, 0% packet loss, time 20040ms
rtt min/avg/max/mdev = 0.093/0.111/7.100/0.055 ms, ipg/ewma 0.233/0.144 ms

Note:

performing device reset mlx5_vdpa is initialized by mlx5_core.
memory map update IOMMU memory mapping and update of mlx5_vdpa is being performed by mlx5_core. Note that this is the most important message because we can never communicate with it unless this log is output.
internal error parsing flow key The ovs is giving multicast-related errors, but there is nothing wrong with it. Don't worry about it.
ens2f0 "ens2f0_0" You can see that the tx/rx packet count and byte count for each port are increasing.


That's all.

*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the VM are physically sent to the outside, we use the configuration shown in fig.1.

*3:This is a description of what I understand. If the content is incorrect, please point it out.

*4:We have confirmed that if the MAC address is not settings, the VM will not recognize the VF after VM startup.

*5:core and modules will be installed at the same time

*6:The "0000" in front of the bsf number is called the Domain number. As far as I know, I have never seen a value other than "0000", so I don't think you need to worry too much about it.

*7:It has already been installed in 2-3.

vDPA の設定方法 vhost_vdpa + VM編

CentOS8によるvDPA(virtio data path acceleration)の設定方法について記載しました。*1
vDPAは、SR-IOVの性能とVirtIOの柔軟性を兼ね備えたフレームワークです。
クラウドネイティブNWを見据え、今後が期待されている技術であり、Linux Kernel 5.7.0より正式にマージされています。
vDPAの詳細はThe official Red Hat blogを参照してください。

本ブログでは、vhost_vdpaモジュールを使用したVM(仮想マシン)間通信の設定方法について記載していきます。

1.構成

1-1.環境
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.3(2011)
Kernel                           : 5.11.11-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.2-2.2.0.0
qemu-kvm                         : v6.0.0-rc1
DPDK                             : v21.02
ovs                              : v2.14.1
1-2.全体の流れ

事前準備
Kernelアップデート
qemuのビルド
dpdkのビルド
SR-IOV switchdev modeへ変更
ovs-dpdkとVMの設定
動作確認

1-3.全体構成

DAC(Direct Attached Cable)を使用してループ接続します。 *2
fig.1
f:id:metonymical:20210407180251j:plain
fig.1は簡易的に記載しており、内部アーキテクチャは省略しています。このため、実際は以下の構成をイメージして頂ければと思います。

fig.2
f:id:metonymical:20210408164131j:plain

Red HatのBlogより引用
vDPA kernel framework part 3: usage for VMs and containers

オレンジ色点線の(A)(B)が、それぞれfig.1とfig.2に対応しています。
さらに、fig.2においては、実際のトラフィックフローを青文字と赤文字で記載しています。*3

fig.2において、SR-IOVのPFとVFをそれぞれ記載していますが、これに加えて「VF rep」を記載しています。PFとVF repのbsf(Bus, Slot, Function)番号が同一である点も注意が必要です。

PF VF0 VF0 rep
ens2f0 ens2f0v0 ens2f0_0
07:00.0 07:00.2 07:00.0

rep=representorは、SR-IOVにおけるswtichdevモード特有のインターフェースで、swtichdevモードを有効化することにより作成されます。
また、swtichdevモードに対して、従来のSR-IOV VFをlegacyモードと呼び、明示的に別けて使用する必要があります。加えて、ConnectX-6 Dxでは、vDPA HW offloadを有効化するために、swtichdevモードが必須要件となっています。

2.事前準備

特に記載しませんが、SELinux無効化、FW無効化、NTP時刻同期設定は事前に行っています。

2-1.HugePageとIOMMUの有効化
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

次にHugePageのマウント設定を実施しておきます。次回OS起動時に自動マウントされます。

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-2.SR-IOV VFの設定

SR-IOV VFの設定を実施します。VF数を増やしても問題ありませんが、ここでは説明をシンプルにするため、VF数「1」としています。加えて、MACアドレスの設定は必須です。*4

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-3.Mellanoxドライバ(OFED)のインストール

isoファイルはMellanoxのサイトからDLしてください。Mellanox Download Site
DLしたisoファイルは、/root/tmp/に保存してください。
以下のコマンドにて、Mellanoxドライバをインストールしますが、ovs v2.14.1も同時にインストールされます。

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.2-2.2.0.0-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

インストールが完了したら、再起動してください。

reboot

再起動が完了したら、HugePageを確認します。

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g155 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g155 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernelアップデート

2021年4月8日現在、vDPA関連モジュールは高い頻度で更新されていますので、最新のKernelをインストールします。

3-2.Kernelのインストール
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

現在インストール済みのKernelを確認
kernel-ml, kernel-ml-develをインストール*5
インストール済みKernelの確認
再起動

3-3.Kernel headersなどのインストール
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

現在起動中のKernel Version確認
kernel-headersのインストール
既存のkernel-tools kernel-tools-libsの削除
kernel-tools kernel-tools-libsのインストール
インストール済みKernelの確認

以下の出力になっていればOKです。

[root@c83g155 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.11.11-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.qemuのビルド

4-1.PowerToolsリポジトリの有効化
vi /etc/yum.repos.d/CentOS-Linux-PowerTools.repo

enable=1
4-2.必要なパッケージのインストール

qemuに加えて、dpdkのビルドで必要とされるパッケージも合わせてインストールしています。

dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm
4-3.qemuのビルド
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.0.0-rc1 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

インストール後のVersion確認

/usr/local/bin/qemu-system-x86_64 --version

[root@c83g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 5.2.91 (v6.0.0-rc1)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.qemu実行Pathの変更
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-5.qemu実行ユーザの変更
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out

5.dpdkのビルド

5-1.dpdkのビルド
cd /usr/src/ && \
git clone git://dpdk.org/dpdk && \
cd dpdk && \
git checkout v21.02 && \
meson -Dexamples=all build && \
ninja -C build && \
ninja -C build install
5-2.dpdk関連ライブラリのリンク

viで新規ファイルを作成し、libのPathを記載してください。

vi /etc/ld.so.conf.d/libdpdk.conf

/usr/src/dpdk/build/lib

ldconfigを実行後、libがリンクされたことを確認してください。

ldconfig
ldconfig -p |grep dpdk

以下のようにポイントされていればOKです。

[root@c83g155 dpdk]# ldconfig -p |grep dpdk
        librte_vhost.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so.21
        librte_vhost.so (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_vhost.so
        librte_timer.so.21 (libc6,x86-64) => /usr/src/dpdk/build/lib/librte_timer.so.21

ここで、一旦再起動しておきます。

reboot

6.SR-IOV switchdev modeへ変更

6-1.現在の動作モードを確認
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

PCIバイスのbsf(bus, slot, function)番号を確認
07:00.0(ens2f0)のステータス確認
07:00.1(ens2f1)のステータス確認

以下のように出力されます。

[root@c83g155 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
6-2.動作モードの変更

bsf番号が微妙に異なっている点に注意してください。*6

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

VFのmlx5_coreドライバをアンバインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0

PFの動作モードをswitchdevに変更します。

07:00.0 ens2f0
07:00.1 ens2f1

VFのmlx5_coreドライバを再バインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0
6-3.変更後の動作モードを確認
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

switchdev モードに変更されました。

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representerが追加されています。

[root@c83g155 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

さらに、NICのHW offload機能が有効化されていることも確認します。

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g155 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g155 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

7.ovs-dpdkとVMの設定

7-1.全体の流れ ~概要~

以下のfig.1に記載されている(1)-(9)の順に設定していきます。
fig.1
f:id:metonymical:20210407180251j:plain

  1. vhost_vdpaモジュールの有効化:(1)
  2. ovsの初期設定
  3. br30-ovsの設定:(2)(3)(4)
  4. br31-ovsの設定:(5)(6)(7)
  5. 仮想マシンc77g153の設定と起動:(8)
  6. 仮想マシンc77g159の設定と起動:(9)
7-2.全体の流れ ~コマンドのみ~

以下のコマンドを投入していきます。
詳細な解説は後述しますが、解説が不要な方はコマンドだけを実行してください。

1.vhost_vdpaモジュールの有効化
(1)
modprobe vhost_vdpa

2.ovsの初期設定
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

3.br30-ovsの設定
(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]
 
4.br31-ovsの設定
(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

5.仮想マシンc77g153の設定と起動
(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log

6.仮想マシンc77g159の設定と起動
(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log
7-3.vhost_vdpaモジュールの有効化:(1)

modprobe vhost_vdpa コマンドの実行前と実行後の変化を確認していきます。

modprobe vhost_vdpa実行前

lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g155 ~]# lsmod |grep vd
mlx5_vdpa              45056  0
vhost_iotlb            16384  2 vhost,mlx5_vdpa
vdpa                   16384  1 mlx5_vdpa
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
ls: cannot access '/sys/bus/vdpa/drivers/vhost_vdpa': No such file or directory

modprobe vhost_vdpa実行後

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g155 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g155 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

上記の出力結果より、以下のことが確認できます。

  • /dev/vhost-vdpa-0と/dev/vhost-vdpa-1がvhost_vdpaデバイスとして認識されている
  • 0000:07:00.2/vdpa0と0000:07:01.2/vdpa1がvhost_vdpaドライバで制御されている
7-4.ovsの初期設定

ovsは、既にインストール済みなので*7、systemctlからサービスをスタートします。

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=1024,1024
ovs-vsctl set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vsctl set Open_vSwitch . other_config:dpdk-extra=" \
-w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 \
-w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0"
systemctl restart openvswitch

ovsサービスの起動
dpdkの初期化
HW offloadとtc-policyの設定
メモリ割当て
vhostのIOMMU設定
representerの設定
ovsサービスの再起動(上記設定を反映させるため)

以下のコマンドで設定内容を確認します。

ovs-vsctl get Open_vSwitch . other_config

[root@c83g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{dpdk-extra=" -w 0000:07:00.0,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0 -w 0000:07:00.1,representor=[0],dv_flow_en=1,dv_esw_en=1,dv_xmeta_en=0", dpdk-init="true", dpdk-socket-mem="1024,1024", hw-offload="true", tc-policy=none, vhost-iommu-support="true"}

<補足1>
other_config:dpdk-extraについて、補足します。
lshw -businfo -c network の出力結果とother_config:dpdk-extra で設定したコマンドには、以下の対応関係があります。

0000:07:00.0 ens2f0_0 -w 0000:07:00.0,representor=[0]
0000:07:00.1 ens2f1_0 -w 0000:07:00.1,representor=[0]

<補足2>
other_config:tc-policyについて、補足します。
tc-policyは、以下のオプションが設定可能です。

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

<補足3>
設定を削除したい場合は、以下のようにコマンドを実行してください。
dpdk-extra がキーとなっていますので、dpdk-initやhw-offloadなど、削除したい任意のキーを指定してください。

ovs-vsctl remove Open_vSwitch . other_config dpdk-extra
7-5.br30-ovsの設定:(2)(3)(4)

1つ目のブリッジを作成します。

(2)
ovs-vsctl add-br br30-ovs -- set bridge br30-ovs datapath_type=netdev
(3)
ovs-vsctl add-port br30-ovs ens2f0 -- set Interface ens2f0 type=dpdk options:dpdk-devargs=0000:07:00.0
(4)
ovs-vsctl add-port br30-ovs ens2f0_0 -- set Interface ens2f0_0 type=dpdk options:dpdk-devargs=0000:07:00.0,representor=[0]

(2)ブリッジの作成
(3)アップリンクの作成(PFを指定し、外部NW向けのインターフェースを設定)
(4)ダウンリンクの作成(VF Representerを指定し、VM向けのインターフェースを設定)

以下のコマンドで設定を確認します。

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-6.br31-ovsの設定:(5)(6)(7)

2つ目のブリッジを作成します。

(5)
ovs-vsctl add-br br31-ovs -- set bridge br31-ovs datapath_type=netdev
(6)
ovs-vsctl add-port br31-ovs ens2f1 -- set Interface ens2f1 type=dpdk options:dpdk-devargs=0000:07:00.1
(7)
ovs-vsctl add-port br31-ovs ens2f1_0 -- set Interface ens2f1_0 type=dpdk options:dpdk-devargs=0000:07:00.1,representor=[0]

(2)(3)(4)と同様です。

以下のコマンドで設定を確認します。青文字が追加された部分です。

[root@c83g155 ~]# ovs-vsctl show
59a34ea2-ca80-48b9-8b14-a656c79bc451
    Bridge br31-ovs
        datapath_type: netdev
        Port ens2f1_0
            Interface ens2f1_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1,representor=[0]"}
        Port ens2f1
            Interface ens2f1
                type: dpdk
                options: {dpdk-devargs="0000:07:00.1"}
        Port br31-ovs
            Interface br31-ovs
                type: internal
    Bridge br30-ovs
        datapath_type: netdev
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0_0
            Interface ens2f0_0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0,representor=[0]"}
        Port ens2f0
            Interface ens2f0
                type: dpdk
                options: {dpdk-devargs="0000:07:00.0"}
    ovs_version: "2.14.1"
7-7.仮想マシンc77g153の設定と起動:(8)

/var/lib/libvirt/images/にqcow2ファイルをアップロードしておいてください。
本ブログでは、CentOS7.7をインストールしたqcow2ファイルを予め準備していました。

(8)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1532.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
 -device virtio-net-pci,netdev=vhost-vdpa0,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm153.log
type=vhost-vdpa vhost-vdpaをtypeに指定できるようにするため、qemuをソースからビルドしています。
vhostdev=/dev/vhost-vdpa-0 modprobe vhost_vdpa コマンドで生成されたvdpaデバイスを指定しています。
page-per-vq=on virtqueueを使用するため必須の設定です
iommu_platform=on 詳細は確認できていませんが、fig.2よりPlatformIOMMUとvirtqueueのやり取りで使用するため必須の設定と思われます。
7-8.仮想マシンc77g159の設定と起動:(9)

qcow2ファイルやvdpaデバイス以外は、7-7と同様です。

(9)
qemu-system-x86_64 \
 -enable-kvm \
 -cpu host \
 -m 8G \
 -hda /var/lib/libvirt/images/c77g1592.qcow2 \
 -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-1,id=vhost-vdpa1 \
 -device virtio-net-pci,netdev=vhost-vdpa1,page-per-vq=on,iommu_platform=on,disable-legacy=on \
 -nographic \
 2>&1 | tee vm159.log

8.動作確認

既に(8)(9)のコマンドを実行した結果、VM(c77g153 と c77g159)が起動済みですが、正常動作を確認するため、一旦シャットダウンしてください。

8-1.事前準備

ホストOS c83g155で5つのコンソールを準備してください。

ConsoleA tail -f /var/log/messages VM起動時に確認すべきログを参照するため
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs c77g153のパケットカウントを確認するため
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs c77g159のパケットカウントを確認するため
ConsoleD (8)のコマンドを実行 仮想マシンc77g153のコンソール用
ConsoleE (9)のコマンドを実行 仮想マシンc77g159のコンソール用
8-2.VMの起動

VMを起動する前に、ConsoleA, B, Cでは上記のコマンドを実行しておいてください。
その後、(8)のコマンドを実行し、c77g153を起動します。
数秒待ってから、(9)のコマンドを実行し、c77g159を起動します。
c77g153 or c77g159からPingを飛ばしてください。
例として、fig.1に従い、c77g153から ping 192.168.30.159 -f を実行します。

fig.1
f:id:metonymical:20210407180251j:plain

以下、出力結果です。注目する箇所は赤文字にしています。
ConsoleA

Apr  8 10:38:21 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.72' (uid=0 pid=2706 comm="/usr/libexec/tracker-miner-fs ")
Apr  8 10:38:21 c83g155 systemd[1919]: Starting Tracker metadata extractor...
Apr  8 10:38:21 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_set_status:1786:(pid 5077): performing device reset
Apr  8 10:38:21 c83g155 kvm[5088]: 1 guest now active
Apr  8 10:38:21 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr  8 10:38:21 c83g155 systemd[1919]: Started Tracker metadata extractor.
Apr  8 10:38:46 c83g155 kernel: mlx5_core 0000:07:00.2: mlx5_vdpa_handle_set_map:473:(pid 5086): memory map update
Apr  8 10:38:51 c83g155 ovs-vswitchd[4998]: ovs|00001|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(6),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:38:51 c83g155 ovs-vswitchd[4998]: ovs|00002|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(3),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:00,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:39:08 c83g155 systemd[1919]: tracker-extract.service: Succeeded.

Apr  8 10:41:52 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.72' (uid=0 pid=2706 comm="/usr/libexec/tracker-miner-fs ")
Apr  8 10:41:52 c83g155 systemd[1919]: Starting Tracker metadata extractor...
Apr  8 10:41:52 c83g155 kernel: mlx5_core 0000:07:01.2: mlx5_vdpa_set_status:1786:(pid 5370): performing device reset
Apr  8 10:41:52 c83g155 kvm[5380]: 2 guests now active
Apr  8 10:41:52 c83g155 dbus-daemon[2076]: [session uid=0 pid=2076] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
Apr  8 10:41:52 c83g155 systemd[1919]: Started Tracker metadata extractor.
Apr  8 10:42:16 c83g155 kernel: mlx5_core 0000:07:01.2: mlx5_vdpa_handle_set_map:473:(pid 5381): memory map update
Apr  8 10:42:21 c83g155 ovs-vswitchd[4998]: ovs|00005|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(2),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:10,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:42:21 c83g155 ovs-vswitchd[4998]: ovs|00006|dpif_netdev(revalidator17)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=00:11:22:33:44:10,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Apr  8 10:42:40 c83g155 systemd[1919]: tracker-extract.service: Succeeded.

ConsoleB

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f0: rx pkts=85847, bytes=8414161, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414028, drop=0, errs=0, coll=?
           duration=423.079s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=54, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=55, errs=0, coll=0
           duration=423.075s
  port  "ens2f0_0": rx pkts=85847, bytes=8414028, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414161, drop=0, errs=0, coll=?
           duration=422.848s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleC

[root@c83g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port  ens2f1: rx pkts=85847, bytes=8414104, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414085, drop=0, errs=0, coll=?
           duration=450.620s
           rx rfc2819 broadcast_packets=2,
           tx rfc2819 multicast_packets=54, broadcast_packets=1,
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, rx_wqe_errors=0, rx_phy_crc_errors=0,
                      rx_phy_in_range_len_errors=0, rx_phy_symbol_errors=0, tx_phy_errors=0,
                      tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0, tx_pp_clock_queue_errors=0,
                      tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,
  port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=55, errs=0, coll=0
           duration=451.970s
  port  "ens2f1_0": rx pkts=85847, bytes=8414085, drop=0, errs=0, frame=?, over=?, crc=?
           tx pkts=85847, bytes=8414104, drop=0, errs=0, coll=?
           duration=450.915s
           CUSTOM Statistics
                      ovs_tx_failure_drops=0, ovs_tx_mtu_exceeded_drops=0, ovs_tx_qos_drops=0,
                      ovs_rx_qos_drops=0, ovs_tx_invalid_hwol_drops=0, rx_missed_errors=0,
                      rx_errors=0, tx_errors=0, rx_mbuf_allocation_errors=0,
                      rx_q0_errors=0, tx_pp_missed_interrupt_errors=0, tx_pp_rearm_queue_errors=0,
                      tx_pp_clock_queue_errors=0, tx_pp_timestamp_past_errors=0, tx_pp_timestamp_future_errors=0,

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159 -f
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.

--- 192.168.30.159 ping statistics ---
85742 packets transmitted, 85742 received, 0% packet loss, time 20040ms
rtt min/avg/max/mdev = 0.093/0.111/7.100/0.055 ms, ipg/ewma 0.233/0.144 ms

<補足>

performing device reset mlx5_coreによりmlx5_vdpaが初期化されています。
memory map update mlx5_coreによりmlx5_vdpaのIOMMUメモリマッピングとアップデートが実行されています。なお、このログが出力されない限り、絶対に通信することはできないため、最重要メッセージとなります。
internal error parsing flow key ovsがマルチキャスト関連のエラーを出していますが、特に問題はありません。気にしないでください。
ens2f0 "ens2f0_0" 各ポートのtx/rxのパケットカウントとバイトカウントが上昇していることが確認できます。


以上です。

*1:各種ドキュメントを確認した結果、vDPAの"v"は、virtual, vhost, virtioの3種類ありますが、意味は全て同じようです。本ブログでは、 Introduction to vDPA kernel framework に従い、virtioの表記で統一しました。

*2:対向機器に100Gbpsスイッチや100GNIC搭載サーバが用意できない場合を考慮し、ループ接続としています。但し、VMで生成したパケットが物理的に外部へ送信されることが重要と考えているため、fig.1の構成としています。

*3:私が理解した内容を記載しています。内容が誤っている場合には、ご指摘ください。

*4:MACアドレスの設定を実施しておかないと、VM起動後にVMがVFを認識しない事象を確認しました。

*5:coreやmodulesも同時にインストールされます

*6:蛇足ですが、bsf番号の前の「0000」は、Domain番号と呼ばれています。私が知る限りでは「0000」以外の値を見たことが無いため、あまり気にしなくてよいと思います。

*7:2-3.Mellanoxドライバ(OFED)のインストールで、インストール済みです。

CentOS7 FD.io/VPP セットアップ KVM編

FD.io/VPP(以下、VPP)のセットアップと自動起動設定までの構築方法について、以下4つのパターンで記載していきます。

  • Bare Metal
  • Linux Bridge
  • OvS-DPDK
  • SR-IOV

1.構成

1-1.環境
筐体                             : DL360 Gen9
CPU                           : Intel(R) Xeon E5-4667 v4 CPU @ 2.20GHz
System ROM                       : P89 05/21/2018
NIC                              : Intel 82599ES
OS                               : CentOS7.6(1810)
Kernel                           : 3.10.0-957.el7.x86_64
Installed Environment Groups     : Server with GUI
Add-Ons for Selected Environment : Virtualization Client, Virtualization Hypervisor, Virtualization Tools 
ovs                              : 2.13.0
DPDK                             : 19.11.2
1-2.全体構成

f:id:metonymical:20200810000205j:plain
VPPの環境をそれぞれ構築していきます。
Bare Metal以外はKVM上の仮想マシンにVPPをインストールします。

1-3 .全体の流れ ~概要~

インストールや設定方法は前回記事と同様ですので、若干被る部分もあります。
まずはBare Metalについて記載した後、KVM上の仮想マシンで、Linux Bridge、OvS-DPDK、SR-IOVについてそれぞれ差分となるポイントを記載したいと思います。

  1. 事前準備とインストール
  2. 初期設定
  3. 動作確認
  4. Linux Bridgeの場合
  5. OvS-DPDKの場合
  6. SR-IOVの場合

2.事前準備とインストール

yum -y install epel-release dpdk-tools
curl -s https://packagecloud.io/install/repositories/fdio/release/script.rpm.sh | sudo bash
yum -y install vpp vpp-plugins vpp-devel vpp-api-python vpp-api-lua vpp-debuginfo vpp-devel

必要なパッケージのインストール
リポジトリ設定
VPPのインストール

3.初期設定

3-1.インターフェースのbsf確認

VPPにアサインするインターフェースのbfs(Bus,Slot,Function)番号とインターフェース名を確認します。
赤文字で記載した「05:00.x」の部分が「bus:slot.function」になります。*1

[root@c76g116 ~]# dpdk-devbind -s

Network devices using kernel driver
===================================
0000:02:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno1 drv=tg3 unused=vfio-pci *Active*
0000:02:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno2 drv=tg3 unused=vfio-pci
0000:02:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno3 drv=tg3 unused=vfio-pci
0000:02:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno4 drv=tg3 unused=vfio-pci
0000:05:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=ens2f0 drv=ixgbe unused=vfio-pci
0000:05:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' if=ens2f1 drv=ixgbe unused=vfio-pci 

以下のコマンドでも確認可能ですので、どちらでも構いません。

[root@c76g116 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:05:00.0  ens2f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:05:00.1  ens2f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:02:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:02:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:02:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:02:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
3-2.OS起動時の設定

OS起動時にvfio-pciの読込みとインターフェースダウンを実施しますので、以下の通り設定を最下行に追記してください。
設定完了後、実行権限を付与します。

vi /etc/rc.local

[root@c76g116 ~]# vi /etc/rc.local

modprobe vfio-pci
ifconfig ens2f0 down
ifconfig ens2f1 down
exit 0

[root@c76g116 ~]# chmod 755 /etc/rc.local
3-3.VPPのstartup.conf設定

VPPのstartup.confに設定を追記します。
青文字箇所を追記してください。*2
緑文字箇所はコメントアウトしてください。

vi /etc/vpp/startup.conf

[root@c76g116 ~]# vi /etc/vpp/startup.conf
unix {
  nodaemon
  log /var/log/vpp/vpp.log
  full-coredump
  cli-listen /run/vpp/cli.sock
  gid vpp
  startup-config /root/vpp.cfg
}

dpdk {
        dev default {
                num-rx-desc 512
                num-tx-desc 512
        }
        no-multi-seg
        dev 0000:05:00.0 {
                name ens2f0
                num-rx-queues 2
                num-tx-queues 2
        }
        dev 0000:05:00.1 {
                name ens2f1
                num-rx-queues 2
                num-tx-queues 2
        }
}

cpu {
        ## Set logical CPU core where main thread runs, if main core is not set
        ## VPP will use core 1 if available
         main-core 1

        ## Set logical CPU core(s) where worker threads are running
        # corelist-workers 2-3,18-19
         corelist-workers 2-3
}

上記の設定はVPPサービス起動時にインターフェースをアタッチする設定やvpp.cfgファイルの参照先パスの設定となります。*3

3-4.VPPのvpp.cfg設定

vpp.cfgファイルはCiscoで言うところのstartup-configファイルに相当します。
これを設定しておかないと、起動するたびにIPアドレスの設定などを都度実施することになりますので、最低限の設定を投入します。

vi /root/vpp.cfg

[root@c76g116 ~]# vi /root/vpp.cfg

set int ip address ens2f0 192.168.30.116/24
set int ip address ens2f1 192.168.31.116/24
set int state ens2f0 up
set int state ens2f1 up

IPの設定、最後にインターフェースアップ設定を入れておきます。*4

3-5.VPPサービス自動起動設定と再起動

VPPサービス自動起動設定と再起動を実施します。

systemctl enable vpp

reboot

4.動作確認

4-1.インターフェース周りの確認

以下のコマンドでVPPコンソールに入ります。

vppctl

出力例

[root@c76g116 ~]# vppctl
    _______    _        _   _____  ___
 __/ __/ _ \  (_)__    | | / / _ \/ _ \
 _/ _// // / / / _ \   | |/ / ___/ ___/
 /_/ /____(_)_/\___/   |___/_/  /_/

vpp#

VPPコンソールに入ったら、以下4つのコマンドでステータスを確認します。

show pci
show hardware-interfaces
show interface
show interface addr
show threads

出力例
確認するポイントを赤文字で記載します。

vpp# show pci
Address      Sock VID:PID     Link Speed   Driver          Product Name                    Vital Product Data
0000:02:00.0   0  14e4:1657   5.0 GT/s x2  tg3             HPE Ethernet 1Gb 4-port 331i Ada PN: N/A
                                                                                           EC: A-5427
                                                                                           SN: N/A
                                                                                           MN: 103C
                                                                                           V0: 0x 35 57 20 50 43 49 65 47 ...
                                                                                           V1: 0x 32 30 2e 31 32 2e 32
                                                                                           V2: 0x 4e 2f 41
                                                                                           V3: 0x 31 2e 34 36
                                                                                           V5: 0x 30 41
                                                                                           V6: 0x 32 30 2e 36 2e 35 30
                                                                                           V7: 0x 33 33 31 69
                                                                                           VA: 0x 32 30 2e 31 32 2e 34 31
                                                                                           RV: 0x a1 00 00 00 00 00 00 00 ...
0000:02:00.1   0  14e4:1657   5.0 GT/s x2  tg3             HPE Ethernet 1Gb 4-port 331i Ada PN: N/A
各インターフェースごとにPCI情報が表示されますが省略します。
0000:02:00.2   0  14e4:1657   5.0 GT/s x2  tg3             HPE Ethernet 1Gb 4-port 331i Ada PN: N/A
0000:02:00.3   0  14e4:1657   5.0 GT/s x2  tg3             HPE Ethernet 1Gb 4-port 331i Ada PN: N/A
0000:05:00.0   0  8086:10fb   5.0 GT/s x8  vfio-pci
0000:05:00.1   0  8086:10fb   5.0 GT/s x8  vfio-pci


vpp# show hardware-interfaces
              Name                Idx   Link  Hardware
ens2f0                             1     up   ens2f0
  Link speed: 10 Gbps
  Ethernet address 00:1b:21:8e:84:c8
  Intel 82599
    carrier up full duplex mtu 9206
    flags: admin-up pmd tx-offload intel-phdr-cksum rx-ip4-cksum
    Devargs:
    rx: queues 2 (max 128), desc 512 (min 32 max 4096 align 8)
    tx: queues 2 (max 64), desc 512 (min 32 max 4096 align 8)
    pci: device 8086:10fb subsystem 108e:7b11 address 0000:05:00.00 numa 0
    max rx packet len: 15872
    promiscuous: unicast off all-multicast on
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum tcp-lro
                       macsec-strip vlan-filter vlan-extend jumbo-frame scatter
                       security keep-crc rss-hash
    rx offload active: ipv4-cksum
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
                       tcp-tso macsec-insert multi-segs security
    tx offload active: udp-cksum tcp-cksum
    rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
                       ipv6-udp ipv6-ex ipv6
    rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
                       ipv6-udp ipv6-ex ipv6
    tx burst function: ixgbe_xmit_pkts
    rx burst function: ixgbe_recv_pkts_vec

ens2f1                             2     up   ens2f1
  Link speed: 10 Gbps
  Ethernet address 00:1b:21:8e:84:c9
  Intel 82599
    carrier up full duplex mtu 9206
    flags: admin-up pmd tx-offload intel-phdr-cksum rx-ip4-cksum
    Devargs:
    rx: queues 2 (max 128), desc 512 (min 32 max 4096 align 8)
    tx: queues 2 (max 64), desc 512 (min 32 max 4096 align 8)
    pci: device 8086:10fb subsystem 108e:7b11 address 0000:05:00.01 numa 0
    max rx packet len: 15872
    promiscuous: unicast off all-multicast on
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum tcp-lro
                       macsec-strip vlan-filter vlan-extend jumbo-frame scatter
                       security keep-crc rss-hash
    rx offload active: ipv4-cksum
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
                       tcp-tso macsec-insert multi-segs security
    tx offload active: udp-cksum tcp-cksum
    rss avail:         ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
                       ipv6-udp ipv6-ex ipv6
    rss active:        ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
                       ipv6-udp ipv6-ex ipv6
    tx burst function: ixgbe_xmit_pkts
    rx burst function: ixgbe_recv_pkts_vec

    extended stats:
      mac remote errors                                    1
local0                             0    down  local0
  Link speed: unknown
  local


vpp# show interface
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count
ens2f0                            1      up          9000/0/0/0
ens2f1                            2      up          9000/0/0/0
local0                            0     down          0/0/0/0


vpp# show interface addr
ens2f0 (up):
  L3 192.168.30.116/24
ens2f1 (up):
  L3 192.168.31.116/24
local0 (dn):


vpp# show threads
ID     Name                Type        LWP     Sched Policy (Priority)  lcore  Core   Socket State
0      vpp_main                        24943   other (0)                1      2      0
1      vpp_wk_0            workers     24945   other (0)                2      4      0
2      vpp_wk_1            workers     24946   other (0)                3      9      0
4-2.疎通確認

自身のIPにPingが飛ばないため、対向機器にPingを飛ばすか、対向機器からPingを飛ばしてください。

vpp# ping 192.168.30.116

Statistics: 5 sent, 0 received, 100% packet loss
自身のIPにはPingが飛びません。

vpp# ping 192.168.30.254
116 bytes from 192.168.30.254: icmp_seq=1 ttl=64 time=.3834 ms
116 bytes from 192.168.30.254: icmp_seq=2 ttl=64 time=.3127 ms
116 bytes from 192.168.30.254: icmp_seq=3 ttl=64 time=.2928 ms
116 bytes from 192.168.30.254: icmp_seq=4 ttl=64 time=.3206 ms
116 bytes from 192.168.30.254: icmp_seq=5 ttl=64 time=.2619 ms

Statistics: 5 sent, 5 received, 0% packet loss
vpp#

ここまでOKであれば、Bare Metalとしての構築は完了です。

4-3.次項へ進む前の準備

本項からは仮想マシンc77g118を作成してVPPをインストールした後、Linux Bridge、OvS-DPDK、SR-IOVのそれぞれで外部NWと接続する際のポイントなどを記載していきます。

同一の物理マシン上で作業を続ける場合、以下のコマンドにより設定後、一旦rebootしてください。*5

systemctl disable vpp

vi /etc/rc.local

#modprobe vfio-pci
#ifconfig ens2f0 down
#ifconfig ens2f1 down

reboot

VPPの自動起動停止
/etc/rc.localを元に戻す設定

5.Linux Bridgeの場合

Linux Bridgeの詳細については、以下の過去記事を参照してください。
Linux nmcliコマンドによるBonding VLAN Bridge Interface設定 - Metonymical Deflection
説明は省きますが、設定方法のみ記載します。

5-1.brインターフェース作成とアタッチ

本項はHostOS上での設定となります。
brインターフェースを作成します。

nmcli connection add type bridge autoconnect yes con-name br300 ifname br300
nmcli connection modify br300 bridge.stp no
nmcli connection modify br300 ipv4.method disabled ipv6.method ignore
nmcli connection up br300
nmcli connection add type bridge-slave ifname ens2f0 master br300
nmcli con show
brctl show

nmcli connection add type bridge autoconnect yes con-name br301 ifname br301
nmcli connection modify br301 bridge.stp no
nmcli connection modify br301 ipv4.method disabled ipv6.method ignore
nmcli connection up br301
nmcli connection add type bridge-slave ifname ens2f1 master br301
nmcli con show
brctl show

仮想マシンにbrインターフェースをアタッチします。
f:id:metonymical:20200809220134j:plain

5-2.仮想マシン上でのVPPインストールと設定

以下、仮想マシン上での設定となります。
インストール方法は「2-1.事前準備とインストール」と同一です。

5-3.インターフェースのbsf確認

設定方法は同一なので、ポイントだけ記載します。

[root@c77g118 ~]# dpdk-devbind -s

Network devices using kernel driver
===================================
0000:00:03.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=virtio_pci *Active*
0000:00:09.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=virtio_pci
0000:00:0a.0 'Virtio network device 1000' if=eth2 drv=virtio-pci unused=virtio_pci
5-4.OS起動時の設定
vi /etc/rc.local

[root@c77g118 ~]# vi /etc/rc.local

modprobe vfio-pci
ifconfig eth1 down
ifconfig eth2 down
exit 0

[root@c77g118 ~]# chmod 755 /etc/rc.local
5-5.VPPのstartup.conf設定
vi /etc/vpp/startup.conf

[root@c77g118 ~]# vi /etc/vpp/startup.conf
unix {
  nodaemon
  log /var/log/vpp/vpp.log
  full-coredump
  cli-listen /run/vpp/cli.sock
  gid vpp
  startup-config /root/vpp.cfg
}

dpdk {
        dev default {
                num-rx-desc 512
                num-tx-desc 512
        }
        no-multi-seg
        dev 0000:00:09.0 {
                name eth1
                num-rx-queues 2
                num-tx-queues 2
        }
        dev 0000:00:0a.0 {
                name eth2
                num-rx-queues 2
                num-tx-queues 2
        }
}

cpu {
        ## Set logical CPU core where main thread runs, if main core is not set
        ## VPP will use core 1 if available
         main-core 1

        ## Set logical CPU core(s) where worker threads are running
        # corelist-workers 2-3,18-19
         corelist-workers 2-3
}
5-6.VPPのvpp.cfg設定
vi /root/vpp.cfg

[root@c77g118 ~]# vi /root/vpp.cfg

set int ip address eth1 192.168.30.118/24
set int ip address eth2 192.168.31.118/24
set int state eth1 up
set int state eth2 up
5-7.VPPサービス自動起動設定と再起動
systemctl enable vpp

reboot

6.OvS-DPDKの場合

OvS-DPDKの詳細については、以下の過去記事を参照してください。
CentOS7 ovs(Open vSwitch)+DPDKのビルドとネットワーク設定方法 - Metonymical Deflection
説明は省きますが、上記URLと異なっている箇所のみ記載します。
<補足>
HostOS上のOvSとDPDKは以下のVerにて設定しました。
ovs:v2.13.0
DPDK:v19.11.2

6-1.OvSインターフェース周りの設定

本項はHostOS上での設定となります。

♯DPDKにバインド
dpdk-devbind.py --bind=vfio-pci ens2f0
dpdk-devbind.py --bind=vfio-pci ens2f1

アップリンク(物理スイッチとの接続)の設定
ovs-vsctl add-port ovsbr0 dpdk300 \
vlan_mode=access tag=300 \
-- set Interface dpdk300 type=dpdk options:dpdk-devargs=0000:05:00.0
ovs-vsctl show

ovs-vsctl add-port ovsbr0 dpdk301 \
vlan_mode=access tag=301 \
-- set Interface dpdk301 type=dpdk options:dpdk-devargs=0000:05:00.1
ovs-vsctl show

♯ダウンリンク(仮想マシンとの接続)の設定
mkdir -p /usr/local/openvswitch/
touch /usr/local/openvswitch/vhuc300
ovs-vsctl add-port ovsbr0 vhuc300 \
vlan_mode=access tag=300 \
-- set Interface vhuc300 type=dpdkvhostuserclient \
options:vhost-server-path=/usr/local/openvswitch/vhuc300
ovs-vsctl show

mkdir -p /usr/local/openvswitch/
touch /usr/local/openvswitch/vhuc301
ovs-vsctl add-port ovsbr0 vhuc301 \
vlan_mode=access tag=301 \
-- set Interface vhuc301 type=dpdkvhostuserclient \
options:vhost-server-path=/usr/local/openvswitch/vhuc301
ovs-vsctl show

仮想マシンxmlファイル設定
ダウンリンクと仮想マシンのインターフェースを接続する設定となります。
それ以外にもHugepageやCPU Pinning設定が含まれていますが、詳細は過去記事を参照してください。

[root@c76g116 ~]# virsh edit c76g118

  <currentMemory unit='KiB'>8388608 </currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <shares>8192</shares>
    <vcpupin vcpu='0' cpuset='10'/>
    <vcpupin vcpu='1' cpuset='11'/>
    <vcpupin vcpu='2' cpuset='12'/>
    <vcpupin vcpu='3' cpuset='13'/>
    <emulatorpin cpuset='10-13'/>
  </cputune>

  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
    <numa>
      <cell id='0' cpus='0-3' memory='8388608' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>

   <interface type='vhostuser'>
     <source type='unix' path='/usr/local/openvswitch/vhuc300' mode='server'/>
     <model type='virtio'/>
   </interface>
   <interface type='vhostuser'>
     <source type='unix' path='/usr/local/openvswitch/vhuc301' mode='server'/>
     <model type='virtio'/>
   </interface>青=追記,緑=置換
6-2.仮想マシン上でのVPPインストールと設定

以下、仮想マシン上での設定となります。
インストール方法は「2-1.事前準備とインストール」と同一です。

6-3.インターフェースのbsf確認

設定方法は同一なので、ポイントだけ記載します。

[root@c77g118 ~]# dpdk-devbind -s

Network devices using kernel driver
===================================
0000:00:03.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=virtio_pci *Active*
0000:00:09.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=virtio_pci
0000:00:0a.0 'Virtio network device 1000' if=eth2 drv=virtio-pci unused=virtio_pci

Linux Bridgeの場合と比較して、全く変化がないことを確認してください。
bsf番号(00:09.0など)やインターフェース名が変わっているようであれば、この後の項においても同様に変更してください。
特にbsf番号が変わる場合があるため、注視してください。

6-4.OS起動時の設定

「5-4.OS起動時の設定」と同一です。

6-5.VPPのstartup.conf設定

「5-5.VPPのstartup.conf設定」と同一です。

6-6.VPPのvpp.cfg設定

「5-6.VPPのvpp.cfg設定」と同一です。

6-7.VPPサービス自動起動設定と再起動

「5-7.VPPサービス自動起動設定と再起動」と同一です。

7.SR-IOVの場合

SR-IOVの詳細については、以下の過去記事を参照してください。
CentOS7でSR-IOV設定 - Metonymical Deflection
CentOS8 小ネタ集その4:SR-IOVの設定 - Metonymical Deflection
説明は省きますが、設定方法のみ記載します。

上記に加え、以下の設定でMTUを1501以上にしてください。*6

vi /etc/udev/rules.d/50-eth.rules

ACTION=="add", SUBSYSTEM=="net", KERNEL=="ens2f0", RUN+="/sbin/ip link set ens2f0 mtu 9000"
ACTION=="add", SUBSYSTEM=="net", KERNEL=="ens2f1", RUN+="/sbin/ip link set ens2f1 mtu 9000"
7-1.SR-IOV VFへのアタッチ

本項はHostOS上での設定となります。
仮想マシンのインターフェースとして、SR-IOV VFをアタッチします。
f:id:metonymical:20200809235202j:plain

7-2.仮想マシン上でのVPPインストールと設定

以下、仮想マシン上での設定となります。
インストール方法は「2-1.事前準備とインストール」と同一です。

7-3.インターフェースのbsf確認

設定方法は同一なので、ポイントだけ記載します。

[root@c77g118 ~]# dpdk-devbind -s

Network devices using kernel driver
===================================
0000:00:03.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=virtio_pci,vfio-pci *Active*
0000:00:09.0 '82599 Ethernet Controller Virtual Function 10ed' if=eth1 drv=ixgbevf unused=vfio-pci
0000:00:0a.0 '82599 Ethernet Controller Virtual Function 10ed' if=eth2 drv=ixgbevf unused=vfio-pci
7-4.OS起動時の設定

「5-4.OS起動時の設定」と同一です。

7-5.VPPのstartup.conf設定

「5-5.VPPのstartup.conf設定」と同一です。

7-6.VPPのvpp.cfg設定

「5-6.VPPのvpp.cfg設定」と同一です。

7-7.VPPサービス自動起動設定と再起動

「5-7.VPPサービス自動起動設定と再起動」と同一です。

以上です。

8.最後に

参考サイトは前回記事と同様です。

KVM VMWare ESXi6.7 VMWare Work pro15
Bare Metal N/A N/A
Linux Bridge vSwitch(未成功) LANセグメント
OvS-DPDK ? N/A
SR-IOV SR-IOV(未成功) N/A

上記より、ESXiの仮想スイッチ上でDPDKを動作させる方法が少し気になっています。
以下URLに記載の通り、nmlx5-core-PMDとして対応しているように見受けられるのですが、VMWareネイティブのDPDK対応仮想スイッチに相当するものがあるか否かがよくわからず。。。
https://docs.mellanox.com/pages/releaseview.action?pageId=15049806
N-VDSであればDPDKが動作しているのか?ちょっと気になっています。

なお、以下の全体構成図では、どこでPMDが動作しているかも表していますが、
f:id:metonymical:20200810000205j:plain
今後はOvS-DPDKとSR-IOVのメリットのみを享受できるような方向性で進化&深化していくと考えています。
具体例として、vDPA(virtual data path acceleration)やScaleble IOVなどが挙げられますが、k8sのPod上で、クラウドネイティブな柔軟性を確保しつつ、Bare Metalに近い性能を簡単に引き出せる技術だと考えています。

vDPA
Achieving network wirespeed in an open standard manner: introducing vDPA
Scaleble IOV
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux

これらをすぐにキャッチアップできるようにするためにも、今回のような構成を充分に理解し、サクッと組み上げられる基礎的な技術力が必須になっていくのでは?と考えています。

なお、過去記事でCiscoCSR1000vなどの仮想アプライアンスについて記載しましたが、単体での性能を確認することに加えて、VPPと比較することにより、どの程度性能が出るのか?といった相対的な指標が導き出せた方が評価し易いかもしれないと考えました。
なので、性能値を測る上での比較元として利用するため、どのような環境においても、VPPをサクッと構築できたら便利だなと思い今回の記事を書いた次第です。

*1:ちなみに、先頭の「0000」は、PCIのDomain番号となります。

*2:DPDKの設定項目はstartup.confの下方に存在しますが、全てコメントアウトされているため、そのまま追記して構いません。

*3:ファイル名や格納パスは任意でOKなようです。

*4:前回記事とは異なり、今回は通常のRouterとして動作させるため、本設定のみで充分です。

*5:dpdk-devbind -uコマンドにより、DPDKの制御下からUnbindできるもののKernel制御に戻らない場合が多いので筐体ごと再起動してください。

*6:MTU1500以下の場合、疎通不可となる事象を確認しましたので、1501以上に設定してください。