How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI

This article describes how to configure vDPA (virtio data path acceleration) in Kubernetes environment using Accelerated Bridge CNI.*1
For more information on vDPA, please refer to The official Red Hat blog.

The difference from the previous article is that we are using Accelerated Bridge CNI.
In the case of SR-IOV CNI (previous article), SR-IOV Legacy mode was used, but in the case of Accelerated Bridge CNI, SR-IOV Switchdev mode is used.
With this, it is possible to use Linux Bridge.

For more information about Accelerated Bridge CNI, please refer to the following URL.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection

How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection

How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection

How to set up vDPA - appendix - - Metonymical Deflection

How to set up Scalable Function with vdpa for VMs - Metonymical Deflection

1.Overview

1-1.Environment

1.ControlPlane
VMWare　　　　　　　　　　　　　 : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 

2.Worker
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
Mellanox OFED                    : v5.3-1.0.0.1

3.ControlPlane&Worker common
OS                               : CentOS8.3(2011)
Kernel(ControlPlane)             : 4.18.0-240.el8.x86_64 
Kernel(Worker)                   : 5.12.12-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Kubernetes                       : 1.21.2
Docker-CE                        : 20.10.7
flannel                          : latest
Multus                           : latest
sriov-network-device-plugin      : latest
accelerated-bridge-cni           : latest

1-2.Overall flow

Advance preparation
Kernel update
Build k8s Cluster & flannel
Build and deploy vDPA and Accelerated Bridge CNI related
Configuring Linux Bridge and deploy Pod
Operation check

There is relatively a lot of documentation for 1-3, so we will skip the non-essential parts.
4-6 are the points that differ from the previous article.

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210621235434j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210621220154j:plain

Quoted from Red Hat's github
github.com

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Swap and Hosts file settings : CP (ControlPlane) & Worker

Disable Swap
vi /etc/fstab

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

Hosts file settings
vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.11.151 c83g151 c83g151.md.jp
192.168.11.152 c83g152 c83g152.md.jp

2-2.Enabling HugePage and IOMMU : Worker

sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0

2-3.SR-IOV VF settings : Worker

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *3

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local

2-4.Install the Mellanox driver (OFED) : Worker

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g152 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g152 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update : Worker

As of June 11, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-1.Installing elrepo

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
dnf -y install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm

3-2.Installation of Kernel

dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *4
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.

uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g152 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System

4.Build k8s Cluster & flannel

4-1.Install Docker : CP&Worker

dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \
dnf -y install --allowerasing  docker-ce docker-ce-cli containerd.io && \
systemctl start docker && systemctl enable docker

4-2.Configuring the k8s repository : CP&Worker

cat > /etc/yum.repos.d/kubernetes.repo <<EOF
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

4-3.Install k8s : CP&Worker

CP
dnf -y install kubeadm kubectl

Worker
dnf -y install kubeadm

CP&Worker
systemctl start kubelet.service && \
systemctl enable kubelet.service

4-4.Configuring Docker : CP&Worker

cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

systemctl daemon-reload && \
systemctl restart docker

4-5.Building the k8sCluster : CP

kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16

Output Example
At the end you will see the following output, please copy the red text.
This will be used when the worker joins the CP.

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

4-6.Configuration after building k8s Cluster : CP

Copy the config file
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Command completion settings
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

4-7.Install flannel : CP

cd /usr/src && \
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \
kubectl apply -f kube-flannel.yml

kubectl get nodes

Output Example
Wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS     ROLES                  AGE   VERSION
c83g151.md.jp   Ready      control-plane,master   44s   v1.21.1

4-8.Joining a Worker : Worker

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

Output Example
Workers also wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
c83g151.md.jp   Ready    control-plane,master   5m2s   v1.21.1
c83g152.md.jp   Ready                           44s    v1.21.1

4-9.Change to SR-IOV switchdev mode : Worker

4-9-1.Check the current operation mode

lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Check the bsf (bus, slot, function) number of the PCI device.
Check the status of 07:00.0 (ens2f0)
Check the status of 07:00.1 (ens2f1)

The output will look like the following

[root@c83g152 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable

4-9-2.Changing the operating mode

Note that the bsf numbers are slightly different.*5

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

Unbind the mlx5_core driver for VF.

07:00.2	ens2f0v0
07:01.2	ens2f1v0

Change the PF operation mode to switchdev.

07:00.0	ens2f0
07:00.1	ens2f1

Rebind the mlx5_core driver of VF.

07:00.2	ens2f0v0
07:01.2	ens2f1v0

4-9-3.Check the operation mode after the change

devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Changed to switchdev mode.

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representer has been added.

[root@c83g152 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

In addition, make sure that the HW offload function of the NIC is enabled.

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g152 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g152 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

4-10.Enabling the vhost_vdpa module : Worker

Build and deploy sriov-device-plugin as described in section "5. Build and deploy vDPA and Accelerated Bridge CNI related".
At that time, if the vhost_vdpa module is not enabled, it will not be recognized as a Worker resource, so enable it beforehand.

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g152 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g152 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following

/dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver

Also, set it so that it is enabled at OS startup. *6
The blue text part has been added.

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind
sleep 1
modprobe vhost_vdpa
sleep 1
exit 0

reboot CP & Worker again.

5.Build and deploy vDPA and Accelerated Bridge CNI related

5-1.Install Golang : CP

dnf -y install epel-release golang

5-2.Building the Docker image : CP

In this section, we will build the following two images.*7

multus
sriov-dp

cd /usr/src && \
git clone https://github.com/redhat-nfvpe/vdpa-deployment.git

cd /usr/src/vdpa-deployment && \
make multus && \
make sriov-dp

5-3.Copy the Docker image : CP

Copy the built Docker image to the Worker.

/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152

Output Example

[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs

Checking the Docker Image : CP&Worker

[root@c83g151 vdpa-deployment]# docker images
REPOSITORY                           TAG          IMAGE ID       CREATED        SIZE
nfvpe/sriov-device-plugin            latest       4bbc4ec2506d   29 hours ago   42.7MB
<none>                               <none>       9d6fcc3268de   29 hours ago   1.26GB
nfvpe/multus                         latest       5b78ae4af974   29 hours ago   331MB
<none>                               <none>       d4742876ce8d   29 hours ago   1.35GB
k8s.gcr.io/kube-apiserver            v1.21.2      106ff58d4308   6 days ago     126MB
k8s.gcr.io/kube-controller-manager   v1.21.2      ae24db9aa2cc   6 days ago     120MB
k8s.gcr.io/kube-scheduler            v1.21.2      f917b8c8f55b   6 days ago     50.6MB
k8s.gcr.io/kube-proxy                v1.21.2      a6ebd1c1ad98   6 days ago     131MB
golang                               alpine3.12   baed0e68a17f   11 days ago    301MB
mellanox/accelerated-bridge-cni      v0.1.0       9adde4aff112   2 weeks ago    11MB
quay.io/coreos/flannel               v0.14.0      8522d622299c   5 weeks ago    67.9MB
alpine                               3.12         13621d1b12d4   2 months ago   5.58MB
k8s.gcr.io/pause                     3.4.1        0f8457a4c2ec   5 months ago   683kB
centos                               centos7      8652b9f0cb4c   7 months ago   204MB
k8s.gcr.io/coredns/coredns           v1.8.0       296a6d5035e2   8 months ago   42.5MB
k8s.gcr.io/etcd                      3.4.13-0     0369cf4303ff   9 months ago   253MB

5-4.Building the Accelerated Bridge CNI : CP

cd /usr/src && \
git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
cd accelerated-bridge-cni/ && \
make

Output Example

[root@c83g151 vdpa-deployment]# cd /usr/src && \
> git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
> cd accelerated-bridge-cni/ && \
> make
Cloning into 'accelerated-bridge-cni'...
remote: Enumerating objects: 3426, done.
============ s n i p ============
github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge
Building accelerated-bridge...
Done!
## Build executable file
building goveralls...
running coverage tests...
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge   [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/cache        0.006s  coverage: 72.2% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/config       0.006s  coverage: 90.9% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager      0.010s  coverage: 53.7% of statements
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager/mocks        [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/plugin       [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/types        [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/utils        0.006s  coverage: 64.3% of statements

5-5.Deploying yaml files : CP

Deploy the following four files.

/usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
/usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
/usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
/usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml

kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml

Output Example

[root@c83g151 accelerated-bridge-cni]# kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
> kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
serviceaccount/sriov-device-plugin created
daemonset.apps/kube-sriov-device-plugin-amd64 created
configmap/sriovdp-config created
daemonset.apps/kube-accelerated-bridge-cni-ds-amd64 created

5-6.Checking DaemonSet & ConfigMap & Pod : CP

Check the DaemonSet and ConfigMap that you deployed in section 5-5.

kubectl -n kube-system get ds
kubectl -n kube-system get cm
kubectl -n kube-system get pod

Output Example

[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get ds
NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-accelerated-bridge-cni-ds-amd64   2         2         2       2            2           beta.kubernetes.io/arch=amd64   12s
kube-flannel-ds                        2         2         2       2            2                                     11m
kube-multus-ds-amd64                   2         2         2       2            2           kubernetes.io/arch=amd64        13s
kube-proxy                             2         2         2       2            2           kubernetes.io/os=linux          13m
kube-sriov-device-plugin-amd64         2         2         2       2            2           beta.kubernetes.io/arch=amd64   13s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get cm
NAME                                 DATA   AGE
coredns                              1      13m
extension-apiserver-authentication   6      13m
kube-flannel-cfg                     2      11m
kube-proxy                           2      13m
kube-root-ca.crt                     1      13m
kubeadm-config                       2      13m
kubelet-config-1.21                  1      13m
multus-cni-config                    1      13s
sriovdp-config                       1      12s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get pod
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-76v4h                     1/1     Running   0          13m
coredns-558bd4d5db-qrn74                     1/1     Running   0          13m
etcd-c83g151.md.jp                           1/1     Running   0          13m
kube-accelerated-bridge-cni-ds-amd64-2dvtm   1/1     Running   0          14s
kube-accelerated-bridge-cni-ds-amd64-mxs92   1/1     Running   0          14s
kube-apiserver-c83g151.md.jp                 1/1     Running   0          13m
kube-controller-manager-c83g151.md.jp        1/1     Running   0          13m
kube-flannel-ds-2n457                        1/1     Running   0          11m
kube-flannel-ds-h9c9r                        1/1     Running   0          11m
kube-multus-ds-amd64-x5cdn                   1/1     Running   0          15s
kube-multus-ds-amd64-zt5pj                   1/1     Running   0          15s
kube-proxy-2585z                             1/1     Running   0          13m
kube-proxy-znl8n                             1/1     Running   0          11m
kube-scheduler-c83g151.md.jp                 1/1     Running   0          13m
kube-sriov-device-plugin-amd64-6cccp         1/1     Running   0          15s
kube-sriov-device-plugin-amd64-k8ktz         1/1     Running   0          15s

5-7.Checking the details of ConfigMap : CP

In the next section "6. Configuring Linux Bridge and deploy Pod", vdpa_mlx_vhost will be related to Network Attachment Definition and Pod configuration, so please check it in advance.

cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

Note

resourceName	This can be any name. You can specify this name explicitly in "6-1 sections".
vendors	This is the vendor identifier for the PCI Device ID. 15b3 indicates that it is a Mellanox product.*8
devices	This is the device identifier for the PCI Device ID. 101e indicates that it is a VF of ConnectX-6 Dx.
drivers	Specifies the mlx5_core driver.
vdpaType	Specifies the vhost. This option has been extended from the regular SR-IOV Device plug-in.

The PCI Device ID can be checked with the following command.

[root@c83g152 ~]# lspci -nn |grep Mellanox
07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]

5-8.Checking Worker Resources : CP

Check that vdpa_mlx_vhost is recognized as a Worker resource.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

Output Example

[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
{
  "cpu": "16",
  "ephemeral-storage": "127203802926",
  "hugepages-1Gi": "16Gi",
  "hugepages-2Mi": "0",
  "intel.com/vdpa_mlx_vhost": "2",
  "memory": "148123456Ki",
  "pods": "110"
}

"2" is the number of VFs that were recognized.

Note
When this value is "0" or this line (intel.com/vdpa_mlx_vhost) is not displayed, delete the pod of sriov-device-plugin once.
Since sriov-device-plugin is deployed as a DaemonSet, it will be automatically recreated after the Pod is deleted.
This will cause sriov-device-plugin to attempt to re-register vdpa_mlx_vhost.

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-rg8hm    1/1     Running   0          73m     192.168.11.152   c83g152.md.jp              

[root@c83g151 vdpa-deployment]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-rg8hm
pod "kube-sriov-device-plugin-amd64-rg8hm" deleted

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-kwc5z    1/1     Running   0          3s      192.168.11.152   c83g152.md.jp

Again, check the Worker resources.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

If it is still not recognized, please refer to section 4-10.
The two points are as follows.

/dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 must be recognized as vhost_vdpa devices.
0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 must be controlled by the vhost_vdpa driver.

Unless the above two conditions are met, vdpa_mlx_vhost will not be recognized as a Worker resource.

6.Configuring Linux Bridge and deploy Pod

6-1.Configuring the NetworkAttachmentDefinition : CP

vi 961nA-vdpa30.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net30
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa30",
  "bridge": "br30ens2f0",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.64",
    "rangeEnd": "192.168.30.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net31
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa31",
  "bridge": "br31ens2f1",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.128",
    "rangeEnd": "192.168.30.191"
  }
}'

kubectl apply -f 961nA-vdpa30.yaml
kubectl get network-attachment-definitions.k8s.cni.cncf.io

Output Example

[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME                   AGE
vdpa-mlx-vhost-net30   4h24m
vdpa-mlx-vhost-net31   4h24m

Note

name	Set it(vdpa-mlx-vhost-net30/31) in the Pod annotations.
k8s.v1.cni.cncf.io/resourceName	It set the name that we checked in 5-7. In addition, set it in the Pod annotations.
type	This is a required setting, and must be explicitly set to "accelerated-bridge" in order to use the Accelerated Bridge CNI.*9
bridge	Setting the name of the Linux Bridge will determine which Bridge the Pod will connect to.*10
rangeStart & rangeEnd	In this configuration, the address ranges are separated by vdpa-mlx-vhost-net30 and vdpa-mlx-vhost-net31 to avoid duplication.

For more information, please refer to the following URL
accelerated-bridge-cni/configuration-reference.md at master · k8snetworkplumbingwg/accelerated-bridge-cni · GitHub
It should be noted that the Accelerated Bridge CNI has a different configuration method for vlan and trunk than the SR-IOV CNI.
We assume that this is because the Linux Bridge is used and the VLAN tagging is left to the Linux Bridge side.

6-2.Configuring the Linux Bridge : Worker

Configure in the order of (1)-(7) in fig.1.
fig.1
f:id:metonymical:20210621235434j:plain

(1)	The settings have already been made in section 4-10. Execute the modprobe vhost_vdpa command.
(2)(3)	Create the first Bridge, and set PF ens2f0 as the uplink.
(4)(5)	Create the second Bridge, and set PF ens2f1 as the uplink.
(6)	This is automatically set when the Pod is deployed. The Accelerated Bridge CNI will configure VF rep ens2f0_0 and ens2f1_0 as downlinks.
(7)	Create the Pod in the next section 6-3.

Worker
(1)
modprobe vhost_vdpa

(2)
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
(3)
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

(4)
nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
(5)
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1

CP
(6)(7)See next section 6-3.
kubectl apply -f 161vdpa.yaml

6-2-1.Creating a Linux Bridge : Worker

nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1

6-2-2.Checking the Linux Bridge : Worker

nmcli con show

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --

6-3.Deploying a Pod : CP

The key point is that the values defined in the Network Attachment Definition are specified for "annotations" and "resources".
Also, there are no settings specific to Accelerated Bridge CNI.

vi 161vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "ca:fe:c0:ff:ee:11"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net31",
        "mac": "ca:fe:c0:ff:ee:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]


kubectl apply -f 161vdpa.yaml
kubectl get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
vdpa-pod01   1/1     Running   0          7s
vdpa-pod02   1/1     Running   0          7s

6-4.Check Pod details : CP

In this section, we will check the details of the Pod status by using the kubectl describe command.

kubectl describe pod vdpa-pod01

fig.1
f:id:metonymical:20210621235434j:plain
Make sure that each parameters(/dev/vhost-vdpa-0, 07:00.2) of device-info and fig.1 match.

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Wed, 23 Jun 2021 00:03:55 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.7
IPs:
  IP:  10.244.1.7
Containers:
  vdpa-single01:
    Container ID:  docker://67e8a82543cd8e4db556e8bd502f42659260e2d50192a5a0f13860d1cefc09ef
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Wed, 23 Jun 2021 00:03:57 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost:  1
    Requests:
      intel.com/vdpa_mlx_vhost:  1
    Environment:                 
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gq69j (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-gq69j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  39s   multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  39s   multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
  Normal  Pulled          39s   kubelet  Container image "centos:latest" already present on machine
  Normal  Created         39s   kubelet  Created container vdpa-single01
  Normal  Started         39s   kubelet  Started container vdpa-single01

The following message is output to /var/log/messages of the Worker when the Pod is deployed.

[root@c83g152 ~]# tail -f /var/log/messages

Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.415300    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.419085    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-pod7cba503f_5336_425e_b32c_45869b6f9020.slice.
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-poddd78b5ac_06f3_48f1_bbc9_2443fa88c653.slice.
============ s n i p ============
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f0_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered forwarding state
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7254] device (ens2f0_0): Activation: starting connection 'ens2f0_0' (84f4ad71-e7dc-44ba-8755-2e3d6279b037)
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7258] device (ens2f0_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7273] device (ens2f0_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7298] device (ens2f0_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7303] device (br30ens2f0): bridge port ens2f0_0 was attached
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7304] device (ens2f0_0): Activation: connection 'ens2f0_0' enslaved, continuing activation
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7308] device (ens2f0_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7398] device (ens2f0_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7404] device (ens2f0_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7431] device (ens2f0_0): Activation: successful, device activated.
Jun 21 19:38:42 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 temp_17: renamed from ens2f0v0
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7876] device (ens2f0v0): interface index 17 renamed iface from 'ens2f0v0' to 'temp_17'
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f1_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered forwarding state
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: renamed from temp_17
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: Link up
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:01.2 temp_18: renamed from ens2f1v0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708881    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.4/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f0_0 to the bridge br30ens2f0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.003540    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.65/24] from default/vdpa-mlx-vhost-net30
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: renamed from temp_18
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: Link up
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1334] device (temp_17): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1418] device (ens2f1v0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1466] device (ens2f1_0): Activation: starting connection 'ens2f1_0' (b0db37c4-3214-497b-80e2-c532ea2e54c2)
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1469] device (ens2f1_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1480] device (ens2f1_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1488] device (ens2f1_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (br31ens2f1): bridge port ens2f1_0 was attached
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (ens2f1_0): Activation: connection 'ens2f1_0' enslaved, continuing activation
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1497] device (ens2f1_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708792    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.5/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f1_0 to the bridge br31ens2f1
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.150099    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.129/24] from default/vdpa-mlx-vhost-net31
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1710] device (ens2f1_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1716] device (ens2f1_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1739] device (ens2f1_0): Activation: successful, device activated.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223212683+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-1 are ignored in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223313924+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-0 are ignored in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223395596+09:00" level=warning msg="path in container /dev/vhost-vdpa-0 already exists in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223319641+09:00" level=warning msg="path in container /dev/vhost-vdpa-1 already exists in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
============ s n i p ============
Jun 21 19:38:53 c83g152 systemd[1]: NetworkManager-dispatcher.service: Succeeded.

6-5.Check the Mac address of the Worker : Worker

Check that the MAC address specified in the annotations of the Pod is reflected in the VF of the Worker.

ip link show ens2f0

Output Example

[root@c83g152 ~]# ip link show ens2f0
15: ens2f0:  mtu 1500 qdisc mq master br30ens2f0 state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off

6-6.Check the Linux Bridge of the Worker : Worker

Make sure that the VF rep (ens2f0_0, ens2f1_0) of the Worker is automatically set as the downlink for the Linux Bridge.

nmcli con show

Output Example

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
cni0                 eacb774c-9306-4c9d-8eb3-2382d9b4c5da  bridge    cni0
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
ens2f0_0             112da873-9159-412b-a576-b70eb1dcb97b  ethernet  ens2f0_0
ens2f1_0             714af871-890d-4e24-8ad6-bcea1ea7c208  ethernet  ens2f1_0
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --

7.Operation check : CP

7-1.Check the IP address of the Pod.

the IP address of the Pod is dynamically assigned, check it in advance.

kubectl describe pod vdpa-pod01 |grep Add
kubectl describe pod vdpa-pod02 |grep Add

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add
  Normal  AddedInterface  7m22s  multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  7m22s  multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add
  Normal  AddedInterface  7m23s  multus   Add eth0 [10.244.1.6/24]
  Normal  AddedInterface  7m23s  multus   Add net1 [192.168.30.130/24] from default/vdpa-mlx-vhost-net31

7-2.communication check

vdpa-pod01	192.168.30.66/24
vdpa-pod02	192.168.30.130/24

Execute a ping from vdpa-pod01(192.168.30.66) to 192.168.30.130.

kubectl exec -it vdpa-pod01 -- ping 192.168.30.130

Output Example

[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.130
PING 192.168.30.130 (192.168.30.130) 56(84) bytes of data.
64 bytes from 192.168.30.130: icmp_seq=1 ttl=64 time=0.510 ms
64 bytes from 192.168.30.130: icmp_seq=2 ttl=64 time=0.590 ms
64 bytes from 192.168.30.130: icmp_seq=3 ttl=64 time=0.551 ms
64 bytes from 192.168.30.130: icmp_seq=4 ttl=64 time=0.510 ms
^C
--- 192.168.30.130 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 63ms
rtt min/avg/max/mdev = 0.510/0.540/0.590/0.037 ms

We was able to confirm that the pods are communicating with each other via the DAC.

You can also check that the Worker is learning the Pod's MAC address in the bridge monitor fdb.
Before pinging, run the following command in Worker.

bridge monitor fdb

The red text is the MAC address that was learned when the ping was run.
Other than that, the MAC address is the one learned when the pod was deployed.

[root@c83g152 ~]# bridge monitor fdb
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b vlan 1 master cni0 permanent
Deleted 7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab vlan 1 master cni0 permanent
fe:42:e8:04:50:51 dev veth78a2b6ab master cni0
b2:4b:f8:68:fe:b0 dev vethdbd6a26b master cni0
7a:d9:02:4d:7c:8f dev cni0 master cni0 permanent
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 master br30ens2f0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 vlan 1 master br30ens2f0 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 master br31ens2f1 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 vlan 1 master br31ens2f1 permanent
ca:fe:c0:ff:ee:11 dev ens2f0_0 master br30ens2f0
ca:fe:c0:ff:ee:11 dev ens2f1 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f1_0 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f0 master br30ens2f0

8.Finally

We referred to the following website.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

This article falls under No.10.

No	vm(qemu)/k8s	k8s Pod/VMI	vDPA Framework	vDPA Type	SR-IOV mode	Related Articles
1	vm	-	kernel	vhost	lagacy	Not started
2	vm	-	kernel	vhost	switchdev	How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
3	vm	-	kernel	virtio	lagacy	Not started
4	vm	-	kernel	virtio	switchdev	Not started
5	vm	-	dpdk	vhost	lagacy	Not started
6	vm	-	dpdk	vhost	switchdev	Not started
7	vm	-	dpdk	virtio	lagacy	Not started
8	vm	-	dpdk	virtio	switchdev	How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
9	k8s	pod	kernel	vhost	lagacy	How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
10	k8s	pod	kernel	vhost	switchdev	How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical DeflectionThis article
11	k8s	pod	kernel	virtio	lagacy	Not started
12	k8s	pod	kernel	virtio	switchdev	Not started
13	k8s	pod	dpdk	client	lagacy	Not started
14	k8s	pod	dpdk	client	switchdev	Not started
15	k8s	pod	dpdk	server	lagacy	Not started
16	k8s	pod	dpdk	server	switchdev	Not started

~~In the next article, I would like to try setting up vmi (Virtual Machine Instance) using kubevirt.~~
At this time, kubevirt only supports vfio-pci, so we will put it on hold for now.

*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the Pod are physically sent to the outside, we use the configuration shown in fig.1.

*3:In the case of VMs, the MAC address setting was mandatory, but in the case of k8s Pod, it is not a mandatory setting because the MAC address is written in the yaml.

*4:core and modules will be installed at the same time

*5:The "0000" in front of the bsf number is called the Domain number. As far as I know, I have never seen a value other than "0000", so I don't think you need to worry too much about it.

*6:When you build the Kernel from source, you can use If you have enabled vhost_vdpa by using make menuconfig, etc., this setting is not necessary.

*7:In the previous article, we also built sriov-cni at the same time. In this article, we will not build sriov-cni because we will not use it, but you can build sriov-cni without any problem. In the 6-1.NetworkAttachmentDefinition setting, we will explicitly specify accelerated-bridge, so there is no problem even if sriov-cni and accelerated-bridge-cni are mixed.

*8:PCI Devices

*9:In the previous article, it is set to "sriov".

*10:If you don't specify anything, the default Bridge will be cni0.