This article describes how to configure vDPA (virtio data path acceleration) in Kubernetes environment using Accelerated Bridge CNI.*1
For more information on vDPA, please refer to The official Red Hat blog.
The difference from the previous article is that we are using Accelerated Bridge CNI.
In the case of SR-IOV CNI (previous article), SR-IOV Legacy mode was used, but in the case of Accelerated Bridge CNI, SR-IOV Switchdev mode is used.
With this, it is possible to use Linux Bridge.
For more information about Accelerated Bridge CNI, please refer to the following URL.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni
The following is a list of related articles.
1.Overview
1-1.Environment
1.ControlPlane VMWare : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 2.Worker IA server : ProLiant DL360p Gen8 or DL360 Gen9 System ROM : P71 01/22/2018 NIC : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT) Mellanox OFED : v5.3-1.0.0.1 3.ControlPlane&Worker common OS : CentOS8.3(2011) Kernel(ControlPlane) : 4.18.0-240.el8.x86_64 Kernel(Worker) : 5.12.12-1.el8.elrepo.x86_64 Installed Environment Groups : @^graphical-server-environment @container-management @development @virtualization-client @virtualization-hypervisor @virtualization-tools Kubernetes : 1.21.2 Docker-CE : 20.10.7 flannel : latest Multus : latest sriov-network-device-plugin : latest accelerated-bridge-cni : latest
1-2.Overall flow
- Advance preparation
- Kernel update
- Build k8s Cluster & flannel
- Build and deploy vDPA and Accelerated Bridge CNI related
- Configuring Linux Bridge and deploy Pod
- Operation check
There is relatively a lot of documentation for 1-3, so we will skip the non-essential parts.
4-6 are the points that differ from the previous article.
1-3.overall structure
Loop connection using a DAC(Direct Attached Cable). *2
fig.1
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.
fig.2
Quoted from Red Hat's github
github.com
2.Advance preparation
Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.
2-1.Swap and Hosts file settings : CP (ControlPlane) & Worker
Disable Swap vi /etc/fstab #/dev/mapper/cl-swap swap swap defaults 0 0 Hosts file settings vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.11.151 c83g151 c83g151.md.jp 192.168.11.152 c83g152 c83g152.md.jp
2-2.Enabling HugePage and IOMMU : Worker
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub grub2-mkconfig -o /etc/grub2.cfg
Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.
vi /etc/fstab nodev /dev/hugepages hugetlbfs pagesize=1GB 0 0
2-3.SR-IOV VF settings : Worker
Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *3
vi /etc/rc.local echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs sleep 1 ip link set ens2f0 vf 0 mac 00:11:22:33:44:00 ip link set ens2f1 vf 0 mac 00:11:22:33:44:10 sleep 1 exit 0 chmod +x /etc/rc.d/rc.local
2-4.Install the Mellanox driver (OFED) : Worker
You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
dnf -y install tcl tk unbound && \ mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \ /mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint
After the installation is complete, reboot.
reboot
After the reboot is complete, check the HugePage.
cat /proc/meminfo | grep Huge grep hugetlbfs /proc/mounts [root@c83g152 ~]# cat /proc/meminfo | grep Huge AnonHugePages: 452608 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 16 HugePages_Free: 16 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 16777216 kB [root@c83g152 ~]# grep hugetlbfs /proc/mounts nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0
3.Kernel update : Worker
As of June 11, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.
3-1.Installing elrepo
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org dnf -y install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm
3-2.Installation of Kernel
dnf list installed | grep kernel dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel dnf list installed | grep kernel reboot
Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *4
Check the installed Kernel.
Reboot
3-3.Install Kernel headers, etc.
uname -r dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \ dnf -y remove kernel-tools kernel-tools-libs && \ dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs dnf list installed | grep kernel
Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.
If you get the following output, you are good to go.
[root@c83g152 ~]# dnf list installed | grep kernel kernel.x86_64 4.18.0-240.el8 @anaconda kernel-core.x86_64 4.18.0-240.el8 @anaconda kernel-devel.x86_64 4.18.0-240.el8 @anaconda kernel-ml.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-core.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-devel.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-headers.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-modules.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-tools.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-ml-tools-libs.x86_64 5.12.12-1.el8.elrepo @elrepo-kernel kernel-modules.x86_64 4.18.0-240.el8 @anaconda kmod-kernel-mft-mlnx.x86_64 4.16.1-1.rhel8u3 @System kmod-mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel8u3 @System mlnx-ofa_kernel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel8u3 @System mlnx-ofa_kernel-devel.x86_64 5.3-OFED.5.3.1.0.0.1.rhel8u3 @System
4.Build k8s Cluster & flannel
4-1.Install Docker : CP&Worker
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \ dnf -y install --allowerasing docker-ce docker-ce-cli containerd.io && \ systemctl start docker && systemctl enable docker
4-2.Configuring the k8s repository : CP&Worker
cat > /etc/yum.repos.d/kubernetes.repo <<EOF [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg EOF
4-3.Install k8s : CP&Worker
CP dnf -y install kubeadm kubectl Worker dnf -y install kubeadm CP&Worker systemctl start kubelet.service && \ systemctl enable kubelet.service
4-4.Configuring Docker : CP&Worker
cat > /etc/docker/daemon.json <<EOF { "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m" }, "storage-driver": "overlay2", "storage-opts": [ "overlay2.override_kernel_check=true" ] } EOF mkdir -p /etc/systemd/system/docker.service.d systemctl daemon-reload && \ systemctl restart docker
4-5.Building the k8sCluster : CP
kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16
Output Example
At the end you will see the following output, please copy the red text.
This will be used when the worker joins the CP.
To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config You should now deploy a pod network to the cluster. Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: https://kubernetes.io/docs/concepts/cluster-administration/addons/ Then you can join any number of worker nodes by running the following on each as root: kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \ --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3
4-6.Configuration after building k8s Cluster : CP
Copy the config file mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config Command completion settings source <(kubectl completion bash) echo "source <(kubectl completion bash)" >> ~/.bashrc
4-7.Install flannel : CP
cd /usr/src && \ wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \ kubectl apply -f kube-flannel.yml kubectl get nodes
Output Example
Wait until the status becomes Ready.
[root@c83g151 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION c83g151.md.jp Ready control-plane,master 44s v1.21.1
4-8.Joining a Worker : Worker
kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \ --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3
Output Example
Workers also wait until the status becomes Ready.
[root@c83g151 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION c83g151.md.jp Ready control-plane,master 5m2s v1.21.1 c83g152.md.jp Ready44s v1.21.1
4-9.Change to SR-IOV switchdev mode : Worker
4-9-1.Check the current operation mode
lshw -businfo -c network devlink dev eswitch show pci/0000:07:00.0 devlink dev eswitch show pci/0000:07:00.1
Check the bsf (bus, slot, function) number of the PCI device.
Check the status of 07:00.0 (ens2f0)
Check the status of 07:00.1 (ens2f1)
The output will look like the following
[root@c83g152 ~]# lshw -businfo -c network Bus info Device Class Description ======================================================== pci@0000:04:00.0 ens1f0 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:04:00.1 ens1f1 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:03:00.0 eno1 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.1 eno2 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.2 eno3 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.3 eno4 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:07:00.0 ens2f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.1 ens2f1 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.2 ens2f0v0 network ConnectX Family mlx5Gen Virtual Function pci@0000:07:01.2 ens2f1v0 network ConnectX Family mlx5Gen Virtual Function [root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0 pci/0000:07:00.0: mode legacy inline-mode none encap disable [root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1 pci/0000:07:00.1: mode legacy inline-mode none encap disable
4-9-2.Changing the operating mode
Note that the bsf numbers are slightly different.*5
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \ echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \ devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \ devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \ echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \ echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind
Unbind the mlx5_core driver for VF.
07:00.2 | ens2f0v0 |
07:01.2 | ens2f1v0 |
Change the PF operation mode to switchdev.
07:00.0 | ens2f0 |
07:00.1 | ens2f1 |
Rebind the mlx5_core driver of VF.
07:00.2 | ens2f0v0 |
07:01.2 | ens2f1v0 |
4-9-3.Check the operation mode after the change
devlink dev eswitch show pci/0000:07:00.0 devlink dev eswitch show pci/0000:07:00.1
Changed to switchdev mode.
[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0 pci/0000:07:00.0: mode switchdev inline-mode none encap enable [root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1 pci/0000:07:00.1: mode switchdev inline-mode none encap enable
VF Representer has been added.
[root@c83g152 ~]# lshw -businfo -c network Bus info Device Class Description ======================================================== pci@0000:04:00.0 ens1f0 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:04:00.1 ens1f1 network 82599ES 10-Gigabit SFI/SFP+ Network Connection pci@0000:03:00.0 eno1 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.1 eno2 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.2 eno3 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:03:00.3 eno4 network NetXtreme BCM5719 Gigabit Ethernet PCIe pci@0000:07:00.0 ens2f0 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.1 ens2f1 network MT2892 Family [ConnectX-6 Dx] pci@0000:07:00.2 ens2f0v0 network ConnectX Family mlx5Gen Virtual Function pci@0000:07:01.2 ens2f1v0 network ConnectX Family mlx5Gen Virtual Function pci@0000:07:00.0 ens2f0_0 network Ethernet interface pci@0000:07:00.1 ens2f1_0 network Ethernet interface
In addition, make sure that the HW offload function of the NIC is enabled.
ethtool -k ens2f0 |grep tc ethtool -k ens2f1 |grep tc [root@c83g152 ~]# ethtool -k ens2f0 |grep tc tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on hw-tc-offload: on [root@c83g152 ~]# ethtool -k ens2f1 |grep tc tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on hw-tc-offload: on
4-10.Enabling the vhost_vdpa module : Worker
Build and deploy sriov-device-plugin as described in section "5. Build and deploy vDPA and Accelerated Bridge CNI related".
At that time, if the vhost_vdpa module is not enabled, it will not be recognized as a Worker resource, so enable it beforehand.
modprobe vhost_vdpa lsmod |grep vd ls -Fal /dev ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa [root@c83g152 ~]# lsmod |grep vd vhost_vdpa 24576 0 vhost 57344 1 vhost_vdpa mlx5_vdpa 45056 0 vhost_iotlb 16384 3 vhost_vdpa,vhost,mlx5_vdpa vdpa 16384 2 vhost_vdpa,mlx5_vdpa irqbypass 16384 2 vhost_vdpa,kvm mlx5_core 1216512 2 mlx5_vdpa,mlx5_ib [root@c83g152 ~]# ls -Fal /dev total 0 drwxr-xr-x 22 root root 3660 Apr 8 00:02 ./ dr-xr-xr-x. 17 root root 244 Apr 7 20:30 ../ crw-r--r-- 1 root root 10, 235 Apr 7 23:28 autofs drwxr-xr-x 2 root root 160 Apr 7 23:28 block/ drwxr-xr-x 2 root root 100 Apr 7 23:28 bsg/ ============ s n i p ============ drwxr-xr-x 2 root root 60 Apr 7 23:28 vfio/ crw------- 1 root root 10, 127 Apr 7 23:28 vga_arbiter crw------- 1 root root 10, 137 Apr 7 23:28 vhci crw------- 1 root root 10, 238 Apr 7 23:28 vhost-net crw------- 1 root root 240, 0 Apr 8 00:06 vhost-vdpa-0 crw------- 1 root root 240, 1 Apr 8 00:06 vhost-vdpa-1 crw------- 1 root root 10, 241 Apr 7 23:28 vhost-vsock [root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa total 0 drwxr-xr-x 2 root root 0 Apr 8 00:06 ./ drwxr-xr-x 3 root root 0 Apr 7 23:49 ../ --w------- 1 root root 4096 Apr 8 00:07 bind lrwxrwxrwx 1 root root 0 Apr 8 00:07 module -> ../../../../module/vhost_vdpa/ --w------- 1 root root 4096 Apr 8 00:06 uevent --w------- 1 root root 4096 Apr 8 00:07 unbind lrwxrwxrwx 1 root root 0 Apr 8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/ lrwxrwxrwx 1 root root 0 Apr 8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/
From the above output results, we can confirm the following
- /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
- 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver
Also, set it so that it is enabled at OS startup. *6
The blue text part has been added.
vi /etc/rc.local echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs sleep 1 ip link set ens2f0 vf 0 mac 00:11:22:33:44:00 ip link set ens2f1 vf 0 mac 00:11:22:33:44:10 sleep 1 echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \ echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \ devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \ devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \ echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \ echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind sleep 1 modprobe vhost_vdpa sleep 1 exit 0
reboot CP & Worker again.
5.Build and deploy vDPA and Accelerated Bridge CNI related
5-2.Building the Docker image : CP
In this section, we will build the following two images.*7
- multus
- sriov-dp
cd /usr/src && \ git clone https://github.com/redhat-nfvpe/vdpa-deployment.git cd /usr/src/vdpa-deployment && \ make multus && \ make sriov-dp
5-3.Copy the Docker image : CP
Copy the built Docker image to the Worker.
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \ /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
Output Example
[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs
Checking the Docker Image : CP&Worker
[root@c83g151 vdpa-deployment]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE nfvpe/sriov-device-plugin latest 4bbc4ec2506d 29 hours ago 42.7MB <none> <none> 9d6fcc3268de 29 hours ago 1.26GB nfvpe/multus latest 5b78ae4af974 29 hours ago 331MB <none> <none> d4742876ce8d 29 hours ago 1.35GB k8s.gcr.io/kube-apiserver v1.21.2 106ff58d4308 6 days ago 126MB k8s.gcr.io/kube-controller-manager v1.21.2 ae24db9aa2cc 6 days ago 120MB k8s.gcr.io/kube-scheduler v1.21.2 f917b8c8f55b 6 days ago 50.6MB k8s.gcr.io/kube-proxy v1.21.2 a6ebd1c1ad98 6 days ago 131MB golang alpine3.12 baed0e68a17f 11 days ago 301MB mellanox/accelerated-bridge-cni v0.1.0 9adde4aff112 2 weeks ago 11MB quay.io/coreos/flannel v0.14.0 8522d622299c 5 weeks ago 67.9MB alpine 3.12 13621d1b12d4 2 months ago 5.58MB k8s.gcr.io/pause 3.4.1 0f8457a4c2ec 5 months ago 683kB centos centos7 8652b9f0cb4c 7 months ago 204MB k8s.gcr.io/coredns/coredns v1.8.0 296a6d5035e2 8 months ago 42.5MB k8s.gcr.io/etcd 3.4.13-0 0369cf4303ff 9 months ago 253MB
5-4.Building the Accelerated Bridge CNI : CP
cd /usr/src && \ git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \ cd accelerated-bridge-cni/ && \ make
Output Example
[root@c83g151 vdpa-deployment]# cd /usr/src && \ > git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \ > cd accelerated-bridge-cni/ && \ > make Cloning into 'accelerated-bridge-cni'... remote: Enumerating objects: 3426, done. ============ s n i p ============ github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge Building accelerated-bridge... Done! ## Build executable file building goveralls... running coverage tests... ? github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge [no test files] ok github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/cache 0.006s coverage: 72.2% of statements ok github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/config 0.006s coverage: 90.9% of statements ok github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager 0.010s coverage: 53.7% of statements ? github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager/mocks [no test files] ? github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/plugin [no test files] ? github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/types [no test files] ok github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/utils 0.006s coverage: 64.3% of statements
5-5.Deploying yaml files : CP
Deploy the following four files.
- /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
- /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
- /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
- /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \ kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \ kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \ kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
Output Example
[root@c83g151 accelerated-bridge-cni]# kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \ > kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \ > kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \ > kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created clusterrole.rbac.authorization.k8s.io/multus created clusterrolebinding.rbac.authorization.k8s.io/multus created serviceaccount/multus created configmap/multus-cni-config created daemonset.apps/kube-multus-ds-amd64 created serviceaccount/sriov-device-plugin created daemonset.apps/kube-sriov-device-plugin-amd64 created configmap/sriovdp-config created daemonset.apps/kube-accelerated-bridge-cni-ds-amd64 created
5-6.Checking DaemonSet & ConfigMap & Pod : CP
Check the DaemonSet and ConfigMap that you deployed in section 5-5.
kubectl -n kube-system get ds kubectl -n kube-system get cm kubectl -n kube-system get pod
Output Example
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-accelerated-bridge-cni-ds-amd64 2 2 2 2 2 beta.kubernetes.io/arch=amd64 12s kube-flannel-ds 2 2 2 2 211m kube-multus-ds-amd64 2 2 2 2 2 kubernetes.io/arch=amd64 13s kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 13m kube-sriov-device-plugin-amd64 2 2 2 2 2 beta.kubernetes.io/arch=amd64 13s [root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get cm NAME DATA AGE coredns 1 13m extension-apiserver-authentication 6 13m kube-flannel-cfg 2 11m kube-proxy 2 13m kube-root-ca.crt 1 13m kubeadm-config 2 13m kubelet-config-1.21 1 13m multus-cni-config 1 13s sriovdp-config 1 12s [root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get pod NAME READY STATUS RESTARTS AGE coredns-558bd4d5db-76v4h 1/1 Running 0 13m coredns-558bd4d5db-qrn74 1/1 Running 0 13m etcd-c83g151.md.jp 1/1 Running 0 13m kube-accelerated-bridge-cni-ds-amd64-2dvtm 1/1 Running 0 14s kube-accelerated-bridge-cni-ds-amd64-mxs92 1/1 Running 0 14s kube-apiserver-c83g151.md.jp 1/1 Running 0 13m kube-controller-manager-c83g151.md.jp 1/1 Running 0 13m kube-flannel-ds-2n457 1/1 Running 0 11m kube-flannel-ds-h9c9r 1/1 Running 0 11m kube-multus-ds-amd64-x5cdn 1/1 Running 0 15s kube-multus-ds-amd64-zt5pj 1/1 Running 0 15s kube-proxy-2585z 1/1 Running 0 13m kube-proxy-znl8n 1/1 Running 0 11m kube-scheduler-c83g151.md.jp 1/1 Running 0 13m kube-sriov-device-plugin-amd64-6cccp 1/1 Running 0 15s kube-sriov-device-plugin-amd64-k8ktz 1/1 Running 0 15s
5-7.Checking the details of ConfigMap : CP
In the next section "6. Configuring Linux Bridge and deploy Pod", vdpa_mlx_vhost will be related to Network Attachment Definition and Pod configuration, so please check it in advance.
cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: | { "resourceList": [{ "resourceName": "vdpa_ifcvf_vhost", "selectors": { "vendors": ["1af4"], "devices": ["1041"], "drivers": ["ifcvf"], "vdpaType": "vhost" } }, { "resourceName": "vdpa_mlx_vhost", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["mlx5_core"], "vdpaType": "vhost" } }, { "resourceName": "vdpa_mlx_virtio", "selectors": { "vendors": ["15b3"], "devices": ["101e"], "drivers": ["mlx5_core"], "vdpaType": "virtio" } } ] }
Note
resourceName | This can be any name. You can specify this name explicitly in "6-1 sections". |
vendors | This is the vendor identifier for the PCI Device ID. 15b3 indicates that it is a Mellanox product.*8 |
devices | This is the device identifier for the PCI Device ID. 101e indicates that it is a VF of ConnectX-6 Dx. |
drivers | Specifies the mlx5_core driver. |
vdpaType | Specifies the vhost. This option has been extended from the regular SR-IOV Device plug-in. |
The PCI Device ID can be checked with the following command.
[root@c83g152 ~]# lspci -nn |grep Mellanox 07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d] 07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d] 07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e] 07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
5-8.Checking Worker Resources : CP
Check that vdpa_mlx_vhost is recognized as a Worker resource.
kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
Output Example
[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable' { "cpu": "16", "ephemeral-storage": "127203802926", "hugepages-1Gi": "16Gi", "hugepages-2Mi": "0", "intel.com/vdpa_mlx_vhost": "2", "memory": "148123456Ki", "pods": "110" }
"2" is the number of VFs that were recognized.
Note
When this value is "0" or this line (intel.com/vdpa_mlx_vhost) is not displayed, delete the pod of sriov-device-plugin once.
Since sriov-device-plugin is deployed as a DaemonSet, it will be automatically recreated after the Pod is deleted.
This will cause sriov-device-plugin to attempt to re-register vdpa_mlx_vhost.
[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152 kube-flannel-ds-89v57 1/1 Running 2 4h8m 192.168.11.152 c83g152.md.jpkube-multus-ds-amd64-75rbf 1/1 Running 0 74m 192.168.11.152 c83g152.md.jp kube-proxy-fdv9r 1/1 Running 2 4h8m 192.168.11.152 c83g152.md.jp kube-sriov-cni-ds-amd64-2xfxw 1/1 Running 0 74m 192.168.11.152 c83g152.md.jp kube-sriov-device-plugin-amd64-rg8hm 1/1 Running 0 73m 192.168.11.152 c83g152.md.jp [root@c83g151 vdpa-deployment]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-rg8hm pod "kube-sriov-device-plugin-amd64-rg8hm" deleted [root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152 kube-flannel-ds-89v57 1/1 Running 2 4h9m 192.168.11.152 c83g152.md.jp kube-multus-ds-amd64-75rbf 1/1 Running 0 76m 192.168.11.152 c83g152.md.jp kube-proxy-fdv9r 1/1 Running 2 4h9m 192.168.11.152 c83g152.md.jp kube-sriov-cni-ds-amd64-2xfxw 1/1 Running 0 76m 192.168.11.152 c83g152.md.jp kube-sriov-device-plugin-amd64-kwc5z 1/1 Running 0 3s 192.168.11.152 c83g152.md.jp
Again, check the Worker resources.
kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
If it is still not recognized, please refer to section 4-10.
The two points are as follows.
- /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 must be recognized as vhost_vdpa devices.
- 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 must be controlled by the vhost_vdpa driver.
Unless the above two conditions are met, vdpa_mlx_vhost will not be recognized as a Worker resource.
6.Configuring Linux Bridge and deploy Pod
6-1.Configuring the NetworkAttachmentDefinition : CP
vi 961nA-vdpa30.yaml apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: vdpa-mlx-vhost-net30 annotations: k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost spec: config: '{ "type": "accelerated-bridge", "cniVersion": "0.3.1", "name": "sriov-vdpa30", "bridge": "br30ens2f0", "trust": "on", "spoofchk": "off", "ipam": { "type": "host-local", "subnet": "192.168.30.0/24", "rangeStart": "192.168.30.64", "rangeEnd": "192.168.30.127" } }' --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: vdpa-mlx-vhost-net31 annotations: k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost spec: config: '{ "type": "accelerated-bridge", "cniVersion": "0.3.1", "name": "sriov-vdpa31", "bridge": "br31ens2f1", "trust": "on", "spoofchk": "off", "ipam": { "type": "host-local", "subnet": "192.168.30.0/24", "rangeStart": "192.168.30.128", "rangeEnd": "192.168.30.191" } }' kubectl apply -f 961nA-vdpa30.yaml kubectl get network-attachment-definitions.k8s.cni.cncf.io
Output Example
[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io NAME AGE vdpa-mlx-vhost-net30 4h24m vdpa-mlx-vhost-net31 4h24m
Note
name | Set it(vdpa-mlx-vhost-net30/31) in the Pod annotations. |
k8s.v1.cni.cncf.io/resourceName | It set the name that we checked in 5-7. In addition, set it in the Pod annotations. |
type | This is a required setting, and must be explicitly set to "accelerated-bridge" in order to use the Accelerated Bridge CNI.*9 |
bridge | Setting the name of the Linux Bridge will determine which Bridge the Pod will connect to.*10 |
rangeStart & rangeEnd | In this configuration, the address ranges are separated by vdpa-mlx-vhost-net30 and vdpa-mlx-vhost-net31 to avoid duplication. |
For more information, please refer to the following URL
accelerated-bridge-cni/configuration-reference.md at master · k8snetworkplumbingwg/accelerated-bridge-cni · GitHub
It should be noted that the Accelerated Bridge CNI has a different configuration method for vlan and trunk than the SR-IOV CNI.
We assume that this is because the Linux Bridge is used and the VLAN tagging is left to the Linux Bridge side.
6-2.Configuring the Linux Bridge : Worker
Configure in the order of (1)-(7) in fig.1.
fig.1
(1) | The settings have already been made in section 4-10. Execute the modprobe vhost_vdpa command. |
(2)(3) | Create the first Bridge, and set PF ens2f0 as the uplink. |
(4)(5) | Create the second Bridge, and set PF ens2f1 as the uplink. |
(6) | This is automatically set when the Pod is deployed. The Accelerated Bridge CNI will configure VF rep ens2f0_0 and ens2f1_0 as downlinks. |
(7) | Create the Pod in the next section 6-3. |
Worker (1) modprobe vhost_vdpa (2) nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0 nmcli connection modify br30ens2f0 bridge.stp no nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore nmcli connection up br30ens2f0 (3) nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0 (4) nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1 nmcli connection modify br31ens2f1 bridge.stp no nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore nmcli connection up br31ens2f1 (5) nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1 CP (6)(7)See next section 6-3. kubectl apply -f 161vdpa.yaml
6-2-1.Creating a Linux Bridge : Worker
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0 nmcli connection modify br30ens2f0 bridge.stp no nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore nmcli connection up br30ens2f0 nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0 nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1 nmcli connection modify br31ens2f1 bridge.stp no nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore nmcli connection up br31ens2f1 nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
6-2-2.Checking the Linux Bridge : Worker
nmcli con show [root@c83g152 ~]# nmcli con show NAME UUID TYPE DEVICE System eno1 a60ea695-651d-46f8-9d93-5ac92f504956 ethernet eno1 br11eno2 6862ba45-5873-4930-ac46-e3c7cfa3074a bridge br11eno2 br30ens2f0 5838f8cd-9f3e-441c-8f72-989e54f8780d bridge br30ens2f0 br31ens2f1 0737de11-7b3b-48a7-a742-ced4375ab3b7 bridge br31ens2f1 docker0 12600c36-dadb-49c9-a082-02b0a51e0424 bridge docker0 virbr0 a3aec808-57a6-4f35-94f0-0081c8571eab bridge virbr0 bridge-slave-eno2 acd7a4d7-585c-4d88-9456-49ac852475ba ethernet eno2 bridge-slave-ens2f0 d24f6a37-97cf-471a-abff-e53582effc05 ethernet ens2f0 bridge-slave-ens2f1 aa37266c-312b-43fa-9e7b-244a7269c90d ethernet ens2f1 eno3 700cd313-7aae-4eb0-9591-a3813e40ef2f ethernet -- eno4 1e768d3b-37ff-47be-ad6a-907710def01f ethernet -- ens1f0 49155983-26e5-4070-834a-49b8a52fbd02 ethernet -- ens1f1 cb97a04a-3130-47b9-af13-2293fca218ee ethernet --
6-3.Deploying a Pod : CP
The key point is that the values defined in the Network Attachment Definition are specified for "annotations" and "resources".
Also, there are no settings specific to Accelerated Bridge CNI.
vi 161vdpa.yaml apiVersion: v1 kind: Pod metadata: name: vdpa-pod01 annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "vdpa-mlx-vhost-net30", "mac": "ca:fe:c0:ff:ee:11" } ]' spec: nodeName: c83g152.md.jp containers: - name: vdpa-single01 image: centos:latest imagePullPolicy: IfNotPresent securityContext: privileged: true resources: requests: intel.com/vdpa_mlx_vhost: '1' limits: intel.com/vdpa_mlx_vhost: '1' command: ["sleep"] args: ["infinity"] --- apiVersion: v1 kind: Pod metadata: name: vdpa-pod02 annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "vdpa-mlx-vhost-net31", "mac": "ca:fe:c0:ff:ee:12" } ]' spec: nodeName: c83g152.md.jp containers: - name: vdpa-single02 image: centos:latest imagePullPolicy: IfNotPresent securityContext: privileged: true resources: requests: intel.com/vdpa_mlx_vhost: '1' limits: intel.com/vdpa_mlx_vhost: '1' command: ["sleep"] args: ["infinity"] kubectl apply -f 161vdpa.yaml kubectl get pod
Output Example
[root@c83g151 vdpa-deployment]# kubectl get pod NAME READY STATUS RESTARTS AGE vdpa-pod01 1/1 Running 0 7s vdpa-pod02 1/1 Running 0 7s
6-4.Check Pod details : CP
In this section, we will check the details of the Pod status by using the kubectl describe command.
kubectl describe pod vdpa-pod01
fig.1
Make sure that each parameters(/dev/vhost-vdpa-0, 07:00.2) of device-info and fig.1 match.
Output Example
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 Name: vdpa-pod01 Namespace: default Priority: 0 Node: c83g152.md.jp/192.168.11.152 Start Time: Wed, 23 Jun 2021 00:03:55 +0900 Labels:Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.244.1.7" ], "mac": "56:5b:99:80:a4:6f", "default": true, "dns": {} },{ "name": "default/vdpa-mlx-vhost-net30", "interface": "net1", "ips": [ "192.168.30.66" ], "mac": "CA:FE:C0:FF:EE:11", "dns": {}, "device-info": { "type": "vdpa", "version": "1.0.0", "vdpa": { "parent-device": "vdpa0", "driver": "vhost", "path": "/dev/vhost-vdpa-0", "pci-address": "0000:07:00.2" } } }] k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.244.1.7" ], "mac": "56:5b:99:80:a4:6f", "default": true, "dns": {} },{ "name": "default/vdpa-mlx-vhost-net30", "interface": "net1", "ips": [ "192.168.30.66" ], "mac": "CA:FE:C0:FF:EE:11", "dns": {}, "device-info": { "type": "vdpa", "version": "1.0.0", "vdpa": { "parent-device": "vdpa0", "driver": "vhost", "path": "/dev/vhost-vdpa-0", "pci-address": "0000:07:00.2" } } }] Status: Running IP: 10.244.1.7 IPs: IP: 10.244.1.7 Containers: vdpa-single01: Container ID: docker://67e8a82543cd8e4db556e8bd502f42659260e2d50192a5a0f13860d1cefc09ef Image: centos:latest Image ID: docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1 Port: Host Port: Command: sleep Args: infinity State: Running Started: Wed, 23 Jun 2021 00:03:57 +0900 Ready: True Restart Count: 0 Limits: intel.com/vdpa_mlx_vhost: 1 Requests: intel.com/vdpa_mlx_vhost: 1 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gq69j (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-gq69j: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal AddedInterface 39s multus Add eth0 [10.244.1.7/24] Normal AddedInterface 39s multus Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30 Normal Pulled 39s kubelet Container image "centos:latest" already present on machine Normal Created 39s kubelet Created container vdpa-single01 Normal Started 39s kubelet Started container vdpa-single01
The following message is output to /var/log/messages of the Worker when the Pod is deployed.
[root@c83g152 ~]# tail -f /var/log/messages Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.415300 2033 topology_manager.go:187] "Topology Admit Handler" Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.419085 2033 topology_manager.go:187] "Topology Admit Handler" Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-pod7cba503f_5336_425e_b32c_45869b6f9020.slice. Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-poddd78b5ac_06f3_48f1_bbc9_2443fa88c653.slice. ============ s n i p ============ Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered disabled state Jun 21 19:38:42 c83g152 kernel: device ens2f0_0 entered promiscuous mode Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered forwarding state Jun 21 19:38:42 c83g152 NetworkManager[1192]:[1624271922.7254] device (ens2f0_0): Activation: starting connection 'ens2f0_0' (84f4ad71-e7dc-44ba-8755-2e3d6279b037) Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7258] device (ens2f0_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7273] device (ens2f0_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7298] device (ens2f0_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7303] device (br30ens2f0): bridge port ens2f0_0 was attached Jun 21 19:38:42 c83g152 NetworkManager[1192]:[1624271922.7304] device (ens2f0_0): Activation: connection 'ens2f0_0' enslaved, continuing activation Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7308] device (ens2f0_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7398] device (ens2f0_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7404] device (ens2f0_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external') Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7431] device (ens2f0_0): Activation: successful, device activated. Jun 21 19:38:42 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive. Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 temp_17: renamed from ens2f0v0 Jun 21 19:38:42 c83g152 NetworkManager[1192]: [1624271922.7876] device (ens2f0v0): interface index 17 renamed iface from 'ens2f0v0' to 'temp_17' Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered disabled state Jun 21 19:38:42 c83g152 kernel: device ens2f1_0 entered promiscuous mode Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered forwarding state Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: renamed from temp_17 Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: Link up Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:01.2 temp_18: renamed from ens2f1v0 Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708881 6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.4/24] Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f0_0 to the bridge br30ens2f0 Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.003540 6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.65/24] from default/vdpa-mlx-vhost-net30 Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: renamed from temp_18 Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: Link up Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1334] device (temp_17): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1418] device (ens2f1v0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1466] device (ens2f1_0): Activation: starting connection 'ens2f1_0' (b0db37c4-3214-497b-80e2-c532ea2e54c2) Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1469] device (ens2f1_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1480] device (ens2f1_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1488] device (ens2f1_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1492] device (br31ens2f1): bridge port ens2f1_0 was attached Jun 21 19:38:43 c83g152 NetworkManager[1192]:[1624271923.1492] device (ens2f1_0): Activation: connection 'ens2f1_0' enslaved, continuing activation Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1497] device (ens2f1_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708792 6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.5/24] Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f1_0 to the bridge br31ens2f1 Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.150099 6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.129/24] from default/vdpa-mlx-vhost-net31 Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1710] device (ens2f1_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1716] device (ens2f1_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external') Jun 21 19:38:43 c83g152 NetworkManager[1192]: [1624271923.1739] device (ens2f1_0): Activation: successful, device activated. Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded. Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded. Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded. Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded. Jun 21 19:38:43 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive. Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223212683+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-1 are ignored in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339 Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223313924+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-0 are ignored in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6 Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223395596+09:00" level=warning msg="path in container /dev/vhost-vdpa-0 already exists in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6 Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223319641+09:00" level=warning msg="path in container /dev/vhost-vdpa-1 already exists in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339 ============ s n i p ============ Jun 21 19:38:53 c83g152 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
6-5.Check the Mac address of the Worker : Worker
Check that the MAC address specified in the annotations of the Pod is reflected in the VF of the Worker.
ip link show ens2f0
Output Example
[root@c83g152 ~]# ip link show ens2f0 15: ens2f0:mtu 1500 qdisc mq master br30ens2f0 state UP mode DEFAULT group default qlen 1000 link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff vf 0 link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
6-6.Check the Linux Bridge of the Worker : Worker
Make sure that the VF rep (ens2f0_0, ens2f1_0) of the Worker is automatically set as the downlink for the Linux Bridge.
nmcli con show
Output Example
[root@c83g152 ~]# nmcli con show NAME UUID TYPE DEVICE System eno1 a60ea695-651d-46f8-9d93-5ac92f504956 ethernet eno1 cni0 eacb774c-9306-4c9d-8eb3-2382d9b4c5da bridge cni0 br11eno2 6862ba45-5873-4930-ac46-e3c7cfa3074a bridge br11eno2 br30ens2f0 5838f8cd-9f3e-441c-8f72-989e54f8780d bridge br30ens2f0 br31ens2f1 0737de11-7b3b-48a7-a742-ced4375ab3b7 bridge br31ens2f1 docker0 12600c36-dadb-49c9-a082-02b0a51e0424 bridge docker0 virbr0 a3aec808-57a6-4f35-94f0-0081c8571eab bridge virbr0 bridge-slave-eno2 acd7a4d7-585c-4d88-9456-49ac852475ba ethernet eno2 bridge-slave-ens2f0 d24f6a37-97cf-471a-abff-e53582effc05 ethernet ens2f0 bridge-slave-ens2f1 aa37266c-312b-43fa-9e7b-244a7269c90d ethernet ens2f1 ens2f0_0 112da873-9159-412b-a576-b70eb1dcb97b ethernet ens2f0_0 ens2f1_0 714af871-890d-4e24-8ad6-bcea1ea7c208 ethernet ens2f1_0 eno3 700cd313-7aae-4eb0-9591-a3813e40ef2f ethernet -- eno4 1e768d3b-37ff-47be-ad6a-907710def01f ethernet -- ens1f0 49155983-26e5-4070-834a-49b8a52fbd02 ethernet -- ens1f1 cb97a04a-3130-47b9-af13-2293fca218ee ethernet --
7.Operation check : CP
7-1.Check the IP address of the Pod.
the IP address of the Pod is dynamically assigned, check it in advance.
kubectl describe pod vdpa-pod01 |grep Add kubectl describe pod vdpa-pod02 |grep Add
Output Example
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add Normal AddedInterface 7m22s multus Add eth0 [10.244.1.7/24] Normal AddedInterface 7m22s multus Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30 [root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add Normal AddedInterface 7m23s multus Add eth0 [10.244.1.6/24] Normal AddedInterface 7m23s multus Add net1 [192.168.30.130/24] from default/vdpa-mlx-vhost-net31
7-2.communication check
vdpa-pod01 | 192.168.30.66/24 |
vdpa-pod02 | 192.168.30.130/24 |
Execute a ping from vdpa-pod01(192.168.30.66) to 192.168.30.130.
kubectl exec -it vdpa-pod01 -- ping 192.168.30.130
Output Example
[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.130 PING 192.168.30.130 (192.168.30.130) 56(84) bytes of data. 64 bytes from 192.168.30.130: icmp_seq=1 ttl=64 time=0.510 ms 64 bytes from 192.168.30.130: icmp_seq=2 ttl=64 time=0.590 ms 64 bytes from 192.168.30.130: icmp_seq=3 ttl=64 time=0.551 ms 64 bytes from 192.168.30.130: icmp_seq=4 ttl=64 time=0.510 ms ^C --- 192.168.30.130 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 63ms rtt min/avg/max/mdev = 0.510/0.540/0.590/0.037 ms
We was able to confirm that the pods are communicating with each other via the DAC.
You can also check that the Worker is learning the Pod's MAC address in the bridge monitor fdb.
Before pinging, run the following command in Worker.
bridge monitor fdb
The red text is the MAC address that was learned when the ping was run.
Other than that, the MAC address is the one learned when the pod was deployed.
[root@c83g152 ~]# bridge monitor fdb
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b vlan 1 master cni0 permanent
Deleted 7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab vlan 1 master cni0 permanent
fe:42:e8:04:50:51 dev veth78a2b6ab master cni0
b2:4b:f8:68:fe:b0 dev vethdbd6a26b master cni0
7a:d9:02:4d:7c:8f dev cni0 master cni0 permanent
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 master br30ens2f0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 vlan 1 master br30ens2f0 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 master br31ens2f1 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 vlan 1 master br31ens2f1 permanent
ca:fe:c0:ff:ee:11 dev ens2f0_0 master br30ens2f0
ca:fe:c0:ff:ee:11 dev ens2f1 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f1_0 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f0 master br30ens2f0
8.Finally
We referred to the following website.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni
This article falls under No.10.
No | vm(qemu)/k8s | k8s Pod/VMI | vDPA Framework | vDPA Type | SR-IOV mode | Related Articles |
1 | vm | - | kernel | vhost | lagacy | Not started |
2 | vm | - | kernel | vhost | switchdev | How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection |
3 | vm | - | kernel | virtio | lagacy | Not started |
4 | vm | - | kernel | virtio | switchdev | Not started |
5 | vm | - | dpdk | vhost | lagacy | Not started |
6 | vm | - | dpdk | vhost | switchdev | Not started |
7 | vm | - | dpdk | virtio | lagacy | Not started |
8 | vm | - | dpdk | virtio | switchdev | How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection |
9 | k8s | pod | kernel | vhost | lagacy | How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection |
10 | k8s | pod | kernel | vhost | switchdev | How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical DeflectionThis article |
11 | k8s | pod | kernel | virtio | lagacy | Not started |
12 | k8s | pod | kernel | virtio | switchdev | Not started |
13 | k8s | pod | dpdk | client | lagacy | Not started |
14 | k8s | pod | dpdk | client | switchdev | Not started |
15 | k8s | pod | dpdk | server | lagacy | Not started |
16 | k8s | pod | dpdk | server | switchdev | Not started |
In the next article, I would like to try setting up vmi (Virtual Machine Instance) using kubevirt.
At this time, kubevirt only supports vfio-pci, so we will put it on hold for now.
*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.
*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the Pod are physically sent to the outside, we use the configuration shown in fig.1.
*3:In the case of VMs, the MAC address setting was mandatory, but in the case of k8s Pod, it is not a mandatory setting because the MAC address is written in the yaml.
*4:core and modules will be installed at the same time
*5:The "0000" in front of the bsf number is called the Domain number. As far as I know, I have never seen a value other than "0000", so I don't think you need to worry too much about it.
*6:When you build the Kernel from source, you can use If you have enabled vhost_vdpa by using make menuconfig, etc., this setting is not necessary.
*7:In the previous article, we also built sriov-cni at the same time. In this article, we will not build sriov-cni because we will not use it, but you can build sriov-cni without any problem. In the 6-1.NetworkAttachmentDefinition setting, we will explicitly specify accelerated-bridge, so there is no problem even if sriov-cni and accelerated-bridge-cni are mixed.
*9:In the previous article, it is set to "sriov".
*10:If you don't specify anything, the default Bridge will be cni0.