Metonymical Deflection

ゆるく日々のコト・たまにITインフラ

How to set up vDPA with vhost_vdpa for Kubernetes

This article describes how to configure vDPA (virtio data path acceleration) in kubernetes environment. *1
For more information on vDPA, please refer to The official Red Hat blog.

In this blog, We will describe the communication between pods in kubernetes (referred to below as "k8s") using vhost_vdpa module.

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
How to set up vDPA - appendix - - Metonymical Deflection
How to set up Scalable Function with vdpa for VMs - Metonymical Deflection

1.Overview

1-1.Environment
1.ControlPlane
VMWare              : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 

2.Worker
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
Mellanox OFED                    : v5.3-1.0.0.1

3.ControlPlane&Worker common
OS                               : CentOS8.3(2011)
Kernel(ControlPlane)             : 4.18.0-240.el8.x86_64 
Kernel(Worker)                   : 5.12.7-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Kubernetes                       : 1.21.1
Docker-CE                        : 20.10.6
flannel                          : latest
Multus                           : latest
sriov-cni                        : latest
sriov-network-device-plugin      : latest
1-2.Overall flow
  1. Advance preparation
  2. Kernel update
  3. Build k8s Cluster & flannel
  4. Build and deploy vDPA (SR-IOV) related
  5. Deploy Pod
  6. Operation check
  7. Advanced configuration

There is relatively a lot of documentation for 1-3, so we will skip the non-essential parts.
7 describes more detailed settings.

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210603000858j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210603000958j:plain

Quoted from Red Hat's github
github.com

In the above github, it is implemented in legacy mode in SR-IOV.
For this reason, we will describe how to configure it in legacy mode in this blog. *3

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Swap and Hosts file settings : CP (ControlPlane) & Worker
Disable Swap
vi /etc/fstab

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

Hosts file settings
vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.11.151 c83g151 c83g151.md.jp
192.168.11.152 c83g152 c83g152.md.jp
2-2.Enabling HugePage and IOMMU : Worker
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-3.SR-IOV VF settings : Worker

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *4

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-4.Install the Mellanox driver (OFED) : Worker

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g152 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g152 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update : Worker

As of June 11, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-2.Installation of Kernel
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *5
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g152 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-core.x86_64                              5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-devel.x86_64                             5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-headers.x86_64                           5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-modules.x86_64                           5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-tools.x86_64                             5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.Build k8s Cluster & flannel

4-1.Install Docker : CP&Worker
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \
dnf -y install --allowerasing  docker-ce docker-ce-cli containerd.io && \
systemctl start docker && systemctl enable docker
4-2.Configuring the k8s repository : CP&Worker
cat > /etc/yum.repos.d/kubernetes.repo <<EOF
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
4-3.Install k8s : CP&Worker
CP
dnf -y install kubeadm kubectl

Worker
dnf -y install kubeadm

CP&Worker
systemctl start kubelet.service && \
systemctl enable kubelet.service
4-4.Configuring Docker : CP&Worker
cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

systemctl daemon-reload && \
systemctl restart docker
4-5.Building the k8sCluster : CP
kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16

Output Example
At the end you will see the following output, please copy the red text.
This will be used when the worker joins the CP.

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3 
4-6.Configuration after building k8s Cluster : CP
Copy the config file
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Command completion settings
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc
4-7.Install flannel : CP
cd /usr/src && \
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \
kubectl apply -f kube-flannel.yml

kubectl get nodes

Output Example
Wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS     ROLES                  AGE   VERSION
c83g151.md.jp   Ready      control-plane,master   44s   v1.21.1
4-8.Joining a Worker : Worker
kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

Output Example
Workers also wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
c83g151.md.jp   Ready    control-plane,master   5m2s   v1.21.1
c83g152.md.jp   Ready                           44s    v1.21.1
4-9.Enabling the vhost_vdpa module : Worker

Build and deploy sriov-cni as described in section "5. vDPA (SR-IOV) related build and deployment".
At that time, if the vhost_vdpa module is not enabled, it will not be recognized as a Worker resource, so enable it beforehand.

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g152 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g152 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following

  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
  • 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver

Also, set it so that it is enabled at OS startup. *6
The blue text part has been added.

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
modprobe vhost_vdpa
sleep 1
exit 0

reboot CP & Worker again.

5.Build and deploy vDPA (SR-IOV) related

5-1.Install Golang : CP
dnf -y install epel-release golang
5-2.Building the Docker image : CP

In this section, we will build the following three images. *7

  • multus
  • sriov-cni
  • sriov-dp
cd /usr/src && \
git clone https://github.com/redhat-nfvpe/vdpa-deployment.git

cd /usr/src/vdpa-deployment && \
make multus && \
make sriov-cni && \
make sriov-dp
5-3.Copy the Docker image : CP

Copy the built Docker image to the Worker.

/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152

Output Example

[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs

Checking the Docker Image : CP&Worker

[root@c83g151 vdpa-deployment]# docker images
REPOSITORY                           TAG          IMAGE ID       CREATED          SIZE
nfvpe/sriov-cni                      latest       521ab1f3a5a1   16 minutes ago   9.47MB
<none>                               <none>       88062f2c13d4   16 minutes ago   561MB
nfvpe/multus                         latest       aa8d9becca0f   17 minutes ago   331MB
<none>                               <none>       e75c422aef6e   17 minutes ago   1.34GB
nfvpe/sriov-device-plugin            latest       0dd5f325c600   18 minutes ago   42.7MB
<none>                               <none>       3deb8b5405fa   18 minutes ago   1.26GB
quay.io/coreos/flannel               v0.14.0      8522d622299c   2 weeks ago      67.9MB
k8s.gcr.io/kube-apiserver            v1.21.1      771ffcf9ca63   3 weeks ago      126MB
k8s.gcr.io/kube-proxy                v1.21.1      4359e752b596   3 weeks ago      131MB
k8s.gcr.io/kube-controller-manager   v1.21.1      e16544fd47b0   3 weeks ago      120MB
k8s.gcr.io/kube-scheduler            v1.21.1      a4183b88f6e6   3 weeks ago      50.6MB
golang                               alpine3.12   24d827672eae   3 weeks ago      301MB
golang                               alpine       722a834ff95b   3 weeks ago      301MB
fedora                               32           c451de0d2441   5 weeks ago      202MB
alpine                               3.12         13621d1b12d4   7 weeks ago      5.58MB
alpine                               3            6dbb9cc54074   7 weeks ago      5.61MB
k8s.gcr.io/pause                     3.4.1        0f8457a4c2ec   4 months ago     683kB
k8s.gcr.io/coredns/coredns           v1.8.0       296a6d5035e2   7 months ago     42.5MB
k8s.gcr.io/etcd                      3.4.13-0     0369cf4303ff   9 months ago     253MB
5-4.Deploying yaml files : CP

Deploy the following four files.

  1. /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
  2. /usr/src/vdpa-deployment/deployment/sriovcni-vdpa-daemonset.yaml
  3. /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
  4. /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
cd /usr/src/vdpa-deployment && \
make deploy && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

Output Example

[root@c83g151 vdpa-deployment]# cd /usr/src/vdpa-deployment && \
> make deploy && \
> kubectl create -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
serviceaccount/sriov-device-plugin created
daemonset.apps/kube-sriov-device-plugin-amd64 created
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
daemonset.apps/kube-sriov-cni-ds-amd64 created
configmap/sriovdp-config created
5-5.Checking DaemonSet & ConfigMap & Pod : CP

Check the DaemonSet and ConfigMap that you deployed in 5-4.

kubectl -n kube-system get ds
kubectl -n kube-system get cm
kubectl -n kube-system get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get ds
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-flannel-ds                  2         2         2       2            2                                     3h4m
kube-multus-ds-amd64             2         2         2       2            2           kubernetes.io/arch=amd64        6m52s
kube-proxy                       2         2         2       2            2           kubernetes.io/os=linux          3h4m
kube-sriov-cni-ds-amd64          2         2         2       2            2           beta.kubernetes.io/arch=amd64   6m52s
kube-sriov-device-plugin-amd64   2         2         2       2            2           beta.kubernetes.io/arch=amd64   6m52s


[root@c83g151 vdpa-deployment]# kubectl -n kube-system get cm
NAME                                 DATA   AGE
coredns                              1      3h4m
extension-apiserver-authentication   6      3h4m
kube-flannel-cfg                     2      3h4m
kube-proxy                           2      3h4m
kube-root-ca.crt                     1      3h4m
kubeadm-config                       2      3h4m
kubelet-config-1.21                  1      3h4m
multus-cni-config                    1      6m52s
sriovdp-config                       1      6m52s

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod
NAME                                    READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-7kql2                1/1     Running   2          178m
coredns-558bd4d5db-nq8k7                1/1     Running   2          178m
etcd-c83g151.md.jp                      1/1     Running   2          178m
kube-apiserver-c83g151.md.jp            1/1     Running   2          178m
kube-controller-manager-c83g151.md.jp   1/1     Running   2          178m
kube-flannel-ds-89v57                   1/1     Running   2          174m
kube-flannel-ds-zwd7n                   1/1     Running   2          177m
kube-multus-ds-amd64-75rbf              1/1     Running   0          33s
kube-multus-ds-amd64-zk6w9              1/1     Running   0          33s
kube-proxy-fdv9r                        1/1     Running   2          174m
kube-proxy-l6t7h                        1/1     Running   2          178m
kube-scheduler-c83g151.md.jp            1/1     Running   2          178m
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          33s
kube-sriov-cni-ds-amd64-ndmmr           1/1     Running   0          33s
kube-sriov-device-plugin-amd64-4lt4p    1/1     Running   0          33s
kube-sriov-device-plugin-amd64-gbplp    1/1     Running   0          33s
5-6.Checking the details of ConfigMap : CP

In "6. Deploy Pod", vdpa_mlx_vhost is related to the Network Attachment Definition and the Pod configuration, so please check it beforehand.

cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

Note

resourceName This can be any name. You can specify this name explicitly in "6-1 sections".
vendors This is the vendor identifier for the PCI Device ID. 15b3 indicates that it is a Mellanox product.*8
devices This is the device identifier for the PCI Device ID. 101e indicates that it is a VF of ConnectX-6 Dx.
drivers Specifies the mlx5_core driver.
vdpaType Specifies the vhost. This option has been extended from the regular SR-IOV Device plug-in.

The PCI Device ID can be checked with the following command.

[root@c83g152 ~]# lspci -nn |grep Mellanox
07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
5-7.Checking Worker Resources : CP

Check that vdpa_mlx_vhost is recognized as a Worker resource.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

Output Example

[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
{
  "cpu": "16",
  "ephemeral-storage": "127203802926",
  "hugepages-1Gi": "16Gi",
  "hugepages-2Mi": "0",
  "intel.com/vdpa_mlx_vhost": "2",
  "memory": "148123456Ki",
  "pods": "110"
}

"2" is the number of VFs that were recognized.

Note
When this value is "0" or this line (intel.com/vdpa_mlx_vhost) is not displayed, delete the pod of sriov-device-plugin once.
Since sriov-device-plugin is deployed as a DaemonSet, it will be automatically recreated after the Pod is deleted.
This will cause sriov-device-plugin to attempt to re-register vdpa_mlx_vhost.

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-rg8hm    1/1     Running   0          73m     192.168.11.152   c83g152.md.jp              

[root@c83g151 vdpa-deployment]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-rg8hm
pod "kube-sriov-device-plugin-amd64-rg8hm" deleted

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-kwc5z    1/1     Running   0          3s      192.168.11.152   c83g152.md.jp              

Again, check the Worker resources.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

If it is still not recognized, please refer to section 4-9.
The two points are as follows.

  1. /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 must be recognized as vhost_vdpa devices.
  2. 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 must be controlled by the vhost_vdpa driver.

Unless the above two conditions are met, vdpa_mlx_vhost will not be recognized as a Worker resource.

6.Deploy Pod

In "/usr/src/vdpa-deployment/deployment/multus-daemonset.yaml", configure the Network Attachment Definition defined as CRD.
In this section, we will configure vlan, spoofchk, ipam, etc.
For more details, please refer to the following website.
sriov-cni/configuration-reference.md at rfe/vdpa · amorenoz/sriov-cni · GitHub

6-1.Configuring the NetworkAttachmentDefinition : CP

"vdpa-mlx-vhost-net30" is specified in the annotations of the Pod.

vi 96nA-vdpa30.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net30
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa30",
  "vlan": 30,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.64",
    "rangeEnd": "192.168.30.127"
  }
}'

kubectl apply -f 96nA-vdpa30.yaml
kubectl get network-attachment-definitions.k8s.cni.cncf.io

Output Example

[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME                   AGE
vdpa-mlx-vhost-net30   14m
6-2.Deploying a Pod : CP

The key point is that the values defined in the Network Attachment Definition are specified for "annotations" and "resources".

vi 16vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "CA:FE:C0:FF:EE:11"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "CA:FE:C0:FF:EE:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]


kubectl apply -f 16vdpa.yaml
kubectl get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
vdpa-pod01   1/1     Running   0          16m
vdpa-pod02   1/1     Running   0          16m
6-3.Check Pod details : CP

In this section, we will check the details of the Pod status by using the kubectl describe command.

kubectl describe pod vdpa-pod01

fig.1
f:id:metonymical:20210603000858j:plain
Make sure that each parameters(/dev/vhost-vdpa-0, 07:00.2) of device-info and fig.1 match.
Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Thu, 03 Jun 2021 20:54:49 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.9"
                    ],
                    "mac": "26:9a:87:2a:70:70",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.71"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.9"
                    ],
                    "mac": "26:9a:87:2a:70:70",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.71"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.9
IPs:
  IP:  10.244.1.9
Containers:
  vdpa-single01:
    Container ID:  docker://cf57569807eb2de3d4901ff2ade55b845682d2c7a37ee88c7f6536498fd0b63e
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Thu, 03 Jun 2021 20:54:51 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost:  1
    Requests:
      intel.com/vdpa_mlx_vhost:  1
    Environment:                 
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9kqtb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-9kqtb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  26s   multus   Add eth0 [10.244.1.9/24]
  Normal  AddedInterface  25s   multus   Add net1 [192.168.30.71/24] from default/vdpa-mlx-vhost-net30
  Normal  Pulled          25s   kubelet  Container image "centos:latest" already present on machine
  Normal  Created         25s   kubelet  Created container vdpa-single01
  Normal  Started         25s   kubelet  Started container vdpa-single01
6-4.Check the Mac address of the Worker : Worker

Check that the MAC address specified in the annotations of the Pod is reflected in the VF of the Worker.

ip link show ens2f0

Output Example

[root@c83g152 ~]# ip link show ens2f0
8: ens2f0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, vlan 30, spoof checking off, link-state auto, trust on, query_rss off

7.Operation check : CP

7-1.Check the IP address of the Pod.

the IP address of the Pod is dynamically assigned, check it in advance.

kubectl describe pod vdpa-pod01 |grep Add
kubectl describe pod vdpa-pod02 |grep Add

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add
  Normal  AddedInterface  26m   multus   Add eth0 [10.244.1.9/24]
  Normal  AddedInterface  26m   multus   Add net1 [192.168.30.71/24] from default/vdpa-mlx-vhost-net30
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add
  Normal  AddedInterface  26m   multus   Add eth0 [10.244.1.8/24]
  Normal  AddedInterface  26m   multus   Add net1 [192.168.30.70/24] from default/vdpa-mlx-vhost-net30
7-2.communication check
vdpa-pod01 192.168.30.71/24
vdpa-pod02 192.168.30.70/24

Execute a ping from vdpa-pod01(192.168.30.71) to 192.168.30.70.

kubectl exec -it vdpa-pod01 -- ping 192.168.30.70

Output Example

[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.70
PING 192.168.30.70 (192.168.30.70) 56(84) bytes of data.
64 bytes from 192.168.30.70: icmp_seq=1 ttl=64 time=0.447 ms
64 bytes from 192.168.30.70: icmp_seq=2 ttl=64 time=0.166 ms
64 bytes from 192.168.30.70: icmp_seq=3 ttl=64 time=0.227 ms
64 bytes from 192.168.30.70: icmp_seq=4 ttl=64 time=0.217 ms
^C
--- 192.168.30.70 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 91ms
rtt min/avg/max/mdev = 0.166/0.264/0.447/0.108 ms

We was able to confirm that the pods are communicating with each other via the DAC.
That's it for the configuration.

In the next section, we will describe a configuration example for more detailed settings.

8.Advanced configuration

Depending on the configuring of ConfigMap and NetworkAttachmentDefinition, you can also set up a configuration as shown in fig.3.

fig.3
f:id:metonymical:20210603230304j:plain

The yaml file will help you understand what exactly you are doing, so I won't describe any particular explanation, but I will describe the points that need attention in red texts.
Please increase the number of VFs in SR-IOV to 4 in advance.

echo 2 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 2 > /sys/class/net/ens2f1/device/sriov_numvfs
8-1.Configuring ConfigMap : CP
vi 83cm-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost11",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:00.2"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost12",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:00.3"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost21",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:01.2"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost22",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:01.3"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

kubectl apply -f 83cm-vdpa.yaml
8-2.Configuring NetworkAttachmentDefinition : CP
vi 93nA-vdpa11-22.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net11
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost11
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 100,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.100.0/24",
    "rangeStart": "192.168.100.64",
    "rangeEnd": "192.168.100.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net12
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost12
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 200,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.200.0/24",
    "rangeStart": "192.168.200.64",
    "rangeEnd": "192.168.200.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net21
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost21
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 100,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.100.0/24",
    "rangeStart": "192.168.100.128",
    "rangeEnd": "192.168.100.191"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net22
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost22
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 200,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.200.0/24",
    "rangeStart": "192.168.200.128",
    "rangeEnd": "192.168.200.191"
  }
}'

kubectl apply -f 93nA-vdpa11-22.yaml
8-3.Configuring Pod : CP
vi 13vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net11",
        "mac": "0C:FE:C0:FF:EE:11"
      },
      {
        "name": "vdpa-mlx-vhost-net12",
        "mac": "0C:FE:C0:FF:EE:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost11: '1'
        intel.com/vdpa_mlx_vhost12: '1'
      limits:
        intel.com/vdpa_mlx_vhost11: '1'
        intel.com/vdpa_mlx_vhost12: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net21",
        "mac": "0C:FE:C0:FF:EE:21"
      },
      {
        "name": "vdpa-mlx-vhost-net22",
        "mac": "0C:FE:C0:FF:EE:22"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost21: '1'
        intel.com/vdpa_mlx_vhost22: '1'
      limits:
        intel.com/vdpa_mlx_vhost21: '1'
        intel.com/vdpa_mlx_vhost22: '1'
    command: ["sleep"]
    args: ["infinity"]

kubectl apply -f 13vdpa.yaml
8-4.Check Pod details : CP
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Thu, 03 Jun 2021 22:23:33 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.12"
                    ],
                    "mac": "c6:01:75:97:75:91",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net11",
                    "interface": "net1",
                    "ips": [
                        "192.168.100.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                },{
                    "name": "default/vdpa-mlx-vhost-net12",
                    "interface": "net2",
                    "ips": [
                        "192.168.200.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:12",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa1",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-1",
                            "pci-address": "0000:07:00.3"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks:
                [ { "name": "vdpa-mlx-vhost-net11", "mac": "0C:FE:C0:FF:EE:11" }, { "name": "vdpa-mlx-vhost-net12", "mac": "0C:FE:C0:FF:EE:12" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.12"
                    ],
                    "mac": "c6:01:75:97:75:91",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net11",
                    "interface": "net1",
                    "ips": [
                        "192.168.100.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                },{
                    "name": "default/vdpa-mlx-vhost-net12",
                    "interface": "net2",
                    "ips": [
                        "192.168.200.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:12",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa1",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-1",
                            "pci-address": "0000:07:00.3"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.12
IPs:
  IP:  10.244.1.12
Containers:
  vdpa-single01:
    Container ID:  docker://df698de4764a4209703f9df3a641167cdf49222860d1e41f1c85de7ba1bb5146
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Thu, 03 Jun 2021 22:23:36 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost11:  1
      intel.com/vdpa_mlx_vhost12:  1
    Requests:
      intel.com/vdpa_mlx_vhost11:  1
      intel.com/vdpa_mlx_vhost12:  1
    Environment:                   
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xst8f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-xst8f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  8m3s  multus   Add eth0 [10.244.1.12/24]
  Normal  AddedInterface  8m2s  multus   Add net1 [192.168.100.64/24] from default/vdpa-mlx-vhost-net11
  Normal  AddedInterface  8m2s  multus   Add net2 [192.168.200.64/24] from default/vdpa-mlx-vhost-net12
  Normal  Pulled          8m2s  kubelet  Container image "centos:latest" already present on machine
  Normal  Created         8m2s  kubelet  Created container vdpa-single01
  Normal  Started         8m1s  kubelet  Started container vdpa-single01

8-5.Check Worker details : Worker
[root@c83g152 ~]# ip link show ens2f0
8: ens2f0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 0c:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on, query_rss off
    vf 1     link/ether 0c:fe:c0:ff:ee:12 brd ff:ff:ff:ff:ff:ff, vlan 200, spoof checking off, link-state auto, trust on, query_rss off
[root@c83g152 ~]# ip link show ens2f1
9: ens2f1:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:11 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 0c:fe:c0:ff:ee:21 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on, query_rss off
    vf 1     link/ether 0c:fe:c0:ff:ee:22 brd ff:ff:ff:ff:ff:ff, vlan 200, spoof checking off, link-state auto, trust on, query_rss off

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Jun  3 22:07 ./
drwxr-xr-x 3 root root    0 Jun  3 22:07 ../
--w------- 1 root root 4096 Jun  3 22:28 bind
lrwxrwxrwx 1 root root    0 Jun  3 22:28 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Jun  3 22:07 uevent
--w------- 1 root root 4096 Jun  3 22:28 unbind
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.3/vdpa1/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa2 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa2/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa3 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.3/vdpa3/

[root@c83g152 ~]# lshw -businfo -c network
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  ens1f0     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.3             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.3             network        ConnectX Family mlx5Gen Virtual Function

[root@c83g152 ~]# ls /dev
autofs           full          lp3     ptp4    shm       vcs2   vcsu5
block            fuse          mapper  ptp5    snapshot  vcs3   vcsu6
bsg              hpet          mcelog  ptp6    snd       vcs4   vfio
btrfs-control    hpilo         mem     ptp7    stderr    vcs5   vga_arbiter
bus              hugepages     mqueue  ptp8    stdin     vcs6   vhci
char             hwrng         net     ptp9    stdout    vcsa   vhost-net
cl               infiniband    null    pts     tty       vcsa1  vhost-vdpa-0
console          initctl       nvram   random  tty0      vcsa2  vhost-vdpa-1
core             input         port    raw     tty1      vcsa3  vhost-vdpa-2
cpu              ipmi0         ppp     rfkill  tty10     vcsa4  vhost-vdpa-3
cpu_dma_latency  kmsg          ptmx    rtc     tty11     vcsa5  vhost-vsock
disk             kvm           ptp0    rtc0    tty12     vcsa6  watchdog
dm-0             log           ptp1    sda     tty13     vcsu   watchdog0
dm-1             loop-control  ptp10   sda1    tty14     vcsu1  zero
dri              lp0           ptp11   sda2    tty15     vcsu2
fb0              lp1           ptp2    sg0     tty16     vcsu3
fd               lp2           ptp3    sg1     tty17     vcsu4

9.Finally

We referred to the following website.
https://github.com/redhat-nfvpe/vdpa-deployment
https://docs.google.com/document/d/1DgZuksLVIVD5ZpNUNH7zPUr-8t6GKKQICDLqIwQv-FA/edit

This time we used the Legacy mode of SR-IOV, but next time we would like to use accelerated-bridge-cni and verify it in swtichdev mode.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

Also, we thought about how many combinations of environments using vDPA exist, and it looks like there are at least 24 16 different ones.
2 and 8 were described in previous articles, and 9 is in this article.

No vm(qemu)/k8s k8s Pod/VMI vDPA Framework vDPA Type SR-IOV mode Related Articles
1 vm - kernel vhost lagacy Not started
2 vm - kernel vhost switchdev How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
3 vm - kernel virtio lagacy Not started
4 vm - kernel virtio switchdev Not started
5 vm - dpdk vhost lagacy Not started
6 vm - dpdk vhost switchdev Not started
7 vm - dpdk virtio lagacy Not started
8 vm - dpdk virtio switchdev How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
9 k8s pod kernel vhost lagacy How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical DeflectionThis article
10 k8s pod kernel vhost switchdev How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
11 k8s pod kernel virtio lagacy Not started
12 k8s pod kernel virtio switchdev Not started
13 k8s pod dpdk client lagacy Not started
14 k8s pod dpdk client switchdev Not started
15 k8s pod dpdk server lagacy Not started
16 k8s pod dpdk server switchdev Not started

It would be inefficient to do all of these, so we will prioritize and verify those that are likely to be used frequently as use cases.


If you can understand the following articles, I think that the content of this time is not so difficult, so I realized that it is important to build up the basics.
How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection

In the future, when you encounter an environment that uses vDPA, isn't it important to be able to figure out which combination it is composed of?

Becoming Cloud-Native implies increasing abstraction, and as an engineer, I believe we have to avoid the situation where we find ourselves in a state where we don't know anything.

*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the Pod are physically sent to the outside, we use the configuration shown in fig.1.

*3:As for swtichdev mode, there is "accelerated-bridge-cni" provided by Mellanox, which we plan to verify in the future. GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

*4:In the case of VMs, the MAC address setting was mandatory, but in the case of k8s Pod, it is not a mandatory setting because the MAC address is written in the yaml.

*5:core and modules will be installed at the same time

*6:When you build the Kernel from source, you can use If you have enabled vhost_vdpa by using make menuconfig, etc., this setting is not necessary.

*7:In addition, you can also build dpdk-devel and dpdk-app. However, since we could not communicate with the external NW, we will not run DPDK on the Pod this time, but use the normal CentOS Pod to communicate with it

*8:PCI Devices