Metonymical Deflection

ゆるく日々のコト・たまにITインフラ

CentOS baidu dperf 設定方法

baidu dperfの設定方法について記載します。
GitHub - baidu/dperf: dperf is a DPDK based 100Gbps network performance and load testing software.

dperfはDPDKをベースとした高性能なHTTP負荷試験ツールです。
特に、TPut(through put)、CPS(Connection per seconds)、CC(Concurrent Connection)の負荷試験に適しています。

本記事では、CentOS7/8のインストール方法や設定方法について記載します。
また、Mellanox製NICを使用した場合とそれ以外のNIC*1を使用した場合の差異についても補足説明を記載します。

私の環境では以下の負荷を発生させることができましたので、最後に設定例を記載します。

TPut:93Gbps
CPS:5M
CC:300M

1.概要

1-1.環境
IA server                        : ProLiant DL360p Gen8
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)

OS                               : CentOS7.9(2009)
Kernel                           : 3.10.0-1160.el7.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @base
  @core
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools
DPDK                             :19.11.10

OS                               : CentOS8.5(2111)
Kernel                           : 4.18.0-348.el8.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
DPDK                             :20.11.4
1-2.全体構成

f:id:metonymical:20220211231825j:plain
100GbpsのL2SWが無いため、2台のサーバを直接接続し、1台をdperf Client、もう1台をdperf Serverとして構成しています。
また、仮想環境でも構築可能です。*2

1-3 .全体の流れ ~概要~
  1. 事前準備
  2. インストール方法1:CentOS7.9 + DPDK19.11.10
  3. インストール方法2:CentOS8.5 + DPDK20.11.4
  4. 設定方法
  5. 負荷試験
  6. 負荷試験の設定例

2.事前準備

2-1.Hugepages設定

hugepageの容量は最低8GBとして、適時増加させてください。

vi /etc/default/grub

nopku transparent_hugepage=never default_hugepagesz=1G hugepagesz=1G hugepages=8

grub2-mkconfig -o /etc/grub2.cfg
vi /etc/fstab

nodev  /dev/hugepages    hugetlbfs pagesize=1GB    0 0
2-2.uio_pci_generic設定(Mellanox製NIC以外の場合)
echo "uio_pci_generic" > /etc/modules-load.d/uio_pci_generic.conf
2-3.OFEDインストール(Mellanox製NICの場合)
#CentOS7.9
yum -y install tcl tk unbound
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-rhel7.9-x86_64.iso /mnt/
/mnt/mlnxofedinstall --upstream-libs --dpdk --with-mft --with-mstflint

#CentOS8.5
dnf -y install tcl tk unbound tcsh gcc-gfortran && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-rhel8.5-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --with-mft --with-mstflint

3.インストール方法1:CentOS7.9 + DPDK19.11.10

3-1.DPDKのビルド
yum -y install numactl-devel libpcap-devel
mkdir dpdk
cd /root/dpdk/
wget http://fast.dpdk.org/rel/dpdk-19.11.10.tar.xz
tar xf dpdk-19.11.10.tar.xz
cd /root/dpdk/dpdk-stable-19.11.10

#Mellanox製NICでは以下の設定が必須
sed -i -e "s/CONFIG_RTE_LIBRTE_MLX5_PMD=n/CONFIG_RTE_LIBRTE_MLX5_PMD=y/g" /root/dpdk/dpdk-stable-19.11.10/config/common_base
sed -i -e "s/CONFIG_RTE_LIBRTE_MLX5_DEBUG=n/CONFIG_RTE_LIBRTE_MLX5_DEBUG=y/g" /root/dpdk/dpdk-stable-19.11.10/config/common_base

export TARGET=x86_64-native-linuxapp-gcc
make install T=$TARGET -j4
3-2.dperfのビルド
cd /root/dpdk
wget https://github.com/baidu/dperf/archive/refs/heads/main.zip

unzip main.zip
cd dperf-main/
export TARGET=x86_64-native-linuxapp-gcc
make -j4 RTE_SDK=/root/dpdk/dpdk-stable-19.11.10 RTE_TARGET=$TARGET

4.インストール方法2:CentOS8.5 + DPDK20.11.4

4-1.事前準備*3
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo && \
dnf -y install numactl-devel meson ninja-build rdma-core && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm
4-2.DPDKのビルド
mkdir dpdk
cd /root/dpdk/
wget https://fast.dpdk.org/rel/dpdk-20.11.4.tar.xz
tar xf dpdk-20.11.4.tar.xz
cd /root/dpdk/dpdk-stable-20.11.4

meson build --prefix=/root/dpdk/dpdk-stable-20.11.4/mydpdk -Denable_kmods=true && \
ninja -C build install
4-3.dperfのビルド
cd /root/dpdk
wget https://github.com/baidu/dperf/archive/refs/heads/main.zip

unzip main.zip
cd /root/dpdk/dperf-main/
export PKG_CONFIG_PATH=/root/dpdk/dpdk-stable-20.11.4/mydpdk/lib64/pkgconfig/
make
4-4.ldconfig設定

補足
dperf起動時に以下のエラーが出力された場合、ldconfigを実行してください。

[root@c85g151 dperf-main]# ./build/dperf -c test/http/client-cps.conf
./build/dperf: error while loading shared libraries: librte_ethdev.so.21: cannot open shared object file: No such file or directory
vi /etc/ld.so.conf.d/libdpdk.conf

/root/dpdk/dpdk-stable-20.11.4/mydpdk/lib64

ldconfig
ldconfig -p |grep dpdk

5.設定方法

5-1.dpdk-devbind設定

Mellanox製NIC以外では、dpdk-devbindが必要となります。
Mellanox製NICでは、dpdk-devbindは不要です。
bsf番号を確認の上、NICをdpdk PMDドライバにバインドしてください。*4

lspci
lshw -businfo -c network
/root/dpdk/dpdk-stable-20.11.4/usertools/dpdk-devbind.py -s
/root/dpdk/dpdk-stable-20.11.4/usertools/dpdk-devbind.py -b uio_pci_generic 0000:03:00.0

<出力例>
以下の出力例は、VMWare Work pro15上のCentOS8.5でdpdk-devbindを実行しています。

[root@c85g151 dperf-main]# lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
02:01.0 Ethernet controller: Intel Corporation 82545EM Gigabit Ethernet Controller (Copper) (rev 01)
03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

[root@c85g151 dperf-main]# /root/dpdk/dpdk-stable-20.11.4/usertools/dpdk-devbind.py -s
Network devices using kernel driver
===================================
0000:02:01.0 '82545EM Gigabit Ethernet Controller (Copper) 100f' if=ens33 drv=e1000 unused=uio_pci_generic *Active*
0000:03:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens192 drv=vmxnet3 unused=uio_pci_generic
0000:0b:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens192 drv=vmxnet3 unused=uio_pci_generic

[root@c85g151 dperf-main]# /root/dpdk/dpdk-stable-20.11.4/usertools/dpdk-devbind.py -b uio_pci_generic 0000:03:00.0

[root@c85g151 dperf-main]# /root/dpdk/dpdk-stable-20.11.4/usertools/dpdk-devbind.py -s
Network devices using DPDK-compatible driver
============================================
0000:03:00.0 'VMXNET3 Ethernet Controller 07b0' drv=uio_pci_generic unused=vmxnet3

Network devices using kernel driver
===================================
0000:02:01.0 '82545EM Gigabit Ethernet Controller (Copper) 100f' if=ens33 drv=e1000 unused=uio_pci_generic *Active*
0000:0b:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens192 drv=vmxnet3 unused=uio_pci_generic
5-2.dperfのclient設定

以下の設定はサンプルConfigから変更を加えています。

cd /root/dpdk/dperf-main
vi test/http/client-cps.conf

[root@c85g151 dperf-main]# vi test/http/client-cps.conf
mode                         client
tx_burst                     128
launch_num                   10
cpu                          0
payload_size                 1400
duration                     120s
cps                          400
cc                           2500
keepalive_request_interval   1ms
port         0000:00:08.0    100.64.12.155   100.64.12.156
client       16.0.0.1        100
server       48.0.0.1        1
listen       80              1

補足
設定時に気付いた点を記載します。

mode client/serverを選択します。
tx_burst 設定変更は不要です。
launch_num CPUコア数を増やした際にFloating point exceptionなどのエラーが出力された場合は、10, 6, 3, 1と値を下げてみてください。
cpu 0 1や0-4などの設定が可能です。CPU数はserverのIP数と一致させる必要があります。
payload_size 最小1Byte、最大1400Byteです。また、Client側で1400とした場合、HTTP GETリクエスト内に1400Byte分の文字列が挿入されます。
duration Defaultでslow_start:30秒の設定があるため、ClientはServerよりも30秒ほど大きい値を設定してください。
cps 90以下に設定するとエラーが出力される場合があります。このため、100以上に設定してください。
cc ccはConcurrent Connection(同時接続数)となります。加えて、TPut試験をしたい場合、この値を増加させてください。
keepalive_request_interval ccを100Mなど大きい値にする場合、30sや60sにすることでCPU負荷を下げることができます。また、TPut試験時は1msなど小さい値にしてください。
port 1カラム目:PCIeのDomain番号:bsf番号。2カラム目:自身のIPアドレス。3カラム目:GWのIPアドレス。(記載していませんが)4カラム目:GWのMACアドレス
client 1カラム目:HTTP Clientの開始IPアドレス。2カラム目:IPアドレス数。最大254。
server 1カラム目:HTTP serverの開始IPアドレス。2カラム目:IPアドレス数。このIP数とCPUの数を一致させる必要があります。例えば、cpu 0 1 の場合、2コアのCPUがアサインされているため、serverのアドレス数も2に設定します。
listen 1カラム目:待受けポート番号。2カラム目:ポート番号数。例えば、この値が4の場合、TCP80, 81, 82, 83で待受けます。この数を増やすと、Hugepagesが使用されます。Hugepagesの容量が不足した場合には8GB以上を設定してください。

詳細は以下URLを参照してください。
dperf/configuration.md at main · baidu/dperf · GitHub

5-3.dperfのserver設定

以下の設定はサンプルConfigから変更を加えています。

cd /root/dpdk/dperf-main
vi test/http/server-cps.conf

[root@c85g154 dperf-main]# vi test/http/server-cps.conf
mode                        server
tx_burst                    128
cpu                         0
duration                    150s
payload_size                1400
keepalive                   1
port        0000:00:09.0    100.64.12.156   100.64.12.155
client      16.0.0.1        100
server      48.0.0.1        1
listen      80              1

補足
5-2以外の点について記載します。

payload_size HTTPのコンテンツサイズになります。TPut試験時はServerSide:1400、ClientSide:1と設定します。ClientSide:1400とした場合、GETリクエスト内に文字列が挿入されるため、上り下りの双方向で同程度のTPutが発生します。この影響により下り(ServerからClient方向)のTPutが上限に達しない事象を確認したため、ServerSide:1400、ClientSide:1としてください。
keepalive ccやTPut試験時には有効"1"にしてください。

詳細は以下URLを参照してください。
dperf/configuration.md at main · baidu/dperf · GitHub

6.負荷試験

負荷を発生させる際は、Client SideとServer Sideにて、ほぼ同時にプログラムを実行してください。
補足
今回の構成では、互いのGWアドレスが自身のIPアドレスとなっているため、プログラムが同時に実行されない場合、arp解決ができずにbad gatewayと表示され、プログラムが停止してしまいます。
DUT*5がGWアドレスを保持している場合には問題ありません。
もしくは、Portの4カラム目にMACアドレスを設定してください。

6-1.Client Side
cd /root/dpdk/dperf-main
./build/dperf -c test/http/client-cps.conf

<出力例>

[root@c85g151 dperf-main]# ./build/dperf -c test/http/client-cps.conf
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:05:00.0 (socket 0)
EAL: No legacy callbacks, legacy socket not created
socket allocation succeeded, size 0.01GB num 131070

seconds 0                  cpuUsage 0
pktRx   0                  pktTx    0                  bitsRx   0                  bitsTx  0                  dropTx  0
arpRx   0                  arpTx    0                  icmpRx   0                  icmpTx  0                  otherRx 0          badRx 0
synRx   0                  synTx    0                  finRx    0                  finTx   0                  rstRx   0          rstTx 0
synRt   0                  finRt    0                  ackRt    0                  pushRt  0                  tcpDrop 0
skOpen  0                  skClose  0                  skCon    0                  skErr   0
httpGet 0                  http2XX  0                  httpErr  0
ierrors 0                  oerrors  0                  imissed  0

seconds 1                  cpuUsage 0
pktRx   0                  pktTx    0                  bitsRx   0                  bitsTx  0                  dropTx  0
arpRx   0                  arpTx    0                  icmpRx   0                  icmpTx  0                  otherRx 0          badRx 0
synRx   0                  synTx    0                  finRx    0                  finTx   0                  rstRx   0          rstTx 0
synRt   0                  finRt    0                  ackRt    0                  pushRt  0                  tcpDrop 0
skOpen  0                  skClose  0                  skCon    0                  skErr   0
httpGet 0                  http2XX  0                  httpErr  0
ierrors 0                  oerrors  0                  imissed  0

6-2.Server Side
cd /root/dpdk/dperf-main
./build/dperf -c test/http/server-cps.conf

<出力例>

[root@c85g154 dperf-main]# ./build/dperf -c test/http/server-cps.conf
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:05:00.0 (socket 0)
EAL: No legacy callbacks, legacy socket not created
socket allocation succeeded, size 0.78GB num 13107000

seconds 0                  cpuUsage 0
pktRx   0                  pktTx    0                  bitsRx   0                  bitsTx  0                  dropTx  0
arpRx   0                  arpTx    0                  icmpRx   0                  icmpTx  0                  otherRx 0          badRx 0
synRx   0                  synTx    0                  finRx    0                  finTx   0                  rstRx   0          rstTx 0
synRt   0                  finRt    0                  ackRt    0                  pushRt  0                  tcpDrop 0
skOpen  0                  skClose  0                  skCon    0                  skErr   0
httpGet 0                  http2XX  0                  httpErr  0
ierrors 0                  oerrors  0                  imissed  0

seconds 1                  cpuUsage 0
pktRx   0                  pktTx    0                  bitsRx   0                  bitsTx  0                  dropTx  0
arpRx   0                  arpTx    0                  icmpRx   0                  icmpTx  0                  otherRx 0          badRx 0
synRx   0                  synTx    0                  finRx    0                  finTx   0                  rstRx   0          rstTx 0
synRt   0                  finRt    0                  ackRt    0                  pushRt  0                  tcpDrop 0
skOpen  0                  skClose  0                  skCon    0                  skErr   0
httpGet 0                  http2XX  0                  httpErr  0
ierrors 0                  oerrors  0                  imissed  0

7.高負荷試験の設定例

本構成にて以下の負荷を印加した際の設定例を記載します。

TPut:93Gbps
CPS:5M
CC:300M

2ポートNICを使用した場合の設定となっていますが、1ポートでも同等の性能が出ました。
(1ポートNICの設定方法は最後に記載します。)

7-1.TPut試験

Client Side

[root@c85g151 dperf-main]# cat test/http/client-cps.conf
mode                         client
tx_burst                     128
launch_num                   3
cpu                          0 1 2 3
payload_size                 1
duration                     120s
cps                          500
cc                           10000
keepalive_request_interval   1ms
port         0000:07:00.0    100.64.12.155   100.64.12.156
client       16.0.0.1        200
server       48.0.0.1        2
port         0000:07:00.1    100.64.13.155   100.64.13.156
client       16.0.1.1        200
server       48.0.1.1        2
listen       80              1

Server Side

[root@c85g154 dperf-main]# cat test/http/server-cps.conf
mode                        server
tx_burst                    128
cpu                         0 1 2 3
duration                    150s
payload_size                1400
keepalive                   1
port        0000:07:00.0    100.64.12.156   100.64.12.155
client      16.0.0.1        200
server      48.0.0.1        2
port        0000:07:00.1    100.64.13.156   100.64.13.155
client      16.0.1.1        200
server      48.0.1.1        2
listen      80              1

TPut:93Gbps
f:id:metonymical:20220211172033p:plain
補足
ここで、clientのpayload_sizeを1400に設定した場合、クライアント側もサーバ側もbitsRXとbitsTXが両方とも74Gbpsとなりました。
f:id:metonymical:20220211233850p:plain

7-2.CPS試験

Client Side

[root@c85g151 dperf-main]# cat test/http/client-cps.conf
mode                         client
tx_burst                     128
launch_num                   3
cpu                          0 1 2 3
payload_size                 1
duration                     120s
cps                          5.1m
port         0000:07:00.0    100.64.12.155   100.64.12.156
client       16.0.0.1        200
server       48.0.0.1        2
port         0000:07:00.1    100.64.13.155   100.64.13.156
client       16.0.1.1        200
server       48.0.1.1        2
listen       80              1

Server Side

[root@c85g154 dperf-main]# cat test/http/server-cps.conf
mode                        server
tx_burst                    128
cpu                         0 1 2 3
duration                    150s
payload_size                1
port        0000:07:00.0    100.64.12.156   100.64.12.155
client      16.0.0.1        200
server      48.0.0.1        2
port        0000:07:00.1    100.64.13.156   100.64.13.155
client      16.0.1.1        200
server      48.0.1.1        2
listen      80              1

CPS:5M
f:id:metonymical:20220211172212p:plain

7-3.CC試験

Client Side

[root@c85g151 dperf-main]# cat test/http/client-cps.conf
mode                         client
tx_burst                     128
launch_num                   3
cpu                          0 1 2 3 4 5
payload_size                 1
duration                     1800s
cps                          1m
cc                           300m
keepalive_request_interval   60s
port         0000:07:00.0    100.64.12.155   100.64.12.156
client       16.0.0.1        200
server       48.0.0.1        3
port         0000:07:00.1    100.64.13.155   100.64.13.156
client       16.0.1.1        200
server       48.0.1.1        3
listen       80              4

Server Side

[root@c85g154 dperf-main]# cat test/http/server-cps.conf
mode                        server
tx_burst                    128
cpu                         0 1 2 3 4 5
duration                    1800s
payload_size                1
keepalive                   1
port        0000:07:00.0    100.64.12.156   100.64.12.155
client      16.0.0.1        200
server      48.0.0.1        3
port        0000:07:00.1    100.64.13.156   100.64.13.155
client      16.0.1.1        200
server      48.0.1.1        3
listen      80              4

CC:300M
f:id:metonymical:20220211172302p:plain

7-4.1ポートNICでの設定例

例として、7-1.TPut試験の設定を1ポートNIC設定に変更します。
変更箇所は赤文字で記載し、削除箇所はグレーアウトしました。

Client Side

[root@c85g151 dperf-main]# cat test/http/client-cps.conf
mode                         client
tx_burst                     128
launch_num                   3
cpu                          0 1 2 3
payload_size                 1
duration                     120s
cps                          500
cc                           10000
keepalive_request_interval   1ms
port         0000:07:00.0    100.64.12.155   100.64.12.156
client       16.0.0.1        200
server       48.0.0.1        4
#port         0000:07:00.1    100.64.13.155   100.64.13.156
#client       16.0.1.1        200
#server       48.0.1.1        2
listen       80              1

Server Side

[root@c85g154 dperf-main]# cat test/http/server-cps.conf
mode                        server
tx_burst                    128
cpu                         0 1 2 3
duration                    150s
payload_size                1400
keepalive                   1
port        0000:07:00.0    100.64.12.156   100.64.12.155
client      16.0.0.1        200
server      48.0.0.1        4
#port        0000:07:00.1    100.64.13.156   100.64.13.155
#client      16.0.1.1        200
#server      48.0.1.1        2
listen      80              1

以上です。

8.最後に

以下のサイトを参考にさせて頂きました。
GitHub - baidu/dperf: dperf is a DPDK based 100Gbps network performance and load testing software.

dperfは最近発表されたばかりですが、簡単な設定で高負荷を発生させることができるため、今後注目される可能性が高い負荷ツールだと考えています。

Cisco TRexのASTFモードを使用した場合、CPS試験時にコネクション確立が不安定になったり、HTTP通信時にTPutが想定通りに上がらないなどの事象が発生していました。
しかし、dperfではTPut、CPS、CCの負荷を安定的に発生させることが可能でした。

このため、特にTCPやHTTP通信においてはdperfも併用していきたいと考えています。*6

なお、今回紹介した設定以外にも、socket_mem設定により同一サーバ内でClientとServerのプロセスを同時起動するなど、詳細な設定も可能なため、さらに色々と試してみたいと思います。

*1:IntelNICやvmxnet3など

*2:但し、期待した性能が出ない場合もあるので、高負荷を発生させたい場合は、ベアメタル環境をお勧めします。

*3:CentOS8はサポート終了したため、リポジトリ設定を"mirror.centos.org"から"vault.centos.org"に変更しています。

*4:bsf番号はdperfの設定時にも必要となりますので、どのbsf番号のNICをバインドしたのか確認しておいてください。

*5:Device Under Testの略。 被測定物、テスト対象デバイスのこと。

*6:TRexはPcapファイルを読み込めるため、UDP通信ではとても重宝しています。

How to set up Scalable Function with vdpa for VMs

This article describes how to configure Scalable Function (referred to below as "SF") on CentOS8.
SF is a technology implemented using the Sub Function of the Linux Kernel, and a similar technology is Intel Scalable IOV.*1

This article will describe how to set up inter-VM communication using SF instead of SR-IOV VF in VMs using vhost vDPA module.

For more information on SF, please refer to Home · Mellanox/scalablefunctions Wiki · GitHub.

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
How to set up vDPA - appendix - - Metonymical Deflection

1.Overview

1-1.Environment
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.4(2105)
Kernel                           : 5.13.4-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.4-1.0.3.0
qemu-kvm                         : v6.1.0-rc0
ovs                              : v2.14.1
libvirtd                         : v7.5.0
1-2.Overall flow

Advance Preparation
Building the Kernel
Build the related application
Enable switchdev mode
Configure ovs and VM
Operation check

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210727105016p:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210727104945p:plain

Quoted from Red Hat's Blog
vDPA kernel framework part 3: usage for VMs and containers

The orange dotted lines (A) and (B) correspond to fig.1 and fig.2, respectively.
Furthermore, in fig.2, the actual traffic flow is described in blue and red letters. *3

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Install the Mellanox driver (OFED)

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.
The following command will install the Mellanox driver, but it will also install ovs v2.14.1 at the same time.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.4-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot
2-2.Prepare the Scalable Function

Use the mst tool to configure the NIC settings at the H/W level.
First, use the lshw command to check the bsf number of the NIC (0000:07:00.0, 0000:07:00.1).

lshw -businfo -c network
mst start
mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024
mlxconfig -d 0000:07:00.1 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  ens1f0     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1     network        MT2892 Family [ConnectX-6 Dx]

[root@c84g155 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

[root@c84g155 ~]# mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

Device #1:
----------

Device type:    ConnectX6DX
Name:           MCX623106AS-CDA_Ax
Description:    ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Secure Boot; No Crypto
Device:         0000:07:00.0

Configurations:                              Next Boot       New
         PF_BAR2_ENABLE                      False(0)        False(0)
         PER_PF_NUM_SF                       True(1)         True(1)
         PF_TOTAL_SF                         236             236
         PF_SF_BAR_SIZE                      10              10
         SRIOV_EN                            True(1)         True(1)
         NUM_OF_VFS                          16              16
         NUM_PF_MSIX_VALID                   False(0)        False(0)
         PF_NUM_PF_MSIX                      1024            1024

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

0000:07:00.1 ens2f1 should be set in the same way.

Note
SRIOV_EN=1 and NUM_OF_VFS=16 are SR-IOV settings, but we are not used this time, so you do not need to set them.

After completing the settings, reboot the system.

reboot

3.Building the Kernel

Kernel 5.13 or later is required. In addition, SF modules, etc. need to be enabled, so build from kernel's source.

3-2.Installing epel, etc
dnf -y install epel-release && \
dnf -y install http://repo.okay.com.mx/centos/8/x86_64/release/okay-release-1-5.el8.noarch.rpm && \
dnf -y --enablerepo=powertools install dwarves libbpf-devel
3-3.Install Kernel nosrc.rpm and download the source
cd /usr/src && \
wget https://elrepo.org/linux/kernel/el8/SRPMS/kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm && \
dnf -y localinstall kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm

wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.13.4.tar.xz && \
cp linux-5.13.4.tar.xz /root/rpmbuild/SOURCES && \
rpmbuild -bp /root/rpmbuild/SPECS/kernel-ml-5.13.spec
3-4.Running builddep
dnf builddep -y kernel
3-5.Prepare for make
cd /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/ && \
make mrproper && \
make oldconfig

The "make oldconfig" command will prompt you to enable/disable/modulate each module, but it will output a huge amount of items, so keep pressing Enter.
Note
After executing the "make oldconfig" command, "cp /boot/config-4.18.0-305.3.1.el8.x86_64 .config" will be executed first.
Then, if the configuration items are the same between Kernel 4.18 and Kernel 5.13, the settings of Kernel 4.18 will be inherited.
In addition, since the newly added items in Kernel 5.13 are default settings, we will configure them individually using the "make menuconfig" command.

3-6.Configuration by "make menuconfig"
make menuconfig

After executing the above command, the following screen will appear.
f:id:metonymical:20210726171324p:plain
Enable/modify the following modules.

MLX5_ESWITCH
MLX5_SF
VDPA

Make sure that the following items are set to [=y] or [=m], and if they are set to [=n], change the setting to [=m].*4
The configuration items correspond to the strings described in "Prompt". (e.g. Mellanox Technologies MLX5 SRIOV E-Switch support)
Please follow "Location" for the hierarchy of configuration items. (e.g. -> Device Drivers -> Network device support)

MLX5
│ Symbol: MLX5_ESWITCH [=y]                                                                                             │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:71                                                         │
│   Prompt: Mellanox Technologies MLX5 SRIOV E-Switch support                                                           │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE_EN [=y] && NET_SWITCHDEV [=y] │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│           -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│ (9)         -> Mellanox 5th generation network adapters (ConnectX series) Ethernet support (MLX5_CORE_EN [=y])        │

│ Symbol: MLX5_SF [=y]                                                                                                  │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:217                                                        │
│   Prompt: Mellanox Technologies subfunction device support using auxiliary device                                     │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE [=m] && MLX5_CORE_EN [=y]     │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│ (1)       -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│                                                                                                                       │
│                                                                                                                       │
│ Symbol: MLX5_SF_MANAGER [=y]                                                                                          │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:226                                                        │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_SF [=y] && MLX5_ESWITCH [=y]       │

VDPA
│ Symbol: VDPA [=m]                                          │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:2                          │
│   Prompt: vDPA drivers                                     │
│   Depends on: NET [=y]                                     │
│   Location:                                                │
│ (1) -> Device Drivers                                      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA [=y]                                     │
│ Type  : bool                                               │
│ Defined at drivers/vdpa/Kconfig:45                         │
│   Depends on: VDPA [=m]                                    │
│ Selects: VHOST_IOTLB [=m]                                  │
│ Selected by [m]:                                           │
│   - MLX5_VDPA_NET [=m] && VDPA [=m] && MLX5_CORE [=m]      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA_NET [=m]                                 │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:53                         │
│   Prompt: vDPA driver for ConnectX devices                 │
│   Depends on: VDPA [=m] && MLX5_CORE [=m]                  │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (2)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: MLX5_VDPA [=y]                                    │
│                                                            │
│                                                            │
│ Symbol: VHOST_VDPA [=m]                                    │
│ Type  : tristate                                           │
│ Defined at drivers/vhost/Kconfig:64                        │
│   Prompt: Vhost driver for vDPA-based backend              │
│   Depends on: VHOST_MENU [=y] && EVENTFD [=y] && VDPA [=m] │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (6)   -> VHOST drivers (VHOST_MENU [=y])                   │
│ Selects: VHOST [=m] && IRQ_BYPASS_MANAGER [=m]             │
│                                                            │
│                                                            │
│ Symbol: VP_VDPA [=m]                                       │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:63                         │
│   Prompt: Virtio PCI bridge vDPA driver                    │
│   Depends on: VDPA [=m] && PCI_MSI [=y]                    │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (8)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: VIRTIO_PCI_LIB [=y]                               │

After completing the configuration, save the file.
The configuration file will be saved to the following path.

/root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config
3-7.Modify the config file
sed -i -e 's/certs\/rhel.pem//g' /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config

The above command modifies the ".config" file as follows. If you do not do this, make will fail. *5

Before
CONFIG_SYSTEM_TRUSTED_KEYS="certs/rhel.pem"
After
CONFIG_SYSTEM_TRUSTED_KEYS=""
3-8.Build the Kernel
LOCALVERSION=-SFvDPA make -j32 && \
make modules_install && \
make install

Note

-SFvDPA This is the string displayed when the OS starts. e.g. CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA)
-j32 Specify the number of jobs to be processed in parallel to speed up the make process. A good rule of thumb is about twice the number of CPU cores.
3-9.Startup Kernel Configuration
grubby --default-index
grubby --default-kernel

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-index
0
[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-kernel
/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

Normally, it should be set as above.
If it is not, please check with the following command and set it to boot on the new Kernel.

Check the Index number
grubby --info=ALL

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --info=ALL
index=0
kernel="/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-5.13.4-1.el8.x86_64-SFvDPA.img $tuned_initrd"
title="CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA"
index=1
kernel="/boot/vmlinuz-4.18.0-305.3.1.el8.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-4.18.0-305.3.1.el8.x86_64.img $tuned_initrd"
title="CentOS Linux (4.18.0-305.3.1.el8.x86_64) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-4.18.0-305.3.1.el8.x86_64"
index=2
kernel="/boot/vmlinuz-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57.img"
title="CentOS Linux (0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-0-rescue"

Startup Kernel Settings
grubby --set-default-index=0

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --set-default-index=0
The default is /boot/loader/entries/f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA.conf with index 0 and kernel /boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

Reboot the system to start with the built Kernel.

reboot

4.Build the related application

Build and install qemu, libvirtd, and iproute2-next.

4-1.Enabling various repositories
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo && \
dnf -y install https://pkgs.dyn.su/el8/base/x86_64/raven-release-1.0-1.el8.noarch.rpm && \
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/raven.repo
4-2.Install the necessary packages
dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel rpcgen python3-docutils \
gnutls gnutls-devel libxml2-devel yajl-devel libtirpc-devel libudev-devel libpciaccess-devel \
usbredir-devel spice-server-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y install https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/meson-0.55.3-3.el8.noarch.rpm
4-3.Building qemu
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.1.0-rc0 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

Checking Version after Installation

/usr/local/bin/qemu-system-x86_64 --version

[root@c84g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 6.0.90 (v6.1.0-rc0)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.Building libvirtd
cd /usr/src && \
git clone https://github.com/libvirt/libvirt.git && \
cd libvirt/ && \
git checkout v7.5.0 && \
meson build && \
ninja -C build && \
ninja -C build install

Checking Version after Installation

libvirtd -V

[root@c84g155 ~]# libvirtd -V
libvirtd (libvirt) 7.5.0
4-5.Change qemu execution path
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-6.Change the user to run qemu
vi /etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out
4-7.Change libvirt-sock settings

Add blue text

vi /etc/rc.local

touch /var/lock/subsys/local
sleep 1
mkdir /var/run/libvirt
ln -s /var/local/run/libvirt/libvirt-sock /var/run/libvirt/libvirt-sock
sleep 1
exit 0

Granting execution permissions

chmod +x /etc/rc.d/rc.local
4-8.Building iproute2-next
cd /usr/src && \
git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git iproute2-next && \
cd iproute2-next && \
./configure --include_dir /usr && \
make -j all && \
make install

Checking Version after Installation

devlink -V

[root@c84g155 ~]# devlink -V
devlink utility, iproute2-5.13.0

reboot once again.

reboot

5.Enabling switchdev mode

5-1.Changing the operating mode
devlink dev eswitch set pci/0000:07:00.0 mode switchdev &&¥
devlink dev eswitch set pci/0000:07:00.1 mode switchdev

Note
0000:07:00.0 and 0000:07:00.1 is the bsf number of the NIC, as confirmed by the lshw command.

5-2.Checking the operation mode

Noteworthy points are indicated in red text.

devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1
devlink port show

[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
5-3.Configure the Scalable Function
devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
devlink port function set pci/0000:07:00.0/32768 state active
devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
devlink port function set pci/0000:07:00.1/98304 state active

[root@c84g155 ~]# devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
pci/0000:07:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 state active
[root@c84g155 ~]# devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
pci/0000:07:00.1/98304: type eth netdev eth0 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 state active
5-4.Check the Scalable Function.

Noteworthy parts are shown in red text.

lshw -businfo -c network
devlink port show en7f0pf0sf11 -jp
devlink port show en7f1pf1sf12 -jp

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device        Class          Description
==========================================================
pci@0000:04:00.0  ens1f0        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.0  en7f0pf0sf11  network        Ethernet interface
pci@0000:07:00.1  en7f1pf1sf12  network        Ethernet interface
[root@c84g155 ~]# devlink port show en7f0pf0sf11 -jp
{
    "port": {
        "pci/0000:07:00.0/32768": {
            "type": "eth",
            "netdev": "en7f0pf0sf11",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 11,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:11",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
[root@c84g155 ~]# devlink port show en7f1pf1sf12 -jp
{
    "port": {
        "pci/0000:07:00.1/98304": {
            "type": "eth",
            "netdev": "en7f1pf1sf12",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 1,
            "sfnum": 12,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:12",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Note
At this stage, we have completed the configuration of SF, but we have not bound SF to the vdpa device.
Therefore, in the next section, we will enable the vdpa module, unbind the existing driver, and add the vdpa device.
This will be the main difference from the SR-IOV VF.

6.Configure ovs and VM

6-1.Overall Flow - Overview -

Configure the settings in the order (1)-(9) described in fig.1 below.
fig.1
f:id:metonymical:20210727105016p:plain

  1. Enable the vdpa module and add vdpa devices: (1)
  2. Initial configuration of ovs
  3. Configuration of br30-ovs: (2)(3)(4)
  4. Configuration of br31-ovs: (5)(6)(7)
  5. Configure and start virtual machine c77g153: (8)
  6. Configure and start virtual machine c77g159: (9)
6-2.Overall flow - Commands only -

We will throw in the following commands.
Detailed explanations will follow, but if you don't need the explanations, just execute the commands.

1.Enable the vdpa module and add vdpa devices
(1)
modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

2.Initial configuration of ovs
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

3.Configuration of br30-ovs
(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11

4.Configuration of br31-ovs
(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

5.Configure and start virtual machine c77g153
(8)
virsh edit c77g153
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>

6.Configure and start virtual machine c77g159
(9)
virsh edit c77g159
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>
6-3.Enable the vdpa module and add vdpa devices : (1)

Enable the vdpa module, unbind the existing drivers, and add vdpa devices.

modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

vdpa module enabled
vhost_vdpa module enabled
Unbind rdma driver for en7f0pf0sf11
Unbind the eth driver of en7f0pf0sf11
Unbind the rdma driver of en7f1pf1sf12
Unbind the eth driver of en7f1pf1sf12
Adding a vdpa0 device
Adding a vdpa1 device

Make sure you have the following settings.
Noteworthy parts are shown in red text.

devlink dev show
devlink port show
vdpa mgmtdev show
vdpa dev show
ls -Fal /dev/
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c84g155 ~]# devlink dev show
pci/0000:07:00.0
pci/0000:07:00.1
auxiliary/mlx5_core.sf.2
auxiliary/mlx5_core.sf.3
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.0/32768: type eth netdev en7f0pf0sf11 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:11 state active opstate attached
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
pci/0000:07:00.1/98304: type eth netdev en7f1pf1sf12 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:12 state active opstate attached
[root@c84g155 ~]# vdpa mgmtdev show
auxiliary/mlx5_core.sf.2:
  supported_classes net
auxiliary/mlx5_core.sf.3:
  supported_classes net
[root@c84g155 ~]# vdpa dev show
vdpa0: type network mgmtdev auxiliary/mlx5_core.sf.2 vendor_id 5555 max_vqs 16 max_vq_size 256
vdpa1: type network mgmtdev auxiliary/mlx5_core.sf.3 vendor_id 5555 max_vqs 16 max_vq_size 256
[root@c84g155 ~]# ls -Fal /dev/
total 0
drwxr-xr-x  22 root root          3660 Jul 27 09:45 ./
dr-xr-xr-x. 17 root root           244 Jul 26 13:17 ../
crw-r--r--   1 root root       10, 235 Jul 27 09:41 autofs
============ s n i p ============
crw-------   1 root root       10, 137 Jul 27 09:41 vhci
crw-------   1 root root       10, 238 Jul 27 09:48 vhost-net
crw-------   1 root root      240,   0 Jul 27 09:45 vhost-vdpa-0
crw-------   1 root root      240,   1 Jul 27 09:45 vhost-vdpa-1
[root@c84g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Jul 27 09:45 ./
drwxr-xr-x 3 root root    0 Jul 27 09:45 ../
--w------- 1 root root 4096 Jul 27 09:45 bind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Jul 27 09:45 uevent
--w------- 1 root root 4096 Jul 27 09:45 unbind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.0/mlx5_core.sf.2/vdpa0/
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.1/mlx5_core.sf.3/vdpa1/

From the above output results, we can confirm the following

  • 0000:07:00.0, 0000:07:00.1 and auxiliary/mlx5_core.sf.2, auxiliary/mlx5_core.sf.3 are recognized as devlink devices.
  • pci/0000:07:00.0/32768 is recognized as en7f0pf0sf11 (as a devlink port).
  • Auxiliary/mlx5_core.sf.2 and auxiliary/mlx5_core.sf.3 are recognized as vdpa management devices.
  • vdpa device auxiliary/mlx5_core.sf.2 is recognized as vdpa0.
  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices.
  • 0000:07:00.0/mlx5_core.sf.2/vdpa0 and 0000:07:00.1/mlx5_core.sf.3/vdpa1 is controlled by vhost_vdpa driver
6-4.Initial configuration of ovs

Since ovs has already been installed, start the service from systemctl.*6

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

Start the ovs service
HW offload and tc-policy configuration
Restart the ovs service (to reflect the above settings)

Use the following command to check the settings.

ovs-vsctl get Open_vSwitch . other_config

[root@c84g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}

Note 1:
Here is a supplementary explanation of other_config:tc-policy.
The following options can be set for tc-policy.

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

Note 2:
If you want to remove the configuration, execute the command as follows.
"hw-offload" is the key, so specify any key you want to delete, such as "tc-policy" or "tc-policy".

ovs-vsctl remove Open_vSwitch . other_config tc-policy
6-5.Configuration of br30-ovs : (2)(3)(4)

Create the first bridge.

(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11

(2) Create a bridge
(3) Create the uplink (specify PF and set the interface for the external NW)
(4) Create downlink (specify VF Representer and set up the interface for VM)

Check the settings with the following command.

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    ovs_version: "2.14.1"
6-6.Configuration of br31-ovs : (5)(6)(7)

Create the second bridge.

(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

Same as (2), (3), and (4).

Check the settings with the following command. The blue text is the part that has been added.

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    Bridge br31-ovs
        Port br31-ovs
            Interface br31-ovs
                type: internal
        Port ens2f1
            Interface ens2f1
        Port en7f1pf1sf12
            Interface en7f1pf1sf12
    ovs_version: "2.14.1"
6-7.Configure and start virtual machine c77g153 : (8)

Please upload the qcow2 file to "/var/lib/libvirt/images/".
In this article, the qcow2 file with CentOS7.7 installed was prepared beforehand.
Additionally, once you have created a virtual machine with virt-manager, you will edit it with the "virsh edit" commands.

Login to the host OS via VNC or other means, and start virt-manager.
When creating a new virtual machine, delete the following [1]-[5] devices.*7
f:id:metonymical:20210412213049j:plain
After booting the VM, shutdown it once.
After shutdown, the device configuration should look like the following.
The NICs listed here are not used in vDPA, but they will allow you to ssh to them, so if you need to, assign a management IP to them.
f:id:metonymical:20210412212616j:plain

After shutdown, use the virsh edit command to perform the following settings.

(8)
virsh edit c77g153

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
6-8.Configure and start virtual machine c77g159 : (9)

Same as 6-7, except /dev/vhost-vdpa-1.

(9)
virsh edit c77g159

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>

7.Operation check

7-1.advance preparation

Prepare five consoles on hostOS c84g155.

ConsoleA tail -f /var/log/messages To refer to the logs that should be checked at VM startup
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs To check the packet count on c77g153
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs To check the packet count on c77g159
ConsoleD virsh start c77g153; virsh console c77g153 For the console of virtual machine c77g153
ConsoleE virsh start c77g159; virsh console c77g159 For the console of virtual machine c77g159
7-2.Booting the VM

Before starting the VM, please execute the above commands on ConsoleA, B, and C.
Then, start c77g153.
Wait a few seconds and then start c77g159.
Send a ping from c77g153 or c77g159.
As an example, follow fig.1 and execute ping 192.168.30.159 from c77g153.

fig.1
f:id:metonymical:20210727105016p:plain

The following is the output result. Noteworthy points are indicated in red text.
ConsoleA

Jul 27 11:50:49 c84g155 systemd[1]: Starting Virtualization daemon...
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtualization daemon.
Jul 27 11:50:49 c84g155 kvm[3831]: 1 guest now active
Jul 27 11:50:49 c84g155 kvm[3832]: 0 guests now active
Jul 27 11:50:49 c84g155 systemd[1]: Listening on Virtual machine log manager socket.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual machine log manager.
Jul 27 11:50:49 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_set_status:1786:(pid 3806): performing device reset
Jul 27 11:50:49 c84g155 systemd-machined[1192]: New machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual Machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 kernel: cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
Jul 27 11:50:49 c84g155 kvm[3850]: 1 guest now active
Jul 27 11:50:59 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_set_status:1786:(pid 3803): performing device reset
Jul 27 11:50:59 c84g155 systemd-machined[1192]: New machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 systemd[1]: Started Virtual Machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 kvm[3892]: 2 guests now active
Jul 27 11:51:11 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_handle_set_map:475:(pid 3855): memory map update
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00002|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:20 c84g155 kernel: Mirror/redirect action on
Jul 27 11:51:26 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_handle_set_map:475:(pid 3897): memory map update
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br30-ovs is down
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br31-ovs is down
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler5)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler3)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)

ConsoleB

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=135.157s
  port  ens2f0: rx pkts=33, bytes=4723, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=34, bytes=4835, drop=0, errs=0, coll=0
           duration=135.163s
  port  en7f0pf0sf11: rx pkts=34, bytes=4699, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=167, bytes=16515, drop=0, errs=0, coll=0
           duration=135.161s

ConsoleC

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=140.882s
  port  ens2f1: rx pkts=34, bytes=4835, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33, bytes=4723, drop=0, errs=0, coll=0
           duration=140.875s
  port  en7f1pf1sf12: rx pkts=33, bytes=4591, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=168, bytes=16627, drop=0, errs=0, coll=0
           duration=140.887s

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.
64 bytes from 192.168.30.159: icmp_seq=1 ttl=64 time=123 ms
64 bytes from 192.168.30.159: icmp_seq=2 ttl=64 time=0.809 ms
64 bytes from 192.168.30.159: icmp_seq=3 ttl=64 time=0.454 ms
64 bytes from 192.168.30.159: icmp_seq=4 ttl=64 time=0.457 ms

--- 192.168.30.159 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.454/25.142/123.539/49.198 ms

Note:

performing device reset mlx5_vdpa is initialized by mlx5_core.
memory map update memory mapping and update of mlx5_vdpa is being performed by mlx5_core. Note that this is the most important message because we can never communicate with it unless this log is output.
tc mirred to Houston A few lines will be output as a temporary error, but there is no problem.
internal error parsing flow key The ovs is giving multicast-related errors, but there is nothing wrong with it. Don't worry about it.
en7f0pf0sf11, en7f1pf1sf12 You can see that the tx/rx packet count and byte count for each port are increasing.

8.Finally

We referred to the following website.
https://github.com/Mellanox/scalablefunctions/wiki
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/devlink/devlink-port.rst?h=v5.12-rc3#n125
https://legacy.netdevconf.info/0x14/pub/slides/45/sf_mgmt_using_devlink_netdevconf_0x14.pdf
https://legacy.netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux

We can think of SF as a lightweight SR-IOV VF with excellent scalability.
Of course, there are many more advantages than just that, as you can see in the URL above.
In addition, SF will be at its best when used in conjunction with vDPA, but I do not expect it to take much time for such a technology to be widely adopted in anticipation of going cloud-native.

Intel Scalable IOV is coming in 2018, and Sub Function and Scalable Function of the Linux Kernel are relatively new technologies that have emerged in the last few years.
On the other hand, SR-IOV is a technology that has been in use for more than 10 years already, although I remember that it was introduced in 2007 and became popular little by little around 2010.
It is said that low layer technologies evolve at a slower pace than higher layer technologies, but I believe that we are now approaching a turning point for new technologies.
As an engineer, I think it is necessary to catch these trends quickly and sublimate them into technical skills that can be used at the practical level.

*1:Intel Scalable IOV can be enabled by setting the kernel parameter "intel_iommu=sm_on". In addition, you need to use a NIC (Intel SmartNIC N3000 series, N5000 series) that supports the ifcvf driver.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the VM are physically sent to the outside, we use the configuration shown in fig.1.

*3:This is a description of what I understand. If the content is incorrect, please point it out.

*4:If it is [=n], change the setting to [=m]. [=m] indicates modularization. In this case, it is possible to enable it with modprobe commands after the OS is booted. Note, however, that if [=n] is set, it will be excluded from the build and cannot be enabled by modprobe commands.

*5:Strictly speaking, when you run "make modules_install", an error will be output.

*6:It has already been installed in 2-1 sections.

*7:This is because related packages such as spice were not installed when qemu was built, and the virtual machine could not be started without removing these devices. Since this is not directly related to vDPA, we will not discuss how to deal with these issues.

Scalable Function の設定方法 vdpa + VM編

CentOS8によるvDPAを使用したScalable Function(以下、SF)の設定方法について記載しました。
SFはLinux KernelのSub Functionを使用して実装された技術となっており、類似の技術としてIntel Scalable IOVがあります。*1

本ブログでは、vhost vDPAモジュールを使用したVMにおいて、SR-IOV VFの代わりに、SFを使用したVM間通信の設定方法について記載していきます。

SFの詳細はHome · Mellanox/scalablefunctions Wiki · GitHubを参照してください。
以下は関連記事の一覧となります。

vDPA の設定方法 vhost_vdpa + VM編 - Metonymical Deflection
vDPA の設定方法 virtio_vdpa + VM編 - Metonymical Deflection
vDPA の設定方法 vhost_vdpa + kubernetes編 - Metonymical Deflection
vDPA の設定方法 vhost_vdpa + kubernetes + Accelerated Bridge CNI編 - Metonymical Deflection
vDPA の設定方法 番外編 - Metonymical Deflection

1.構成

1-1.環境
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
OS                               : CentOS8.4(2105)
Kernel                           : 5.13.4-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Mellanox OFED                    : v5.4-1.0.3.0
qemu-kvm                         : v6.1.0-rc0
ovs                              : v2.14.1
libvirtd                         : v7.5.0
1-2.全体の流れ

事前準備
Kernelのビルド
関連アプリケーションのビルド
switchdev modeの有効化
ovsとVMの設定
動作確認

1-3.全体構成

DAC(Direct Attached Cable)を使用してループ接続します。 *2
fig.1
f:id:metonymical:20210727105016p:plain
fig.1は簡易的に記載しており、内部アーキテクチャは省略しています。このため、実際は以下の構成をイメージして頂ければと思います。

fig.2
f:id:metonymical:20210727104945p:plain

Red HatのBlogより引用
vDPA kernel framework part 3: usage for VMs and containers

オレンジ色点線の(A)(B)が、それぞれfig.1とfig.2に対応しています。
さらに、fig.2においては、実際のトラフィックフローを青文字と赤文字で記載しています。*3

2.事前準備

特に記載しませんが、SELinux無効化、FW無効化、NTP時刻同期設定は事前に行っています。

2-1.Mellanoxドライバ(OFED)のインストール

isoファイルはMellanoxのサイトからDLしてください。Mellanox Download Site
DLしたisoファイルは、/root/tmp/に保存してください。
以下のコマンドにて、Mellanoxドライバをインストールしますが、ovs v2.14.1も同時にインストールされます。

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel8.4-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

インストールが完了したら、再起動してください。

reboot
2-2.Scalable Functionの準備

mstツールを使用して、H/WレベルでNICの設定を実施します。
始めにlshwコマンドでNICのbsf番号(0000:07:00.0, 0000:07:00.1)を確認します。

lshw -businfo -c network
mst start
mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024
mlxconfig -d 0000:07:00.1 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  ens1f0     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1     network        MT2892 Family [ConnectX-6 Dx]

[root@c84g155 ~]# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success

[root@c84g155 ~]# mlxconfig -d 0000:07:00.0 set PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=236 PF_SF_BAR_SIZE=10 SRIOV_EN=1 NUM_OF_VFS=16 NUM_PF_MSIX_VALID=0 PF_NUM_PF_MSIX=1024

Device #1:
----------

Device type:    ConnectX6DX
Name:           MCX623106AS-CDA_Ax
Description:    ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0 x16; Secure Boot; No Crypto
Device:         0000:07:00.0

Configurations:                              Next Boot       New
         PF_BAR2_ENABLE                      False(0)        False(0)
         PER_PF_NUM_SF                       True(1)         True(1)
         PF_TOTAL_SF                         236             236
         PF_SF_BAR_SIZE                      10              10
         SRIOV_EN                            True(1)         True(1)
         NUM_OF_VFS                          16              16
         NUM_PF_MSIX_VALID                   False(0)        False(0)
         PF_NUM_PF_MSIX                      1024            1024

 Apply new Configuration? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.

0000:07:00.1 ens2f1も同様に設定してください。

補足
SRIOV_EN=1 と NUM_OF_VFS=16でSR-IOVの設定を実施していますが、今回は使用しませんので、設定に含めなくても問題ありません。

設定が完了したら、再起動してください。

reboot

3.Kernelビルド

Kernel5.13以上が必須となります。加えて、SFモジュールなどの有効化が必要となるため、ソースからビルドします。

3-2.epelなどのインストール
dnf -y install epel-release && \
dnf -y install http://repo.okay.com.mx/centos/8/x86_64/release/okay-release-1-5.el8.noarch.rpm && \
dnf -y --enablerepo=powertools install dwarves libbpf-devel
3-3.Kernel nosrc.rpmのインストールとソースのダウンロード
cd /usr/src && \
wget https://elrepo.org/linux/kernel/el8/SRPMS/kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm && \
dnf -y localinstall kernel-ml-5.13.4-1.el8.elrepo.nosrc.rpm

wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.13.4.tar.xz && \
cp linux-5.13.4.tar.xz /root/rpmbuild/SOURCES && \
rpmbuild -bp /root/rpmbuild/SPECS/kernel-ml-5.13.spec
3-4.builddepの実施
dnf builddep -y kernel
3-5.makeの準備
cd /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/ && \
make mrproper && \
make oldconfig

make oldconfig コマンドでは、各モジュールの有効化/無効化/モジュール化のプロンプトが表示されますが、膨大な量の項目が出力されますので、Enterキーを押し続けてください。
補足
make oldconfig コマンドを実行した後、cp /boot/config-4.18.0-305.3.1.el8.x86_64 .config が最初に実行されます。
そして、Kernel4.18とKernel5.13で設定項目が同じ場合にはKernel4.18の設定が引き継がれます。
さらに、Kernel5.13で新規追加された項目はDefault設定となっているため、この後、make menuconfig コマンドで個別に設定していきます。

3-6.make menuconfigによる設定
make menuconfig

上記コマンドを実行すると以下の画面が表示されます。
f:id:metonymical:20210726171324p:plain
以下のモジュールを有効化/モジュール化します。

MLX5_ESWITCH
MLX5_SF
VDPA

以下の項目が[=y]もしくは[=m]となっていることを確認の上、[=n]になっていた場合は、[=m]に設定を変更してください。*4
設定項目はPromptに記載の文字列が該当します。(e.g. Mellanox Technologies MLX5 SRIOV E-Switch support)
設定項目の階層はLocationに従ってください。(e.g. -> Device Drivers -> Network device support)

MLX5
│ Symbol: MLX5_ESWITCH [=y]                                                                                             │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:71                                                         │
│   Prompt: Mellanox Technologies MLX5 SRIOV E-Switch support                                                           │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE_EN [=y] && NET_SWITCHDEV [=y] │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│           -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│ (9)         -> Mellanox 5th generation network adapters (ConnectX series) Ethernet support (MLX5_CORE_EN [=y])        │

│ Symbol: MLX5_SF [=y]                                                                                                  │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:217                                                        │
│   Prompt: Mellanox Technologies subfunction device support using auxiliary device                                     │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_CORE [=m] && MLX5_CORE_EN [=y]     │
│   Location:                                                                                                           │
│     -> Device Drivers                                                                                                 │
│       -> Network device support (NETDEVICES [=y])                                                                     │
│         -> Ethernet driver support (ETHERNET [=y])                                                                    │
│ (1)       -> Mellanox devices (NET_VENDOR_MELLANOX [=y])                                                              │
│                                                                                                                       │
│                                                                                                                       │
│ Symbol: MLX5_SF_MANAGER [=y]                                                                                          │
│ Type  : bool                                                                                                          │
│ Defined at drivers/net/ethernet/mellanox/mlx5/core/Kconfig:226                                                        │
│   Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MELLANOX [=y] && MLX5_SF [=y] && MLX5_ESWITCH [=y]       │

VDPA
│ Symbol: VDPA [=m]                                          │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:2                          │
│   Prompt: vDPA drivers                                     │
│   Depends on: NET [=y]                                     │
│   Location:                                                │
│ (1) -> Device Drivers                                      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA [=y]                                     │
│ Type  : bool                                               │
│ Defined at drivers/vdpa/Kconfig:45                         │
│   Depends on: VDPA [=m]                                    │
│ Selects: VHOST_IOTLB [=m]                                  │
│ Selected by [m]:                                           │
│   - MLX5_VDPA_NET [=m] && VDPA [=m] && MLX5_CORE [=m]      │
│                                                            │
│                                                            │
│ Symbol: MLX5_VDPA_NET [=m]                                 │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:53                         │
│   Prompt: vDPA driver for ConnectX devices                 │
│   Depends on: VDPA [=m] && MLX5_CORE [=m]                  │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (2)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: MLX5_VDPA [=y]                                    │
│                                                            │
│                                                            │
│ Symbol: VHOST_VDPA [=m]                                    │
│ Type  : tristate                                           │
│ Defined at drivers/vhost/Kconfig:64                        │
│   Prompt: Vhost driver for vDPA-based backend              │
│   Depends on: VHOST_MENU [=y] && EVENTFD [=y] && VDPA [=m] │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (6)   -> VHOST drivers (VHOST_MENU [=y])                   │
│ Selects: VHOST [=m] && IRQ_BYPASS_MANAGER [=m]             │
│                                                            │
│                                                            │
│ Symbol: VP_VDPA [=m]                                       │
│ Type  : tristate                                           │
│ Defined at drivers/vdpa/Kconfig:63                         │
│   Prompt: Virtio PCI bridge vDPA driver                    │
│   Depends on: VDPA [=m] && PCI_MSI [=y]                    │
│   Location:                                                │
│     -> Device Drivers                                      │
│ (8)   -> vDPA drivers (VDPA [=m])                          │
│ Selects: VIRTIO_PCI_LIB [=y]                               │

設定が一通り完了したら、保存してください。
設定ファイルは以下のパスに保存されます。

/root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config
3-7.configファイルの修正
sed -i -e 's/certs\/rhel.pem//g' /root/rpmbuild/BUILD/kernel-ml-5.13.4/linux-5.13.4-1.el8.x86_64/.config

上記コマンドは、.configファイルを以下のように修正しています。これを実行しないとmakeに失敗します。*5

Before
CONFIG_SYSTEM_TRUSTED_KEYS="certs/rhel.pem"
After
CONFIG_SYSTEM_TRUSTED_KEYS=""
3-8.Kernelのビルド
LOCALVERSION=-SFvDPA make -j32 && \
make modules_install && \
make install

補足

-SFvDPA OS起動時に表示される文字列です。e.g. CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA)
-j32 並列処理させるJob数を指定することによりmakeを高速化します。目安としてはCPUコア数の2倍程度
3-9.起動Kernelの設定
grubby --default-index
grubby --default-kernel

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-index
0
[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --default-kernel
/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

本来であれば、上記の設定になっているはずです。
もし、異なっていた場合には、以下のコマンドで確認の上、新Kernelで起動するように設定してください。

Index番号の確認
grubby --info=ALL

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --info=ALL
index=0
kernel="/boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-5.13.4-1.el8.x86_64-SFvDPA.img $tuned_initrd"
title="CentOS Linux (5.13.4-1.el8.x86_64-SFvDPA) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA"
index=1
kernel="/boot/vmlinuz-4.18.0-305.3.1.el8.x86_64"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-4.18.0-305.3.1.el8.x86_64.img $tuned_initrd"
title="CentOS Linux (4.18.0-305.3.1.el8.x86_64) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-4.18.0-305.3.1.el8.x86_64"
index=2
kernel="/boot/vmlinuz-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57"
args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet"
root="/dev/mapper/cl-root"
initrd="/boot/initramfs-0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57.img"
title="CentOS Linux (0-rescue-f71e06bfa3c74d2c9b19e85ce38b4b57) 8"
id="f71e06bfa3c74d2c9b19e85ce38b4b57-0-rescue"

起動Kernelの設定
grubby --set-default-index=0

[root@c84g155 linux-5.13.4-1.el8.x86_64]# grubby --set-default-index=0
The default is /boot/loader/entries/f71e06bfa3c74d2c9b19e85ce38b4b57-5.13.4-1.el8.x86_64-SFvDPA.conf with index 0 and kernel /boot/vmlinuz-5.13.4-1.el8.x86_64-SFvDPA

ビルドしたKernelで起動させるため、再起動します。

reboot

4.関連アプリケーションのビルド

qemu, libvirtd, iproute2-nextをビルドしてインストールします。

4-1.各種リポジトリの有効化
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo && \
dnf -y install https://pkgs.dyn.su/el8/base/x86_64/raven-release-1.0-1.el8.noarch.rpm && \
sed -i -e 's/enabled=0/enabled=1/g' /etc/yum.repos.d/raven.repo
4-2.必要なパッケージのインストール
dnf -y install cmake gcc libnl3-devel libudev-devel make numactl numactl-devel \
pkgconfig valgrind-devel pandoc libibverbs libmlx5 libmnl-devel meson ninja-build \
glibc-utils glib2 glib2-devel pixman pixman-devel zlib zlib-devel rpcgen python3-docutils \
gnutls gnutls-devel libxml2-devel yajl-devel libtirpc-devel libudev-devel libpciaccess-devel \
usbredir-devel spice-server-devel && \
wget https://cbs.centos.org/kojifiles/packages/pyelftools/0.26/1.el8/noarch/python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y localinstall python3-pyelftools-0.26-1.el8.noarch.rpm && \
dnf -y install https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/meson-0.55.3-3.el8.noarch.rpm
4-3.qemuのビルド
cd /usr/src && \
git clone https://github.com/qemu/qemu.git && \
cd qemu/ && \
git checkout v6.1.0-rc0 && \
mkdir build && \
cd build/ && \
../configure --enable-vhost-vdpa --target-list=x86_64-softmmu && \
make -j && \
make install

インストール後のVersion確認

/usr/local/bin/qemu-system-x86_64 --version

[root@c84g155 ~]# /usr/local/bin/qemu-system-x86_64 --version
QEMU emulator version 6.0.90 (v6.1.0-rc0)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
4-4.libvirtdのビルド
cd /usr/src && \
git clone https://github.com/libvirt/libvirt.git && \
cd libvirt/ && \
git checkout v7.5.0 && \
meson build && \
ninja -C build && \
ninja -C build install

インストール後のVersion確認

libvirtd -V

[root@c84g155 ~]# libvirtd -V
libvirtd (libvirt) 7.5.0
4-5.qemu実行Pathの変更
mv /usr/libexec/qemu-kvm /usr/libexec/qemu-kvm.org
ln -s /usr/local/bin/qemu-system-x86_64 /usr/libexec/qemu-kvm
4-6.qemu実行ユーザの変更
vi /usr/local/etc/libvirt/qemu.conf

user = "root"  #comment out
group = "root"  #comment out
4-7.libvirt-sockの設定変更

青文字箇所を追記

vi /etc/rc.local

touch /var/lock/subsys/local
sleep 1
mkdir /var/run/libvirt
ln -s /var/local/run/libvirt/libvirt-sock /var/run/libvirt/libvirt-sock
sleep 1
exit 0

実行権限付与

chmod +x /etc/rc.d/rc.local
4-8.iproute2-nextのビルド
cd /usr/src && \
git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git iproute2-next && \
cd iproute2-next && \
./configure --include_dir /usr && \
make -j all && \
make install

インストール後のVersion確認

devlink -V

[root@c84g155 ~]# devlink -V
devlink utility, iproute2-5.13.0

ここで、一旦再起動しておきます。

reboot

5.switchdev modeの有効化

5-1.動作モードの変更
devlink dev eswitch set pci/0000:07:00.0 mode switchdev &&¥
devlink dev eswitch set pci/0000:07:00.1 mode switchdev

補足
0000:07:00.0 は、lshwコマンドで確認したNICのbsf番号です。

5-2.動作モードの確認

注目すべき箇所を赤文字で記載します。

devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1
devlink port show

[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap-mode basic
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
5-3.Scalable Functionの設定
devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
devlink port function set pci/0000:07:00.0/32768 state active
devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
devlink port function set pci/0000:07:00.1/98304 state active

[root@c84g155 ~]# devlink port add pci/0000:07:00.0 flavour pcisf pfnum 0 sfnum 11
pci/0000:07:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 hw_addr CA:FE:C0:FF:EE:11
[root@c84g155 ~]# devlink port function set pci/0000:07:00.0/32768 state active
[root@c84g155 ~]# devlink port add pci/0000:07:00.1 flavour pcisf pfnum 1 sfnum 12
pci/0000:07:00.1/98304: type eth netdev eth0 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 hw_addr CA:FE:C0:FF:EE:12
[root@c84g155 ~]# devlink port function set pci/0000:07:00.1/98304 state active
5-4.Scalable Functionの確認

注目すべき箇所を赤文字で記載します。

lshw -businfo -c network
devlink port show en7f0pf0sf11 -jp
devlink port show en7f1pf1sf12 -jp

[root@c84g155 ~]# lshw -businfo -c network
Bus info          Device        Class          Description
==========================================================
pci@0000:04:00.0  ens1f0        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1        network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4          network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1        network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.0  en7f0pf0sf11  network        Ethernet interface
pci@0000:07:00.1  en7f1pf1sf12  network        Ethernet interface
[root@c84g155 ~]# devlink port show en7f0pf0sf11 -jp
{
    "port": {
        "pci/0000:07:00.0/32768": {
            "type": "eth",
            "netdev": "en7f0pf0sf11",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 11,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:11",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}
[root@c84g155 ~]# devlink port show en7f1pf1sf12 -jp
{
    "port": {
        "pci/0000:07:00.1/98304": {
            "type": "eth",
            "netdev": "en7f1pf1sf12",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 1,
            "sfnum": 12,
            "splittable": false,
            "function": {
                "hw_addr": "ca:fe:c0:ff:ee:12",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

補足
現段階でSFの設定は完了しましたが、SFをvdpaデバイスにバインドしていません。
このため、次のセクションでは、vdpaモジュールの有効化、既存ドライバのアンバインド、vdpaデバイスの追加を実施します。
これがSR-IOV VFとは大きく異なるポイントになります。

6.ovs-dpdkとVMの設定

6-1.全体の流れ ~概要~

以下のfig.1に記載されている(1)-(9)の順に設定していきます。
fig.1
f:id:metonymical:20210727105016p:plain

  1. vdpaモジュールの有効化とvdpaデバイスの追加:(1)
  2. ovsの初期設定
  3. br30-ovsの設定:(2)(3)(4)
  4. br31-ovsの設定:(5)(6)(7)
  5. 仮想マシンc77g153の設定:(8)
  6. 仮想マシンc77g159の設定:(9)
6-2.全体の流れ ~コマンドのみ~

以下のコマンドを投入していきます。
詳細な解説は後述しますが、解説が不要な方はコマンドだけを実行してください。

1.vdpaモジュールの有効化とvdpaデバイスの追加
(1)
modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

2.ovsの初期設定
systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

3.br30-ovsの設定
(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11
 
4.br31-ovsの設定
(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

5.仮想マシンc77g153の設定
(8)
virsh edit c77g153
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>

6.仮想マシンc77g159の設定
(9)
virsh edit c77g159
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>
6-3.vdpaモジュールの有効化とvdpaデバイスの追加:(1)

vdpaモジュールの有効化、既存ドライバのアンバインド、vdpaデバイスの追加を実施します。

modprobe vdpa
modprobe vhost_vdpa
echo mlx5_core.rdma.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.rdma.2/driver/unbind
echo mlx5_core.eth.2 > /sys/bus/auxiliary/devices/mlx5_core.sf.2/mlx5_core.eth.2/driver/unbind
echo mlx5_core.rdma.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.rdma.3/driver/unbind
echo mlx5_core.eth.3 > /sys/bus/auxiliary/devices/mlx5_core.sf.3/mlx5_core.eth.3/driver/unbind
vdpa dev add name vdpa0 mgmtdev auxiliary/mlx5_core.sf.2
vdpa dev add name vdpa1 mgmtdev auxiliary/mlx5_core.sf.3

vdpaモジュール有効化
vhost_vdpaモジュール有効化
en7f0pf0sf11のrdmaドライバのアンバインド
en7f0pf0sf11のethドライバのアンバインド
en7f1pf1sf12のrdmaドライバのアンバインド
en7f1pf1sf12のethドライバのアンバインド
vdpa0デバイスの追加
vdpa1デバイスの追加

以下の設定になっていることを確認します。
注目すべき箇所は赤文字で記載します。

devlink dev show
devlink port show
vdpa mgmtdev show
vdpa dev show
ls -Fal /dev/
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c84g155 ~]# devlink dev show
pci/0000:07:00.0
pci/0000:07:00.1
auxiliary/mlx5_core.sf.2
auxiliary/mlx5_core.sf.3
[root@c84g155 ~]# devlink port show
pci/0000:07:00.0/65535: type eth netdev ens2f0 flavour physical port 0 splittable false
pci/0000:07:00.0/32768: type eth netdev en7f0pf0sf11 flavour pcisf controller 0 pfnum 0 sfnum 11 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:11 state active opstate attached
pci/0000:07:00.1/131071: type eth netdev ens2f1 flavour physical port 1 splittable false
pci/0000:07:00.1/98304: type eth netdev en7f1pf1sf12 flavour pcisf controller 0 pfnum 1 sfnum 12 splittable false
  function:
    hw_addr ca:fe:c0:ff:ee:12 state active opstate attached
[root@c84g155 ~]# vdpa mgmtdev show
auxiliary/mlx5_core.sf.2:
  supported_classes net
auxiliary/mlx5_core.sf.3:
  supported_classes net
[root@c84g155 ~]# vdpa dev show
vdpa0: type network mgmtdev auxiliary/mlx5_core.sf.2 vendor_id 5555 max_vqs 16 max_vq_size 256
vdpa1: type network mgmtdev auxiliary/mlx5_core.sf.3 vendor_id 5555 max_vqs 16 max_vq_size 256
[root@c84g155 ~]# ls -Fal /dev/
total 0
drwxr-xr-x  22 root root          3660 Jul 27 09:45 ./
dr-xr-xr-x. 17 root root           244 Jul 26 13:17 ../
crw-r--r--   1 root root       10, 235 Jul 27 09:41 autofs
============ s n i p ============
crw-------   1 root root       10, 137 Jul 27 09:41 vhci
crw-------   1 root root       10, 238 Jul 27 09:48 vhost-net
crw-------   1 root root      240,   0 Jul 27 09:45 vhost-vdpa-0
crw-------   1 root root      240,   1 Jul 27 09:45 vhost-vdpa-1
[root@c84g155 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Jul 27 09:45 ./
drwxr-xr-x 3 root root    0 Jul 27 09:45 ../
--w------- 1 root root 4096 Jul 27 09:45 bind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Jul 27 09:45 uevent
--w------- 1 root root 4096 Jul 27 09:45 unbind
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.0/mlx5_core.sf.2/vdpa0/
lrwxrwxrwx 1 root root    0 Jul 27 09:45 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.1/mlx5_core.sf.3/vdpa1/

上記の出力結果より、以下のことが確認できます。

  • devlinkデバイスとして、0000:07:00.0, 0000:07:00.1, auxiliary/mlx5_core.sf.2, auxiliary/mlx5_core.sf.3 が認識されている
  • devlinkポートとして、pci/0000:07:00.0がen7f0pf0sf11 として認識されている
  • vdpaの管理デバイスとして、auxiliary/mlx5_core.sf.2, auxiliary/mlx5_core.sf.3 が認識されている
  • vdpaデバイスとして、auxiliary/mlx5_core.sf.2がvdpa0 として認識されている
  • /dev/vhost-vdpa-0, /dev/vhost-vdpa-1がvhost_vdpaデバイスとして認識されている
  • 0000:07:00.0/mlx5_core.sf.2/vdpa0, 0000:07:00.1/mlx5_core.sf.3/vdpa1がvhost_vdpaドライバに制御されている
6-4.ovsの初期設定

ovsは、既にインストール済みなので*6、systemctlからサービスをスタートします。

systemctl start openvswitch
ovs-vsctl set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=none
systemctl restart openvswitch

ovsサービスの起動
HW offloadとtc-policyの設定
ovsサービスの再起動(上記設定を反映させるため)

以下のコマンドで設定内容を確認します。

ovs-vsctl get Open_vSwitch . other_config

[root@c84g155 ~]# ovs-vsctl get Open_vSwitch . other_config
{hw-offload="true", tc-policy=none}

補足1
other_config:tc-policyについて、補足します。
tc-policyは、以下のオプションが設定可能です。

none adds a TC rule to both the software and the hardware (default)
skip_sw adds a TC rule only to the hardware
skip_hw adds a TC rule only to the software

補足2
設定を削除したい場合は、以下のようにコマンドを実行してください。
hw-offload がキーとなっていますので、tc-policyやhw-offloadなど、削除したい任意のキーを指定してください。

ovs-vsctl remove Open_vSwitch . other_config hw-offload
6-5.br30-ovsの設定:(2)(3)(4)

1つ目のブリッジを作成します。

(2)
ovs-vsctl add-br br30-ovs
(3)
ovs-vsctl add-port br30-ovs ens2f0
(4)
ovs-vsctl add-port br30-ovs en7f0pf0sf11

(2)ブリッジの作成
(3)アップリンクの作成(PFを指定し、外部NW向けのインターフェースを設定)
(4)ダウンリンクの作成(SFを指定し、VM向けのインターフェースを設定)

以下のコマンドで設定を確認します。

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    ovs_version: "2.14.1"
6-6.br31-ovsの設定:(5)(6)(7)

2つ目のブリッジを作成します。

(5)
ovs-vsctl add-br br31-ovs
(6)
ovs-vsctl add-port br31-ovs ens2f1
(7)
ovs-vsctl add-port br31-ovs en7f1pf1sf12

(2)(3)(4)と同様です。

以下のコマンドで設定を確認します。青文字が追加された部分です。

[root@c84g155 ~]# ovs-vsctl show
09598355-a1bf-4ce0-9edc-53c04d15ac8a
    Bridge br30-ovs
        Port br30-ovs
            Interface br30-ovs
                type: internal
        Port ens2f0
            Interface ens2f0
        Port en7f0pf0sf11
            Interface en7f0pf0sf11
    Bridge br31-ovs
        Port br31-ovs
            Interface br31-ovs
                type: internal
        Port ens2f1
            Interface ens2f1
        Port en7f1pf1sf12
            Interface en7f1pf1sf12
    ovs_version: "2.14.1"
6-7.仮想マシンc77g153の設定:(8)

/var/lib/libvirt/images/にqcow2ファイルをアップロードしておいてください。
本ブログでは、CentOS7.7をインストールしたqcow2ファイルを予め準備していました。
加えて、一度virt-manager仮想マシンを作成後、"virsh edit" コマンドで編集していきます。

VNCなどでホストOSにログインし、virt-managerを起動してください。
新規仮想マシンを作成する際、以下の[1]-[5]のデバイスを削除してください。*7
f:id:metonymical:20210412213049j:plain
VM起動後、一旦shutdownします。
shutdown後、以下のようなデバイス構成になっていればOKです。
ここに記載されているNICはvDPAでは使用しませんが、sshできるようになるため、必要であれば管理用IPをアサインしてください。
f:id:metonymical:20210412212616j:plain

shutdown後、virsh editコマンドで以下の設定を実施します。

(8)
virsh edit c77g153

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-0'/>
  </interface>
6-8.仮想マシンc77g159の設定:(9)

/dev/vhost-vdpa-1以外は、6-7と同様です。

(9)
virsh edit c77g159

<devices>
============ s n i p ============
  <interface type='vdpa'>
    <source dev='/dev/vhost-vdpa-1'/>
  </interface>

7.動作確認

7-1.事前準備

ホストOS c84g155で5つのコンソールを準備してください。

ConsoleA tail -f /var/log/messages VM起動時に確認すべきログを参照するため
ConsoleB watch ovs-ofctl -O OpenFlow14 dump-ports br30-ovs c77g153のパケットカウントを確認するため
ConsoleC watch ovs-ofctl -O OpenFlow14 dump-ports br31-ovs c77g159のパケットカウントを確認するため
ConsoleD virsh start c77g153; virsh console c77g153 仮想マシンc77g153のコンソール用
ConsoleE virsh start c77g159; virsh console c77g159 仮想マシンc77g159のコンソール用
7-2.VMの起動

VMを起動する前に、ConsoleA, B, Cでは上記のコマンドを実行しておいてください。
その後、c77g153を起動します。
数秒待ってから、c77g159を起動します。
c77g153 or c77g159からPingを飛ばしてください。
例として、fig.1に従い、c77g153から ping 192.168.30.159 を実行します。

fig.1
f:id:metonymical:20210727105016p:plain
以下、出力結果です。注目する箇所は赤文字で記載します。
ConsoleA

Jul 27 11:50:49 c84g155 systemd[1]: Starting Virtualization daemon...
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtualization daemon.
Jul 27 11:50:49 c84g155 kvm[3831]: 1 guest now active
Jul 27 11:50:49 c84g155 kvm[3832]: 0 guests now active
Jul 27 11:50:49 c84g155 systemd[1]: Listening on Virtual machine log manager socket.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual machine log manager.
Jul 27 11:50:49 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_set_status:1786:(pid 3806): performing device reset
Jul 27 11:50:49 c84g155 systemd-machined[1192]: New machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 systemd[1]: Started Virtual Machine qemu-1-c77g153.
Jul 27 11:50:49 c84g155 kernel: cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
Jul 27 11:50:49 c84g155 kvm[3850]: 1 guest now active
Jul 27 11:50:59 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_set_status:1786:(pid 3803): performing device reset
Jul 27 11:50:59 c84g155 systemd-machined[1192]: New machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 systemd[1]: Started Virtual Machine qemu-2-c77g159.
Jul 27 11:50:59 c84g155 kvm[3892]: 2 guests now active
Jul 27 11:51:11 c84g155 kernel: mlx5_core.sf mlx5_core.sf.2: mlx5_vdpa_handle_set_map:475:(pid 3855): memory map update
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:19 c84g155 ovs-vswitchd[3688]: ovs|00002|odp_util(handler10)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(3),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:11,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.153,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:20 c84g155 kernel: Mirror/redirect action on
Jul 27 11:51:26 c84g155 kernel: mlx5_core.sf mlx5_core.sf.3: mlx5_vdpa_handle_set_map:475:(pid 3897): memory map update
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br30-ovs is down
Jul 27 11:51:29 c84g155 kernel: tc mirred to Houston: device br31-ovs is down
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler5)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(1),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)
Jul 27 11:51:34 c84g155 ovs-vswitchd[3688]: ovs|00001|odp_util(handler3)|ERR|internal error parsing flow key recirc_id(0),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),eth(src=ca:fe:c0:ff:ee:12,dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=192.168.30.159,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no)

ConsoleB

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br30-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=135.157s
  port  ens2f0: rx pkts=33, bytes=4723, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=34, bytes=4835, drop=0, errs=0, coll=0
           duration=135.163s
  port  en7f0pf0sf11: rx pkts=34, bytes=4699, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=167, bytes=16515, drop=0, errs=0, coll=0
           duration=135.161s

ConsoleC

[root@c84g155 ~]# ovs-ofctl -O OpenFlow14 dump-ports br31-ovs
OFPST_PORT reply (OF1.4) (xid=0x2): 3 ports
  port LOCAL: rx pkts=0, bytes=0, drop=13, errs=0, frame=0, over=0, crc=0
           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
           duration=140.882s
  port  ens2f1: rx pkts=34, bytes=4835, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=33, bytes=4723, drop=0, errs=0, coll=0
           duration=140.875s
  port  en7f1pf1sf12: rx pkts=33, bytes=4591, drop=0, errs=0, frame=0, over=0, crc=0
           tx pkts=168, bytes=16627, drop=0, errs=0, coll=0
           duration=140.887s

ConsoleD

[root@c77g153 ~]# ping 192.168.30.159
PING 192.168.30.159 (192.168.30.159) 56(84) bytes of data.
64 bytes from 192.168.30.159: icmp_seq=1 ttl=64 time=123 ms
64 bytes from 192.168.30.159: icmp_seq=2 ttl=64 time=0.809 ms
64 bytes from 192.168.30.159: icmp_seq=3 ttl=64 time=0.454 ms
64 bytes from 192.168.30.159: icmp_seq=4 ttl=64 time=0.457 ms

--- 192.168.30.159 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 0.454/25.142/123.539/49.198 ms

<補足>

performing device reset mlx5_coreによりmlx5_vdpaが初期化されています。
memory map update mlx5_core.sfによりmlx5_vdpaのメモリマッピングとアップデートが実行されています。なお、このログが出力されない限り、絶対に通信することはできないため、最重要メッセージとなります。
tc mirred to Houston 一時的なエラーとして数行ほど出力されますが、特に問題ありません。
internal error parsing flow key ovsがマルチキャスト関連のエラーを出していますが、特に問題はありません。気にしないでください。
en7f0pf0sf11, en7f1pf1sf12 各ポートのtx/rxのパケットカウントとバイトカウントが上昇していることが確認できます。


以上です。

8.最後に

以下のサイトを参考にさせて頂きました。
https://github.com/Mellanox/scalablefunctions/wiki
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/devlink/devlink-port.rst?h=v5.12-rc3#n125
https://legacy.netdevconf.info/0x14/pub/slides/45/sf_mgmt_using_devlink_netdevconf_0x14.pdf
https://legacy.netdevconf.info/0x14/pub/papers/45/0x14-paper45-talk-paper.pdf
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux

SFはスケーラビリティに優れた軽量なSR-IOV VFと考えれば良いのではないかと思います。
もちろん、それだけには留まらない多くの利点があることは、上記URLを参照すれば、ご理解頂けると思います。
また、SFはvDPAと併用することにより、その真価が発揮されますが、クラウドネイティブ化を見据えたこのような技術の普及には、それほど多くの時間は掛からないと予想しています。

Intel Scalable IOVは2018年に登場しており、Linux KernelのSub FunctionやScalable Functionは、ここ数年に登場した比較的新しい技術となります。
一方で、SR-IOVは2007年に登場して、2010年頃から少しづつ普及したと記憶していますが、既に10年以上もの間、利用され続けている技術です。
低レイヤの技術は、上位レイヤの技術と比較して進化の速度が遅いとされていますが、今まさに新技術の転換期に差し掛かっているのだと思います。
私は技術者として、こういったトレンドを素早くキャッチし、実務レベルでも活用可能な技術力に昇華させておく必要があると考えています。

*1:Scalable IOVは、カーネルパラメータにintel_iommu=sm_onを設定することにより、scalable modeを有効化することができます。加えて、 ifcvfドライバに対応したNIC(Intel SmartNIC N3000 series , N5000 series)を使用する必要があります。

*2:対向機器に100Gbpsスイッチや100GNIC搭載サーバが用意できない場合を考慮し、ループ接続としています。但し、VMで生成したパケットが物理的に外部へ送信されることが重要と考えているため、fig.1の構成としています。

*3:私が理解した内容を記載しています。内容が誤っている場合には、ご指摘ください。

*4:[=m]はモジュール化されていることを示します。この場合、OS起動後にmodprobeで有効化することが可能です。しかし、[=n]の場合、ビルド対象から除外されるため、modprobeでも有効化することができませんので、注意してください。

*5:厳密には、make modules_installを実行したとき、エラーが出力されます。

*6:2-1.Mellanoxドライバ(OFED)のインストールで、インストール済みです。

*7:これはqemuのビルド時にspiceなどの関連パッケージをインストールしていないため、これらのデバイスを削除しないと仮想マシンが起動できませんでした。なお、vDPAとは直接関係がないため、これらの対処方法については割愛します。

Windows10 VLAN Interface設定 GUI編

GUIによるWindows10でのVLAN Interface設定方法について記載します。

過去記事Windows10 VLAN Interface設定 - Metonymical DeflectionではPowerShellによる設定方法を記載しました。
しかし、Wired PROSetにIntel PROSet Adapter Configuration Utility(以下、Utility)が同梱され、GUIにてVLAN Interfaceの設定が可能になりましたので、こちらの方法について記載します 。

1.構成

1-1.環境
OS                               : Windows10pro Ver:20H2 Build:19042.1052
NIC                              : Intel I350
Driver                           : Intel PROSet 26.3

2.Intel PROSetのDL

以下のIntelサイトよりIntel PROSetをDLします。
イーサネット製品」を選択の上、絞り込み条件にて「Windows10」 を選択してください。
downloadcenter.intel.com

以下の画面でDLします。
f:id:metonymical:20210625081605j:plain

Wired_PROSet_XXX.zipとWired_Driver_XXX.zipが存在しますが、Wired_Driverの方はUtilityが同梱されていないため、Wired_PROSetの方をDLしてください。

3.Intel PROSetのインストール

zipファイル解凍後、exe形式のファイル*1を実行してください。
インストール時の注意点として、セットアップ・オプション時に
インテル アドバンスト・ネットワークサービス」にチェックを入れてください。
f:id:metonymical:20210625085253p:plain

3-1.Utilityの起動

Utilityを起動します。
f:id:metonymical:20210625085755p:plain
そのまま起動しようとすると管理者権限が無いと実行不可のメッセージが表示されますので、Intel PROSet Adapter Configuration Utilityを右クリック>その他>管理者として実行をクリックします。

以下の画面が起動します。
f:id:metonymical:20210625091830p:plain

4.UtilityによるVLAN設定

左メニューからNICを選択後、チーム化/VLANタブ>新規作成をクリックします。
f:id:metonymical:20210625092040p:plain

新しいVLANの作成画面が表示されますので、VLAN-IDと名前を入力して、「VLANを作成」をクリックします。
f:id:metonymical:20210625092357p:plain

作成が完了すると以下の画面になります。
f:id:metonymical:20210625093122p:plain

  • 初めてVLAN Interfaceを作成する際、私の環境では「VLANを作成」をクリックしてから、元の画面に戻るまでに1~2分程度掛かりました。
  • ステータス:Disableとなっていますが、画面を更新するとEnableに変わりましたので、特に問題ありません。

2つ目以降のVLAN Interfaceを作成する際も同様の手順を繰り返すことにより作成可能です。
以下の例では、初回VLAN300を作成後、2つ目にVLAN301を作成しています。
f:id:metonymical:20210625093433p:plain

5.設定後の確認

作成後、コントロール パネル\ネットワークとインターネット\ネットワーク接続にて、以下のように表示されますので、IPアドレスなどの設定が可能です。
f:id:metonymical:20210625094328p:plain
また、デバイス名にて、

  • Intel(R) I350 Gigabit Network Connection - VLAN:VLAN300
  • Intel(R) I350 Gigabit Network Connection - VLAN:VLAN301

となっている箇所が追加されたVLAN Interfaceとなります。
デフォルトの名前は「イーサネット」となるため、必要に応じて名前を変更してください。*2

6.補足1:PowerShellとの同期

以下の通り、PowerShellの方とも同期が取れています。

PS C:\Users\metaphor_x13> Get-IntelNetAdapter

Location        Name                                                           ConnectionName            LinkStatus
--------        ----                                                           --------------            ----------
0:31:6:0        Intel(R) Ethernet Connection (7) I219-V                        local                     1.00 Gbps ...
5:0:0:0         Intel(R) I350 Gigabit Network Connection                                                 1.00 Gbps ...
5:0:1:0         Intel(R) I350 Gigabit Network Connection #2                                              1.00 Gbps ...


PS C:\Users\metaphor_x13> Get-IntelNetVLAN

VLANID VLANName                                       ParentName
------ --------                                       ----------
300    VLAN300                                        Intel(R) I350 Gigabit Network Connection
301    VLAN301                                        Intel(R) I350 Gigabit Network Connection

7.補足2:仮想マシンのEtherフレームにタグを付ける場合

応用的な使い方として、今回作成したVLAN InterfaceをVMWareWrkProの仮想ネットワークエディタで、以下のVMnet10,VMnet11のようにブリッジさせることも可能です。
f:id:metonymical:20210625100415p:plain
この設定により、仮想マシンから送信されたEtherフレームにVLANタグを付けて、外部NWと疎通することが可能になります。

8.補足3:仮想マシンでタグ付けしたい場合

VLAN作成時、「タグなし」にチェックを入れることでNativeVLANに設定することが可能です。
f:id:metonymical:20210625100340p:plain
しかし、この設定では、トラフィックが外部NWに出る際、必ずVLANタグが剥がされます。
意図的に仮想マシン上でタグ付けしたEtherフレームをタグ付けしたまま外部NWに送信したい場合であっても、VLANタグが剥がされます。*3

このようなケースで不都合が生じる場合は、以下のVLAN aware VMsを参照してください。
metonymical.hatenablog.com
経験則として、VLAN aware VMsを設定した物理ポートではレジストリをいじることになるため、VLAN Interfaceを作成した物理ポートとは別けた方が良いかなと思います。

私の場合、以下のように別けて使用しています。(7.補足2の図を参照)

VMnet9 VLAN aware VMsを設定した物理ポート:Intel(R) I350 Gigabit Network Connection #2
VMnet10,11 VLAN Interfaceを作成した物理ポート:Intel(R) I350 Gigabit Network Connection


以上です。

9.最後に

Intel NICのドライバを更新したら、Intel PROSet Adapter Configuration Utilityが同梱されていたので、記事に起こしました。
PowerShellを使うよりも、取っつきやすくなったかなと思います。

補足に諸々記載しましたが、どこでVLANタグの付け外しが処理されるのかを把握しておくことや自在に設定できることは大切かなと思います。

*1:32bit版であればPROWin32.exeなど

*2:個人的にNIC名に2バイト文字は使用したくない(=思わぬところでトラブルに遭遇した経験がある)ため必ず名前を変更しています。

*3:CiscoのNexus9000vやCSR1000vなどの仮想NW機器を使用する際、Nexus9000v側でVLANタグを付けたまま外部NWと疎通したい場合にVLANタグが剥がされると困るケースがあります。

How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI

This article describes how to configure vDPA (virtio data path acceleration) in Kubernetes environment using Accelerated Bridge CNI.*1
For more information on vDPA, please refer to The official Red Hat blog.

The difference from the previous article is that we are using Accelerated Bridge CNI.
In the case of SR-IOV CNI (previous article), SR-IOV Legacy mode was used, but in the case of Accelerated Bridge CNI, SR-IOV Switchdev mode is used.
With this, it is possible to use Linux Bridge.

For more information about Accelerated Bridge CNI, please refer to the following URL.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
How to set up vDPA - appendix - - Metonymical Deflection
How to set up Scalable Function with vdpa for VMs - Metonymical Deflection

1.Overview

1-1.Environment
1.ControlPlane
VMWare              : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 

2.Worker
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
Mellanox OFED                    : v5.3-1.0.0.1

3.ControlPlane&Worker common
OS                               : CentOS8.3(2011)
Kernel(ControlPlane)             : 4.18.0-240.el8.x86_64 
Kernel(Worker)                   : 5.12.12-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Kubernetes                       : 1.21.2
Docker-CE                        : 20.10.7
flannel                          : latest
Multus                           : latest
sriov-network-device-plugin      : latest
accelerated-bridge-cni           : latest
1-2.Overall flow
  1. Advance preparation
  2. Kernel update
  3. Build k8s Cluster & flannel
  4. Build and deploy vDPA and Accelerated Bridge CNI related
  5. Configuring Linux Bridge and deploy Pod
  6. Operation check

There is relatively a lot of documentation for 1-3, so we will skip the non-essential parts.
4-6 are the points that differ from the previous article.

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210621235434j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210621220154j:plain

Quoted from Red Hat's github
github.com

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Swap and Hosts file settings : CP (ControlPlane) & Worker
Disable Swap
vi /etc/fstab

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

Hosts file settings
vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.11.151 c83g151 c83g151.md.jp
192.168.11.152 c83g152 c83g152.md.jp
2-2.Enabling HugePage and IOMMU : Worker
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-3.SR-IOV VF settings : Worker

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *3

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-4.Install the Mellanox driver (OFED) : Worker

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g152 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g152 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update : Worker

As of June 11, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-2.Installation of Kernel
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *4
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g152 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System

4.Build k8s Cluster & flannel

4-1.Install Docker : CP&Worker
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \
dnf -y install --allowerasing  docker-ce docker-ce-cli containerd.io && \
systemctl start docker && systemctl enable docker
4-2.Configuring the k8s repository : CP&Worker
cat > /etc/yum.repos.d/kubernetes.repo <<EOF
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
4-3.Install k8s : CP&Worker
CP
dnf -y install kubeadm kubectl

Worker
dnf -y install kubeadm

CP&Worker
systemctl start kubelet.service && \
systemctl enable kubelet.service
4-4.Configuring Docker : CP&Worker
cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

systemctl daemon-reload && \
systemctl restart docker
4-5.Building the k8sCluster : CP
kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16

Output Example
At the end you will see the following output, please copy the red text.
This will be used when the worker joins the CP.

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3 
4-6.Configuration after building k8s Cluster : CP
Copy the config file
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Command completion settings
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc
4-7.Install flannel : CP
cd /usr/src && \
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \
kubectl apply -f kube-flannel.yml

kubectl get nodes

Output Example
Wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS     ROLES                  AGE   VERSION
c83g151.md.jp   Ready      control-plane,master   44s   v1.21.1
4-8.Joining a Worker : Worker
kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

Output Example
Workers also wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
c83g151.md.jp   Ready    control-plane,master   5m2s   v1.21.1
c83g152.md.jp   Ready                           44s    v1.21.1
4-9.Change to SR-IOV switchdev mode : Worker
4-9-1.Check the current operation mode
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Check the bsf (bus, slot, function) number of the PCI device.
Check the status of 07:00.0 (ens2f0)
Check the status of 07:00.1 (ens2f1)

The output will look like the following

[root@c83g152 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
4-9-2.Changing the operating mode

Note that the bsf numbers are slightly different.*5

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

Unbind the mlx5_core driver for VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0

Change the PF operation mode to switchdev.

07:00.0 ens2f0
07:00.1 ens2f1

Rebind the mlx5_core driver of VF.

07:00.2 ens2f0v0
07:01.2 ens2f1v0
4-9-3.Check the operation mode after the change
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

Changed to switchdev mode.

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representer has been added.

[root@c83g152 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

In addition, make sure that the HW offload function of the NIC is enabled.

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g152 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g152 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on
4-10.Enabling the vhost_vdpa module : Worker

Build and deploy sriov-device-plugin as described in section "5. Build and deploy vDPA and Accelerated Bridge CNI related".
At that time, if the vhost_vdpa module is not enabled, it will not be recognized as a Worker resource, so enable it beforehand.

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g152 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g152 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following

  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
  • 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver

Also, set it so that it is enabled at OS startup. *6
The blue text part has been added.

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind
sleep 1
modprobe vhost_vdpa
sleep 1
exit 0

reboot CP & Worker again.

5.Build and deploy vDPA and Accelerated Bridge CNI related

5-1.Install Golang : CP
dnf -y install epel-release golang
5-2.Building the Docker image : CP

In this section, we will build the following two images.*7

  • multus
  • sriov-dp
cd /usr/src && \
git clone https://github.com/redhat-nfvpe/vdpa-deployment.git

cd /usr/src/vdpa-deployment && \
make multus && \
make sriov-dp
5-3.Copy the Docker image : CP

Copy the built Docker image to the Worker.

/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152

Output Example

[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs

Checking the Docker Image : CP&Worker

[root@c83g151 vdpa-deployment]# docker images
REPOSITORY                           TAG          IMAGE ID       CREATED        SIZE
nfvpe/sriov-device-plugin            latest       4bbc4ec2506d   29 hours ago   42.7MB
<none>                               <none>       9d6fcc3268de   29 hours ago   1.26GB
nfvpe/multus                         latest       5b78ae4af974   29 hours ago   331MB
<none>                               <none>       d4742876ce8d   29 hours ago   1.35GB
k8s.gcr.io/kube-apiserver            v1.21.2      106ff58d4308   6 days ago     126MB
k8s.gcr.io/kube-controller-manager   v1.21.2      ae24db9aa2cc   6 days ago     120MB
k8s.gcr.io/kube-scheduler            v1.21.2      f917b8c8f55b   6 days ago     50.6MB
k8s.gcr.io/kube-proxy                v1.21.2      a6ebd1c1ad98   6 days ago     131MB
golang                               alpine3.12   baed0e68a17f   11 days ago    301MB
mellanox/accelerated-bridge-cni      v0.1.0       9adde4aff112   2 weeks ago    11MB
quay.io/coreos/flannel               v0.14.0      8522d622299c   5 weeks ago    67.9MB
alpine                               3.12         13621d1b12d4   2 months ago   5.58MB
k8s.gcr.io/pause                     3.4.1        0f8457a4c2ec   5 months ago   683kB
centos                               centos7      8652b9f0cb4c   7 months ago   204MB
k8s.gcr.io/coredns/coredns           v1.8.0       296a6d5035e2   8 months ago   42.5MB
k8s.gcr.io/etcd                      3.4.13-0     0369cf4303ff   9 months ago   253MB
5-4.Building the Accelerated Bridge CNI : CP
cd /usr/src && \
git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
cd accelerated-bridge-cni/ && \
make

Output Example

[root@c83g151 vdpa-deployment]# cd /usr/src && \
> git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
> cd accelerated-bridge-cni/ && \
> make
Cloning into 'accelerated-bridge-cni'...
remote: Enumerating objects: 3426, done.
============ s n i p ============
github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge
Building accelerated-bridge...
Done!
## Build executable file
building goveralls...
running coverage tests...
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge   [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/cache        0.006s  coverage: 72.2% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/config       0.006s  coverage: 90.9% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager      0.010s  coverage: 53.7% of statements
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager/mocks        [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/plugin       [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/types        [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/utils        0.006s  coverage: 64.3% of statements
5-5.Deploying yaml files : CP

Deploy the following four files.

  1. /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
  2. /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
  3. /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
  4. /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml

Output Example

[root@c83g151 accelerated-bridge-cni]# kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
> kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
serviceaccount/sriov-device-plugin created
daemonset.apps/kube-sriov-device-plugin-amd64 created
configmap/sriovdp-config created
daemonset.apps/kube-accelerated-bridge-cni-ds-amd64 created
5-6.Checking DaemonSet & ConfigMap & Pod : CP

Check the DaemonSet and ConfigMap that you deployed in section 5-5.

kubectl -n kube-system get ds
kubectl -n kube-system get cm
kubectl -n kube-system get pod

Output Example

[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get ds
NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-accelerated-bridge-cni-ds-amd64   2         2         2       2            2           beta.kubernetes.io/arch=amd64   12s
kube-flannel-ds                        2         2         2       2            2                                     11m
kube-multus-ds-amd64                   2         2         2       2            2           kubernetes.io/arch=amd64        13s
kube-proxy                             2         2         2       2            2           kubernetes.io/os=linux          13m
kube-sriov-device-plugin-amd64         2         2         2       2            2           beta.kubernetes.io/arch=amd64   13s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get cm
NAME                                 DATA   AGE
coredns                              1      13m
extension-apiserver-authentication   6      13m
kube-flannel-cfg                     2      11m
kube-proxy                           2      13m
kube-root-ca.crt                     1      13m
kubeadm-config                       2      13m
kubelet-config-1.21                  1      13m
multus-cni-config                    1      13s
sriovdp-config                       1      12s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get pod
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-76v4h                     1/1     Running   0          13m
coredns-558bd4d5db-qrn74                     1/1     Running   0          13m
etcd-c83g151.md.jp                           1/1     Running   0          13m
kube-accelerated-bridge-cni-ds-amd64-2dvtm   1/1     Running   0          14s
kube-accelerated-bridge-cni-ds-amd64-mxs92   1/1     Running   0          14s
kube-apiserver-c83g151.md.jp                 1/1     Running   0          13m
kube-controller-manager-c83g151.md.jp        1/1     Running   0          13m
kube-flannel-ds-2n457                        1/1     Running   0          11m
kube-flannel-ds-h9c9r                        1/1     Running   0          11m
kube-multus-ds-amd64-x5cdn                   1/1     Running   0          15s
kube-multus-ds-amd64-zt5pj                   1/1     Running   0          15s
kube-proxy-2585z                             1/1     Running   0          13m
kube-proxy-znl8n                             1/1     Running   0          11m
kube-scheduler-c83g151.md.jp                 1/1     Running   0          13m
kube-sriov-device-plugin-amd64-6cccp         1/1     Running   0          15s
kube-sriov-device-plugin-amd64-k8ktz         1/1     Running   0          15s
5-7.Checking the details of ConfigMap : CP

In the next section "6. Configuring Linux Bridge and deploy Pod", vdpa_mlx_vhost will be related to Network Attachment Definition and Pod configuration, so please check it in advance.

cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

Note

resourceName This can be any name. You can specify this name explicitly in "6-1 sections".
vendors This is the vendor identifier for the PCI Device ID. 15b3 indicates that it is a Mellanox product.*8
devices This is the device identifier for the PCI Device ID. 101e indicates that it is a VF of ConnectX-6 Dx.
drivers Specifies the mlx5_core driver.
vdpaType Specifies the vhost. This option has been extended from the regular SR-IOV Device plug-in.

The PCI Device ID can be checked with the following command.

[root@c83g152 ~]# lspci -nn |grep Mellanox
07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
5-8.Checking Worker Resources : CP

Check that vdpa_mlx_vhost is recognized as a Worker resource.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

Output Example

[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
{
  "cpu": "16",
  "ephemeral-storage": "127203802926",
  "hugepages-1Gi": "16Gi",
  "hugepages-2Mi": "0",
  "intel.com/vdpa_mlx_vhost": "2",
  "memory": "148123456Ki",
  "pods": "110"
}

"2" is the number of VFs that were recognized.

Note
When this value is "0" or this line (intel.com/vdpa_mlx_vhost) is not displayed, delete the pod of sriov-device-plugin once.
Since sriov-device-plugin is deployed as a DaemonSet, it will be automatically recreated after the Pod is deleted.
This will cause sriov-device-plugin to attempt to re-register vdpa_mlx_vhost.

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-rg8hm    1/1     Running   0          73m     192.168.11.152   c83g152.md.jp              

[root@c83g151 vdpa-deployment]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-rg8hm
pod "kube-sriov-device-plugin-amd64-rg8hm" deleted

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-kwc5z    1/1     Running   0          3s      192.168.11.152   c83g152.md.jp              

Again, check the Worker resources.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

If it is still not recognized, please refer to section 4-10.
The two points are as follows.

  1. /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 must be recognized as vhost_vdpa devices.
  2. 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 must be controlled by the vhost_vdpa driver.

Unless the above two conditions are met, vdpa_mlx_vhost will not be recognized as a Worker resource.

6.Configuring Linux Bridge and deploy Pod

6-1.Configuring the NetworkAttachmentDefinition : CP
vi 961nA-vdpa30.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net30
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa30",
  "bridge": "br30ens2f0",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.64",
    "rangeEnd": "192.168.30.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net31
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa31",
  "bridge": "br31ens2f1",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.128",
    "rangeEnd": "192.168.30.191"
  }
}'

kubectl apply -f 961nA-vdpa30.yaml
kubectl get network-attachment-definitions.k8s.cni.cncf.io

Output Example

[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME                   AGE
vdpa-mlx-vhost-net30   4h24m
vdpa-mlx-vhost-net31   4h24m

Note

name Set it(vdpa-mlx-vhost-net30/31) in the Pod annotations.
k8s.v1.cni.cncf.io/resourceName It set the name that we checked in 5-7. In addition, set it in the Pod annotations.
type This is a required setting, and must be explicitly set to "accelerated-bridge" in order to use the Accelerated Bridge CNI.*9
bridge Setting the name of the Linux Bridge will determine which Bridge the Pod will connect to.*10
rangeStart & rangeEnd In this configuration, the address ranges are separated by vdpa-mlx-vhost-net30 and vdpa-mlx-vhost-net31 to avoid duplication.

For more information, please refer to the following URL
accelerated-bridge-cni/configuration-reference.md at master · k8snetworkplumbingwg/accelerated-bridge-cni · GitHub
It should be noted that the Accelerated Bridge CNI has a different configuration method for vlan and trunk than the SR-IOV CNI.
We assume that this is because the Linux Bridge is used and the VLAN tagging is left to the Linux Bridge side.

6-2.Configuring the Linux Bridge : Worker

Configure in the order of (1)-(7) in fig.1.
fig.1
f:id:metonymical:20210621235434j:plain

(1) The settings have already been made in section 4-10. Execute the modprobe vhost_vdpa command.
(2)(3) Create the first Bridge, and set PF ens2f0 as the uplink.
(4)(5) Create the second Bridge, and set PF ens2f1 as the uplink.
(6) This is automatically set when the Pod is deployed. The Accelerated Bridge CNI will configure VF rep ens2f0_0 and ens2f1_0 as downlinks.
(7) Create the Pod in the next section 6-3.
Worker
(1)
modprobe vhost_vdpa

(2)
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
(3)
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

(4)
nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
(5)
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1

CP
(6)(7)See next section 6-3.
kubectl apply -f 161vdpa.yaml
6-2-1.Creating a Linux Bridge : Worker
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
6-2-2.Checking the Linux Bridge : Worker
nmcli con show

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --
6-3.Deploying a Pod : CP

The key point is that the values defined in the Network Attachment Definition are specified for "annotations" and "resources".
Also, there are no settings specific to Accelerated Bridge CNI.

vi 161vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "ca:fe:c0:ff:ee:11"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net31",
        "mac": "ca:fe:c0:ff:ee:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]


kubectl apply -f 161vdpa.yaml
kubectl get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
vdpa-pod01   1/1     Running   0          7s
vdpa-pod02   1/1     Running   0          7s
6-4.Check Pod details : CP

In this section, we will check the details of the Pod status by using the kubectl describe command.

kubectl describe pod vdpa-pod01

fig.1
f:id:metonymical:20210621235434j:plain
Make sure that each parameters(/dev/vhost-vdpa-0, 07:00.2) of device-info and fig.1 match.

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Wed, 23 Jun 2021 00:03:55 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.7
IPs:
  IP:  10.244.1.7
Containers:
  vdpa-single01:
    Container ID:  docker://67e8a82543cd8e4db556e8bd502f42659260e2d50192a5a0f13860d1cefc09ef
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Wed, 23 Jun 2021 00:03:57 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost:  1
    Requests:
      intel.com/vdpa_mlx_vhost:  1
    Environment:                 
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gq69j (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-gq69j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  39s   multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  39s   multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
  Normal  Pulled          39s   kubelet  Container image "centos:latest" already present on machine
  Normal  Created         39s   kubelet  Created container vdpa-single01
  Normal  Started         39s   kubelet  Started container vdpa-single01

The following message is output to /var/log/messages of the Worker when the Pod is deployed.

[root@c83g152 ~]# tail -f /var/log/messages

Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.415300    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.419085    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-pod7cba503f_5336_425e_b32c_45869b6f9020.slice.
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-poddd78b5ac_06f3_48f1_bbc9_2443fa88c653.slice.
============ s n i p ============
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f0_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered forwarding state
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7254] device (ens2f0_0): Activation: starting connection 'ens2f0_0' (84f4ad71-e7dc-44ba-8755-2e3d6279b037)
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7258] device (ens2f0_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7273] device (ens2f0_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7298] device (ens2f0_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7303] device (br30ens2f0): bridge port ens2f0_0 was attached
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7304] device (ens2f0_0): Activation: connection 'ens2f0_0' enslaved, continuing activation
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7308] device (ens2f0_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7398] device (ens2f0_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7404] device (ens2f0_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7431] device (ens2f0_0): Activation: successful, device activated.
Jun 21 19:38:42 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 temp_17: renamed from ens2f0v0
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7876] device (ens2f0v0): interface index 17 renamed iface from 'ens2f0v0' to 'temp_17'
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f1_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered forwarding state
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: renamed from temp_17
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: Link up
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:01.2 temp_18: renamed from ens2f1v0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708881    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.4/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f0_0 to the bridge br30ens2f0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.003540    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.65/24] from default/vdpa-mlx-vhost-net30
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: renamed from temp_18
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: Link up
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1334] device (temp_17): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1418] device (ens2f1v0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1466] device (ens2f1_0): Activation: starting connection 'ens2f1_0' (b0db37c4-3214-497b-80e2-c532ea2e54c2)
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1469] device (ens2f1_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1480] device (ens2f1_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1488] device (ens2f1_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (br31ens2f1): bridge port ens2f1_0 was attached
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (ens2f1_0): Activation: connection 'ens2f1_0' enslaved, continuing activation
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1497] device (ens2f1_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708792    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.5/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f1_0 to the bridge br31ens2f1
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.150099    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.129/24] from default/vdpa-mlx-vhost-net31
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1710] device (ens2f1_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1716] device (ens2f1_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1739] device (ens2f1_0): Activation: successful, device activated.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223212683+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-1 are ignored in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223313924+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-0 are ignored in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223395596+09:00" level=warning msg="path in container /dev/vhost-vdpa-0 already exists in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223319641+09:00" level=warning msg="path in container /dev/vhost-vdpa-1 already exists in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
============ s n i p ============
Jun 21 19:38:53 c83g152 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
6-5.Check the Mac address of the Worker : Worker

Check that the MAC address specified in the annotations of the Pod is reflected in the VF of the Worker.

ip link show ens2f0

Output Example

[root@c83g152 ~]# ip link show ens2f0
15: ens2f0:  mtu 1500 qdisc mq master br30ens2f0 state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
6-6.Check the Linux Bridge of the Worker : Worker

Make sure that the VF rep (ens2f0_0, ens2f1_0) of the Worker is automatically set as the downlink for the Linux Bridge.

nmcli con show

Output Example

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
cni0                 eacb774c-9306-4c9d-8eb3-2382d9b4c5da  bridge    cni0
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
ens2f0_0             112da873-9159-412b-a576-b70eb1dcb97b  ethernet  ens2f0_0
ens2f1_0             714af871-890d-4e24-8ad6-bcea1ea7c208  ethernet  ens2f1_0
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --

7.Operation check : CP

7-1.Check the IP address of the Pod.

the IP address of the Pod is dynamically assigned, check it in advance.

kubectl describe pod vdpa-pod01 |grep Add
kubectl describe pod vdpa-pod02 |grep Add

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add
  Normal  AddedInterface  7m22s  multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  7m22s  multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add
  Normal  AddedInterface  7m23s  multus   Add eth0 [10.244.1.6/24]
  Normal  AddedInterface  7m23s  multus   Add net1 [192.168.30.130/24] from default/vdpa-mlx-vhost-net31
7-2.communication check
vdpa-pod01 192.168.30.66/24
vdpa-pod02 192.168.30.130/24

Execute a ping from vdpa-pod01(192.168.30.66) to 192.168.30.130.

kubectl exec -it vdpa-pod01 -- ping 192.168.30.130

Output Example

[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.130
PING 192.168.30.130 (192.168.30.130) 56(84) bytes of data.
64 bytes from 192.168.30.130: icmp_seq=1 ttl=64 time=0.510 ms
64 bytes from 192.168.30.130: icmp_seq=2 ttl=64 time=0.590 ms
64 bytes from 192.168.30.130: icmp_seq=3 ttl=64 time=0.551 ms
64 bytes from 192.168.30.130: icmp_seq=4 ttl=64 time=0.510 ms
^C
--- 192.168.30.130 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 63ms
rtt min/avg/max/mdev = 0.510/0.540/0.590/0.037 ms

We was able to confirm that the pods are communicating with each other via the DAC.

You can also check that the Worker is learning the Pod's MAC address in the bridge monitor fdb.
Before pinging, run the following command in Worker.

bridge monitor fdb

The red text is the MAC address that was learned when the ping was run.
Other than that, the MAC address is the one learned when the pod was deployed.

[root@c83g152 ~]# bridge monitor fdb
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b vlan 1 master cni0 permanent
Deleted 7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab vlan 1 master cni0 permanent
fe:42:e8:04:50:51 dev veth78a2b6ab master cni0
b2:4b:f8:68:fe:b0 dev vethdbd6a26b master cni0
7a:d9:02:4d:7c:8f dev cni0 master cni0 permanent
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 master br30ens2f0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 vlan 1 master br30ens2f0 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 master br31ens2f1 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 vlan 1 master br31ens2f1 permanent
ca:fe:c0:ff:ee:11 dev ens2f0_0 master br30ens2f0
ca:fe:c0:ff:ee:11 dev ens2f1 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f1_0 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f0 master br30ens2f0

8.Finally

We referred to the following website.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

This article falls under No.10.

No vm(qemu)/k8s k8s Pod/VMI vDPA Framework vDPA Type SR-IOV mode Related Articles
1 vm - kernel vhost lagacy Not started
2 vm - kernel vhost switchdev How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
3 vm - kernel virtio lagacy Not started
4 vm - kernel virtio switchdev Not started
5 vm - dpdk vhost lagacy Not started
6 vm - dpdk vhost switchdev Not started
7 vm - dpdk virtio lagacy Not started
8 vm - dpdk virtio switchdev How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
9 k8s pod kernel vhost lagacy How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical Deflection
10 k8s pod kernel vhost switchdev How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical DeflectionThis article
11 k8s pod kernel virtio lagacy Not started
12 k8s pod kernel virtio switchdev Not started
13 k8s pod dpdk client lagacy Not started
14 k8s pod dpdk client switchdev Not started
15 k8s pod dpdk server lagacy Not started
16 k8s pod dpdk server switchdev Not started

In the next article, I would like to try setting up vmi (Virtual Machine Instance) using kubevirt.
At this time, kubevirt only supports vfio-pci, so we will put it on hold for now.

*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the Pod are physically sent to the outside, we use the configuration shown in fig.1.

*3:In the case of VMs, the MAC address setting was mandatory, but in the case of k8s Pod, it is not a mandatory setting because the MAC address is written in the yaml.

*4:core and modules will be installed at the same time

*5:The "0000" in front of the bsf number is called the Domain number. As far as I know, I have never seen a value other than "0000", so I don't think you need to worry too much about it.

*6:When you build the Kernel from source, you can use If you have enabled vhost_vdpa by using make menuconfig, etc., this setting is not necessary.

*7:In the previous article, we also built sriov-cni at the same time. In this article, we will not build sriov-cni because we will not use it, but you can build sriov-cni without any problem. In the 6-1.NetworkAttachmentDefinition setting, we will explicitly specify accelerated-bridge, so there is no problem even if sriov-cni and accelerated-bridge-cni are mixed.

*8:PCI Devices

*9:In the previous article, it is set to "sriov".

*10:If you don't specify anything, the default Bridge will be cni0.

vDPA の設定方法 vhost_vdpa + kubernetes + Accelerated Bridge CNI編

Accelerated Bridge CNIを使用したKubernetes環境におけるvDPA(virtio data path acceleration)の設定方法について記載しました。*1
vDPAの詳細はThe official Red Hat blogを参照してください。

前回の記事と異なる点は、Accelerated Bridge CNIを使用していることです。
SR-IOV CNI(前回の記事)の場合、SR-IOV Legacyモードを使用しましたが、
Accelerated Bridge CNIの場合、SR-IOV Switchdevモードを使用します。
これに伴い、Linux Bridgeを使用することが可能になります。

Accelerated Bridge CNIの詳細は以下のURLを参照してください。
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

以下は関連記事の一覧となります。

vDPA の設定方法 vhost_vdpa + VM編 - Metonymical Deflection
vDPA の設定方法 virtio_vdpa + VM編 - Metonymical Deflection
vDPA の設定方法 vhost_vdpa + kubernetes編 - Metonymical Deflection
vDPA の設定方法 番外編 - Metonymical Deflection
Scalable Function の設定方法 vdpa + VM編 - Metonymical Deflection

1.構成

1-1.環境
1.ControlPlane
VMWare              : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 

2.Worker
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
Mellanox OFED                    : v5.3-1.0.0.1

3.ControlPlane&Worker common
OS                               : CentOS8.3(2011)
Kernel(ControlPlane)             : 4.18.0-240.el8.x86_64 
Kernel(Worker)                   : 5.12.12-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Kubernetes                       : 1.21.2
Docker-CE                        : 20.10.7
flannel                          : latest
Multus                           : latest
sriov-network-device-plugin      : latest
accelerated-bridge-cni           : latest
1-2.全体の流れ
  1. 事前準備
  2. Kernelアップデート
  3. k8s Cluster & flannelの構築
  4. vDPA & Accelerated Bridge CNI関連のビルドとデプロイ
  5. Linux Bridge設定 & Podのデプロイ
  6. 動作確認

1-3までは比較的多くのドキュメントが存在しますので、重要な箇所以外は割愛します。
4-6が前回記事と異なるポイントになります。

1-3.全体構成

DAC(Direct Attached Cable)を使用してループ接続します。 *2
fig.1
f:id:metonymical:20210621235434j:plain
fig.1は簡易的に記載しており、内部アーキテクチャは省略しています。このため、実際は以下の構成をイメージして頂ければと思います。

fig.2
f:id:metonymical:20210621220154j:plain

Red Hatgithubより引用
github.com

2.事前準備

特に記載しませんが、SELinux無効化、FW無効化、NTP時刻同期設定は事前に行っています。

2-1.swapとHostsファイル設定 : CP(ControlPlane)&Worker
swap無効化
vi /etc/fstab

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

hostsファイル設定
vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.11.151 c83g151 c83g151.md.jp
192.168.11.152 c83g152 c83g152.md.jp
2-2.HugePageとIOMMUの有効化 : Worker
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

次にHugePageのマウント設定を実施しておきます。次回OS起動時に自動マウントされます。

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-3.SR-IOV VFの設定 : Worker

SR-IOV VFの設定を実施します。VF数を増やしても問題ありませんが、ここでは説明をシンプルにするため、VF数「1」としています。*3

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-4.Mellanoxドライバ(OFED)のインストール : Worker

isoファイルはMellanoxのサイトからDLしてください。Mellanox Download Site
DLしたisoファイルは、/root/tmp/に保存してください。
以下のコマンドにて、Mellanoxドライバをインストールします。

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

インストールが完了したら、再起動してください。

reboot

再起動が完了したら、HugePageを確認します。

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g152 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g152 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernelアップデート : Worker

2021年6月21日現在、vDPA関連モジュールは高い頻度で更新されていますので、最新のKernelをインストールします。

3-2.Kernelのインストール
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

現在インストール済みのKernelを確認
kernel-ml, kernel-ml-develをインストール*4
インストール済みKernelの確認
再起動

3-3.Kernel headersなどのインストール
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

現在起動中のKernel Version確認
kernel-headersのインストール
既存のkernel-tools kernel-tools-libsの削除
kernel-tools kernel-tools-libsのインストール
インストール済みKernelの確認

以下の出力になっていればOKです。

[root@c83g152 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-core.x86_64                              5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-devel.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-headers.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-modules.x86_64                           5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools.x86_64                             5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.12.12-1.el8.elrepo                          @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.3-OFED.5.3.1.0.0.1.rhel8u3                  @System

4.k8s Cluster & flannelの構築

4-1.Dockerのインストール : CP&Worker
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \
dnf -y install --allowerasing  docker-ce docker-ce-cli containerd.io && \
systemctl start docker && systemctl enable docker
4-3.k8sのインストール : CP&Worker
CP
dnf -y install kubeadm kubectl

Worker
dnf -y install kubeadm

CP&Worker
systemctl start kubelet.service && \
systemctl enable kubelet.service
4-4.Dockerの設定 : CP&Worker
cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

systemctl daemon-reload && \
systemctl restart docker
4-5.k8sClusterの構築 : CP
kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16

<出力例>
最後に以下の出力が表示されますので、赤文字部分をコピーしておいてください。
WorkerがCPへJoinする際に使用します。

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3 
4-6.k8s Cluster構築後の設定 : CP
コンフィグファイルのコピー
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

コマンド補完設定
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc
4-7.flannelのインストール : CP
cd /usr/src && \
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \
kubectl apply -f kube-flannel.yml

kubectl get nodes

<出力例>
以下のようにReadyとなるまで待ってください。

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS     ROLES                  AGE   VERSION
c83g151.md.jp   Ready      control-plane,master   44s   v1.21.2
4-8.WorkerのJoin : Worker
kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

<出力例>
以下のようにWorkerもReadyとなるまで待ってください。

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
c83g151.md.jp   Ready    control-plane,master   5m2s   v1.21.2
c83g152.md.jp   Ready                           44s    v1.21.2
4-9.SR-IOV switchdev modeへ変更 : Worker
4-9-1.現在の動作モードを確認
lshw -businfo -c network
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

PCIバイスのbsf(bus, slot, function)番号を確認
07:00.0(ens2f0)のステータス確認
07:00.1(ens2f1)のステータス確認

以下のように出力されます。

[root@c83g152 ~]# lshw -businfo -c network
Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode legacy inline-mode none encap disable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode legacy inline-mode none encap disable
4-9-2.動作モードの変更

bsf番号が微妙に異なっている点に注意してください。*5

echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind

VFのmlx5_coreドライバをアンバインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0

PFの動作モードをswitchdevに変更します。

07:00.0 ens2f0
07:00.1 ens2f1

VFのmlx5_coreドライバを再バインドします。

07:00.2 ens2f0v0
07:01.2 ens2f1v0
4-9-3.変更後の動作モードを確認
devlink dev eswitch show pci/0000:07:00.0
devlink dev eswitch show pci/0000:07:00.1

switchdev モードに変更されました。

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode none encap enable

[root@c83g152 ~]# devlink dev eswitch show pci/0000:07:00.1
pci/0000:07:00.1: mode switchdev inline-mode none encap enable

VF Representerが追加されています。

[root@c83g152 ~]# lshw -businfo -c network                                                                                   Bus info          Device      Class          Description
========================================================
pci@0000:04:00.0  ens1f0      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1      network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4        network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1      network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2  ens2f0v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2  ens2f1v0    network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.0  ens2f0_0    network        Ethernet interface
pci@0000:07:00.1  ens2f1_0    network        Ethernet interface

さらに、NICのHW offload機能が有効化されていることも確認します。

ethtool -k ens2f0 |grep tc
ethtool -k ens2f1 |grep tc

[root@c83g152 ~]# ethtool -k ens2f0 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on

[root@c83g152 ~]# ethtool -k ens2f1 |grep tc
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: on
hw-tc-offload: on
4-10.vhost_vdpaモジュールの有効化 : Worker

5.vDPA & Accelerated Bridge CNI関連のビルドとデプロイにて、sriov-device-pluginのビルドやデプロイを実施します。
その際、vhost_vdpaモジュールが有効化されていないと、Workerリソースとして認識されないため、事前に有効化しておきます。

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g152 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g152 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

上記の出力結果より、以下のことが確認できます。

  • /dev/vhost-vdpa-0と/dev/vhost-vdpa-1がvhost_vdpaデバイスとして認識されている
  • 0000:07:00.2/vdpa0と0000:07:01.2/vdpa1がvhost_vdpaドライバで制御されている

また、OS起動時に有効化されるように設定しておきます。*6
青文字箇所を追記。

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind && \
devlink dev eswitch set pci/0000:07:00.0 mode switchdev && \
devlink dev eswitch set pci/0000:07:00.1 mode switchdev && \
echo 0000:07:00.2 > /sys/bus/pci/drivers/mlx5_core/bind && \
echo 0000:07:01.2 > /sys/bus/pci/drivers/mlx5_core/bind
sleep 1
modprobe vhost_vdpa
sleep 1
exit 0

ここまでの準備が整ったらCP&Workerをもう一度再起動してください。

5.vDPA & Accelerated Bridge CNI関連のビルドとデプロイ

5-1.Golangのインストール : CP
dnf -y install epel-release golang
5-2.Docker imageのビルド : CP

ここでは、以下2つのimageをビルドします。*7

  • multus
  • sriov-dp
cd /usr/src && \
git clone https://github.com/redhat-nfvpe/vdpa-deployment.git

cd /usr/src/vdpa-deployment && \
make multus && \
make sriov-dp
5-3.Docker imageのコピー : CP

ビルドしたDocker imageをWorkerにコピーします。

/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152

<出力例>

[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs

Docker Imageの確認 : CP&Worker

[root@c83g151 vdpa-deployment]# docker images
REPOSITORY                           TAG          IMAGE ID       CREATED        SIZE
nfvpe/sriov-device-plugin            latest       4bbc4ec2506d   29 hours ago   42.7MB
<none>                               <none>       9d6fcc3268de   29 hours ago   1.26GB
nfvpe/multus                         latest       5b78ae4af974   29 hours ago   331MB
<none>                               <none>       d4742876ce8d   29 hours ago   1.35GB
k8s.gcr.io/kube-apiserver            v1.21.2      106ff58d4308   6 days ago     126MB
k8s.gcr.io/kube-controller-manager   v1.21.2      ae24db9aa2cc   6 days ago     120MB
k8s.gcr.io/kube-scheduler            v1.21.2      f917b8c8f55b   6 days ago     50.6MB
k8s.gcr.io/kube-proxy                v1.21.2      a6ebd1c1ad98   6 days ago     131MB
golang                               alpine3.12   baed0e68a17f   11 days ago    301MB
mellanox/accelerated-bridge-cni      v0.1.0       9adde4aff112   2 weeks ago    11MB
quay.io/coreos/flannel               v0.14.0      8522d622299c   5 weeks ago    67.9MB
alpine                               3.12         13621d1b12d4   2 months ago   5.58MB
k8s.gcr.io/pause                     3.4.1        0f8457a4c2ec   5 months ago   683kB
centos                               centos7      8652b9f0cb4c   7 months ago   204MB
k8s.gcr.io/coredns/coredns           v1.8.0       296a6d5035e2   8 months ago   42.5MB
k8s.gcr.io/etcd                      3.4.13-0     0369cf4303ff   9 months ago   253MB
5-4.Accelerated Bridge CNIのビルド : CP
cd /usr/src && \
git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
cd accelerated-bridge-cni/ && \
make

<出力例>

[root@c83g151 vdpa-deployment]# cd /usr/src && \
> git clone https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni.git && \
> cd accelerated-bridge-cni/ && \
> make
Cloning into 'accelerated-bridge-cni'...
remote: Enumerating objects: 3426, done.
============ s n i p ============
github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge
Building accelerated-bridge...
Done!
## Build executable file
building goveralls...
running coverage tests...
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/cmd/accelerated-bridge   [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/cache        0.006s  coverage: 72.2% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/config       0.006s  coverage: 90.9% of statements
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager      0.010s  coverage: 53.7% of statements
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/manager/mocks        [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/plugin       [no test files]
?       github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/types        [no test files]
ok      github.com/k8snetworkplumbingwg/accelerated-bridge-cni/pkg/utils        0.006s  coverage: 64.3% of statements
5-5.yamlファイルのデプロイ : CP

以下4つのファイルをデプロイします。

  1. /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
  2. /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
  3. /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
  4. /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml

<出力例>

[root@c83g151 accelerated-bridge-cni]# kubectl apply -f /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml && \
> kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml && \
> kubectl apply -f /usr/src/accelerated-bridge-cni/images/k8s-v1.16/accelerated-bridge-cni-daemonset.yaml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
serviceaccount/sriov-device-plugin created
daemonset.apps/kube-sriov-device-plugin-amd64 created
configmap/sriovdp-config created
daemonset.apps/kube-accelerated-bridge-cni-ds-amd64 created
5-6.DaemonSet & ConfigMap & Podの確認 : CP

5-5.でデプロイしたDaemonSetとConfigMapを確認します。

kubectl -n kube-system get ds
kubectl -n kube-system get cm
kubectl -n kube-system get pod

<出力例>

[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get ds
NAME                                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-accelerated-bridge-cni-ds-amd64   2         2         2       2            2           beta.kubernetes.io/arch=amd64   12s
kube-flannel-ds                        2         2         2       2            2                                     11m
kube-multus-ds-amd64                   2         2         2       2            2           kubernetes.io/arch=amd64        13s
kube-proxy                             2         2         2       2            2           kubernetes.io/os=linux          13m
kube-sriov-device-plugin-amd64         2         2         2       2            2           beta.kubernetes.io/arch=amd64   13s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get cm
NAME                                 DATA   AGE
coredns                              1      13m
extension-apiserver-authentication   6      13m
kube-flannel-cfg                     2      11m
kube-proxy                           2      13m
kube-root-ca.crt                     1      13m
kubeadm-config                       2      13m
kubelet-config-1.21                  1      13m
multus-cni-config                    1      13s
sriovdp-config                       1      12s
[root@c83g151 accelerated-bridge-cni]# kubectl -n kube-system get pod
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-76v4h                     1/1     Running   0          13m
coredns-558bd4d5db-qrn74                     1/1     Running   0          13m
etcd-c83g151.md.jp                           1/1     Running   0          13m
kube-accelerated-bridge-cni-ds-amd64-2dvtm   1/1     Running   0          14s
kube-accelerated-bridge-cni-ds-amd64-mxs92   1/1     Running   0          14s
kube-apiserver-c83g151.md.jp                 1/1     Running   0          13m
kube-controller-manager-c83g151.md.jp        1/1     Running   0          13m
kube-flannel-ds-2n457                        1/1     Running   0          11m
kube-flannel-ds-h9c9r                        1/1     Running   0          11m
kube-multus-ds-amd64-x5cdn                   1/1     Running   0          15s
kube-multus-ds-amd64-zt5pj                   1/1     Running   0          15s
kube-proxy-2585z                             1/1     Running   0          13m
kube-proxy-znl8n                             1/1     Running   0          11m
kube-scheduler-c83g151.md.jp                 1/1     Running   0          13m
kube-sriov-device-plugin-amd64-6cccp         1/1     Running   0          15s
kube-sriov-device-plugin-amd64-k8ktz         1/1     Running   0          15s
5-7.ConfigMapの詳細確認 : CP

6.Linux Bridge設定 & Podのデプロイにおいて、"resourceName": "vdpa_mlx_vhost"が、Network Attachment DefinitionやPodの設定に関連してくるため、事前に確認しておきます。

cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

<補足>

resourceName 任意の名前となります。6-1.NetworkAttachmentDefinitionの設定で、この名前を明示的に指定します。
vendors PCI Device IDのベンダ識別子となります。15b3はMellanox製品であることを示しています。*8
devices PCI Device IDのデバイス識別子となります。101eはConnectX-6 DxのVFであることを示しています。
drivers mlx5_coreドライバを指定します。
vdpaType vhostを指定しています。このオプションが通常のSR-IOV Device plug-inからvDPA用として拡張されているポイントになります。

PCI Device IDは以下のコマンドで確認できます。

[root@c83g152 ~]# lspci -nn |grep Mellanox
07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
5-8.Workerリソースの確認 : CP

vdpa_mlx_vhostがWorkerリソースとして認識されていることを確認します。

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

出力例

[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
{
  "cpu": "16",
  "ephemeral-storage": "127203802926",
  "hugepages-1Gi": "16Gi",
  "hugepages-2Mi": "0",
  "intel.com/vdpa_mlx_vhost": "2",
  "memory": "148123456Ki",
  "pods": "110"
}

"2"は、認識されたVFの数となります。

<補足>
この値が"0"、またはこの行(intel.com/vdpa_mlx_vhost)が表示されないときは、sriov-device-pluginのPodを一度削除します。
sriov-device-pluginは、DaemonSetとしてデプロイされているため、Podを削除しても自動的に再作成されます。
これにより、sriov-device-pluginがvdpa_mlx_vhostの再登録を試行します。

[root@c83g151 ~]# kubectl -n kube-system get pod -o wide |grep 152
kube-accelerated-bridge-cni-ds-amd64-2dvtm   1/1     Running   2          28h   192.168.11.152   c83g152.md.jp              
kube-flannel-ds-h9c9r                        1/1     Running   2          29h   192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-x5cdn                   1/1     Running   2          28h   192.168.11.152   c83g152.md.jp              
kube-proxy-znl8n                             1/1     Running   2          29h   192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-6cccp         1/1     Running   2          28h   192.168.11.152   c83g152.md.jp              

[root@c83g151 ~]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-6cccp
pod "kube-sriov-device-plugin-amd64-6cccp" deleted

[root@c83g151 ~]# kubectl -n kube-system get pod -o wide |grep 152
kube-accelerated-bridge-cni-ds-amd64-2dvtm   1/1     Running   2          28h   192.168.11.152   c83g152.md.jp              
kube-flannel-ds-h9c9r                        1/1     Running   2          29h   192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-x5cdn                   1/1     Running   2          28h   192.168.11.152   c83g152.md.jp              
kube-proxy-znl8n                             1/1     Running   2          29h   192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-c7j84         1/1     Running   0          3s    192.168.11.152   c83g152.md.jp              

再度、Workerのリソースを確認します。

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

それでも認識されないときは、4-10.vhost_vdpaモジュールの有効化 : Worker を参照してください。
ポイントは、以下2点です。

  1. /dev/vhost-vdpa-0と/dev/vhost-vdpa-1がvhost_vdpaデバイスとして認識されていること
  2. 0000:07:00.2/vdpa0と0000:07:01.2/vdpa1がvhost_vdpaドライバで制御されていること

上記2つの条件を満たさない限り、vdpa_mlx_vhostがWorkerリソースとして認識されることはありません。

6.Linux Bridge設定 & Podのデプロイ

6-1.NetworkAttachmentDefinitionの設定 : CP
vi 961nA-vdpa30.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net30
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa30",
  "bridge": "br30ens2f0",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.64",
    "rangeEnd": "192.168.30.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net31
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "accelerated-bridge",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa31",
  "bridge": "br31ens2f1",
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.128",
    "rangeEnd": "192.168.30.191"
  }
}'

kubectl apply -f 961nA-vdpa30.yaml
kubectl get network-attachment-definitions.k8s.cni.cncf.io

出力例

[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME                   AGE
vdpa-mlx-vhost-net30   4h24m
vdpa-mlx-vhost-net31   4h24m

<補足>

name Podのannotationsでvdpa-mlx-vhost-net30/31を設定します。
k8s.v1.cni.cncf.io/resourceName 5-7で確認した名前を設定しています。加えて、Podのannotationsでも設定します。
type 必須の設定です。Accelerated Bridge CNIを使用するため、明示的に"accelerated-bridge"と設定する必要があります。*9
bridge 今回の構成では必須の設定です。Linux Bridgeの名前を設定することにより、PodがどのBridgeに接続するかが決定されます。*10
rangeStart & rangeEnd 今回の構成では、アドレスレンジが重複しないように、vdpa-mlx-vhost-net30とvdpa-mlx-vhost-net31で別けています。

詳細は以下のURLを参照してください。
accelerated-bridge-cni/configuration-reference.md at master · k8snetworkplumbingwg/accelerated-bridge-cni · GitHub
Accelerated Bridge CNIの場合、vlanとtrunkの設定方法がSR-IOV CNIと異なっている点に注意が必要です。
これはLinux Bridgeを使用するため、VLANタグ付けはLinux Bridge側に任せているのではないかと想定しています。

6-2.Linux Bridgeの設定 : Worker

fig.1の(1)-(7)の順番で設定します。
fig.1
f:id:metonymical:20210621235434j:plain

(1) 4-10で既に設定済みです。modprobe vhost_vdpaコマンドを実行します。
(2)(3) 1つ目のBridgeを作成します。PF ens2f0をアップリンクとして設定します。
(4)(5) 2つ目のBridgeを作成します。PF ens2f1をアップリンクとして設定します。
(6) Pod作成時に自動的に設定されます。Accelerated Bridge CNIがVF rep ens2f0_0とens2f1_0をダウンリンクとして設定します。
(7) 次項6-3でPodを作成します。
Worker
(1)
modprobe vhost_vdpa

(2)
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
(3)
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

(4)
nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
(5)
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1

CP
(6)(7)次項6-3に記載
kubectl apply -f 161vdpa.yaml
6-2-1.Linux Bridgeの作成 : Worker
nmcli connection add type bridge autoconnect yes con-name br30ens2f0 ifname br30ens2f0
nmcli connection modify br30ens2f0 bridge.stp no
nmcli connection modify br30ens2f0 ipv4.method disabled ipv6.method ignore
nmcli connection up br30ens2f0
nmcli connection add type bridge-slave ifname ens2f0 master br30ens2f0

nmcli connection add type bridge autoconnect yes con-name br31ens2f1 ifname br31ens2f1
nmcli connection modify br31ens2f1 bridge.stp no
nmcli connection modify br31ens2f1 ipv4.method disabled ipv6.method ignore
nmcli connection up br31ens2f1
nmcli connection add type bridge-slave ifname ens2f1 master br31ens2f1
6-2-2.Linux Bridgeの確認 : Worker
nmcli con show

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --
6-3.Podのデプロイ : CP

ポイントは、annotationsとresourcesにNetwork Attachment Definitionで定義した値を指定している点です。
また、Accelerated Bridge CNI特有の設定はありません。

vi 161vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "ca:fe:c0:ff:ee:11"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net31",
        "mac": "ca:fe:c0:ff:ee:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]


kubectl apply -f 161vdpa.yaml
kubectl get pod

出力例

[root@c83g151 vdpa-deployment]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
vdpa-pod01   1/1     Running   0          7s
vdpa-pod02   1/1     Running   0          7s
6-4.Podの詳細確認 : CP

ここでは、kubectl describeコマンドにより、Podステータスの詳細を確認していきます。

kubectl describe pod vdpa-pod01

fig.1
f:id:metonymical:20210621235434j:plain
device-infoの各パラメータ(/dev/vhost-vdpa-0, 07:00.2)とfig.1が一致していることを確認してください。
出力例

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Wed, 23 Jun 2021 00:03:55 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.7"
                    ],
                    "mac": "56:5b:99:80:a4:6f",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.66"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.7
IPs:
  IP:  10.244.1.7
Containers:
  vdpa-single01:
    Container ID:  docker://67e8a82543cd8e4db556e8bd502f42659260e2d50192a5a0f13860d1cefc09ef
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Wed, 23 Jun 2021 00:03:57 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost:  1
    Requests:
      intel.com/vdpa_mlx_vhost:  1
    Environment:                 
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gq69j (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-gq69j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  39s   multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  39s   multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
  Normal  Pulled          39s   kubelet  Container image "centos:latest" already present on machine
  Normal  Created         39s   kubelet  Created container vdpa-single01
  Normal  Started         39s   kubelet  Started container vdpa-single01

Podデプロイ時、Workerの/var/log/messagesには以下のメッセージが出力されます。

[root@c83g152 ~]# tail -f /var/log/messages

Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.415300    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 kubelet[2033]: I0621 19:38:41.419085    2033 topology_manager.go:187] "Topology Admit Handler"
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-pod7cba503f_5336_425e_b32c_45869b6f9020.slice.
Jun 21 19:38:41 c83g152 systemd[1]: Created slice libcontainer container kubepods-besteffort-poddd78b5ac_06f3_48f1_bbc9_2443fa88c653.slice.
============ s n i p ============
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f0_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br30ens2f0: port 2(ens2f0_0) entered forwarding state
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7254] device (ens2f0_0): Activation: starting connection 'ens2f0_0' (84f4ad71-e7dc-44ba-8755-2e3d6279b037)
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7258] device (ens2f0_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7273] device (ens2f0_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7298] device (ens2f0_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7303] device (br30ens2f0): bridge port ens2f0_0 was attached
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7304] device (ens2f0_0): Activation: connection 'ens2f0_0' enslaved, continuing activation
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7308] device (ens2f0_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7398] device (ens2f0_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7404] device (ens2f0_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7431] device (ens2f0_0): Activation: successful, device activated.
Jun 21 19:38:42 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 temp_17: renamed from ens2f0v0
Jun 21 19:38:42 c83g152 NetworkManager[1192]:   [1624271922.7876] device (ens2f0v0): interface index 17 renamed iface from 'ens2f0v0' to 'temp_17'
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered disabled state
Jun 21 19:38:42 c83g152 kernel: device ens2f1_0 entered promiscuous mode
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered blocking state
Jun 21 19:38:42 c83g152 kernel: br31ens2f1: port 2(ens2f1_0) entered forwarding state
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: renamed from temp_17
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:00.2 net1: Link up
Jun 21 19:38:42 c83g152 kernel: mlx5_core 0000:07:01.2 temp_18: renamed from ens2f1v0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708881    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.4/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f0_0 to the bridge br30ens2f0
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.003540    6118 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod01", UID:"7cba503f-5336-425e-b32c-45869b6f9020", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.65/24] from default/vdpa-mlx-vhost-net30
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: renamed from temp_18
Jun 21 19:38:43 c83g152 kernel: mlx5_core 0000:07:01.2 net1: Link up
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1334] device (temp_17): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1418] device (ens2f1v0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1466] device (ens2f1_0): Activation: starting connection 'ens2f1_0' (b0db37c4-3214-497b-80e2-c532ea2e54c2)
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1469] device (ens2f1_0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1480] device (ens2f1_0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1488] device (ens2f1_0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (br31ens2f1): bridge port ens2f1_0 was attached
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1492] device (ens2f1_0): Activation: connection 'ens2f1_0' enslaved, continuing activation
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1497] device (ens2f1_0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:42.708792    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.244.1.5/24]
Jun 21 19:38:43 c83g152 kubelet[2033]: 2021-06-21T19:38:42+09:00 INF Attaching rep ens2f1_0 to the bridge br31ens2f1
Jun 21 19:38:43 c83g152 kubelet[2033]: I0621 19:38:43.150099    6117 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"vdpa-pod02", UID:"dd78b5ac-06f3-48f1-bbc9-2443fa88c653", APIVersion:"v1", ResourceVersion:"3643", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add net1 [192.168.30.129/24] from default/vdpa-mlx-vhost-net31
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1710] device (ens2f1_0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1716] device (ens2f1_0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jun 21 19:38:43 c83g152 NetworkManager[1192]:   [1624271923.1739] device (ens2f1_0): Activation: successful, device activated.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-524994f068d6d26da4d1a605184e6de152bf73c5a986154266fe173dd7160fd8\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[2041]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: var-lib-docker-overlay2-397914421b4b0f2e6a93c8262d6bd7c747dbdfaff65f7f674be5513b35d2ca6c\x2dinit-merged.mount: Succeeded.
Jun 21 19:38:43 c83g152 systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive.
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223212683+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-1 are ignored in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223313924+09:00" level=warning msg="custom mrw permissions for device /dev/vhost-vdpa-0 are ignored in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223395596+09:00" level=warning msg="path in container /dev/vhost-vdpa-0 already exists in privileged mode" container=52d8b02b57b1f979bcff398fcaf5b276ad056b73fe93aff10bb104c35a4b31d6
Jun 21 19:38:43 c83g152 dockerd[2034]: time="2021-06-21T19:38:43.223319641+09:00" level=warning msg="path in container /dev/vhost-vdpa-1 already exists in privileged mode" container=b31ebb7e20322aed53aa95e5d30ff86d3d153edd2eb1ccfeceed24db0cc3a339
============ s n i p ============
Jun 21 19:38:53 c83g152 systemd[1]: NetworkManager-dispatcher.service: Succeeded.
6-5.WorkerのMacアドレス確認 : Worker

Podのannotationsで指定したMACアドレスがWorkerのVFに反映されていることを確認してください。

ip link show ens2f0

出力例

[root@c83g152 ~]# ip link show ens2f0
15: ens2f0:  mtu 1500 qdisc mq master br30ens2f0 state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
6-6.WorkerのLinux Bridgeの確認 : Worker

WorkerのVF rep(ens2f0_0, ens2f1_0)が自動的に設定されていることを確認してください。

nmcli con show

出力例

[root@c83g152 ~]# nmcli con show
NAME                 UUID                                  TYPE      DEVICE
System eno1          a60ea695-651d-46f8-9d93-5ac92f504956  ethernet  eno1
cni0                 eacb774c-9306-4c9d-8eb3-2382d9b4c5da  bridge    cni0
br11eno2             6862ba45-5873-4930-ac46-e3c7cfa3074a  bridge    br11eno2
br30ens2f0           5838f8cd-9f3e-441c-8f72-989e54f8780d  bridge    br30ens2f0
br31ens2f1           0737de11-7b3b-48a7-a742-ced4375ab3b7  bridge    br31ens2f1
docker0              12600c36-dadb-49c9-a082-02b0a51e0424  bridge    docker0
virbr0               a3aec808-57a6-4f35-94f0-0081c8571eab  bridge    virbr0
bridge-slave-eno2    acd7a4d7-585c-4d88-9456-49ac852475ba  ethernet  eno2
bridge-slave-ens2f0  d24f6a37-97cf-471a-abff-e53582effc05  ethernet  ens2f0
bridge-slave-ens2f1  aa37266c-312b-43fa-9e7b-244a7269c90d  ethernet  ens2f1
ens2f0_0             112da873-9159-412b-a576-b70eb1dcb97b  ethernet  ens2f0_0
ens2f1_0             714af871-890d-4e24-8ad6-bcea1ea7c208  ethernet  ens2f1_0
eno3                 700cd313-7aae-4eb0-9591-a3813e40ef2f  ethernet  --
eno4                 1e768d3b-37ff-47be-ad6a-907710def01f  ethernet  --
ens1f0               49155983-26e5-4070-834a-49b8a52fbd02  ethernet  --
ens1f1               cb97a04a-3130-47b9-af13-2293fca218ee  ethernet  --

7.動作確認 : CP

7-1.PodのIPアドレス確認

PodのIPアドレスは動的アサインされるため、事前に確認しておきます。

kubectl describe pod vdpa-pod01 |grep Add
kubectl describe pod vdpa-pod02 |grep Add

出力例

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add
  Normal  AddedInterface  7m22s  multus   Add eth0 [10.244.1.7/24]
  Normal  AddedInterface  7m22s  multus   Add net1 [192.168.30.66/24] from default/vdpa-mlx-vhost-net30
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add
  Normal  AddedInterface  7m23s  multus   Add eth0 [10.244.1.6/24]
  Normal  AddedInterface  7m23s  multus   Add net1 [192.168.30.130/24] from default/vdpa-mlx-vhost-net31
7-2.疎通確認
vdpa-pod01 192.168.30.66/24
vdpa-pod02 192.168.30.130/24

上記のIPアドレスだったので、vdpa-pod01から192.168.30.130へPingを実行します。

kubectl exec -it vdpa-pod01 -- ping 192.168.30.130

出力例

[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.130
PING 192.168.30.130 (192.168.30.130) 56(84) bytes of data.
64 bytes from 192.168.30.130: icmp_seq=1 ttl=64 time=0.510 ms
64 bytes from 192.168.30.130: icmp_seq=2 ttl=64 time=0.590 ms
64 bytes from 192.168.30.130: icmp_seq=3 ttl=64 time=0.551 ms
64 bytes from 192.168.30.130: icmp_seq=4 ttl=64 time=0.510 ms
^C
--- 192.168.30.130 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 63ms
rtt min/avg/max/mdev = 0.510/0.540/0.590/0.037 ms

DACを経由して、Pod間で通信していることが確認できました。

また、Workerのbridge monitor fdbにてPodのMACアドレスを学習していることが確認できます。
Pingを飛ばす前にWokerで以下のコマンドを実行しておきます。

bridge monitor fdb

赤文字がPing実行時に学習したMACアドレスとなります。
それ以外はPod起動時に学習したMACアドレスとなります。

[root@c83g152 ~]# bridge monitor fdb
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b master cni0 permanent
82:9f:e7:63:77:47 dev vethdbd6a26b vlan 1 master cni0 permanent
Deleted 7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab master cni0 permanent
42:83:37:60:8a:8d dev veth78a2b6ab vlan 1 master cni0 permanent
fe:42:e8:04:50:51 dev veth78a2b6ab master cni0
b2:4b:f8:68:fe:b0 dev vethdbd6a26b master cni0
7a:d9:02:4d:7c:8f dev cni0 master cni0 permanent
7a:d9:02:4d:7c:8f dev cni0 vlan 1 master cni0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 master br30ens2f0 permanent
ea:03:d9:6b:62:a3 dev ens2f0_0 vlan 1 master br30ens2f0 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 master br31ens2f1 permanent
e6:2e:2e:d8:04:3d dev ens2f1_0 vlan 1 master br31ens2f1 permanent
ca:fe:c0:ff:ee:11 dev ens2f0_0 master br30ens2f0
ca:fe:c0:ff:ee:11 dev ens2f1 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f1_0 master br31ens2f1
ca:fe:c0:ff:ee:12 dev ens2f0 master br30ens2f0


以上です。

8.最後に

以下のサイトを参考にさせて頂きました。
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

今回の記事はNo10に該当します。

No vm(qemu)/k8s k8s Pod/VMI vDPA Framework vDPA Type SR-IOV mode Related Articles
1 vm - kernel vhost lagacy Not started
2 vm - kernel vhost switchdev vDPA の設定方法 vhost_vdpa + VM編 - Metonymical Deflection
3 vm - kernel virtio lagacy Not started
4 vm - kernel virtio switchdev Not started
5 vm - dpdk vhost lagacy Not started
6 vm - dpdk vhost switchdev Not started
7 vm - dpdk virtio lagacy Not started
8 vm - dpdk virtio switchdev vDPA の設定方法 virtio_vdpa + VM編 - Metonymical Deflection
9 k8s pod kernel vhost lagacy vDPA の設定方法 vhost_vdpa + kubernetes編 - Metonymical Deflection
10 k8s pod kernel vhost switchdev This article
11 k8s pod kernel virtio lagacy Not started
12 k8s pod kernel virtio switchdev Not started
13 k8s pod dpdk client lagacy Not started
14 k8s pod dpdk client switchdev Not started
15 k8s pod dpdk server lagacy Not started
16 k8s pod dpdk server switchdev Not started

次回はkubevirtを使用したvmi(Virtual Machine Instance)の設定について試してみたいと思います。
現時点で、kubevirtはvfio-pciのみのサポートとなるため一旦保留にします。

*1:各種ドキュメントを確認した結果、vDPAの"v"は、virtual, vhost, virtioの3種類ありますが、意味は全て同じようです。本ブログでは、 Introduction to vDPA kernel framework に従い、virtioの表記で統一しました。

*2:対向機器に100Gbpsスイッチや100GNIC搭載サーバが用意できない場合を考慮し、ループ接続としています。但し、Podで生成したパケットが物理的に外部へ送信されることが重要と考えているため、fig.1の構成としています。

*3:VMの場合、MACアドレス設定は必須でしたが、k8s Podの場合、yamlMACアドレスを記述するため、必須設定ではありません。

*4:coreやmodulesも同時にインストールされます

*5:蛇足ですが、bsf番号の前の「0000」は、Domain番号と呼ばれています。私が知る限りでは「0000」以外の値を見たことが無いため、あまり気にしなくてよいと思います。

*6:ソースからKernelをビルドする際、make menuconfigなどでvhost_vdpaを有効化しておけば、この設定は不要です。

*7:前回の記事では、sriov-cniも同時にビルドしました。今回は使用しないためビルドしませんが、sriov-cniをビルドしても問題ありません。6-1.NetworkAttachmentDefinitionの設定にて、明示的にaccelerated-bridgeを指定するため、sriov-cniとaccelerated-bridge-cniが混在しても問題ありません。

*8:こちらで確認できます。PCI Devices

*9:前回の記事では、"sriov"と設定しています。

*10:何も指定しない場合、Defaultでcni0のBridgeが設定されます。

How to set up vDPA with vhost_vdpa for Kubernetes

This article describes how to configure vDPA (virtio data path acceleration) in kubernetes environment. *1
For more information on vDPA, please refer to The official Red Hat blog.

In this blog, We will describe the communication between pods in kubernetes (referred to below as "k8s") using vhost_vdpa module.

The following is a list of related articles.

How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
How to set up vDPA - appendix - - Metonymical Deflection
How to set up Scalable Function with vdpa for VMs - Metonymical Deflection

1.Overview

1-1.Environment
1.ControlPlane
VMWare              : VMware(R) Workstation 15 Pro 15.5.1 build-15018445 

2.Worker
IA server                        : ProLiant DL360p Gen8 or DL360 Gen9
System ROM                       : P71 01/22/2018
NIC                              : Mellanox ConnectX-6 Dx (MCX623106AS-CDAT)
Mellanox OFED                    : v5.3-1.0.0.1

3.ControlPlane&Worker common
OS                               : CentOS8.3(2011)
Kernel(ControlPlane)             : 4.18.0-240.el8.x86_64 
Kernel(Worker)                   : 5.12.7-1.el8.elrepo.x86_64
Installed Environment Groups     : 
  @^graphical-server-environment
  @container-management
  @development
  @virtualization-client
  @virtualization-hypervisor
  @virtualization-tools 
Kubernetes                       : 1.21.1
Docker-CE                        : 20.10.6
flannel                          : latest
Multus                           : latest
sriov-cni                        : latest
sriov-network-device-plugin      : latest
1-2.Overall flow
  1. Advance preparation
  2. Kernel update
  3. Build k8s Cluster & flannel
  4. Build and deploy vDPA (SR-IOV) related
  5. Deploy Pod
  6. Operation check
  7. Advanced configuration

There is relatively a lot of documentation for 1-3, so we will skip the non-essential parts.
7 describes more detailed settings.

1-3.overall structure

Loop connection using a DAC(Direct Attached Cable). *2
fig.1
f:id:metonymical:20210603000858j:plain
fig.1 is a simplified description and omits the internal architecture. For this reason, please imagine the following configuration in reality.

fig.2
f:id:metonymical:20210603000958j:plain

Quoted from Red Hat's github
github.com

In the above github, it is implemented in legacy mode in SR-IOV.
For this reason, we will describe how to configure it in legacy mode in this blog. *3

2.Advance preparation

Although not specifically mentioned, SELinux disabling, FW disabling, and NTP time synchronization settings are done in advance.

2-1.Swap and Hosts file settings : CP (ControlPlane) & Worker
Disable Swap
vi /etc/fstab

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

Hosts file settings
vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4 c80g105.md.jp c80g105
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.11.151 c83g151 c83g151.md.jp
192.168.11.152 c83g152 c83g152.md.jp
2-2.Enabling HugePage and IOMMU : Worker
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ default_hugepagesz=1G hugepagesz=1G hugepages=16\"/g" /etc/default/grub
sed -i -e "/GRUB_CMDLINE_LINUX=/s/\"$/ intel_iommu=on iommu=pt pci=realloc\"/g" /etc/default/grub
grub2-mkconfig -o /etc/grub2.cfg

Next, implement the mount settings for HugePage. It will be mounted automatically the next time the OS boots.

vi /etc/fstab

nodev  /dev/hugepages hugetlbfs pagesize=1GB    0 0
2-3.SR-IOV VF settings : Worker

Configure the SR-IOV VF settings; you can increase the number of VFs, but for the sake of simplicity, we have set the number of VFs to "1". In addition, setting the MAC address is mandatory. *4

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
exit 0

chmod +x /etc/rc.d/rc.local
2-4.Install the Mellanox driver (OFED) : Worker

You can download the iso file from the Mellanox website.Mellanox Download Site
Please save the downloaded iso file to /root/tmp/.

dnf -y install tcl tk unbound && \
mount -t iso9660 -o loop /root/tmp/MLNX_OFED_LINUX-5.3-1.0.0.1-rhel8.3-x86_64.iso /mnt && \
/mnt/mlnxofedinstall --upstream-libs --dpdk --ovs-dpdk --with-mft --with-mstflint

After the installation is complete, reboot.

reboot

After the reboot is complete, check the HugePage.

cat /proc/meminfo | grep Huge
grep hugetlbfs /proc/mounts

[root@c83g152 ~]# cat /proc/meminfo | grep Huge
AnonHugePages:    452608 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:      16
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

[root@c83g152 ~]# grep hugetlbfs /proc/mounts
nodev /dev/hugepages hugetlbfs rw,relatime,pagesize=1024M 0 0

3.Kernel update : Worker

As of June 11, 2021, the vDPA-related modules are updated at a high frequency, so install the latest Kernel.

3-2.Installation of Kernel
dnf list installed | grep kernel
dnf -y --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel
dnf list installed | grep kernel
reboot

Check the currently installed Kernel.
Install kernel-ml and kernel-ml-devel *5
Check the installed Kernel.
Reboot

3-3.Install Kernel headers, etc.
uname -r
dnf -y swap --enablerepo=elrepo-kernel kernel-headers -- kernel-ml-headers && \
dnf -y remove kernel-tools kernel-tools-libs && \
dnf -y --enablerepo=elrepo-kernel install kernel-ml-tools kernel-ml-tools-libs
dnf list installed | grep kernel

Check the currently running Kernel Version.
Install kernel-headers.
Remove the existing kernel-tools kernel-tools-libs
Install kernel-tools kernel-tools-libs
Check the installed Kernel.

If you get the following output, you are good to go.

[root@c83g152 ~]# dnf list installed | grep kernel
kernel.x86_64                                      4.18.0-240.el8                                @anaconda
kernel-core.x86_64                                 4.18.0-240.el8                                @anaconda
kernel-devel.x86_64                                4.18.0-240.el8                                @anaconda
kernel-ml.x86_64                                   5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-core.x86_64                              5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-devel.x86_64                             5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-headers.x86_64                           5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-modules.x86_64                           5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-tools.x86_64                             5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-ml-tools-libs.x86_64                        5.12.7-1.el8.elrepo                           @elrepo-kernel
kernel-modules.x86_64                              4.18.0-240.el8                                @anaconda
kmod-kernel-mft-mlnx.x86_64                        4.16.1-1.rhel8u3                              @System
kmod-mlnx-ofa_kernel.x86_64                        5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel.x86_64                             5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System
mlnx-ofa_kernel-devel.x86_64                       5.2-OFED.5.2.2.2.0.1.rhel8u3                  @System

4.Build k8s Cluster & flannel

4-1.Install Docker : CP&Worker
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo && \
dnf -y install --allowerasing  docker-ce docker-ce-cli containerd.io && \
systemctl start docker && systemctl enable docker
4-2.Configuring the k8s repository : CP&Worker
cat > /etc/yum.repos.d/kubernetes.repo <<EOF
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
4-3.Install k8s : CP&Worker
CP
dnf -y install kubeadm kubectl

Worker
dnf -y install kubeadm

CP&Worker
systemctl start kubelet.service && \
systemctl enable kubelet.service
4-4.Configuring Docker : CP&Worker
cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true"
  ]
}
EOF

mkdir -p /etc/systemd/system/docker.service.d

systemctl daemon-reload && \
systemctl restart docker
4-5.Building the k8sCluster : CP
kubeadm init --apiserver-advertise-address=192.168.11.151 --pod-network-cidr=10.244.0.0/16

Output Example
At the end you will see the following output, please copy the red text.
This will be used when the worker joins the CP.

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3 
4-6.Configuration after building k8s Cluster : CP
Copy the config file
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Command completion settings
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc
4-7.Install flannel : CP
cd /usr/src && \
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml && \
kubectl apply -f kube-flannel.yml

kubectl get nodes

Output Example
Wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS     ROLES                  AGE   VERSION
c83g151.md.jp   Ready      control-plane,master   44s   v1.21.1
4-8.Joining a Worker : Worker
kubeadm join 192.168.11.151:6443 --token 0gfh5j.vgu76alcycb2tc2e \
    --discovery-token-ca-cert-hash sha256:edcb1a3856838586a6ea7c99200daafa4fbb639e822838f4df81ce09d2faaac3

Output Example
Workers also wait until the status becomes Ready.

[root@c83g151 ~]# kubectl get nodes
NAME            STATUS   ROLES                  AGE    VERSION
c83g151.md.jp   Ready    control-plane,master   5m2s   v1.21.1
c83g152.md.jp   Ready                           44s    v1.21.1
4-9.Enabling the vhost_vdpa module : Worker

Build and deploy sriov-cni as described in section "5. vDPA (SR-IOV) related build and deployment".
At that time, if the vhost_vdpa module is not enabled, it will not be recognized as a Worker resource, so enable it beforehand.

modprobe vhost_vdpa
lsmod |grep vd
ls -Fal /dev
ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa

[root@c83g152 ~]# lsmod |grep vd
vhost_vdpa             24576  0
vhost                  57344  1 vhost_vdpa
mlx5_vdpa              45056  0
vhost_iotlb            16384  3 vhost_vdpa,vhost,mlx5_vdpa
vdpa                   16384  2 vhost_vdpa,mlx5_vdpa
irqbypass              16384  2 vhost_vdpa,kvm
mlx5_core            1216512  2 mlx5_vdpa,mlx5_ib

[root@c83g152 ~]# ls -Fal /dev
total 0
drwxr-xr-x  22 root root          3660 Apr  8 00:02 ./
dr-xr-xr-x. 17 root root           244 Apr  7 20:30 ../
crw-r--r--   1 root root       10, 235 Apr  7 23:28 autofs
drwxr-xr-x   2 root root           160 Apr  7 23:28 block/
drwxr-xr-x   2 root root           100 Apr  7 23:28 bsg/
============ s n i p ============
drwxr-xr-x   2 root root            60 Apr  7 23:28 vfio/
crw-------   1 root root       10, 127 Apr  7 23:28 vga_arbiter
crw-------   1 root root       10, 137 Apr  7 23:28 vhci
crw-------   1 root root       10, 238 Apr  7 23:28 vhost-net
crw-------   1 root root      240,   0 Apr  8 00:06 vhost-vdpa-0
crw-------   1 root root      240,   1 Apr  8 00:06 vhost-vdpa-1
crw-------   1 root root       10, 241 Apr  7 23:28 vhost-vsock

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Apr  8 00:06 ./
drwxr-xr-x 3 root root    0 Apr  7 23:49 ../
--w------- 1 root root 4096 Apr  8 00:07 bind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Apr  8 00:06 uevent
--w------- 1 root root 4096 Apr  8 00:07 unbind
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Apr  8 00:07 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa1/

From the above output results, we can confirm the following

  • /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 are recognized as vhost_vdpa devices
  • 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 are controlled by the vhost_vdpa driver

Also, set it so that it is enabled at OS startup. *6
The blue text part has been added.

vi /etc/rc.local

echo 1 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 1 > /sys/class/net/ens2f1/device/sriov_numvfs
sleep 1
ip link set ens2f0 vf 0 mac 00:11:22:33:44:00
ip link set ens2f1 vf 0 mac 00:11:22:33:44:10
sleep 1
modprobe vhost_vdpa
sleep 1
exit 0

reboot CP & Worker again.

5.Build and deploy vDPA (SR-IOV) related

5-1.Install Golang : CP
dnf -y install epel-release golang
5-2.Building the Docker image : CP

In this section, we will build the following three images. *7

  • multus
  • sriov-cni
  • sriov-dp
cd /usr/src && \
git clone https://github.com/redhat-nfvpe/vdpa-deployment.git

cd /usr/src/vdpa-deployment && \
make multus && \
make sriov-cni && \
make sriov-dp
5-3.Copy the Docker image : CP

Copy the built Docker image to the Worker.

/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152

Output Example

[root@c83g151 vdpa-deployment]# /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-device-plugin root@192.168.11.152 && \
/usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/sriov-cni root@192.168.11.152 && \
> /usr/src/vdpa-deployment/scripts/load-image.sh nfvpe/multus root@192.168.11.152
+ IMAGE=nfvpe/sriov-device-plugin
+ NODE=root@192.168.11.152
++ mktemp -d
+ temp=/tmp/tmp.Lh8BaezUtC
+ dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ save nfvpe/sriov-device-plugin /tmp/tmp.Lh8BaezUtC/image.tar
+ local image=nfvpe/sriov-device-plugin
+ local dest=/tmp/tmp.Lh8BaezUtC/image.tar
+ echo 'Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar'
Saving nfvpe/sriov-device-plugin into /tmp/tmp.Lh8BaezUtC/image.tar
+ docker save -o /tmp/tmp.Lh8BaezUtC/image.tar nfvpe/sriov-device-plugin
============ s n i p ============
+ echo 'Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152'
Loading /tmp/tmp.Z6emF9eiAs/image.tar into root@192.168.11.152
+ ssh root@192.168.11.152 'docker load'
Loaded image: nfvpe/multus:latest
+ rm -r /tmp/tmp.Z6emF9eiAs

Checking the Docker Image : CP&Worker

[root@c83g151 vdpa-deployment]# docker images
REPOSITORY                           TAG          IMAGE ID       CREATED          SIZE
nfvpe/sriov-cni                      latest       521ab1f3a5a1   16 minutes ago   9.47MB
<none>                               <none>       88062f2c13d4   16 minutes ago   561MB
nfvpe/multus                         latest       aa8d9becca0f   17 minutes ago   331MB
<none>                               <none>       e75c422aef6e   17 minutes ago   1.34GB
nfvpe/sriov-device-plugin            latest       0dd5f325c600   18 minutes ago   42.7MB
<none>                               <none>       3deb8b5405fa   18 minutes ago   1.26GB
quay.io/coreos/flannel               v0.14.0      8522d622299c   2 weeks ago      67.9MB
k8s.gcr.io/kube-apiserver            v1.21.1      771ffcf9ca63   3 weeks ago      126MB
k8s.gcr.io/kube-proxy                v1.21.1      4359e752b596   3 weeks ago      131MB
k8s.gcr.io/kube-controller-manager   v1.21.1      e16544fd47b0   3 weeks ago      120MB
k8s.gcr.io/kube-scheduler            v1.21.1      a4183b88f6e6   3 weeks ago      50.6MB
golang                               alpine3.12   24d827672eae   3 weeks ago      301MB
golang                               alpine       722a834ff95b   3 weeks ago      301MB
fedora                               32           c451de0d2441   5 weeks ago      202MB
alpine                               3.12         13621d1b12d4   7 weeks ago      5.58MB
alpine                               3            6dbb9cc54074   7 weeks ago      5.61MB
k8s.gcr.io/pause                     3.4.1        0f8457a4c2ec   4 months ago     683kB
k8s.gcr.io/coredns/coredns           v1.8.0       296a6d5035e2   7 months ago     42.5MB
k8s.gcr.io/etcd                      3.4.13-0     0369cf4303ff   9 months ago     253MB
5-4.Deploying yaml files : CP

Deploy the following four files.

  1. /usr/src/vdpa-deployment/deployment/multus-daemonset.yaml
  2. /usr/src/vdpa-deployment/deployment/sriovcni-vdpa-daemonset.yaml
  3. /usr/src/vdpa-deployment/deployment/sriovdp-vdpa-daemonset.yaml
  4. /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
cd /usr/src/vdpa-deployment && \
make deploy && \
kubectl apply -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

Output Example

[root@c83g151 vdpa-deployment]# cd /usr/src/vdpa-deployment && \
> make deploy && \
> kubectl create -f /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml
serviceaccount/sriov-device-plugin created
daemonset.apps/kube-sriov-device-plugin-amd64 created
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
daemonset.apps/kube-sriov-cni-ds-amd64 created
configmap/sriovdp-config created
5-5.Checking DaemonSet & ConfigMap & Pod : CP

Check the DaemonSet and ConfigMap that you deployed in 5-4.

kubectl -n kube-system get ds
kubectl -n kube-system get cm
kubectl -n kube-system get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get ds
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                   AGE
kube-flannel-ds                  2         2         2       2            2                                     3h4m
kube-multus-ds-amd64             2         2         2       2            2           kubernetes.io/arch=amd64        6m52s
kube-proxy                       2         2         2       2            2           kubernetes.io/os=linux          3h4m
kube-sriov-cni-ds-amd64          2         2         2       2            2           beta.kubernetes.io/arch=amd64   6m52s
kube-sriov-device-plugin-amd64   2         2         2       2            2           beta.kubernetes.io/arch=amd64   6m52s


[root@c83g151 vdpa-deployment]# kubectl -n kube-system get cm
NAME                                 DATA   AGE
coredns                              1      3h4m
extension-apiserver-authentication   6      3h4m
kube-flannel-cfg                     2      3h4m
kube-proxy                           2      3h4m
kube-root-ca.crt                     1      3h4m
kubeadm-config                       2      3h4m
kubelet-config-1.21                  1      3h4m
multus-cni-config                    1      6m52s
sriovdp-config                       1      6m52s

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod
NAME                                    READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-7kql2                1/1     Running   2          178m
coredns-558bd4d5db-nq8k7                1/1     Running   2          178m
etcd-c83g151.md.jp                      1/1     Running   2          178m
kube-apiserver-c83g151.md.jp            1/1     Running   2          178m
kube-controller-manager-c83g151.md.jp   1/1     Running   2          178m
kube-flannel-ds-89v57                   1/1     Running   2          174m
kube-flannel-ds-zwd7n                   1/1     Running   2          177m
kube-multus-ds-amd64-75rbf              1/1     Running   0          33s
kube-multus-ds-amd64-zk6w9              1/1     Running   0          33s
kube-proxy-fdv9r                        1/1     Running   2          174m
kube-proxy-l6t7h                        1/1     Running   2          178m
kube-scheduler-c83g151.md.jp            1/1     Running   2          178m
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          33s
kube-sriov-cni-ds-amd64-ndmmr           1/1     Running   0          33s
kube-sriov-device-plugin-amd64-4lt4p    1/1     Running   0          33s
kube-sriov-device-plugin-amd64-gbplp    1/1     Running   0          33s
5-6.Checking the details of ConfigMap : CP

In "6. Deploy Pod", vdpa_mlx_vhost is related to the Network Attachment Definition and the Pod configuration, so please check it beforehand.

cat /usr/src/vdpa-deployment/deployment/configMap-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

Note

resourceName This can be any name. You can specify this name explicitly in "6-1 sections".
vendors This is the vendor identifier for the PCI Device ID. 15b3 indicates that it is a Mellanox product.*8
devices This is the device identifier for the PCI Device ID. 101e indicates that it is a VF of ConnectX-6 Dx.
drivers Specifies the mlx5_core driver.
vdpaType Specifies the vhost. This option has been extended from the regular SR-IOV Device plug-in.

The PCI Device ID can be checked with the following command.

[root@c83g152 ~]# lspci -nn |grep Mellanox
07:00.0 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.1 Ethernet controller [0200]: Mellanox Technologies MT2892 Family [ConnectX-6 Dx] [15b3:101d]
07:00.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
07:01.2 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
5-7.Checking Worker Resources : CP

Check that vdpa_mlx_vhost is recognized as a Worker resource.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

Output Example

[root@c83g151 vdpa-deployment]# kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'
{
  "cpu": "16",
  "ephemeral-storage": "127203802926",
  "hugepages-1Gi": "16Gi",
  "hugepages-2Mi": "0",
  "intel.com/vdpa_mlx_vhost": "2",
  "memory": "148123456Ki",
  "pods": "110"
}

"2" is the number of VFs that were recognized.

Note
When this value is "0" or this line (intel.com/vdpa_mlx_vhost) is not displayed, delete the pod of sriov-device-plugin once.
Since sriov-device-plugin is deployed as a DaemonSet, it will be automatically recreated after the Pod is deleted.
This will cause sriov-device-plugin to attempt to re-register vdpa_mlx_vhost.

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h8m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          74m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-rg8hm    1/1     Running   0          73m     192.168.11.152   c83g152.md.jp              

[root@c83g151 vdpa-deployment]# kubectl -n kube-system delete pod kube-sriov-device-plugin-amd64-rg8hm
pod "kube-sriov-device-plugin-amd64-rg8hm" deleted

[root@c83g151 vdpa-deployment]# kubectl -n kube-system get pod -o wide |grep 152
kube-flannel-ds-89v57                   1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-multus-ds-amd64-75rbf              1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-proxy-fdv9r                        1/1     Running   2          4h9m    192.168.11.152   c83g152.md.jp              
kube-sriov-cni-ds-amd64-2xfxw           1/1     Running   0          76m     192.168.11.152   c83g152.md.jp              
kube-sriov-device-plugin-amd64-kwc5z    1/1     Running   0          3s      192.168.11.152   c83g152.md.jp              

Again, check the Worker resources.

kubectl get node c83g152.md.jp -o json | jq '.status.allocatable'

If it is still not recognized, please refer to section 4-9.
The two points are as follows.

  1. /dev/vhost-vdpa-0 and /dev/vhost-vdpa-1 must be recognized as vhost_vdpa devices.
  2. 0000:07:00.2/vdpa0 and 0000:07:01.2/vdpa1 must be controlled by the vhost_vdpa driver.

Unless the above two conditions are met, vdpa_mlx_vhost will not be recognized as a Worker resource.

6.Deploy Pod

In "/usr/src/vdpa-deployment/deployment/multus-daemonset.yaml", configure the Network Attachment Definition defined as CRD.
In this section, we will configure vlan, spoofchk, ipam, etc.
For more details, please refer to the following website.
sriov-cni/configuration-reference.md at rfe/vdpa · amorenoz/sriov-cni · GitHub

6-1.Configuring the NetworkAttachmentDefinition : CP

"vdpa-mlx-vhost-net30" is specified in the annotations of the Pod.

vi 96nA-vdpa30.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net30
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa30",
  "vlan": 30,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.30.0/24",
    "rangeStart": "192.168.30.64",
    "rangeEnd": "192.168.30.127"
  }
}'

kubectl apply -f 96nA-vdpa30.yaml
kubectl get network-attachment-definitions.k8s.cni.cncf.io

Output Example

[root@c83g151 vdpa-deployment]# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME                   AGE
vdpa-mlx-vhost-net30   14m
6-2.Deploying a Pod : CP

The key point is that the values defined in the Network Attachment Definition are specified for "annotations" and "resources".

vi 16vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "CA:FE:C0:FF:EE:11"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net30",
        "mac": "CA:FE:C0:FF:EE:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost: '1'
      limits:
        intel.com/vdpa_mlx_vhost: '1'
    command: ["sleep"]
    args: ["infinity"]


kubectl apply -f 16vdpa.yaml
kubectl get pod

Output Example

[root@c83g151 vdpa-deployment]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
vdpa-pod01   1/1     Running   0          16m
vdpa-pod02   1/1     Running   0          16m
6-3.Check Pod details : CP

In this section, we will check the details of the Pod status by using the kubectl describe command.

kubectl describe pod vdpa-pod01

fig.1
f:id:metonymical:20210603000858j:plain
Make sure that each parameters(/dev/vhost-vdpa-0, 07:00.2) of device-info and fig.1 match.
Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Thu, 03 Jun 2021 20:54:49 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.9"
                    ],
                    "mac": "26:9a:87:2a:70:70",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.71"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks: [ { "name": "vdpa-mlx-vhost-net30", "mac": "CA:FE:C0:FF:EE:11" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.9"
                    ],
                    "mac": "26:9a:87:2a:70:70",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net30",
                    "interface": "net1",
                    "ips": [
                        "192.168.30.71"
                    ],
                    "mac": "CA:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.9
IPs:
  IP:  10.244.1.9
Containers:
  vdpa-single01:
    Container ID:  docker://cf57569807eb2de3d4901ff2ade55b845682d2c7a37ee88c7f6536498fd0b63e
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Thu, 03 Jun 2021 20:54:51 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost:  1
    Requests:
      intel.com/vdpa_mlx_vhost:  1
    Environment:                 
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9kqtb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-9kqtb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  26s   multus   Add eth0 [10.244.1.9/24]
  Normal  AddedInterface  25s   multus   Add net1 [192.168.30.71/24] from default/vdpa-mlx-vhost-net30
  Normal  Pulled          25s   kubelet  Container image "centos:latest" already present on machine
  Normal  Created         25s   kubelet  Created container vdpa-single01
  Normal  Started         25s   kubelet  Started container vdpa-single01
6-4.Check the Mac address of the Worker : Worker

Check that the MAC address specified in the annotations of the Pod is reflected in the VF of the Worker.

ip link show ens2f0

Output Example

[root@c83g152 ~]# ip link show ens2f0
8: ens2f0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether ca:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, vlan 30, spoof checking off, link-state auto, trust on, query_rss off

7.Operation check : CP

7-1.Check the IP address of the Pod.

the IP address of the Pod is dynamically assigned, check it in advance.

kubectl describe pod vdpa-pod01 |grep Add
kubectl describe pod vdpa-pod02 |grep Add

Output Example

[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01 |grep Add
  Normal  AddedInterface  26m   multus   Add eth0 [10.244.1.9/24]
  Normal  AddedInterface  26m   multus   Add net1 [192.168.30.71/24] from default/vdpa-mlx-vhost-net30
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod02 |grep Add
  Normal  AddedInterface  26m   multus   Add eth0 [10.244.1.8/24]
  Normal  AddedInterface  26m   multus   Add net1 [192.168.30.70/24] from default/vdpa-mlx-vhost-net30
7-2.communication check
vdpa-pod01 192.168.30.71/24
vdpa-pod02 192.168.30.70/24

Execute a ping from vdpa-pod01(192.168.30.71) to 192.168.30.70.

kubectl exec -it vdpa-pod01 -- ping 192.168.30.70

Output Example

[root@c83g151 vdpa-deployment]# kubectl exec -it vdpa-pod01 -- ping 192.168.30.70
PING 192.168.30.70 (192.168.30.70) 56(84) bytes of data.
64 bytes from 192.168.30.70: icmp_seq=1 ttl=64 time=0.447 ms
64 bytes from 192.168.30.70: icmp_seq=2 ttl=64 time=0.166 ms
64 bytes from 192.168.30.70: icmp_seq=3 ttl=64 time=0.227 ms
64 bytes from 192.168.30.70: icmp_seq=4 ttl=64 time=0.217 ms
^C
--- 192.168.30.70 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 91ms
rtt min/avg/max/mdev = 0.166/0.264/0.447/0.108 ms

We was able to confirm that the pods are communicating with each other via the DAC.
That's it for the configuration.

In the next section, we will describe a configuration example for more detailed settings.

8.Advanced configuration

Depending on the configuring of ConfigMap and NetworkAttachmentDefinition, you can also set up a configuration as shown in fig.3.

fig.3
f:id:metonymical:20210603230304j:plain

The yaml file will help you understand what exactly you are doing, so I won't describe any particular explanation, but I will describe the points that need attention in red texts.
Please increase the number of VFs in SR-IOV to 4 in advance.

echo 2 > /sys/class/net/ens2f0/device/sriov_numvfs
echo 2 > /sys/class/net/ens2f1/device/sriov_numvfs
8-1.Configuring ConfigMap : CP
vi 83cm-vdpa.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourceName": "vdpa_ifcvf_vhost",
                "selectors": {
                    "vendors": ["1af4"],
                    "devices": ["1041"],
                    "drivers": ["ifcvf"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost11",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:00.2"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost12",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:00.3"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost21",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:01.2"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_vhost22",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "pciAddresses": ["0000:07:01.3"],
                    "vdpaType": "vhost"
                }
            },
            {
                "resourceName": "vdpa_mlx_virtio",
                "selectors": {
                    "vendors": ["15b3"],
                    "devices": ["101e"],
                    "drivers": ["mlx5_core"],
                    "vdpaType": "virtio"
                }
            }
        ]
    }

kubectl apply -f 83cm-vdpa.yaml
8-2.Configuring NetworkAttachmentDefinition : CP
vi 93nA-vdpa11-22.yaml

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net11
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost11
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 100,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.100.0/24",
    "rangeStart": "192.168.100.64",
    "rangeEnd": "192.168.100.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net12
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost12
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 200,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.200.0/24",
    "rangeStart": "192.168.200.64",
    "rangeEnd": "192.168.200.127"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net21
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost21
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 100,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.100.0/24",
    "rangeStart": "192.168.100.128",
    "rangeEnd": "192.168.100.191"
  }
}'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vdpa-mlx-vhost-net22
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/vdpa_mlx_vhost22
spec:
  config: '{
  "type": "sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-vdpa",
  "vlan": 200,
  "trust": "on",
  "spoofchk": "off",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.200.0/24",
    "rangeStart": "192.168.200.128",
    "rangeEnd": "192.168.200.191"
  }
}'

kubectl apply -f 93nA-vdpa11-22.yaml
8-3.Configuring Pod : CP
vi 13vdpa.yaml

apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod01
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net11",
        "mac": "0C:FE:C0:FF:EE:11"
      },
      {
        "name": "vdpa-mlx-vhost-net12",
        "mac": "0C:FE:C0:FF:EE:12"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single01
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost11: '1'
        intel.com/vdpa_mlx_vhost12: '1'
      limits:
        intel.com/vdpa_mlx_vhost11: '1'
        intel.com/vdpa_mlx_vhost12: '1'
    command: ["sleep"]
    args: ["infinity"]
---
apiVersion: v1
kind: Pod
metadata:
  name: vdpa-pod02
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {
        "name": "vdpa-mlx-vhost-net21",
        "mac": "0C:FE:C0:FF:EE:21"
      },
      {
        "name": "vdpa-mlx-vhost-net22",
        "mac": "0C:FE:C0:FF:EE:22"
      }
      ]'
spec:
  nodeName: c83g152.md.jp
  containers:
  - name: vdpa-single02
    image: centos:latest
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    resources:
      requests:
        intel.com/vdpa_mlx_vhost21: '1'
        intel.com/vdpa_mlx_vhost22: '1'
      limits:
        intel.com/vdpa_mlx_vhost21: '1'
        intel.com/vdpa_mlx_vhost22: '1'
    command: ["sleep"]
    args: ["infinity"]

kubectl apply -f 13vdpa.yaml
8-4.Check Pod details : CP
[root@c83g151 vdpa-deployment]# kubectl describe pod vdpa-pod01
Name:         vdpa-pod01
Namespace:    default
Priority:     0
Node:         c83g152.md.jp/192.168.11.152
Start Time:   Thu, 03 Jun 2021 22:23:33 +0900
Labels:       
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.12"
                    ],
                    "mac": "c6:01:75:97:75:91",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net11",
                    "interface": "net1",
                    "ips": [
                        "192.168.100.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                },{
                    "name": "default/vdpa-mlx-vhost-net12",
                    "interface": "net2",
                    "ips": [
                        "192.168.200.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:12",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa1",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-1",
                            "pci-address": "0000:07:00.3"
                        }
                    }
                }]
              k8s.v1.cni.cncf.io/networks:
                [ { "name": "vdpa-mlx-vhost-net11", "mac": "0C:FE:C0:FF:EE:11" }, { "name": "vdpa-mlx-vhost-net12", "mac": "0C:FE:C0:FF:EE:12" } ]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.244.1.12"
                    ],
                    "mac": "c6:01:75:97:75:91",
                    "default": true,
                    "dns": {}
                },{
                    "name": "default/vdpa-mlx-vhost-net11",
                    "interface": "net1",
                    "ips": [
                        "192.168.100.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:11",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa0",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-0",
                            "pci-address": "0000:07:00.2"
                        }
                    }
                },{
                    "name": "default/vdpa-mlx-vhost-net12",
                    "interface": "net2",
                    "ips": [
                        "192.168.200.64"
                    ],
                    "mac": "0C:FE:C0:FF:EE:12",
                    "dns": {},
                    "device-info": {
                        "type": "vdpa",
                        "version": "1.0.0",
                        "vdpa": {
                            "parent-device": "vdpa1",
                            "driver": "vhost",
                            "path": "/dev/vhost-vdpa-1",
                            "pci-address": "0000:07:00.3"
                        }
                    }
                }]
Status:       Running
IP:           10.244.1.12
IPs:
  IP:  10.244.1.12
Containers:
  vdpa-single01:
    Container ID:  docker://df698de4764a4209703f9df3a641167cdf49222860d1e41f1c85de7ba1bb5146
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:5528e8b1b1719d34604c87e11dcd1c0a20bedf46e83b5632cdeac91b8c04efc1
    Port:          
    Host Port:     
    Command:
      sleep
    Args:
      infinity
    State:          Running
      Started:      Thu, 03 Jun 2021 22:23:36 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      intel.com/vdpa_mlx_vhost11:  1
      intel.com/vdpa_mlx_vhost12:  1
    Requests:
      intel.com/vdpa_mlx_vhost11:  1
      intel.com/vdpa_mlx_vhost12:  1
    Environment:                   
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xst8f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-xst8f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  8m3s  multus   Add eth0 [10.244.1.12/24]
  Normal  AddedInterface  8m2s  multus   Add net1 [192.168.100.64/24] from default/vdpa-mlx-vhost-net11
  Normal  AddedInterface  8m2s  multus   Add net2 [192.168.200.64/24] from default/vdpa-mlx-vhost-net12
  Normal  Pulled          8m2s  kubelet  Container image "centos:latest" already present on machine
  Normal  Created         8m2s  kubelet  Created container vdpa-single01
  Normal  Started         8m1s  kubelet  Started container vdpa-single01

8-5.Check Worker details : Worker
[root@c83g152 ~]# ip link show ens2f0
8: ens2f0:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:10 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 0c:fe:c0:ff:ee:11 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on, query_rss off
    vf 1     link/ether 0c:fe:c0:ff:ee:12 brd ff:ff:ff:ff:ff:ff, vlan 200, spoof checking off, link-state auto, trust on, query_rss off
[root@c83g152 ~]# ip link show ens2f1
9: ens2f1:  mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:7e:1f:11 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 0c:fe:c0:ff:ee:21 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on, query_rss off
    vf 1     link/ether 0c:fe:c0:ff:ee:22 brd ff:ff:ff:ff:ff:ff, vlan 200, spoof checking off, link-state auto, trust on, query_rss off

[root@c83g152 ~]# ls -Fal /sys/bus/vdpa/drivers/vhost_vdpa
total 0
drwxr-xr-x 2 root root    0 Jun  3 22:07 ./
drwxr-xr-x 3 root root    0 Jun  3 22:07 ../
--w------- 1 root root 4096 Jun  3 22:28 bind
lrwxrwxrwx 1 root root    0 Jun  3 22:28 module -> ../../../../module/vhost_vdpa/
--w------- 1 root root 4096 Jun  3 22:07 uevent
--w------- 1 root root 4096 Jun  3 22:28 unbind
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa0 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.2/vdpa0/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa1 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:00.3/vdpa1/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa2 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.2/vdpa2/
lrwxrwxrwx 1 root root    0 Jun  3 22:28 vdpa3 -> ../../../../devices/pci0000:00/0000:00:03.0/0000:07:01.3/vdpa3/

[root@c83g152 ~]# lshw -businfo -c network
Bus info          Device     Class          Description
=======================================================
pci@0000:04:00.0  ens1f0     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:04:00.1  ens1f1     network        82599ES 10-Gigabit SFI/SFP+ Network Connection
pci@0000:03:00.0  eno1       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.1  eno2       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.2  eno3       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:03:00.3  eno4       network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:07:00.0  ens2f0     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.1  ens2f1     network        MT2892 Family [ConnectX-6 Dx]
pci@0000:07:00.2             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:00.3             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.2             network        ConnectX Family mlx5Gen Virtual Function
pci@0000:07:01.3             network        ConnectX Family mlx5Gen Virtual Function

[root@c83g152 ~]# ls /dev
autofs           full          lp3     ptp4    shm       vcs2   vcsu5
block            fuse          mapper  ptp5    snapshot  vcs3   vcsu6
bsg              hpet          mcelog  ptp6    snd       vcs4   vfio
btrfs-control    hpilo         mem     ptp7    stderr    vcs5   vga_arbiter
bus              hugepages     mqueue  ptp8    stdin     vcs6   vhci
char             hwrng         net     ptp9    stdout    vcsa   vhost-net
cl               infiniband    null    pts     tty       vcsa1  vhost-vdpa-0
console          initctl       nvram   random  tty0      vcsa2  vhost-vdpa-1
core             input         port    raw     tty1      vcsa3  vhost-vdpa-2
cpu              ipmi0         ppp     rfkill  tty10     vcsa4  vhost-vdpa-3
cpu_dma_latency  kmsg          ptmx    rtc     tty11     vcsa5  vhost-vsock
disk             kvm           ptp0    rtc0    tty12     vcsa6  watchdog
dm-0             log           ptp1    sda     tty13     vcsu   watchdog0
dm-1             loop-control  ptp10   sda1    tty14     vcsu1  zero
dri              lp0           ptp11   sda2    tty15     vcsu2
fb0              lp1           ptp2    sg0     tty16     vcsu3
fd               lp2           ptp3    sg1     tty17     vcsu4

9.Finally

We referred to the following website.
https://github.com/redhat-nfvpe/vdpa-deployment
https://docs.google.com/document/d/1DgZuksLVIVD5ZpNUNH7zPUr-8t6GKKQICDLqIwQv-FA/edit

This time we used the Legacy mode of SR-IOV, but next time we would like to use accelerated-bridge-cni and verify it in swtichdev mode.
GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

Also, we thought about how many combinations of environments using vDPA exist, and it looks like there are at least 24 16 different ones.
2 and 8 were described in previous articles, and 9 is in this article.

No vm(qemu)/k8s k8s Pod/VMI vDPA Framework vDPA Type SR-IOV mode Related Articles
1 vm - kernel vhost lagacy Not started
2 vm - kernel vhost switchdev How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection
3 vm - kernel virtio lagacy Not started
4 vm - kernel virtio switchdev Not started
5 vm - dpdk vhost lagacy Not started
6 vm - dpdk vhost switchdev Not started
7 vm - dpdk virtio lagacy Not started
8 vm - dpdk virtio switchdev How to set up vDPA with virtio_vdpa for VMs - Metonymical Deflection
9 k8s pod kernel vhost lagacy How to set up vDPA with vhost_vdpa for Kubernetes - Metonymical DeflectionThis article
10 k8s pod kernel vhost switchdev How to set up vDPA with vhost_vdpa for Kubernetes + Accelerated Bridge CNI - Metonymical Deflection
11 k8s pod kernel virtio lagacy Not started
12 k8s pod kernel virtio switchdev Not started
13 k8s pod dpdk client lagacy Not started
14 k8s pod dpdk client switchdev Not started
15 k8s pod dpdk server lagacy Not started
16 k8s pod dpdk server switchdev Not started

It would be inefficient to do all of these, so we will prioritize and verify those that are likely to be used frequently as use cases.


If you can understand the following articles, I think that the content of this time is not so difficult, so I realized that it is important to build up the basics.
How to set up vDPA with vhost_vdpa for VMs - Metonymical Deflection

In the future, when you encounter an environment that uses vDPA, isn't it important to be able to figure out which combination it is composed of?

Becoming Cloud-Native implies increasing abstraction, and as an engineer, I believe we have to avoid the situation where we find ourselves in a state where we don't know anything.

*1:After checking various documents, I found that the "v" in vDPA has three different meanings: virtual, vhost, and virtio, but they all seem to be the same. In this article, I have followed Introduction to vDPA kernel framework and used virtio.

*2:A loop connection is used in case a 100Gbps switch or a server with 100GNICs is not available. However, since we believe it is important that packets generated by the Pod are physically sent to the outside, we use the configuration shown in fig.1.

*3:As for swtichdev mode, there is "accelerated-bridge-cni" provided by Mellanox, which we plan to verify in the future. GitHub - k8snetworkplumbingwg/accelerated-bridge-cni

*4:In the case of VMs, the MAC address setting was mandatory, but in the case of k8s Pod, it is not a mandatory setting because the MAC address is written in the yaml.

*5:core and modules will be installed at the same time

*6:When you build the Kernel from source, you can use If you have enabled vhost_vdpa by using make menuconfig, etc., this setting is not necessary.

*7:In addition, you can also build dpdk-devel and dpdk-app. However, since we could not communicate with the external NW, we will not run DPDK on the Pod this time, but use the normal CentOS Pod to communicate with it

*8:PCI Devices