测来测去5：Linux网络性能调优方法

Posted on January 22, 2019 本文总阅读量次

换换口味

老搞DPDK的人有一个毛病就是怎么也看不上内核网络，又是中断又是拷贝的，实在没有一脚地板油CPU直接100%炸街来得爽快。另外作为一个软件性能优化的“硬核”玩家，是很看不上内核这种改改参数，调调设置的玩法的。不过…既然自己主动跳了个大坑，该调内核性能的时候还是要调的…所以今天就换换口味，看看在Linux下通过配置调优网络性能怎么搞。

了解你的设备

性能调优只有一个任务，就是充分发挥现有资源的能力。因此，了解自己的设备，就成了一切的前提。

硬件软件的细节难以遍数，若从细处着手…傻子才从细处着手吧。在网络方面，需要从全局上搞清楚这样几个问题：

网卡什么型号
网卡有几个口
每个口有几个队列
有什么硬件offloading的功能
CPU什么型号
CPU有几个核心
CPU有什么能调高主频的方法
CPU有什么指令集
有没有NUMA

其他至于内存、存储的可以先放一边。搞清楚系统配置之后需要准备几样工具。

几样工具

htop
和top相比确实更直观一些，没有的话用top凑活一下也行。难就难在有些自己裁剪的Linux系统（比如OpenWRT）里虽然有top，但和我们用的不是一个top….
ethtool
非常非常非常值得深度挖掘的工具，最近一个星期最后悔的事就是自己编译OpenWRT时没有勾上它。导致配代理配了个一六八开才把OpenWRT盒子挂上外网用opkg装上。
sysctl
主要用来修改内核参数。
perf
其实在这里用处不大，但一个系统里没装perf就会感觉少点什么。
常规工具
cat、echo、ifconfig等基础自带的工具。

搞一搞

测试拓扑及相关

拓扑如下：

其中Gateway Moon就是需要调优性能的OpenWRT盒子。Gateway Moon和Gateway Sun之间建立了IPsec加密隧道。进行Client Alice和Client Bob之间的路由转发。

在两个Client上分别运行iperf3 server和client，来进行带宽（也即IPSec隧道的转发性能）测试。

目前除了Gateway Moon之外所有服务器都是用的高端服务器，所以瓶颈肯定在这片可怜的阿童木小盒子上。

IPsec相关的详细配置可以参考这里。

如果只是单纯测试转发速率完全不用这么复杂，两个盒子直连就可以。我这里只是最近需要搞IPsec隧道。

打开网卡多队列

先看看你的网卡支持多少个队列：

ethtool -l eth3

这里面RX和TX等于0是说仅仅能用作接收或发送的队列个数为0，而下方combined是指既可以作为发送队列也可以作为接收队列的个数，一般看这个数字就知道可以有多少个接收队列和多少个发送队列了。

而other是指用作link interrupt或SR-IOV协调的队列，在我们这个场景下并没有什么卵用。

打开多队列：

ethtool -L eth3 combined 2

这样你就有了两个接收队列，以及两个发送队列。

这个时候看一下cat /proc/interrupts应该能看到eth3-rx-0/1的中断号。

打开能打开的网卡Offloading

首先看一下你都有哪些offloading能力：

ethtool -k eth3

带[fixed]标识的就别多想了。如果有想打开的offloading能力，比如RX checksum：

ethtool -K eth3 rx on

全部能力和操作方法参考man ethtool。

网卡队列深度

看一下最大支持深度和现在的配置情况：

ethtool -g eth3

将接收队列深度改为4096

ethtool -G eth3 rx 4096

RSS队列配置

开了多队列最好配置一下RSS，先看一下RSS现在的配置：

ethtool -x eth3

能看到RSS indirection table和RSS hash key以及RSS hash function。

具体RSS是什么就不在这里讲解了，如果想比较均匀地让报文散列到前两个RX队列上：

ethtool -X eth3 equal 2

再看一下RSS indirection table，也许会有不一样的地方，当然也许没有 :D

如果想让某条RX队列收取更多的报文，可以配置报文的权重：

ethtool -X eth3 weight 6 2

这样RX queue 0的权重是6，会比RX queue 1收取更多的报文(一般情况下)。在需要更细粒度优化的情况下可以使用。

RSS Hash配置

这里可以决定针对不同的流量（IPv4-tcp, IPv4-udp, IPv6-tcp, Ethernet…)采用报文的哪些字段进行RSS Hash。

有没有体验过UDP流量换了端口号还是始终进入同一条队列的恐惧？

那是因为：

root@OpenWrt:~# ethtool -n eth0 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA

针对UDP流量只用Src IP和dst IP做哈希…如果这两个字段没变化那么就只能进入同一条队列…

想添加上src和dst port一同作为RSS的字段：

ethtool -N eth3 rx-flow-hash udp4 sdfn

查看一下man ethtool可以明白sdfn的意义：

m   Hash on the Layer 2 destination address of the rx packet.
v   Hash on the VLAN tag of the rx packet.
t   Hash on the Layer 3 protocol field of the rx packet.
s   Hash on the IP source address of the rx packet.
d   Hash on the IP destination address of the rx packet.
f   Hash on bytes 0 and 1 of the Layer 4 header of the rx packet.
n   Hash on bytes 2 and 3 of the Layer 4 header of the rx packet.
r   Discard all packets of this flow type. When  this  option  is
    set, all other options are ignored.

另外注意如果你搭建的是IPsec隧道，即便你加解密之前/后可能是UDP/TCP流量，但经过加密之后都是esp4类型的流量。

N-tuple filters配置

这个需要考虑自己的实际应用场景，比如在一个web服务器中将处理http流量的进程绑定在CPU1，同时将RX queue-1的中断都放在CPU1，这时如果将所有dst port是80的http流量都导入Rx queue-1将会在进程切换和缓存命中方面提供好处。

首先查看一下网卡是不是支持：

1
2
3

ethtool -k eth3
...
ntuple-filters: off

打开：

ethtool -K eth3 ntuple on

配一条过滤规则：

ethtool -U eth3 flow-type tcp4 dst-port 80 action 1

具体流量的命中情况可以通过

ethtool -S eth3

中的fdir_match和fdir_miss查看。

中断分布

内核收包的一大瓶颈就是中断处理。一个常见的技巧就是让这些中断由所有CPU核共同分担。最优的配置就是一个NUMA节点中有多少个CPU核，该节点上的网卡就有多少个收包队列。当然我这里用的这个盒子是不敢奢望什么NUMA了….

先看一下都挂了哪些中断：
cat /proc/interrupts

然后把某个收包队列对应的中断号绑定到对应的CPU核上：

echo mask > /proc/irq/$IRQ/smp_affinity

其中mask就是允许发送中断的CPU的bit位。mask=1就是CPU0, mask=2就是CPU1, mask=3就是CPU0和1.

在自己手动设置中断分布之前，先检查一下系统里是不是已经在运行irqbalance守护进程。如果有先把它关掉。

贴一个网上应用比较广泛的脚本：

# setting up irq affinity according to /proc/interrupts
# 2008-11-25 Robert Olsson
# 2009-02-19 updated by Jesse Brandeburg
#
# > Dave Miller:
# (To get consistent naming in /proc/interrups)
# I would suggest that people use something like:
#             char buf[IFNAMSIZ+6];
#
#             sprintf(buf, "%s-%s-%d",
#                 	netdev->name,
#                            (RX_INTERRUPT ? "rx" : "tx"),
#                            queue->index);
#
#  Assuming a device with two RX and TX queues.
#  This script will assign:
#
#             eth0-rx-0  CPU0
#             eth0-rx-1  CPU1
#             eth0-tx-0  CPU0
#             eth0-tx-1  CPU1
#
set_affinity()
{
	MASK=$((1<<$VEC))
	printf "%s mask=%X for /proc/irq/%d/smp_affinity\n" $DEV $MASK $IRQ
	printf "%X" $MASK > /proc/irq/$IRQ/smp_affinity
	echo $DEV mask=$MASK for /proc/irq/$IRQ/smp_affinity
	echo $MASK > /proc/irq/$IRQ/smp_affinity
}
 
if [ "$1" = "" ] ; then
    echo "Description:"
    echo "	This script attempts to bind each queue of a multi-queue NIC"
    echo "	to the same numbered core, ie tx0|rx0 --> cpu0, tx1|rx1 --> cpu1"
    echo "usage:"
    echo "	$0 eth0 [eth1 eth2 eth3]"
fi
 
#
# Set up the desired devices.
#
 
for DEV in $*
do
  for DIR in  rx tx
  do
 	MAX=`grep $DEV-$DIR /proc/interrupts | wc -l`
 	if [ "$MAX" == "0" ] ; then
   	MAX=`egrep -i "$DEV:.*$DIR" /proc/interrupts | wc -l`
 	fi
 	if [ "$MAX" == "0" ] ; then
   	echo no vectors found on $DEV
   	exit 1
 	fi
 	for VEC in `seq 0 1 $MAX`
 	do
    	IRQ=`cat /proc/interrupts | grep -i $DEV-$DIR-$VEC"$"  \
			| cut  -d:  -f1 | sed "s/ //g"`
    	if [ -n  "$IRQ" ]; then
          set_affinity
    	else
           IRQ=`cat /proc/interrupts | egrep -i $DEV:v$VEC-$DIR"$"  \
			| cut  -d:  -f1 | sed "s/ //g"`
           if [ -n  "$IRQ" ]; then
             set_affinity
           fi
    	fi
 	done
  done
done

内核网络相关参数

这一部分没太多好说的，下面给一个/etc/sysctl.conf的配置内容，可以参考，若是觉得有些数字还不够激进，可以自己再改大一点，最后别忘了用sysctl -p生效。

### GENERAL NETWORK SECURITY OPTIONS ###
 
# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2
 
# Allowed local port range
net.ipv4.ip_local_port_range = 2000 65535
 
# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1
 
# Control Syncookies
net.ipv4.tcp_syncookies = 1
 
# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15
 
# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
 
### TUNING NETWORK PERFORMANCE ###
 
# Default Socket Receive Buffer
net.core.rmem_default = 31457280
 
# Maximum Socket Receive Buffer
net.core.rmem_max = 67108864
 
# Default Socket Send Buffer
net.core.wmem_default = 31457280
 
# Maximum Socket Send Buffer
net.core.wmem_max = 33554432
 
# Increase number of incoming connections
net.core.somaxconn = 65535
 
# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65536
 
# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824
 
# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.udp_mem = 192576 256768 385152
# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem = 8192 87380 33554432
net.ipv4.udp_rmem_min = 131072
 
# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem = 8192 65536 33554432
net.ipv4.udp_wmem_min = 131072
 
# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

Interrupt Coalescing

在大流量情况下需要考虑对NIC发送的中断进行一些“批量处理”，合并一些中断请求，从而减少CPU的压力。

看一下当前网卡的Interrupt Coalescing的配置情况：

[root@server-P1 ~]# ethtool -c eno3
Coalesce parameters for eno3:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

改动这些配置需要网卡硬件和驱动的支持。如果可以改动的话，比较简单的就是改成自适应模式：

ethtool -C eth3 adaptive-rx on

自适应模式就是自动在网络压力小或者大的时候调整参数，从而达到最小延迟/最大吞吐。

其他的参数的含义如下：

rx-usecs：从收到报文到发送中断delay的usec
rx-frames：发送中断前最大收取的报文数量
rx-usecs-irq：再次发送中断的delay的usec
等等…

Receive Packet Steering(RPS)

RPS是一种软件实现的RSS。在多队列网卡系统上，这个东西是非常多余的…所以前面的中断和RSS都配置得没问题得话，一定要记得关闭RPS：

echo 0 > /sys/class/net/<dev>/queues/rx-<n>/rps_cpus

关于RPS具体的说明，以及为什么它是多余的，可以参看这里。