Wednesday, 27 October 2021

How to fix chmod execute permissions

Problem

You've run something like the following and accidentally removed the execute permission from /bin/chmod:

        [ec2-user@ip-172-31-30-6 ~]$ sudo chmod -x /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rw-r--r-- 1 root root 54384 Jan 23  2020 /bin/chmod
        ...
        [root@ip-172-31-30-6 ~]# /bin/chmod +x /usr/bin/netstat 
        -bash: /bin/chmod: Permission denied

Now you can't execute chmod, to change the permissions on any files on the system including chmod itself. Below are a couple of ways to fix it.

Solution

Use the ld.so and ld-linux.so* dynamic loader to execute chmod:

According to its man page [1], "The programs ld.so and ld-linux.so* find and load the shared libraries needed by a program, prepare the program to run, and then run it.".

We can use this to execute chmod despite the fact it doesn't have execute permissions, and undo our mistake. Before doing so, we first need to find the ld linux binary. In Amazon Linux 2, I found ld.so under /usr/lib64/ld-2.26.so.

        [ec2-user@ip-172-31-30-6 ~]$ sudo find /usr/lib64 -name "ld*.so*"
        /usr/lib64/ld-2.26.so
        /usr/lib64/ld-linux-x86-64.so.2
        ...

Now that we've found them we can use either one of them to execute chmod:

        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rw-r--r-- 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ sudo /usr/lib64/ld-2.26.so /bin/chmod +x /bin/chmod

Finally we verify that the issue is resolved and we can execute chmod to our hearts content:

        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rwxr-xr-x 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ sudo /bin/chmod +x /usr/bin/netstat
        [ec2-user@ip-172-31-30-6 ~]$

Using Perl

Interestingly enough, perl has its chmod function built in [2]. Why? I have no idea, but we can use it to fix the chmod binary.

An example is shown below:

        [ec2-user@ip-172-31-30-6 ~]$ sudo chmod -x /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rw-r--r-- 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ sudo perl -e 'chmod(0755, "/bin/chmod")'
        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rwxr-xr-x 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$

As you can see chmod has execute permissions once again.

Rsync from another server

If you have the ability to rsync /bin/chmod from another server you can use the following command as an example to pull the file. This will replace the existing chmod file, including file metadata (such as execute permissions).

        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rw-r--r-- 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ rsync -av SOURCE_SERVER:/bin/chmod /tmp/chmod
        ...
        [ec2-user@ip-172-31-30-6 ~]$ sudo mv /tmp/chmod /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rwxr-xr-x 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$

Note that you'll need to update SOURCE_SERVER with the IP address or DNS hostname of the source server.

When I have done this in the past, I've used the same OS i.e. Amazon Linux 2. I'm not sure if this would work if it were a completely different OS.

Making a copy and replacing its contents

This solutions requires making a copy of an existing binary which does have execute permissions, and then rsync'ing the contents of the existing broken chmod binary to our copied file before moving the copied file to replace the /bin/chmod that's broken. This one is probably better explained with an example.

        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rw-r--r-- 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ sudo cp /bin/chown /bin/chmod2
        [ec2-user@ip-172-31-30-6 ~]$ sudo rsync /bin/chmod /bin/chmod2
        [ec2-user@ip-172-31-30-6 ~]$ sudo /bin/chmod2 +x /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$ sudo rm -f /bin/chmod2
        [ec2-user@ip-172-31-30-6 ~]$ ls -l /bin/chmod
        -rwxr-xr-x 1 root root 54384 Jan 23  2020 /bin/chmod
        [ec2-user@ip-172-31-30-6 ~]$

Using a Live CD

Unfortunately, I can't go into much detail on this one as its dependant on what kind of OS you're running. Essentially you'd boot the machine with the Live CD, mount the old root volume to a temporary location, and use the Live CD's version of chmod to make your broken chmod executable once again.

References:

[1] ld-linux(8) - Linux man page

https://linux.die.net/man/8/ld-linux

[2] chmod - Perldoc Browser
https://perldoc.perl.org/functions/chmod

Enabling TCP Keepalive Functionality For Legacy Linux Applications

Problem

You want to enable TCP keep alive functionality but the application either doesn't support or is being overriden by the application itself.

You may have tried (and failed) to configure this using the sysctl parameters mentioned below to no avail. As a result the connection eventually times out or is closed on its own.

The sysctl parameters you may have tried to configure are:

net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6

If you have tried to configure the above-mentioned parameters and you still aren't seeing TCP Keepalive functionality enabled then this article may be of use to you.

Solution

You can install the libkeepalive library and using the LD_PRELOAD environment variable, instruct the application to load the library and enable TCP Keepalive functionality.

Quick Setup Guide

        [ec2-user@ip-172-31-30-6 ~]$ wget http://prdownloads.sourceforge.net/libkeepalive/libkeepalive-0.3.tar.gz?download
        [ec2-user@ip-172-31-30-6 ~]$ tar zxf libkeepalive-0.3.tar.gz
        [ec2-user@ip-172-31-30-6 ~]$ cd libkeepalive-0.3
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ make
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ sudo cp libkeepalive.so /usr/lib64
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ export LD_PRELOAD=/usr/lib64/libkeepalive.so
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ export KEEPCNT=20
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ export KEEPIDLE=75
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ export KEEPINTVL=60
        [ec2-user@ip-172-31-30-6 libkeepalive-0.3]$ /path/to/myapplication

How It Works

When /path/to/myapplication executes, the OS will preload the libkeepalive.so library enabling TCP keepalive functionality for newly created TCP sockets in accordance with the KEEPCNT, KEEPIDLE, and KEEPINTVL environment variables. I have tested this using the nc command to create a process that listens on TCP port 5000. Run the nc command via strace, and you'll see what happens when TCP Keepalives are not enabled:

        [ec2-user@ip-172-31-30-6 ~]$ strace nc -l 5000 2>&1 | grep setsockopt
        setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
        setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
        setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
        ^C
        [ec2-user@ip-172-31-30-6 ~]$

We set that the SO_KEEPALIVE socket option has not been enabled nor have any of the other Keepalive related settings. Now let's review the difference when TCP Keepalives are enabled for the same command:

        [ec2-user@ip-172-31-30-6 ~]$ export LD_PRELOAD=/usr/lib64/libkeepalive.so
        [ec2-user@ip-172-31-30-6 ~]$ export KEEPCNT=20
        [ec2-user@ip-172-31-30-6 ~]$ export KEEPIDLE=75
        [ec2-user@ip-172-31-30-6 ~]$ export KEEPINTVL=60
        [ec2-user@ip-172-31-30-6 ~]$ strace nc -l 5000 2>&1 | grep setsockopt
        setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
        setsockopt(3, SOL_TCP, TCP_KEEPCNT, [20], 4) = 0
        setsockopt(3, SOL_TCP, TCP_KEEPIDLE, [75], 4) = 0
        setsockopt(3, SOL_TCP, TCP_KEEPINTVL, [60], 4) = 0
        setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
        setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
        setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
        setsockopt(4, SOL_TCP, TCP_KEEPCNT, [20], 4) = 0
        setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [75], 4) = 0
        setsockopt(4, SOL_TCP, TCP_KEEPINTVL, [60], 4) = 0
        setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
        ^C
        [ec2-user@ip-172-31-30-6 ~]$

After using telnet to connect to TCP port 5000, we can use the netstat (or ss) command and see that a Keepalive timer is being used which further confirms that we've enabled TCP Keepalive functionality successfully for the telnet session.

        [ec2-user@ip-172-31-30-6 ~]$ sudo netstat -tnopea
        Active Internet connections (servers and established)
        Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name     Timer
        ...
        tcp        0      0 127.0.0.1:5000          127.0.0.1:36246         ESTABLISHED 1000       1886232    24629/nc             keepalive (71.16/0/0)

References:

[1] libkeepalive http://libkeepalive.sourceforge.net/#download

udev: renamed network interface eth0 to eth1

In the console logs you see the following message:

udev: renamed network interface eth0 to eth1

or you may see something like:

ena 0000:00:05.0 eth1: renamed from eth0

As a result, the network fails to start, and the host isn't accessible.

Solution

There are a couple of ways to solve this issue, but both will require rebooting the machine.

Short Term Solution

In the /etc/udev/rules.d directory, there is a udev rule file ending with "-persistent-net.rules". Usually this file will be prepended with a number (such as 70) which defines the order in which udev rules are processed. Delete the file, and when the OS is started again, the file will be generated from scratch and the network interface will not be renamed to eth1..

$ sudo rm -vf /etc/udev/rules.d/70-persistent-net.rules
$ sudo reboot

If you also see the following message in the console log and the network interface is attached as eth0, then you'll need to check the /etc/sysconfig/network-scripts (or the OS equivalent) to make sure that your network configuration scripts are named correctly ie. ifcfg-eth0 rather than ifcfg-eth1, and the DEVICE and NAME parameters match the device name (eth0).

Bringing up interface eth1: Determining IP information for eth1... done.

An example is shown below of what it's supposed to look like:

[ec2-user@ip-172-31-28-71 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet

Long Term Solution

Depending on the operating system, in /lib/udev or /usr/lib/udev there is a bash script called "write_net_rules ". In this file, you'll find the section of code towards the bottom of the file which renames the network device if a rule already exists. The exact section of code i'm referring to is outlined below. To stop the issue from re-occurring in future, hash out the portion of code below before proceeding to the next step. When you're done, it should look like this;

#else
#        # if a rule using the current name already exists, find a new name
#        if interface_name_taken; then
#                INTERFACE="$basename$(find_next_available "$basename[0-9]*")"
#                # prevent INTERFACE from being "eth" instead of "eth0"
#                [ "$INTERFACE" = "${INTERFACE%%[ \[\]0-9]*}" ] && INTERFACE=${INTERFACE}0
#                echo "INTERFACE_NEW=$INTERFACE"
#        fi

Once you've done that, you'll need to remove the persistent-net.rules file as per the "Short Term Solution" above.

Why Is This Happening?

When you launch a new machine from an existing image or snapshot, udev will have an existing rule for eth0. An example of this rule is shown below.

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="01:23:45:67:89:ab", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

Simply put, when the conditions above are met, the device name is set to eth0.

When a new machine is launched which has not been patched, it launches with an network interface that has a different MAC address. Because of this, the new network interface will not match against the above-mentioned udev rule.

The write_net_rules script will then add a new entry to the persistent-net.rules file for the new network interface. However, since a rule already exists for eth0, udev changes the device name to a device name which isn't already "In use". The script will increase the interface number by one which effectively forces eth0 to be renamed to eth1 because it fails to match on the first rule, and succeeds on the second.

If the customer experiences this behaviour, you should see similar output to what's shown below in the persistent-net.rules file.

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="01:23:45:67:89:ab", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="ab:cd:ef:09:87:65", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

As long as the second rule exists, the machine will always boot and rename eth0 to eth1.

Also if you're wondering, Amazon Linux 1/2 disables the above-mentioned functionality in the ec2-net-utils package. This can be seen here:

https://github.com/aws/ec2-net-utils/blob/master/write_net_rules

How Does Udev Rule Matching Work?

Let's take the following udev rule and break it down.

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="01:23:45:67:89:ab", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

if a network device ( SUBSYSTEM=="net" ) is added ( ACTION=="add" ) to the system, and it's not a VLAN'd i.e eth0.130 or sub-interface i.e. eth0:0 ( DRIVERS=="?*" ) with a MAC address of 01:23:45:67:89:ab ( ATTR{address}=="01:23:45:67:89:ab" ), and it's the primary ethernet device ( ATTR{type}=="1" ), and the kernel name of the device begins with "eth" ( KERNEL=="eth*" ), set the name of the device to eth0 ( NAME="eth0" ).

If you want to read more about udev rules, there are far better explanations on the interwebs. I recommend the following online resources.

https://linuxconfig.org/tutorial-on-how-to-write-basic-udev-rules-in-linux

http://www.linuxfromscratch.org/lfs/view/6.3/chapter07/network.html

Monday, 28 May 2018

Deploy an iPerf Speed Testing Server with Web Interface

Sometimes you just need to run network performance tests between two hosts. While there are tools available for running speed tests such as iPerf, they're often annoying to setup, confusing to use, and require console access to the server itself in order to test.

Here's where the IITG Web-based iPerf containers come into play. With this guide, you can setup iPerf servers in multiple locations and allow network support staff to perform network testing without the need to give them console access to the server itself.

Let's get started!

Installation

Install web interface, iperf-servers and iperf command

curl -L https://raw.githubusercontent.com/iitggithub/iperf-web/master/install.sh | bash

Configure iperf-web

If the URL you will be using to access the iperf web interface does not match the fully qualfied domain name of your docker container host, make sure you set FQDN_SERVER_NAME to something more meaningful.

Set a variable to contain the hostname

FQDN_SERVER_NAME="`hostname`"

By default iperf-web exposes port 80. That's not necessary since we'll use a self signed certificate and HTTPS later on. You can manually edit the /data/iperf-web/docker-compose.yml file to disable exposing port 80 to the outside world or just recreate from scratch using the code below

Reconfigure iperf-web docker-compose.yml file

cat | sudo tee /data/iperf-web/docker-compose.yml <<EOF

server:

image: iitgdocker/iperf-web:latest

volumes:

- /var/run/docker.sock:/var/run/docker.sock

- /data/iperf-web/images:/var/www/html/images

environment:

- VIRTUAL_HOST=${FQDN_SERVER_NAME}

EOF

Configure An Nginx Web Server

Assuming you don't already run your own webserver on the host, we'll use a docker container for that too. This one will automatically regenerate and reload nginx whenever a compatible container is detected. This is based on jwilders nginx proxy. You can find more detailed information on how to use it here: https://github.com/jwilder/nginx-proxy.

Create a self-signed SSL certificate (Optional)

mkdir -p /data/nginx/certs

openssl req -new -newkey rsa:2048 -nodes -out /data/nginx/certs/${FQDN_SERVER_NAME}.crt -keyout /data/nginx/certs/${FQDN_SERVER_NAME}.key -subj "/C=/ST=/L=/O=IITG Blog/CN=${FQDN_SERVER_NAME}"

Install Your Own Company Logo (Optional)

mkdir -p /data/nginx/images
cp <path_to_my_logo>.png /data/nginx/images

Install the nginx docker compose file

cat | tee /data/nginx/docker-compose.yml <<EOF

proxy:

image: jwilder/nginx-proxy:latest

ports:

- "80:80"

- "443:443"

volumes:

- /var/run/docker.sock:/tmp/docker.sock

- /data/nginx/certs:/etc/nginx/certs

EOF

Install the systemd service file

If you can make use of a systemd service file, here's the one you'll need for the nginx proxy

cat | sudo tee /usr/lib/systemd/system/docker-nginx.service <<EOF

[Unit]

Description=Nginx web proxy

After=docker.service

[Service]

Conflicts=shutdown.target

StartLimitInterval=0

Restart=always

TimeoutStartSec=0

Restart=on-failure

WorkingDirectory=/data/nginx

ExecStartPre=-/usr/local/bin/docker-compose stop

ExecStartPre=-/usr/local/bin/docker-compose pull

ExecStart=/usr/local/bin/docker-compose up

ExecStop=-/usr/local/bin/docker-compose stop

[Install]

WantedBy=multi-user.target

EOF

Start the nginx proxy

sudo systemctl start docker-nginx
sudo systemctl enable docker-nginx

Restart the iperf-web server

sudo systemctl restart docker-iperf-web

Useful VMware CLI commands

If you're running on the free version of VMware ESX 5 then you know the pangs of having to manually perform operations which are much easier to do with VMware Essentials or higher licenses.

This is not really a guide, more it's a collection of useful command line utilities which you can use to save yourself some time.

Initiate auto-shutdown of VMs

/bin/vmware-autostart.sh stop

Note: This requires VMware tools (and running) and auto shutdown rules configured

Initiate auto power on of VMs

/bin/vmware-autostart.sh start

Note: This requires VMware tools (and running) on the VM and auto shutdown rules configured

Unregister multiple VMs

for vm in <vm_name>; do vim-cmd /vmsvc/unregister /vmfs/volumes/<volume_name>/${vm}/${vm}.vmx; done

Note: This is the same as "Remove from inventory"

Register multiple VMs

for vm in <vm_name>; do vim-cmd /solo/register
/vmfs/volumes/<volume_name>/${vm}/${vm}.vmx; done

Note: This is the same as "Add to inventory"

Power on multiple VMs

for vm in <vm_name>; do vim-cmd /vmsvc/power.on /vmfs/volumes/<volume_name>/${vm}/${vm}.vmx; done

Note: This is the same as CTRL+B them

Rename a VM

/vmfs/volumes/rename_vm.sh <datastoreName> <directory_name_of_copied_VM> <old_vm_name> <new_vm_name>

Note: You'll need to download the script from the iitggithub page and scp it to the VM host datastore.

....and that's all we have so far. If you have any more tips and tricks, let me know in the comments below.

TCP Window Size Tuning Guide (Linux)

There are many ways to tune TCP window sizes, below i show you how i do it. Feel free to comment.
First of all a few things need to be known prior to tuning. They are:

The amount of bandwidth available in Kb
The average ping response time between the source and destination hosts (in ms)
The maximum TCP segment size

For the purpose of this article, let's set some numbers of our own

Bandwidth (Kb)	20,000
Ping Time (ms)	340
Max Segment Size	1300

Note: To make full use of receive or even transmit window size tuning, BOTH hosts should be tuned.

First of all, we need to find the current settings for the following parameters:

sysctl net.ipv4.tcp_window_scaling

sysctl net.ipv4.tcp_slow_start_after_idle

sysctl net.ipv4.tcp_rmem

sysctl net.ipv4.tcp_wmem

sysctl net.core.rmem_default

sysctl net.core.wmem_default

sysctl net.core.rmem_max

sysctl net.core.wmem_max

sysctl net.core.optmem_max

sysctl net.core.netdev_max_backlog

sysctl net.ipv4.tcp_congestion_control

sysctl net.ipv4.tcp_timestamps

sysctl net.ipv4.tcp_sack

As an example, this is the output when running those commands on our test system:

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_slow_start_after_idle = 1

net.ipv4.tcp_rmem = 4096 87380 4194304

net.ipv4.tcp_wmem = 4096 16384 4194304

net.core.rmem_default = 124928

net.core.wmem_default = 124928

net.core.rmem_max = 124928

net.core.wmem_max = 124928

net.core.optmem_max = 20480

net.core.netdev_max_backlog = 1000

net.ipv4.tcp_congestion_control = cubic

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_sack = 1

We’re only going to do receive window tuning so ignore the wmem parameters (it’s the same process anyway but requires bi-directional access in order to test).

Calculating The Optimal Window Size

First we need to find our BDP or Bandwidth Delay Product.

Start by multiplying the amount of bandwidth available (in our case, is 20000) by the ping response time (340ms). This gives us 6800000. Divide this by 8 and we have our BDP which is 850000.

Now we need to find our unscaled window value.

Take 65535 (why 65535? I don’t know) and divide it by our MSS (1300) and round down the result to the nearest even number. 65535 / 1300 is 50.41153846153846. Rounded down to the nearest even number brings us to 50. Then, multiply this value by 1300 (our MSS) to find the optimal unscaled window value which is 65000.

Still following? Hope so.

Multiply 65000 by 2 until it is larger than our BDP (which is 850000) and you should arrive at 1040000 which is your optimal window size. You can’t use 1040000 because it’s not a valid number. I usually divide it by 1024 and round up the result to the nearest whole number. In our case, 104000 / 1024 is 1015.625 which when rounded up is 1016. 1024 * 1016 = 1040384 which after all that is now our default window size. That’s just the method I use but you can do more research if you want to but I find it easy to remember especially since 1024 is usually the default minimum value given when you dump the current configuration.

We also need to set a minimum window size and a maximum window size. To save you some time and effort, use 16MB or 16777216 which is the maximum for a 1Gb/s local network link. Of course, if the server has a larger local network link ie 10Gb/s the maximum should be at least double that.

Minimum receive window size can be set where ever you want. Set this value too high and It will cause problems because there isn’t enough memory available to handle your minimum.

The Optimised Values

Note that this also includes write optimisation (wmem) values as well because i find myself referencing this documentation a lot.

# Turn on automatic TCP window size scaling

sysctl -w net.ipv4.tcp_window_scaling=1

# Disable TCP slow start

sysctl -w net.ipv4.tcp_slow_start_after_idle=0

# Set the min, default and maximum receive window sizes used during auto tuning

sysctl -w net.ipv4.tcp_rmem='20480 1040384 16777216'

sysctl -w net.ipv4.tcp_wmem='20480 1040384 16777216'

# Set default receive window size here as well. This one is used when window size scaling is disabled

sysctl -w net.core.rmem_default=1040384

sysctl -w net.core.wmem_default=1040384

# Set max receive window size here as well. This one is used when window size scaling is disabled

sysctl -w net.core.rmem_max=16777216

sysctl -w net.core.wmem_max=16777216

# Set the maximum buffer size allowed per socket

# I recommend just setting this to your maximum window size

sysctl -w net.core.optmem_max=16777216

# Sets the maximum number of packets that will be buffered if the kernel can’t keep up

# There’s no real method, I just set it to something that’s a lot higher than default

sysctl -w net.core.netdev_max_backlog=65536

# Set the congestion control algorithm. Not sure which one is better but cubic seems like it’s better than reno

sysctl -w net.ipv4.tcp_available_congestion_control=’cubic’

# Enable timestamps as defined in RFC1323

sysctl -w net.ipv4.tcp_timestamps=1

# Enable select acknowledgments

sysctl -w net.ipv4.tcp_sack=1

# Force all new TCP connections to use the above settings

sysctl -w net.ipv4.route.flush=1

You can (and definitely should) store these values in a file like so:

cat | tee /etc/sysctl.d/tcp_optimisations.conf <<EOF

# Turn on automatic TCP window size scaling

net.ipv4.tcp_window_scaling = 1

# Disable TCP slow start

net.ipv4.tcp_slow_start_after_idle = 0

# Set the min, default and maximum receive window sizes used during auto

tuning

net.ipv4.tcp_rmem = 20480 1040384 16777216

net.ipv4.tcp_wmem = 20480 1040384 16777216

# Set default receive window size here as well. This one is used when

window size scaling is disabled

net.core.rmem_default = 1040384

net.core.wmem_default = 1040384

# Set max receive window size here as well. This one is used when window

size scaling is disabled

net.core.rmem_max = 16777216

net.core.wmem_max = 16777216

# Set the maximum buffer size allowed per socket

# I recommend just setting this to your maximum window size

net.core.optmem_max = 16777216

# Sets the maximum number of packets that will be buffered if the kernel

can’t keep up

# There’s no real method, I just set it to something that’s a lot higher

than default

net.core.netdev_max_backlog = 65536

# Set the congestion control algorithm. Not sure which one is better but

cubic seems like it’s better than reno

net.ipv4.tcp_available_congestion_control = cubic

# Enable timestamps as defined in RFC1323

net.ipv4.tcp_timestamps = 1

# Enable select acknowledgments

net.ipv4.tcp_sack = 1

# Force all new TCP connections to use the above settings

net.ipv4.route.flush = 1

EOF

... and that's it! Happy tuning!

OpenZFS/Dedupe Put To The Ultimate Test

Overview

The aim of this task is to re-evaluate the viability of ZFS in our environment. Concerns were raised over the original test results such as;

IO statistics grossly exceed expectations (Could have been influenced by system RAM)
Multiple ZFS deadlocks occurring (Possibly due to CPU/RAM contention on the test system)
Disk usage savings did not meet or exceed expectations (Possibly due to limiting the amount of user data copied to the machine)

Although the hardware is exactly the same as the previous ZFS test hardware, the results from this round of testing cannot be merged with previous results due to:

Changes in drive/raid configuration
ZFS record size is set to default (128K)
Test server does not have access to the Nimble CS215 SAN

Hardware Configuration

The hardware being used for this test is a Dell PowerEdge R720 server with the following hardware configuration:

2 x Intel Xeon E5-2667 (3.30GHz, 15MB Cache)
Intel C600 Chipset
Memory - 32 GB (4 x 8GB) 1600Mhz DDR3 Registered RDIMMs
CentOS 7.1 operating system
PERC H710p Integrated RAID Controller, 1GB NVRAM
OS storage configuration - 600GB raw storage consisting of 2 x 600GB 15K RPM SAS 6Gbps 2.5in drives in RAID 1
Data Storage configuration - 1.8TB raw storage consisting of 6 x 600GB 15K RPM SAS 6Gbps 2.5in drives in RAID 10

ZFS Benchmark Setup

Because we're using the Dell PERC H710p Integrated RAID Controller, we can't configure ZFS exactly how it prefers to be configured. This would otherwise require JBOD mode, which the H710p card doesn't support. To remedy this, we would need to purchase a PowerVault MD12xx direct attached storage enclosure which includes a PERC H810 RAID Adapter card that does support JBOD mode or (even better) a HBA where ZFS has complete control over the entire world. The more information ZFS

has about the disks it's using the better it's able to manage and assess the health of the disks. You can read al about it http://open-zfs.org/wiki/Hardware#Hardware_RAID_controllers.

We'll also be performing tests with ZFS primarycache set to "all", "metadata" and "none". Just so you are aware, these are the commands you'll need to know in order to change the primarycache setting for the entire pool.

Changing the primarycache to "all"

zfs set primarycache=all zfs

Create the ZFS pool using /dev/sdb for testing purposes. This will create ~1.62TB storage pool.

zpool create zfs /dev/sdb

Create ZFS volume with de-duplication turned on and atime turned off as recommended by some best practices guides.

zfs create -o de-dup=on -o atime=off zfs/data

Mount the ZFS volume

zfs mount zfs/data /data

Test Data

All access to /data is severed to ensure the data contained remained consistent and the tests are not influenced by end users.

There's two types of test data

Compile data
User data

Compile Data

This is a piece of software to be compiled. To be fair, this could be any piece of software which could be compiled under Linux but the more organisational specific it is, the more relatable it will be to the end users. Since this machine will store software that needs to be compiled anyway, it makes sense to perform a couple of timed compilation tests.

The test data is stored under /data (wherever is convenient) and will be accessed from a remote "helper" server via NFS. More on that later..
.

User Data

This is the reason why we're here. How much of this data can fit into 1.8TB of raw storage. Our users data (known herein as user data) will be migrated via rsync to the test server. How many exactly? enough to fill 100% of the space available. That should be quite a bit... we might not have enough data.

Testing Process

A total of 3 tests will be performed. All of which are outlined below;

Compilation Speed
Disk Space Consolidation
IO Tests

Compilation Speed

This test aims to compare ZFS using an organisational specific metric that everyone is able to understand - software compilation times. The compilation test is executed three times, timed and the resulting run times are then averaged. Flock is used to make sure that compilation times are not influenced by cache.

This test requires the use of a "helper" server. This server should be be unused and ideally connected to the same network switch as the test server. This is to ensure that the testing environment remains consistent throughout the test.

Also since we'll be accessing the data via NFS, it's important set the mount options exactly as they are in production. Failure to do so could impact test results.

Mount the test directory via NFS

sudo mkdir -p /mnt/compile_test && sudo mount -t nfs -o
"vers=3,rsize=8192,wsize=8192,soft" zfs:/data/compile_test
/mnt/compile_test

Testing Procedure

From within the testing directory, execute the following:

flock /tmp/compile.lock -c "make veryclean && time make -j `grep processor
-c /proc/cpuinfo`"

Results

The results show that there is very little performance difference when using ZFS with in-line deduplication. This is very interesting because the previous compile tests showed a 25% difference which i now believe was due to the fact that we were attempting a CPU intensive workload on the same machine which was running ZFS. For me, it's validation that says we should never use a ZFS server for anything other than ZFS unless it's absolutely necessary.

	Run 1	Run 2	Run 3	Average	% Difference
EXT4 Logical Volume	366	380	371	372	+00.00%
ZFS Primary Cache: all	388	364	370	374	+00.54%
ZFS Primary Cache: metadata	520	599	568	562	+51.08%
ZFS Primary Cache: none	626	654	634	638	+71.51%

Disk Space Consolidation

Given the original filesystem was ext4 based, obviously, it had 0 disk space consolidation savings. What we're really after is a figure that tells us how much user data we've been able to cram into our ZFS volume.

ZFS

Since we only have 1.8TB of raw space and some of that gets eaten up by system overheads, we're left with 1.62TB of available disk space for the ZFS pool. ]The ZFS pool seems to reserve some of this space so the pool can never grow beyond 90% capacity. This leaves us with only 1.46TB of usable disk space.

When deduplication and/or compression is enabled, it's hard to know exactly how much of that 1.46TB you've really used. The command below will tell you but to be honest, if the df command tells you the disk is full, it's damn well full.

This commands will tell you how much of your 1.46TB has actually been consumed.

zpool get allocated zfs

Results

As was explained earlier, we could only fill the pool to just under 90% capacity. As you can see from the graph below, we managed to squeeze in an impressive 5.19TB of real user data onto the our 1.46TB of available disk space. That's a 71.87% reduction with a deduplication ratio of 3.59x ! At the time of writing (and for comparison purposes) we have 8TB of user data currently spread across two hosts with roughly 300GB saved due to the in-line compression provided by the Nimble SAN.

	Disk Usage (GB)	% Difference
Local (EXT4)	1565	+00.00%
ZFS w/ de-dupe on	5190	-69.85%

IO Testing

IO testing. The basis for this is mostly stolen from http://www.storagereview.com/synology_rackstation_rs10613xs_review and http://www.storagereview.com/fio_flexible_i_o_tester_synthetic_benchmark. The FIO tests are customised slightly for our environment and were originally contained within one FIO test file. They've since been split into separate test files.

The test names are important because my shitty bash script uses them to pull out the information and populate CSV files to make it easier to copy and paste into this report.

A useful link to help you understand fio stats can be found here: http://tfindelkind.com/2015/08/24/fio-flexible-io-tester-part8-interpret-and-understand-the-resultoutput/

Below is the 4K Random 100% write test file.

[global]
ioengine=libaio
bs=4k
# This must be set to 0 for ZFS. 1 for all others.
direct=${DIRECT}
# This must be set to none for ZFS. posix for all others.
fallocate=${FALLOCATE}
rw=randrw
# Make sure fio will refill the IO buffers on every submit rather than just
init
refill_buffers
# Setting to zero in an attempt to stop ZFS from skewing results via
de-dupe.
#dedupe_percentage=0
# Setting to zero in an attempt to stop both ZFS and Nimble from skewing
results via compression.
buffer_compress_percentage=0
norandommap
randrepeat=0
rwmixread=70
runtime=60
ramp_time=5
group_reporting
directory=${DIRECTORY}
filename=fio_testfile
time_based=1
runtime=60
[16t-rand-write-16q-4k]
name=4k100writetest-16t-16q
rw=randrw
bs=4k
rwmixread=0
numjobs=16
iodepth=16

In order to make use of the file, you'll need fio and libaio-devel installed. There's no rpm for fio so you need to download it and compile yourself.

To execute a ZFS FIO test use the following

export DIRECT=0
export FALLOCATE=none
export DIRECTORY=<zfs_mount>
fio <FioTestFile>

To execute a non ZFS FIO test (ie EXT4) use the following

export DIRECT=1
export FALLOCATE=posix
export DIRECTORY=<fs_mount>
fio <FioTestFile>

Results

What do the results say? Well, the default value for primarycache is "all". The results tell you that modifying this value is a bad idea so leave it set to the default value and instead buy enough RAM to store everything in memory. De-duplicated data is considered "metadata" so it will fight for a piece of your cache. In addition to this, you also need space for the actual hash table and whatever else ZFS stores in its cache. A more detailed explanation can be found here: http://open-zfs.org/wiki/Performance_tuning#Deduplication but the basic rule of thumb is more RAM = less problems.

The results also show that in the majority of cases with the ZFS primary cache set to "all" performed better than the EXT4 logical volume. This was expected since the primary cache is using RAM after all.

4K Random 100% Read/Write Test [Throughput]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	51680	6545
ZFS Primary Cache: all	43770	56024
ZFS Primary Cache: metadata	3386	502
ZFS Primary Cache: none	1552	498

4K Random 100% Read/Write Test [Average Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	4	39
ZFS Primary Cache: all	5	4
ZFS Primary Cache: metadata	76	509
ZFS Primary Cache: none	165	509

4K Random 100% Read/Write Test [Max Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	733	739
ZFS Primary Cache: all	287	52
ZFS Primary Cache: metadata	1357	1435
ZFS Primary Cache: none	402	3080

4K Random 100% Read/Write Test [Standard Deviation]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	10	27
ZFS Primary Cache: all	9	5
ZFS Primary Cache: metadata	36	229
ZFS Primary Cache: none	32	349

8K Sequential 100% Read/Write Test [Throughput]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	86532	106972
ZFS Primary Cache: all	492187	189733
ZFS Primary Cache: metadata	7496	22249
ZFS Primary Cache: none	6758	22798

8K Sequential 100% Read/Write Test [Average Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	2	2
ZFS Primary Cache: all	1	1
ZFS Primary Cache: metadata	34	12
ZFS Primary Cache: none	37	11

8K Sequential 100% Read/Write Test [Max Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	260	23
ZFS Primary Cache: all	742	199
ZFS Primary Cache: metadata	1275	425
ZFS Primary Cache: none	2092	459

8K Sequential 100% Read/Write Test [Standard Deviation]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	3	0
ZFS Primary Cache: all	3	1
ZFS Primary Cache: metadata	25	13
ZFS Primary Cache: none	41	11

128K Sequential 100% Read/Write Test [Throughput]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	20830	25769
ZFS Primary Cache: all	287079	29784
ZFS Primary Cache: metadata	4456	35186
ZFS Primary Cache: none	4606	32107

128K Sequential 100% Read/Write Test [Average Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	12	9
ZFS Primary Cache: all	1	8
ZFS Primary Cache: metadata	57	7
ZFS Primary Cache: none	55	7

128K Sequential 100% Read/Write Test [Max Latency]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	119	40
ZFS Primary Cache: all	26	57
ZFS Primary Cache: metadata	236	86
ZFS Primary Cache: none	125	255

128K Sequential 100% Read/Write Test [Standard Deviation]

	16 Thread 16 Queue 100% Read	16 Thread 16 Queue 100% Write
EXT4 Logical Volume	5	2
ZFS Primary Cache: all	1	5
ZFS Primary Cache: metadata	27	5
ZFS Primary Cache: none	26	7

8K Random 70% Read 30% Test [Throughput]

	2 Threads 2 Queues	2 Threads 4 Queues	2 Threads 8 Queues	2 Threads 16 Queues	4 Threads 2 Queues	4 Threads 4 Queues	4 Threads 8 Queues	4 Threads 16 Queues	8 Threads 2 Queues	8 Threads 4 Queues	8 Threads 8 Queues	8 Threads 16 Queues	16 Threads 2 Queues	16 Threads 4 Queues	16 Threads 8 Queues	16 Threads 16 Queues
EXT4 Logical Volume	4004	4504	5767	6943	4328	5718	6838	7658	5694	6883	7719	9284	6754	7675	9375	9227
ZFS Primary Cache: all	50117	37957	45180	45199	46851	56708	56576	66382	38537	46762	46456	40270	38447	42264	40095	40322
ZFS Primary Cache: metadata	4699	345	355	333	562	528	498	496	833	810	837	779	846	798	723	887
ZFS Primary Cache: none	3943	351	343	345	556	474	456	542	665	695	672	746	916	864	1009	968

8K Random 70% Read 30% Test [Average Latency]

	2 Threads 4 Queues	2 Threads 8 Queues	2 Threads 16 Queues	4 Threads 2 Queues	4 Threads 4 Queues	4 Threads 8 Queues	4 Threads 16 Queues	8 Threads 2 Queues	8 Threads 4 Queues	8 Threads 8 Queues	8 Threads 16 Queues	16 Threads 2 Queues	16 Threads 4 Queues	16 Threads 8 Queues	16 Threads 16 Queues
EXT4 Logical Volume	1	1	3	1	1	3	5	1	3	5	10	3	5	9	23
ZFS Primary Cache: all	0	0	0	0	0	0	0	0	0	1	3	0	1	3	6
ZFS Primary Cache: metadata	23	45	96	14	31	64	129	19	40	77	165	39	82	179	288
ZFS Primary Cache: none	23	47	92	15	34	71	118	25	47	95	171	36	35	128	266

8K Random 70% Read 30% Test [Max Latency]

	2 Threads 2 Queues	2 Threads 4 Queues	2 Threads 8 Queues	2 Threads 16 Queues	4 Threads 2 Queues	4 Threads 4 Queues	4 Threads 8 Queues	4 Threads 16 Queues	8 Threads 2 Queues	8 Threads 4 Queues	8 Threads 8 Queues	8 Threads 16 Queues	16 Threads 2 Queues	16 Threads 4 Queues	16 Threads 8 Queues	16 Threads 16 Queues
EXT4 Logical Volume	95	138	166	299	142	162	326	544	211	230	520	1333	302	581	888	1589
ZFS Primary Cache: all	808	661	266	204	31	28	86	15	34	38	36	41	30	52	81	64
ZFS Primary Cache: metadata	914	770	693	963	327	551	865	1597	472	559	758	947	1570	2082	2458	2015
ZFS Primary Cache: none	1465	444	927	942	713	1478	4246	1202	1487	1523	1453	1597	1352	1195	1138	1346

8K Random 70% Read 30% Test [Standard Deviation]

	2 Threads 2 Queues	2 Threads 4 Queues	2 Threads 8 Queues	2 Threads 16 Queues	4 Threads 2 Queues	4 Threads 4 Queues	4 Threads 8 Queues	4 Threads 16 Queues	8 Threads 2 Queues	8 Threads 4 Queues	8 Threads 8 Queues	8 Threads 16 Queues	16 Threads 2 Queues	16 Threads 4 Queues	16 Threads 8 Queues	16 Threads 16 Queues
EXT4 Logical Volume	1	2	3	7	2	3	7	14	4	7	15	29	7	14	24	39
ZFS Primary Cache: all	1	1	0	0	0	0	0	0	0	0	0	0	1	1	2	4
ZFS Primary Cache: metadata	5	22	27	61	14	29	61	102	20	36	63	101	62	140	274	248
ZFS Primary Cache: none	8	17	41	51	21	63	128	68	58	83	145	138	56	45	109	191

Final Thoughts

All of the solutions below either support inline deduplication/compression or will in the very near future. I've had a couple of them priced just to give people an idea. The Dell pricing is indicative whereas the pricing for the Nimble SAN controller upgrade/All Flash Array are current as of May 9th 2017.

Obviously, more thought should go into picking one of these solutions than just price. Other than the amount of raw disk space each solution provides, what's below really isn't an apples to apples comparison. The SAN based solutions include other tools like 4Hr support, predictive analytics, proactive monitoring, easy to use storage management tools, capacity planning and much more which do help to justify the additional expense. An in-house solution will mean in-house support and even with my current knowledge of ZFS, i'm not 100% confident that i could fix it in an emergency 100% of the time within a reasonable time frame.

With ZFS we still need to be concerned about its RAID configuration but from what i've read so far, it's not as complicated as it is for most servers/SANs (Nimble excluded) but still it's one more thing to learn and understand. While most Systems Administrators would be able to come up with a sensible configuration by themselves, i'm not sure all of them would be able to.Of course, you can ask yourself the question "How often does one need to do that?" and it's safe to say not often. It's still something worth keeping in mind though.

Ultimately, ZFS would be very useful for us IF used correctly but the risks make it difficult to justify its use for production systems/storage in my opinion. These risks should be mitigated as OpenZFS matures but it's not quite there yet.

Hardware Recommendations

To recap, these are the hardware stats for the test server so you don't have to scroll up.

2 x Intel Xeon E5-2667 (3.30GHz, 15MB Cache)
Intel C600 Chipset
Memory - 32 GB (4 x 8GB) 1600Mhz DDR3 Registered RDIMMs
CentOS 7.1 operating system
PERC H710p Integrated RAID Controller, 1GB NVRAM
OS storage configuration - 600GB raw storage consisting of 2 x 600GB 15K RPM SAS 6Gbps 2.5in drives in RAID 1
Data Storage configuration - 1.8TB raw storage consisting of 6 x 600GB 15K RPM SAS 6Gbps 2.5in drives in RAID 10

The test server performed quite well during testing. I'd probably recommend dual CPU's. ZFS runs quite a few processes so being able to run them on their own logical core is quite handy.

RAM-wise, i'd probably recommend going with 64GB of RAM to begin with. Having said that, if you can add more ram, do it. The more RAM you have, the more data you can cache in it.

If you can stretch the budget a bit more, an SSD drive (one is ok) can be used as a secondary cache. This will help reduce the need to read from the spinning disks directly.

Finally, and this is an absolute MUST HAVE in my opinion. You need a SAS controller card capable of JBOD mode. While we didn't use one in testing (Because we didn't have one), you really should look into it. ZFS has it's own storage management system and it compliments the ZFS filesystem perfectly. Let ZFS manage the RAID and you'll be rewarded.

Price Comparisons

Below is a price comparison of some of the options we have on the table at the time of writing. All prices are in AUD and are inclusive of GST.

Option	Price
Dell PowerVault 1220 DAS (11TB)	18157 *
Nimble Controller Upgrade (More Storage)	52953
Nimble All Flash Array (11TB)	73164

* Estimated Price