AKA: The Pragmatic Neckbeard 2: Talk ZFS to me
In this installation, we're going to talk ZFS on Arch Linux. ZFS is your friend with the huge house. ZFS is your friend with the fast car. ZFS is also your friend who's a bit high-maintenance. ZFS takes a bit of work, but the benefits (in my opinion) outweigh the work required for someone who's on a budget.

Why isn't ZFS default then?

ZFS is a bit of a bear to work with because of the way that it is installed on your system. ZFS and Linux have incompatible licenses. This prevents a distribution from packaging Linux with ZFS. That doesn't mean you can't build ZFS modules on your own and include them in a kernel. There's a difference between distributing binaries and distributing a script to combine it yourself. From my research it appears that by compiling ZFS into your kernel for personal use, you are not in breach of Linux GPLv2 license (I am not a lawyer, correct me if I'm wrong). That's why ZFS isn't the default, among other reasons.

Cool, how can I get it?

Getting ZFS is actually quite easy on Arch. Start by making sure your installation is up to date, then install the AUR package zfs-dkms.

By now, you must be thinking: "Wow, that was easy. I thought you said this was going to be a pain?!"

That's true. I did say that. Let's talk about updates. At the time of writing this, Linux, DKMS and ZFS don't quite jive properly and every time your kernel updates, DKMS fails to install the zfs modules. This can cause issues with your system. The solution for this is to make sure that every time you install a new kernel or upgrade your existing one, be sure to run sudo dkms autoinstall and DKMS will recompile and install the ZFS and SPL modules into your system.

Awesome! Can I mkfs.zfs yet?

Nope. Actually, you will never mkfs.zfs. ZFS doesn't work that way.

Let's talk about ZFS on a conceptual level before we get into functional demonstrations.

How ZFS works

Think of ZFS not as a filesystem, but as a package of tools for dealing with storage. At it's core, ZFS accomplishes four basic things: It provides a file system, snapshot management, volume management and pooling management.

Think of ZFS like a beefed up software raid without the downsides of software raid. Let's say we have 10 2TB drives. We also have 2 128GB SSD's. From there, we can build an array with 16 TB of storage. This is accomplished by creating the equivalent of a RAID 6 array, which zfs calls raidz2. This means we sacrifice two drives worth of data, 4TB for parity, and as a consequence, we are able to have any two drives on the array fail and not lose any data.

Now, we haven't allocated our two SSD's yet. Let's do that now. ZFS allows you to use your SSD to cache often accessed data and data destined to be written to the array.

As I mentioned in a previous article, ZFS doesn't suffer from the "write hole" issue that plagues many RAID solutions. This is because ZFS has a feature called the ZFS Intent Log (ZIL for short). Essentially, this feature stores intents (and the data therein) to write data to the array, and ZFS then moves that data into the cold storage array. the ZIL normally resides on random free sectors of the array, but if we specify a dedicated "log" disk for this, in this case, one of our SSD's, we can increase our effective write speed to that of the SSD. This gives you the benefit of having the speed of an SSD and the capacity of your spinning media.

Corrections: It's been brought to my attention that the paragraph above has been written as potentially misleading. To clarify the way the ZIL or SLOG works, it's best to read through this thread on the freenas forums. Thanks @Vitalius!

The Cache works in a similar way. It stores copies of the most frequently accessed data on the array in itself so it's more easily accessable. This requires ram to operate, but it's worth it in the end because this means that for your most frequenlty accessed data, you have full SSD speed for both reading and writing your data.

Now, let's say we've nearly filled up our 16TB of storage and want to add more. No problem! The way zpools work, you can have multiple raidz objects within one pool and it will still display as one filesystem. This essentially means that ZFS will function as if you had used Linux software raid (mdadm) and LVM in conjunction to create one awesome pool, but that's not the end of it.

Let's say you've got a dataset in your ZFS cluster that you are going to make some major changes on but you aren't sure if that's what you're going to want in the end. ZFS will allow you to take snapshots and roll back to them or even mount the snapshot separately from the main dataset.

Snapshots in ZFS are awesome, so let's drill down and explain in detail how they work. For the the cases of this example, let's say we have a dataset named home and a brand new snapshot named test. So it will look like:

tank                20G
tank/home           20G
tank/home@test       8K

As you can see, the snapshot, test has no difference in data to the main dataset and is taking up only 8K of additional space. This is the magic of copy-on-write. Now, when a file is modified or deleted within the home dataset, ZFS writes the data in a different physical location on disk, as is the standard with copy-on-write, updates the pointers on the dataset to the new location, looks at the old data blocks and sees that the snapshot still needs them, so it doesn't free that space. This is why snapshots will grow over time. They take ownership of the original blocks of their parent dataset when data is changed.

ZFS also supports sending and receiving data in both raw and incremental methods. This means you can take daily snapshots of a dataset and send the changes to another computer with a ZFS pool over standard tools like netcat or ssh.

It's important to note that ZFS can be very rigid in some aspects. When it comes to physical disk configuration, ZFS trails behind it's competitor, BTRFS, in that the raid level and disk configuration cannot be reconfigured without destroying and recreating the zpool.

Using ZFS

Now that we've got a bit of knowledge about how ZFS works, we can move on to the fun stuff. Let's start by outlining my setup, so that those who are following can make proper adjustments.

I'm using 2 3TB Western Digital red's and a single 120GB Samsung 850 pro for the array. I'm going to be over-provisioning the SSD and only using 100GB. This will allow wear leveling to occur more evenly, in theory. I've partitioned my SSD into two partitions, one 80GB partition to be used for the ZFS cache and one 20GB partition to be used for the ZIL. I'm going to be using the two 3TB drives in raid0 because I have no mission critical data on these drives. If you're using two drives and intend to store important data on them, put them in a mirror configuration. Remember you can always add storage at a later point.

So, let's look at my disks:

# fdisk -l
Disk /dev/sdc: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: D30AA847-F3AB-4B45-A69F-D522C16DB1B0

Device          Start        End    Sectors  Size Type
/dev/sdc1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sdc9  5860515840 5860532223      16384    8M Solaris reserved 1

Disk /dev/sdd: 119.2 GiB, 128035676160 bytes, 250069680 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7f8ac676

Device     Boot Start       End   Sectors   Size Id Type
/dev/sdd1  *     2048 250067789 250065742 119.2G 83 Linux

Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6495F007-EFC5-9D4C-95B2-499ABBFABC9E

Device          Start        End    Sectors  Size Type
/dev/sdb1        2048 5860515839 5860513792  2.7T Solaris /usr & Apple ZFS
/dev/sdb9  5860515840 5860532223      16384    8M Solaris reserved 1

Disk /dev/sda: 119.2 GiB, 128035676160 bytes, 250069680 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2eea4214

Device     Boot    Start       End   Sectors Size Id Type
/dev/sda1           2048  41945087  41943040  20G 83 Linux
/dev/sda2       41945088 209717247 167772160  80G 83 Linux

As you can see, I'm using a second 120GB SSD for my OS partition. I'm doing this so that if something goes terribly wrong with the ZFS array, I can still boot my PC and login as root (or any other user that's not set up with $HOME=/home/*).

Now, my ZFS SSD is showing up as /dev/sda, and my two 3TB devices are /dev/sdb and /dev/sdc. This makes things easy. Let's create our pool.

# zpool create tank -omountpoint=/media/tank -oashift=12 /dev/sdb /dev/sdc log /dev/sda1 cache /dev/sda2

The meaning of -oashift=12 is to tell the array to use 4096 byte sectors instead of the default of 512 bytes. This will improve performance significantly on my system because the optimal size is 4096.

The -omountpoint=/media/tank option is specified to tell ZFS to mount the root pool at /media/tank. If this option is not specified, you'll wind up with a new folder in / with the same name as your zpool's name. As you can see, I've named my pool tank.

Let's have a look at how tank's set up now.

# zpool status tank
  pool: tank
 state: ONLINE
  scan: none requested
	tank        ONLINE       0     0     0
	  sdb       ONLINE       0     0     0
	  sdc       ONLINE       0     0     0
	  sda1      ONLINE       0     0     0
	  sda2      ONLINE       0     0     0
errors: No known data errors

You should see something like this. Periodically checking in on this output will show you when you have issues with a disk in the array.

One last thing we need to do to get ZFS to work after boot is enable zfs.targetand zfs-import-cache.service so that everything is configured properly on boot.

# systemctl enable zfs.target
# systemctl enable zfs-import-cache.service

ZFS has two main commands that you'll use while working with a pool. It's split up very logically. If you're operating on the pool itself, you'll use the zpool command. If you're operating on the data within the pool, you'll use the zfs command. You can read the manpages for each if you're interested in finding out more about each command. Be warned that you can easily spend an hour in the manpages when working with ZFS.

Creating datasets

Now, we've got these 6TiB of storage available to use, let's mount out /tmp on it.

# zfs create -o mountpoint=/tmp -o setuid=off -o sync=disabled -o devices=off tank/tmp

This will create a new zfs dataset, try to mount it at /tmp and set some security options on it.

-o sync=disabled will improve the performance of the /tmp filesystem, at the cost of dataset integrity in the event of a sudden shutdown. Keep in mind that this won't affect the integrity of the rest of ZFS , just /tmp.

If you're using Arch Linux (or derivatives like Manjaro or Antergos), like I recommended in the previous article, you'll need to mask systemd's tmp.mount.

# systemctl mask tmp.mount

On restart, your system will mount the zfs dataset on /tmp instead of the ramdisk that is used by default. It's important to note that /tmp will not mount if ZFS fails to load, so if /tmp has contents, ZFS will not be able to mount a dataset there.

I also have my /home on a zfs dataset, just to keep user data separate from system. I'll go through the requirements of doing that. First, log out of your normal user and login as root. We need to make sure that no files in /home are open.

First step is to move /home somewhere else, but keep it on the same hard drive.

# mv /home /home-old

If you're only logged in as root and you get an error about files being in use, try restarting your PC.

Now we can create our home dataset dataset:

# zfs create -omountpoint=/home tank/home

Zfs will automatically mount the dataset on /home and we can begin to copy our data back:

# rsync -arv --progress /home-old/* /home

Once all our data is sync'd across, we can delete the old home directory and make a snapshot of the current status of our home directory, for safe keeping.

# rm -rf /home-old
# zfs snapshot tank/home@initial-copy

We've now successfully set up our zpool.

let's have a look at our filesystem usage:

# zfs list -t all
tank                    80.9G  5.19T    96K  /mnt/tank
tank/home               21.8G  5.19T  21.8G  /home
tank/home@initial-copy      0      -  21.8G  -
                        -- snip --
tank/tmp                3.06G  46.9G  3.06G  /tmp

This will show you a list of all datasets, zvols and snapshots that are on pools in your system.

Let's say you want to back up your home directory. ZFS makes that easier than ever. You've already got a snapshot, I'll just send that to my freenas server, for example. Create a new dataset to hold your backups first. Setting gzip compression on the dataset will help keep the size of the data down.

# zfs send tank/home@initial-copy | ssh -l root "zfs receive -F storage/backups/home"

This command will send all the data, over ssh, to your newly created dataset on the remote server. Keep in mind that you'll get a max of ~100MB/s on gigabit ethernet, so this can easily take a couple hours to transfer data. That's why we have incremental data transfers.

Let's say we've worked for a few days and we've accumulated more data:

# zfs list -t all
tank                    83.0G  5.19T    96K  /mnt/tank
tank/home               23.9G  5.19T  23.8G  /home
tank/home@initial-copy  67.1M      -  21.8G  -
                        -- snip --
tank/tmp                3.06G  46.9G  3.06G  /tmp

Now as you can see, we've got 67M of data referencing the snapshot, meaning that data's been changed, and 2G of additional data has been created since the last snapshot. Let's snapshot again and send the data again.

# zfs snapshot tank/home@another-snapshot

Now that we've got our next snapshot, let's send the changed data, to reduce the network usage. This is a bit more complicated, because we have to reference both snapshots. Essentially, we reference the snapshot that both sides have, and the snapshot we want the remote server to recieve. This allows ZFS to calculate which blocks to send over to make both sides match.

# zfs send -i tank/home@initial-copy tank/home@another-snapshot | ssh -l root "zfs receive -F storage/backups/home"

So, the second argument for -i is the first snapshot and the final argument for send is the latest snapshot, if that wasn't obvious from the above command. You'll notice that the transfer happens significantly faster because it's transferring less data overall. Whenever possible, use incremental transfers when backing up data over the network.

What happens if you want to get data back from a snapshot that's since changed? We've got two different ways to do this. The first way is the nuclear option, rolling back. ZFS allows you to roll back an entire dataset to a previous point in time. This is simple and effective, but it also has drawbacks. This means that you'll destroy any snapshots and data in between the target snapshot and your current zfs configuration. It's pretty messy, but can be done if you don't care about that data.

The second option is a bit more tedious, but is a cleaner method: mounting the snapshot and copying data. You can do this with a simple mount command:

# mount -t zfs tank/home@initial-copy /mnt/snapshot

It can be dismounted with:

# umount /mnt/snapshot

Now you can traverse the snapshot and copy data you need from it. This is beneficial in many ways, but it's a bit more time consuming, so your preferred method will heavily depend on the situation.

So we've covered most of the basics of ZFS so far, but what if I decide I don't want a dataset or snapshot anymore? That's easy. There's zfs destroy that will allow us to destroy snapshots or datasets in the same way:

# zfs destroy tank/home@initial-copy

This will destroy the snapshot, without confirmation, so be careful when using this command. It won't operate on a dataset with snapshots, unless using the -R argument, which tells zfs destroy to operate recursively on the dataset destroying any subsets or snapshots it encounters within that dataset.

ZFS volumes

Now that we've spent about 3000 words talking about native datasets, let's talk about virtual block devices or volumes, which ZFS calls zvol.

Creating a zvol is very similar to creating a dataset. All you need to do is specify -V and then the size of the desired size of the volume. You can also use -s to create a "sparse" volume, which doesn't take up space on the pool until it's needed. This is known as thin provisioning. To create a thinly provisioned zvol, you can use the following command:

# zfs create -s -V 50G tank/windows-vm

This will create a 50GB zvol in that pool tank and it will show up as /dev/zvol/tank/windows-vm. You can format and mount it like you could any other physical disk or you could use it to run a Virtual Machine on, which we'll go over in part 4.

zvol's can utilize snapshots in the same way that a normal dataset can, so we won't go into it, just know that you can snapshot and rollback in the same way that you would with a normal dataset.

Before you go hogwild and start making zvol's all over the place, it's critically important that you understand recommended sector sizes and how that fits in with ZFS. ZFS has a default zvol sector size of 8K. Linux filesystems (ext4, xfs, btrfs) automatically detect this and align themselves to this, but NTFS and other windows filesystems don't. You can fix this by using the -b option and specifying a blocksize in bytes. The default NTFS sector size is 4K, so using -b 4096 will result in a properly aligned NTFS partition for your zvol. If you're creating a new partition from the windows installer gui (you're setting up an OS drive) you'll want to do this. Otherwise, just leave off the -b option and use the disk management utility and using a NTFS cluster size of 8192 for the zvol. If there is a sector that's not quite correct, it's going to leave a problematic system status.


I hope you got something out of this article. As I mentioned before, I'm not much of a writer, so I'm working on improving my skills, which is one reason why I'm doing this series in the first place. As I get better, I'll go back and fix the earlier guides and make sure that they flow better and are easy to understand.

I also plan on releasing supplemental content on how zfs is tuned for different workloads, but that's a ways out. I'm still working out certain kinks on my own system.