FS: file system
GC: garbage collector
PP: protection period
Scope: This is a tutorial aimed at optimizing the NILFS2 FS. This is, in no way, static and unquestionable. FS optimization always depends on a number of factors, including the machine, usage pattern, kind of data usually handled and other system settings. So I'm counting on the input of everyone interested in this specific FS, which intrigued me for quite some time until I finally decided to give it a shot.
Introduction to NILFS2:
More about NILFS2:
Some interesting benchmarks:
Why tune it?
NILFS2 never deletes data. It writes data sequentially until a specified minimum threshold is reached (in ex.: free space). At this point a GC is launched to release reclaimable segments (see next section) from the oldest to the newest. The GC then stops cleaning when the maximum threshold is reached. Optionally, the GC can run continuously (see next section).
This design allows us to take some conclusions:
- it provides an easy way to keep a chronological log of changes, with easy snapshot creation and respective RO mounts (won't be covered here), as long as the GC isn't run continuously;
- if the GC is run continuously, it can still provide protection for a given period of time (see next section);
- if the GC isn't run continuously, it has a large performance impact over time, because it gets to a point when the GC needs to run frequently to keep available space above the specified threshold, especially when large write operations are required:
- under these conditions, it isn't appropriate for small and/or frequently filled partitions, because this means the GC needs to make room for new writes very often;
- the GC is slow by default (see next section), which probably aims a low footprint, but makes write operations a painful process:
- increasing the threshold probably reduces the performance impact (needs to be tested) but triggers the GC sooner;
- reducing the threshold delays the GC but makes writes a lot more painful when the FS is running low on free space;
- it's a good idea to avoid free space from getting below the specified threshold, as this will avoid triggering GC.
NILFS2 has its own tools (
nilfs-utils) and provides a configuration file to tune the GC behaviour. The latter is discussed on the next section and some of its tools are discussed on the next and after the next sections.
Configuration file parameters: the GC configuration file is located in
/etc/nilfs_cleanerd.conf by default. I'll explain and present my view on the most important parameters this file contains (more info on this file).
- segments are groups of sequential sectors;
lssuis a tool which allows inspecting segments;
nilfs-tune -l [device]allows one to inquire the superblock and retrieve some valuable information:
nilfs-tune 2.2.7 Filesystem volume name: Data Filesystem UUID: 8648d408-3b28-4f1e-8edb-3fe0bd09071e Filesystem magic number: 0x3434 Filesystem revision #: 2.0 Filesystem features: (none) Filesystem state: invalid or mounted Filesystem OS type: Linux Block size: 4096 Filesystem created: Wed Mar 14 23:52:16 2018 Last mount time: Mon Mar 19 10:36:30 2018 Last write time: Tue Mar 20 01:27:57 2018 Mount count: 17 Maximum mount count: 50 Reserve blocks uid: 0 (user root) Reserve blocks gid: 0 (group root) First inode: 11 Inode size: 128 DAT entry size: 32 Checkpoint size: 192 Segment usage size: 16 Number of segments: 20360 Device size: 170799398912 First data block: 1 # of blocks per segment: 2048 Reserved segments %: 5 Last checkpoint #: 1127 Last block address: 17356924 Last sequence #: 8475 Free blocks count: 24336384 Commit interval: 0 # of blks to create seg: 0 CRC seed: 0x5d79decc CRC check sum: 0xae05331e CRC check data size: 0x00000118
Here, I can see the logical block is 4KB in size and each segment has 2048 blocks (8MB in size). This is important when tuning the GC parameters.
- the garbage collector never cleans data newer than this relative time value, even if the FS needs more free space;
nilfs-cleancan always be run with a custom, or no, protection period;
- this value is set in seconds and defaults to 3600 (1 hour).
min_clean_segments / max_clean_segments
- these values are the minimum and maximum thresholds (in ex.: free space), respectively;
- the garbage collector starts cleaning the FS when less than min_clean_segments are available and stops cleaning when more than max_clean_segments are available;
- if min_clean_segments is set to zero, max_clean_segments is ignored and the GC runs continuously, always respecting the PP;
- these values can be set in terms of FS space or a percentage of the total drive capacity:
- the defaults are 10% and 20%, respectively.
- number of segments cleared in each cleaning step;
- this value, in conjunction with cleaning_interval, dictates the cleaning speed of the GC when the available space is between min_clean_segments and max_clean_segments;
- the default value is 2:
- for 8MB segments (see "About segments"), this translates into a maximum of 16MB per clean cycle.
- same as nsegments_per_clean when available space is below min_clean_segments;
- the default value is 4:
- for 8MB segments (see "About segments"), this translates into a maximum of 32MB per clean cycle.
- time interval, in seconds, between each cleaning step;
- this value, in conjunction with nsegments_per_clean, dictates the cleaning speed of the GC when the available space is between min_clean_segments and max_clean_segments;
- the default value is 5s:
- for 16MB cleaned per cycle (see nsegments_per_clean), this translates in 3.2MB/s (around 5min to clean 1GB).
- same as cleaning_interval when available space is below min_clean_segments;
- the default is 1s:
- for 32MB cleaned per cycle (see mc_nsegments_per_clean), this translates in 32MB/s (around 30s to clean 1GB).
- time interval between a failed cleaning step and a new try;
- I inserted this parameter here because the manual states failures can happen due to high system load. I wonder if the GC is postponed during a transaction if there's enough available space (needs to be tested).
- minimum number of reclaimable blocks in a segment so it can be cleaned, when available space is between min_clean_segments and max_clean_segments;
- this can be set as an integer value or a percentage of the total number of blocks per segment;
- the default value is 10%:
- for 2048 4KB blocks per segment (see "About segments"), this translates into a minimum of 819KB in a segment so it can be reclaimed.
- same as min_reclaimable_blocks when available space is less than min_clean_segments;
- the default value is 1%:
- for 2048 4KB blocks per segment (see min_reclaimable_blocks), this translates into a minimum of 82KB in a segment so it can be reclaimed.
So, what's better for desktop usage? In my opinion:
- the GC shouldn't run continuously in order keep a low resource usage, especially on laptops;
- the GC should operate fast enough, so it doesn't stall a write operation when there's lack of space;
- min_clean_segments shouldn't be too low in order to avoid lack of space during a write operation;
- max_clean_segments shouldn't be too close to min_clean_segments in order to avoid running the GC too often;
- max_clean_segments shouldn't be too far from min_clean_segments in order to avoid running the GC for too long (speed also counts here);
- free space should be kept higher than min_clean_segments for the longest time possible.
Please take this reasoning with a grain of salt. It's too early for me to take any definite conclusions, as all of this needs to be tested first.
So, in my case:
- the Data partition is 160GB, which means that, by default, min_clean_segments and max_clean_segments are set to 16GB and 32GB, respectively. Will I normally perform a 16GB transfer? No, I don't think so. I'll keep these defaults for now, as the GC operation isn't likely to colide with a transfer;
- regarding the number of segments and the cleaning interval, I'll keep it as is now, but I'll probably tweak it in the future, after running some tests:
- the tweak will probably be in direction of increasing the GC speed;
- note the above speeds presume all cleaned segments will be full, which is a best case scenario (in ex.: the speed will likely be slower than the estimated values above);
- I still have some doubts regarding nsegments_per_clean and mc_nsegments_per_clean (in ex.: are these maximum values, or there won't be any cleaning if less than these segments are reclaimable?);
- minimum reclaimable blocks will also be maintained for now:
- note NILFS2 is very prone to fragmentation, because only changes to the file system are saved, which often leads to files scatered through several segments as they get altered. That's why this only makes sense in a device with a fast seek time;
- on one hand, reclaiming smaller segments (with fewer blocks filled) avoids fragmentation and can increase the chance of finding available segments;
- on the other hand, reclaiming larger segments (with more blocks filled) makes the cleaning process faster without the need to reduce the cleaning interval too much.
A systemd unit to keep it slim: as you noticed, I haven't changed any defaults yet, but there's something bugging me... how to avoid free space from reaching min_clean_segments (unecessarily, I mean) without changing the other parameters? Well,
nilfs-clean can be run manually, after all. I can even define a custom PP. It's just a matter of automating this task.
Why not change the parameters?
Well, I don't want the GC running continuously, I want to keep a relatively long history, but 1 hour protection seems fine in case I'm running low on space.
.mount units which handle device mounts (more about this). A list of the active units of this type can be obtained:
[mbb@mbb-laptop ~]$ systemctl list-units -t mount UNIT LOAD ACTIVE SUB DESCRIPTION -.mount loaded active mounted Root Mount boot-efi.mount loaded active mounted /boot/efi dev-hugepages.mount loaded active mounted Huge Pages File System dev-mqueue.mount loaded active mounted POSIX Message Queue File System media-Data_160GB.mount loaded active mounted /media/Data_160GB mnt-Data.mount loaded active mounted /mnt/Data proc-sys-fs-binfmt_misc.mount loaded active mounted Arbitrary Executable File Formats File System run-user-1000-gvfs.mount loaded active mounted /run/user/1000/gvfs run-user-1000.mount loaded active mounted /run/user/1000 sys-fs-fuse-connections.mount loaded active mounted FUSE Control File System sys-kernel-config.mount loaded active mounted Kernel Configuration File System sys-kernel-debug.mount loaded active mounted Kernel Debug File System tmp.mount loaded active mounted /tmp LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 13 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'.
Mount units must be named after the mount point directories they control.
If you look at the output above,
media-Data_160GB.mount is mounted at
mnt-Data.mount is mounted at
Mount points created at runtime (independently of unit files or /etc/fstab) will be monitored by systemd and appear like any other mount unit in systemd.
Well, if you know a systemd mount unit will exist for each mount, and you also know how it's named, then it's very easy to write a unit which depends on a specific mount:
## nilfs2-autoclean unit file for /mnt/Data [Unit] Description=Trigger nilfs-clean on /mnt/Data Requires=mnt-Data.mount After=mnt-Data.mount [Service] Type=oneshot ExecStart=/usr/bin/sh -c "nilfs-clean -p 1M $(cat /proc/mounts | grep /mnt/Data | cut -d ' ' -f1)" [Install] WantedBy=mnt-Data.mount
Note that the
ExecStart line could have been a lot simpler (in ex.:
ExecStart=/usr/bin/nilfs-clean -p 1M /dev/sda2). I took this approach because I have other plans and want to keep this in mind (see next section).
And that's it for now. Name this unit as
<some-name>.service, copy it to
/etc/systemd/system/, enable and start it. Your FS will be cleaned upon mount with the specified PP (in this case 1 month). However, if the drive ever fills and you suddenly need space, you can still keep a record up to the PP specified in
/etc/nilfs_cleanerd.conf (in this case 1 hour).
Future work: my intention is to write a tool to automate the management of systemd units like the one above for every desired NILFS2 FS. My plan is to have a script which takes mount points and PPs as arguments, writes them to a configuration file and creates the corresponding systemd units. The systemd units are then triggered upon mount and call the script with a different option. The script then checks for the FS type and, in case of being NILFS2, reads the configuration file and calls
nilfs-clean with the specified device and PP.
...as some of you may know by now, it will probably be a long time before I get back to this again, as I often leave ideas/intentions hanging due to lack of time. Nevertheless, the idea is here to be discussed and possibly endeavoured by someone else.
Conclusion: NILFS2 is a very straightforward, effective and safe FS. The way I see it, this FS is very good for Data partitions, especially in shared environments. Many organizations use shared FSs for work, where many different persons access and manipulate files. NILFS2 provides an easy and safe way to track changes and recover deleted content in such an environment. However, this FS has its downsides and can get slow on constricted partitions or slow systems low on cache.
For desktop use, NILFS2 should be optimized in order to reclaim used space faster and avoid reaching the minimum threshold. This tutorial presents a first attempt in trying to achieve those objectives and it will be updated in order to reflect achieved results and to introduce any additional optimizations that might emerge.
Thank you all for taking the time to read this. Please point me any misconceptions and raise your own reasoning on the matter. I'm not, at all, sure of many allegations I made in this text.
Update: Used space reported by the system
Update: Accounting snapshots on usage computations
Update: Retrieve history for a particular file
Update: Script to retrieve the history of a particular file