Tuning the NILFS2 file system

Used abbreviations:

FS: file system
GC: garbage collector
PP: protection period

Scope: This is a tutorial aimed at optimizing the NILFS2 FS. This is, in no way, static and unquestionable. FS optimization always depends on a number of factors, including the machine, usage pattern, kind of data usually handled and other system settings. So I'm counting on the input of everyone interested in this specific FS, which intrigued me for quite some time until I finally decided to give it a shot.

Introduction to NILFS2:

More about NILFS2:

Some interesting benchmarks:

Why tune it?
NILFS2 never deletes data. It writes data sequentially until a specified minimum threshold is reached (in ex.: free space). At this point a GC is launched to release reclaimable segments (see next section) from the oldest to the newest. The GC then stops cleaning when the maximum threshold is reached. Optionally, the GC can run continuously (see next section).

This design allows us to take some conclusions:

  • it provides an easy way to keep a chronological log of changes, with easy snapshot creation and respective RO mounts (won't be covered here), as long as the GC isn't run continuously;
  • if the GC is run continuously, it can still provide protection for a given period of time (see next section);
  • if the GC isn't run continuously, it has a large performance impact over time, because it gets to a point when the GC needs to run frequently to keep available space above the specified threshold, especially when large write operations are required:
    • under these conditions, it isn't appropriate for small and/or frequently filled partitions, because this means the GC needs to make room for new writes very often;
    • the GC is slow by default (see next section), which probably aims a low footprint, but makes write operations a painful process:
      • increasing the threshold probably reduces the performance impact (needs to be tested) but triggers the GC sooner;
      • reducing the threshold delays the GC but makes writes a lot more painful when the FS is running low on free space;
      • it's a good idea to avoid free space from getting below the specified threshold, as this will avoid triggering GC.

NILFS2 has its own tools (nilfs-utils) and provides a configuration file to tune the GC behaviour. The latter is discussed on the next section and some of its tools are discussed on the next and after the next sections.

Configuration file parameters: the GC configuration file is located in /etc/nilfs_cleanerd.conf by default. I'll explain and present my view on the most important parameters this file contains (more info on this file).

About segments:
  • segments are groups of sequential sectors;
  • lssu is a tool which allows inspecting segments;
  • nilfs-tune -l [device] allows one to inquire the superblock and retrieve some valuable information:
nilfs-tune 2.2.7
Filesystem volume name:   Data
Filesystem UUID:          8648d408-3b28-4f1e-8edb-3fe0bd09071e
Filesystem magic number:  0x3434
Filesystem revision #:    2.0
Filesystem features:      (none)
Filesystem state:         invalid or mounted
Filesystem OS type:       Linux
Block size:               4096
Filesystem created:       Wed Mar 14 23:52:16 2018
Last mount time:          Mon Mar 19 10:36:30 2018
Last write time:          Tue Mar 20 01:27:57 2018
Mount count:              17
Maximum mount count:      50
Reserve blocks uid:       0 (user root)
Reserve blocks gid:       0 (group root)
First inode:              11
Inode size:               128
DAT entry size:           32
Checkpoint size:          192
Segment usage size:       16
Number of segments:       20360
Device size:              170799398912
First data block:         1
# of blocks per segment:  2048
Reserved segments %:      5
Last checkpoint #:        1127
Last block address:       17356924
Last sequence #:          8475
Free blocks count:        24336384
Commit interval:          0
# of blks to create seg:  0
CRC seed:                 0x5d79decc
CRC check sum:            0xae05331e
CRC check data size:      0x00000118

Here, I can see the logical block is 4KB in size and each segment has 2048 blocks (8MB in size). This is important when tuning the GC parameters.

  • protection_period

    • the garbage collector never cleans data newer than this relative time value, even if the FS needs more free space;
    • nilfs-clean can always be run with a custom, or no, protection period;
    • this value is set in seconds and defaults to 3600 (1 hour).
  • min_clean_segments / max_clean_segments

    • these values are the minimum and maximum thresholds (in ex.: free space), respectively;
    • the garbage collector starts cleaning the FS when less than min_clean_segments are available and stops cleaning when more than max_clean_segments are available;
    • if min_clean_segments is set to zero, max_clean_segments is ignored and the GC runs continuously, always respecting the PP;
    • these values can be set in terms of FS space or a percentage of the total drive capacity:
      • the defaults are 10% and 20%, respectively.
  • nsegments_per_clean

    • number of segments cleared in each cleaning step;
    • this value, in conjunction with cleaning_interval, dictates the cleaning speed of the GC when the available space is between min_clean_segments and max_clean_segments;
    • the default value is 2:
      • for 8MB segments (see "About segments"), this translates into a maximum of 16MB per clean cycle.
  • mc_nsegments_per_clean

    • same as nsegments_per_clean when available space is below min_clean_segments;
    • the default value is 4:
      • for 8MB segments (see "About segments"), this translates into a maximum of 32MB per clean cycle.
  • cleaning_interval

    • time interval, in seconds, between each cleaning step;
    • this value, in conjunction with nsegments_per_clean, dictates the cleaning speed of the GC when the available space is between min_clean_segments and max_clean_segments;
    • the default value is 5s:
      • for 16MB cleaned per cycle (see nsegments_per_clean), this translates in 3.2MB/s (around 5min to clean 1GB).
  • mc_cleaning_interval

    • same as cleaning_interval when available space is below min_clean_segments;
    • the default is 1s:
      • for 32MB cleaned per cycle (see mc_nsegments_per_clean), this translates in 32MB/s (around 30s to clean 1GB).
  • retry_interval

    • time interval between a failed cleaning step and a new try;
    • I inserted this parameter here because the manual states failures can happen due to high system load. I wonder if the GC is postponed during a transaction if there's enough available space (needs to be tested).
  • min_reclaimable_blocks

    • minimum number of reclaimable blocks in a segment so it can be cleaned, when available space is between min_clean_segments and max_clean_segments;
    • this can be set as an integer value or a percentage of the total number of blocks per segment;
    • the default value is 10%:
      • for 2048 4KB blocks per segment (see "About segments"), this translates into a minimum of 819KB in a segment so it can be reclaimed.
  • mc_min_reclaimable_blocks

    • same as min_reclaimable_blocks when available space is less than min_clean_segments;
    • the default value is 1%:
      • for 2048 4KB blocks per segment (see min_reclaimable_blocks), this translates into a minimum of 82KB in a segment so it can be reclaimed.

So, what's better for desktop usage? In my opinion:

  • the GC shouldn't run continuously in order keep a low resource usage, especially on laptops;
  • the GC should operate fast enough, so it doesn't stall a write operation when there's lack of space;
  • min_clean_segments shouldn't be too low in order to avoid lack of space during a write operation;
  • max_clean_segments shouldn't be too close to min_clean_segments in order to avoid running the GC too often;
  • max_clean_segments shouldn't be too far from min_clean_segments in order to avoid running the GC for too long (speed also counts here);
  • free space should be kept higher than min_clean_segments for the longest time possible.

Please take this reasoning with a grain of salt. It's too early for me to take any definite conclusions, as all of this needs to be tested first.

So, in my case:

  • the Data partition is 160GB, which means that, by default, min_clean_segments and max_clean_segments are set to 16GB and 32GB, respectively. Will I normally perform a 16GB transfer? No, I don't think so. I'll keep these defaults for now, as the GC operation isn't likely to colide with a transfer;
  • regarding the number of segments and the cleaning interval, I'll keep it as is now, but I'll probably tweak it in the future, after running some tests:
    • the tweak will probably be in direction of increasing the GC speed;
    • note the above speeds presume all cleaned segments will be full, which is a best case scenario (in ex.: the speed will likely be slower than the estimated values above);
    • I still have some doubts regarding nsegments_per_clean and mc_nsegments_per_clean (in ex.: are these maximum values, or there won't be any cleaning if less than these segments are reclaimable?);
  • minimum reclaimable blocks will also be maintained for now:
    • note NILFS2 is very prone to fragmentation, because only changes to the file system are saved, which often leads to files scatered through several segments as they get altered. That's why this only makes sense in a device with a fast seek time;
    • on one hand, reclaiming smaller segments (with fewer blocks filled) avoids fragmentation and can increase the chance of finding available segments;
    • on the other hand, reclaiming larger segments (with more blocks filled) makes the cleaning process faster without the need to reduce the cleaning interval too much.

A systemd unit to keep it slim: as you noticed, I haven't changed any defaults yet, but there's something bugging me... how to avoid free space from reaching min_clean_segments (unecessarily, I mean) without changing the other parameters? Well, nilfs-clean can be run manually, after all. I can even define a custom PP. It's just a matter of automating this task.

Why not change the parameters?

Well, I don't want the GC running continuously, I want to keep a relatively long history, but 1 hour protection seems fine in case I'm running low on space.

Systemd has .mount units which handle device mounts (more about this). A list of the active units of this type can be obtained:

[mbb@mbb-laptop ~]$ systemctl list-units -t mount
UNIT                          LOAD   ACTIVE SUB     DESCRIPTION                                  
-.mount                       loaded active mounted Root Mount                                   
boot-efi.mount                loaded active mounted /boot/efi                                    
dev-hugepages.mount           loaded active mounted Huge Pages File System                       
dev-mqueue.mount              loaded active mounted POSIX Message Queue File System              
media-Data_160GB.mount        loaded active mounted /media/Data_160GB                            
mnt-Data.mount                loaded active mounted /mnt/Data                                    
proc-sys-fs-binfmt_misc.mount loaded active mounted Arbitrary Executable File Formats File System
run-user-1000-gvfs.mount      loaded active mounted /run/user/1000/gvfs                          
run-user-1000.mount           loaded active mounted /run/user/1000                               
sys-fs-fuse-connections.mount loaded active mounted FUSE Control File System                     
sys-kernel-config.mount       loaded active mounted Kernel Configuration File System             
sys-kernel-debug.mount        loaded active mounted Kernel Debug File System                     
tmp.mount                     loaded active mounted /tmp                                         

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

13 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Mount units must be named after the mount point directories they control.

If you look at the output above, media-Data_160GB.mount is mounted at /media/Data_160GB and mnt-Data.mount is mounted at /mnt/Data.

Mount points created at runtime (independently of unit files or /etc/fstab) will be monitored by systemd and appear like any other mount unit in systemd.

Well, if you know a systemd mount unit will exist for each mount, and you also know how it's named, then it's very easy to write a unit which depends on a specific mount:

## nilfs2-autoclean unit file for /mnt/Data

[Unit]
Description=Trigger nilfs-clean on /mnt/Data
Requires=mnt-Data.mount
After=mnt-Data.mount

[Service]
Type=oneshot
ExecStart=/usr/bin/sh -c "nilfs-clean -p 1M $(cat /proc/mounts | grep /mnt/Data | cut -d ' ' -f1)"

[Install]
WantedBy=mnt-Data.mount

Note that the ExecStart line could have been a lot simpler (in ex.: ExecStart=/usr/bin/nilfs-clean -p 1M /dev/sda2). I took this approach because I have other plans and want to keep this in mind (see next section).

And that's it for now. Name this unit as <some-name>.service, copy it to /etc/systemd/system/, enable and start it. Your FS will be cleaned upon mount with the specified PP (in this case 1 month). However, if the drive ever fills and you suddenly need space, you can still keep a record up to the PP specified in /etc/nilfs_cleanerd.conf (in this case 1 hour).

Future work: my intention is to write a tool to automate the management of systemd units like the one above for every desired NILFS2 FS. My plan is to have a script which takes mount points and PPs as arguments, writes them to a configuration file and creates the corresponding systemd units. The systemd units are then triggered upon mount and call the script with a different option. The script then checks for the FS type and, in case of being NILFS2, reads the configuration file and calls nilfs-clean with the specified device and PP.

My intention...

...as some of you may know by now, it will probably be a long time before I get back to this again, as I often leave ideas/intentions hanging due to lack of time. Nevertheless, the idea is here to be discussed and possibly endeavoured by someone else.

Conclusion: NILFS2 is a very straightforward, effective and safe FS. The way I see it, this FS is very good for Data partitions, especially in shared environments. Many organizations use shared FSs for work, where many different persons access and manipulate files. NILFS2 provides an easy and safe way to track changes and recover deleted content in such an environment. However, this FS has its downsides and can get slow on constricted partitions or slow systems low on cache.

For desktop use, NILFS2 should be optimized in order to reclaim used space faster and avoid reaching the minimum threshold. This tutorial presents a first attempt in trying to achieve those objectives and it will be updated in order to reflect achieved results and to introduce any additional optimizations that might emerge.


Thank you all for taking the time to read this. Please point me any misconceptions and raise your own reasoning on the matter. I'm not, at all, sure of many allegations I made in this text.


Update: Used space reported by the system
Update: Accounting snapshots on usage computations
Update: Retrieve history for a particular file or directory
Update: Script to retrieve the history of a particular file or directory

6 Likes

Used space reported by the system: after some time of usage, during which I preformed several writes and deletions for various tasks, the system reports 113GB used and 40GB free space on the Data partition:

[mbb@mbb-laptop ~]$ df | grep /mnt/Data
/dev/sda2       160G  113G   40G  75% /mnt/Data

If I use the file manager to compute the space used by all the files and directories I get around 64.82GB 63.27GB used (the prior value was in GiB). So, the usage reported by the system clearly includes all the deleted data kept by file system as "checkpoints" (in ex.: reclaimable snapshots). What's the space used by the current data and what's the "real" (or reclaimable) free space?

Once the virtual block size is known (see OP), one way to do this is to list checkpoints via the tool lscp, since it includes the number of used blocks for each checkpoint, and then multiply the two values:

  • Get the virtual block size
[mbb@mbb-laptop ~]$ sudo nilfs-tune -l /dev/sda2 | grep Block
Block size:               4096
  • List checkpoints
[mbb@mbb-laptop ~]$ lscp /dev/sda2
                 CNO        DATE     TIME  MODE  FLG      BLKCNT       ICNT
                   1  2018-03-14 23:52:16   cp    -            4          2
                   2  2018-03-15 00:02:08   cp    -            5          3
                   3  2018-03-15 00:02:46   cp    -            5          3
                   4  2018-03-15 00:03:41   cp    -            6          4
[...]
                5292  2018-03-27 14:51:27   cp    -     16758145     113035
                5293  2018-03-27 14:54:03   cp    -     16758145     113035
                5294  2018-03-27 14:54:10   cp    -     16758144     113034
                5295  2018-03-27 18:09:50   cp    -     16758146     113034

To list only the last checkpoint:

[mbb@mbb-laptop ~]$ lscp -r -n 1 /dev/sda2
                 CNO        DATE     TIME  MODE  FLG      BLKCNT       ICNT
                5295  2018-03-27 18:09:50   cp    -     16758146     113034
  • Compute usage

For 16758146 blocks used, 4KB each, used/allocated space is 16758146*4/1024^2 = 63,93GB. Not exactly the same as reported by the file manager, but pretty close [1] differences may be due to real object size vs allocated space, the 2nd (also known as size on disk) being always equal or larger than the first. Subtract this value from the total capacity and you get the free/reclaimable space.

[1] I still need to dig a bit more into this subject.

Now wrap it all up in a simple script:

#!/bin/bash
## nilfs2-usage: get usage statistics from a NILFS2 file system
# TODO: include argument handling to choose a specific device and/or size units
#       try to exclude the need for elevated privilege
#       test the influence of snapshots
#       test accuracy of reported and calculated sizes
#       optimizations

DEBUG() { #print info for debugging purposes
    echo "i:" $i
    echo "DEVICE_SHORT:" $DEVICE_SHORT
    echo "BLOCK_BYTES:" $BLOCK_BYTES
    echo "CAPACITY_BYTES:" $CAPACITY_BYTES
    #echo "CAPACITY_KB:" $CAPACITY_KB
    #echo "CAPACITY_MB:" $CAPACITY_MB
    echo "CAPACITY_GB:" $CAPACITY_GB
    echo "USED_BLOCKS:" $USED_BLOCKS
    #echo "USED_BYTES:" $USED_BYTES
    #echo "USED_KB:" $USED_KB
    #echo "USED_MB:" $USED_MB
    echo "USED_GB:" $USED_GB
    #echo "FREE_BYTES:" $FREE_BYTES
    #echo "FREE_KB:" $FREE_KB
    #echo "FREE_MB:" $FREE_MB
    echo "FREE_GB:" $FREE_GB
}

GETINFO() { #variable assignment
    DEVICE_SHORT=$(echo "$i" | cut -d '/' -f3)
    BLOCK_BYTES=$(nilfs-tune -l /dev/sda2 | grep Block | awk '{print $3}')

    # Check permissions
    if [ "${#BLOCK_BYTES}" = 0 ]; then
        exit
    fi
    
    CAPACITY_BYTES=$(lsblk -lb -o NAME,SIZE | grep $DEVICE_SHORT | awk '{print $2}')
    #CAPACITY_KB=$(echo "scale=2; $CAPACITY_BYTES / 1024" | bc)
    #CAPACITY_MB=$(echo "scale=2; $CAPACITY_BYTES / 1024^2" | bc)
    CAPACITY_GB=$(echo "scale=2; $CAPACITY_BYTES / 1024^3" | bc)
    USED_BLOCKS=$(lscp -r -n 1 $i | tail -n 1 | awk '{print $6}')
    #USED_BYTES=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES" | bc)
    #USED_KB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024" | bc)
    #USED_MB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024^2" | bc)
    USED_GB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024^3" | bc)
    #FREE_BYTES=$(echo "scale=2; $CAPACITY_BYTES - $USED_BYTES" | bc)
    #FREE_KB=$(echo "scale=2; $CAPACITY_KB - $USED_KB" | bc)
    #FREE_MB=$(echo "scale=2; $CAPACITY_MB - $USED_MB" | bc)
    FREE_GB=$(echo "scale=2; $CAPACITY_GB - $USED_GB" | bc)
}

# Print table header and contents
printf "%-16s %-16s %-16s %s\n" "DEVICE" "CAPACITY" "USED" "FREE"
for i in $(cat /proc/mounts | grep nilfs2 | cut -d ' ' -f1); do
    GETINFO
    #DEBUG
    printf "%-16s %-16s %-16s %s\n" "$i" "$CAPACITY_GB"\GB "$USED_GB"\GB "$FREE_GB"\GB
done

This is still very raw and I don't know when I'll come back to it to make it better, but it reports used and free space for all NILFS2 mounted devices, in GB. I haven't tested the influence of snapshots yet, though I believe the block count will include those [1]. It needs elevated privileges because it uses nilfs-tune to get the block size.

Example output:

[mbb@mbb-laptop ~]$ sudo nilfs2-usage.sh 
[sudo] password for mbb: 
DEVICE           CAPACITY         USED             FREE
/dev/sda2        159.06GB         63.92GB          95.14GB

Hope this clarifies the free space reported by system for those using NILFS2.

[1] Influence of snapshots in space calculations

What happens if you get close to "real" disk limit, will the garbage collecter (?) start and free some space?

That can be configured AFAIK: https://nilfs.sourceforge.io/en/man5/nilfs_cleanerd.conf.5.html

2 Likes

As @anon23612428 pointed out that can be configured. There is an upper and a lower limit. The GC starts cleaning when the lower limit is reached and stops when the upper limit is reached. Default values are 10% of disk capacity for the lower limit and 20% for the upper limit. If the lower limit is set to zero or the upper limit can't be reached the GC runs continuously. The problem comes with the cleaning speed (also configurable), which can affect write speed if there isn't enough free space for the write, but this is something I still need to test.

1 Like

Influence of snapshots in space calculations: snapshots have no influence in the used number of block reported for the last checkpoint. I just tested this by searching for a checkpoint with higher number of blocks, turning it into a snapshot and then running the script I posted:

  • Search for the checkpoint using lscp
    I chose checkpoint 5279, which has 16959590 used blocks (64.7GB)
5279  2018-03-27 00:18:23   cp    -     16959590     113054
  • Turn the checkpoint into a snapshot
[mbb@mbb-laptop ~]$ sudo chcp ss 5279
[sudo] password for mbb: 
[mbb@mbb-laptop ~]$ lscp -s
                 CNO        DATE     TIME  MODE  FLG      BLKCNT       ICNT
                5279  2018-03-27 00:18:23   ss    -     16959590     113054
  • Run the script (or check the used blocks of the last checkpoint)
[mbb@mbb-laptop ~]$ lscp -r -n 1
                 CNO        DATE     TIME  MODE  FLG      BLKCNT       ICNT
                5381  2018-03-28 15:20:01   cp    -     16758160     113055
[mbb@mbb-laptop ~]$ sudo nilfs2-usage.sh 
DEVICE           CAPACITY         USED             FREE
/dev/sda2        159.06GB         63.92GB          95.14GB

As you can see the reported usage for the last checkpoint doesn't change, and so it is wrong. To compute the correct usage it is necessary to know which content is common to existing snapshots and the last checkpoint. This seems a lot of work; something I need to dig in whenever I have time; dumpseg and lssu seem promising tools to achieve this. Besides this, I just remembered free space should also be computed considering the protection period set in /etc/nilfs_cleanerd.conf

Update: Accounting snapshots on usage computations

Short question: do you use nilfs on a HDD, AHCI SSD or NVMe SSD?

I'm not really interested in nilfs' safety/snapshot features but in pure speed.
Do you notice any difference compared to the usual filesystems?
I've only used nilfs in VBox so I can't say much about it...

I use it in a AHCI SSD. I don't see much benefit using it on a HDD, besides the snapshot feature. I don't notice any performance impact, but I haven't tested it properly either. There are benchmarks by other authors (see OP). I think performance will be better evaluated when the lower threshold is reached and the GC starts working. Then, I'll probably need some more tuning, but I haven't got there yet.

1 Like

nilfs2 for lower latency
f2fs for higher throughput
This is what Wikipedia has to say about nilfs2.

1 Like

I'm aware of that, but I was more interested in traditional filesystems like ext4 or xfs.

I've found this:
https://www.phoronix.com/scan.php?page=article&item=linux414-fs-compare&num=1
and this:
https://www.phoronix.com/scan.php?page=article&item=linux-48-hdd&num=1

Some big differences but those are only benchmarks and I think they don't reflect the general desktop usage.
In some conditions nilfs seems to be slow, in others it's on par with the usual suspects.
Maybe in the end it doesn't really matter...

1 Like

For HDDs I found mostly xfs and ext4 to be recommended, where xfs tends to be bettter with large files.
The performance of btrfs improves in benchmarks when compression is enabled, especially on HDDs. The benchmarks you linked test btrfs without compression.

I don't think that these benchmarks don't have anything to do with real life usage at all. But somebody could create a test which gives results where a reader can imagine what it has to do with real uage. For example install the same OS on different filesystems and measure boot times. Or startup times of some application, or build times of particular packages.

Probably this is part of that what existing benchmarks already measure?

1 Like

Yes, XFS seems to be particularly good for drives which hold VirtualBox drives or similar large files.

What I meant is that the normal desktop user probably won't notice a difference in normal day-to-day usage, but it's different when you do specific tasks like databases, or copying/removing a lot of super small or very large files.

I generally use btrfs (just like you) for root and home and ext4 for my data (because of compatibility), but the log-based concept of nilfs is very interesting but perhaps not of much use if you don't care about continuous snapshots or safety.

F2FS is my filesystem of choice for Linux-only USB sticks.

Agree, agree. :wink: And with benchmark we see the difference between good concepts and how far the implementation has progressed.

Accounting snapshots on usage computations: as exposed on a previous post, the used blocks reported for the last checkpoint don't include snapshots. Consequently, used space computed from those values express the current usage only. That is, if the garbage collector is run, used space reported by the system will still be higher because snapshots won't be cleared. So, how to know what's "real" used space, so the correct free space can be computed?

lssu is a tool from the package nilfs-utils which allows listing segments. By default it lists its index, date/time of creation, attributes/flags and number of addressed blocks (note addressed blocks doesn't mean all of them are being used):

              SEGNUM        DATE     TIME STAT     NBLOCKS
                   0  2018-03-15 22:36:37  -d-        2047
                   1  2018-03-15 22:36:37  -d-        2048
                   2  2018-03-15 22:36:37  -d-        2048
                   3  2018-03-15 22:36:38  -d-        2048
                   4  2018-03-15 22:36:38  -d-        2048
[...]
               14460  2018-03-29 12:42:13  -d-        2048
               14461  2018-03-29 12:52:15  -d-        2048
               14462  2018-03-29 12:52:20  ad-         138
               14463  ---------- --:--:--  ad-           0

lssu -l outputs the same list with an additional column containing usage information in the format <used blocks> ( <segment usage percentage> ):

           SEGNUM        DATE     TIME STAT     NBLOCKS       NLIVEBLOCKS
                0  2018-03-15 22:36:37 -d--        2047          3 (  0%)
                1  2018-03-15 22:36:37 -d--        2048          0 (  0%)
                2  2018-03-15 22:36:37 -d--        2048          0 (  0%)
                3  2018-03-15 22:36:38 -d--        2048          0 (  0%)
                4  2018-03-15 22:36:38 -d--        2048          0 (  0%)
[...]
            14460  2018-03-29 12:42:13 -d--        2048         13 (  0%)
            14461  2018-03-29 12:52:15 -d-p        2048       2048 (100%)
            14462  2018-03-29 12:52:20 ad-p         138        138 (100%)
            14463  ---------- --:--:-- ad-p           0          0 (100%)
Note the flags on the above outputs

a: recent checkpoint; can't be cleaned
d: dirty state; being used
p: protected; can't be cleaned

Does lssu -l usage report account for snapshots?
Yes it does. This was my test:

  1. output lssu -l to a file;
  2. delete 10.3GB of data;
  3. output lssu -l to a different file;
  4. turn the checkpoint prior to the deletion into a snapshot;
  5. output lssu -l to a third file;
  6. compute the number of used blocks from the 3 files and calculate used space using the known virtual sector size of 4KB (see 2nd post).

The results were:

  • 64.80GB before the deletion;
  • 54.45GB after the deletion;
  • 64.80GB after creating the snapshot.

So, lssu -l does provide a way to compute used space considering snapshots. The drawback is being too slow (maybe used blocks are computed on the fly - IDK).

A clearer view: as I needed to parse the output, I decided to do it in a spreadsheet, since I'm more used to it. Then, once I had the data already there, I decided to chart it. The picture below shows the difference in the used blocks reported before the deletion, after the deletion and after creating the snapshot:
NILFS-snapshot_chart
As can be seen, lssu -l doesn't account for blocks with data which can be cleared, it counts reserved blocks as in-use, which is exactly what I think is relevant to compute used and free space. Furthermore, the chart shows protected segments (or its reference) is always put at the end, which is nice if one needs that kind of info. This feature can also be seen from the lists itself, such as the above output or the list immediately following the deletion (note the p flag within the default protection period of 1 hour):

           SEGNUM        DATE     TIME STAT     NBLOCKS       NLIVEBLOCKS
[...]
            14434  2018-03-28 15:16:30 -d--        2048          0 (  0%)
            14435  2018-03-28 15:16:30 -d--        2048          1 (  0%)
            14436  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14437  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14438  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14439  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14440  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14441  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14442  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14443  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14444  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14445  2018-03-28 23:20:45 -d-p        2048       2048 (100%)
            14446  2018-03-28 23:20:50 ad-p        1973       1973 (100%)
            14447  ---------- --:--:-- ad-p           0          0 (100%)

lssu -l is very slow: I noticed it was fast until around 500 lines and then it slowed down. So I thought it might be worth to get the full list 500 lines at a time. This is what I've tried:

  • calling lssu -l directly with no amount of lines predefined;
  • 1 thread, 500 lines at a time;
  • 4 threads, 500 lines at a time;
  • 4 threads, 500 lines at a time and nice=-10;
  • 4 threads, 500 lines at a time, nice=-10 and assignment to a variable instead of a file;
  • 1 thread, 500 lines at a time, nice=-10 and assignment to a variable instead of a file;
  • 1 thread, 500 lines at a time, nice=-10, assignment to a variable and no IF constructs.

The following chart shows the results:
NILFS-lssu_chart
So, the best time I got was about 4 minutes! I won't post the script because it is a bit cluttered and I don't think it's necessary. While making these tests I monitored the laptop activity and noticed the SSD read throughput was relatively constant at around 400MB/s. This explains why it takes so long and why more threads bring no benefit: the block usage stats aren't cached and are read from the disk on the fly.

An updated script: I updated the script previously posted to also account snapshots (optionally):

#!/bin/bash
## nilfs2-usage: get usage statistics from a NILFS2 file system
# TODO: include argument handling to choose a specific device and/or size units
#       try to exclude the need for elevated privilege
#       protection period should also be accounted for space calculations
#       optimizations

ERROR() { #error handler
    echo "Error: only option \"-s\" is allowed"
    exit
}

DEBUG() { #print info for debugging purposes
    echo "i:" $i
    echo "DEVICE_SHORT:" $DEVICE_SHORT
    echo "BLOCK_BYTES:" $BLOCK_BYTES
    echo "CAPACITY_BYTES:" $CAPACITY_BYTES
    #echo "CAPACITY_KB:" $CAPACITY_KB
    #echo "CAPACITY_MB:" $CAPACITY_MB
    echo "CAPACITY_GB:" $CAPACITY_GB
    echo "USED_BLOCKS:" $USED_BLOCKS
    #echo "USED_BYTES:" $USED_BYTES
    #echo "USED_KB:" $USED_KB
    #echo "USED_MB:" $USED_MB
    echo "USED_GB:" $USED_GB
    #echo "FREE_BYTES:" $FREE_BYTES
    #echo "FREE_KB:" $FREE_KB
    #echo "FREE_MB:" $FREE_MB
    echo "FREE_GB:" $FREE_GB
    echo "nSEG_RUN:" $nSEG_RUN
    echo "nSEGMENTS:" $nSEGMENTS
    echo "nRUNS:" $nRUNS
    echo "nSEG_LAST:" $nSEG_LAST
    echo "RESERVED_BLOCKS:" $RESERVED_BLOCKS
    echo "RESERVED_BYTES:" $RESERVED_BYTES
    echo "RESERVED_GB:" $RESERVED_GB
    #echo "USABLE_BYTES:" $USABLE_BYTES
    echo "USABLE_GB:" $USABLE_GB
    #echo "PROTECTED_BYTES:" $PROTECTED_BYTES
    echo "PROTECTED_GB:" $PROTECTED_GB
}

GETINFO() { #currently used and free space
    DEVICE_SHORT=$(echo "$i" | cut -d '/' -f3)
    BLOCK_BYTES=$(nilfs-tune -l /dev/sda2 | grep Block | awk '{print $3}')

    # Check permissions
    if [ "${#BLOCK_BYTES}" = 0 ]; then
        exit
    fi
    
    CAPACITY_BYTES=$(lsblk -lb -o NAME,SIZE | grep $DEVICE_SHORT | awk '{print $2}')
    #CAPACITY_KB=$(echo "scale=2; $CAPACITY_BYTES / 1024" | bc)
    #CAPACITY_MB=$(echo "scale=2; $CAPACITY_BYTES / 1024^2" | bc)
    CAPACITY_GB=$(echo "scale=2; $CAPACITY_BYTES / 1024^3" | bc)
    USED_BLOCKS=$(lscp -r -n 1 $i | tail -n 1 | awk '{print $6}')
    #USED_BYTES=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES" | bc)
    #USED_KB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024" | bc)
    #USED_MB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024^2" | bc)
    USED_GB=$(echo "scale=2; $USED_BLOCKS * $BLOCK_BYTES / 1024^3" | bc)
    #FREE_BYTES=$(echo "scale=2; $CAPACITY_BYTES - $USED_BYTES" | bc)
    #FREE_KB=$(echo "scale=2; $CAPACITY_KB - $USED_KB" | bc)
    #FREE_MB=$(echo "scale=2; $CAPACITY_MB - $USED_MB" | bc)
    FREE_GB=$(echo "scale=2; $CAPACITY_GB - $USED_GB" | bc)
}

GETEXTRA() { #used and free space considering protected content (in ex.: snapshots)
    nSEG_RUN=500
    nSEGMENTS=$(echo "$(lssu)" | awk 'END {print $1}')
    nRUNS=$(echo "scale=2; $nSEGMENTS / $nSEG_RUN" | bc | awk '{print ($0-int($0)>0)?int($0)+1:int($0)}')
    nSEG_LAST=$((nSEGMENTS - nSEG_RUN * (nRUNS - 1)))
    
    # Get "lssu -l" output and parse used blocks
    j=0
    RESERVED_BLOCKS=0
    while [ "$j" -lt "$(($nRUNS))" ]; do
        OUTPUT=$(nice -n -10 lssu -l -i $(($nSEG_RUN * $j)) -n $nSEG_RUN)
        
        # Parse used blocks
        for k in $(echo "$OUTPUT" | tail -n +2 | awk '{print $6}'); do
            RESERVED_BLOCKS=$(($RESERVED_BLOCKS + $k))
        done
        
        j=$(($j+1))
    done
    
    RESERVED_BYTES=$(echo "scale=2; $RESERVED_BLOCKS * $BLOCK_BYTES" | bc)
    RESERVED_GB=$(echo "scale=2; $RESERVED_BYTES / 1024^3" | bc)
    #USABLE_BYTES=$(echo "scale=2; $CAPACITY_BYTES - $RESERVED_BYTES" | bc)
    USABLE_GB=$(echo "scale=2; $CAPACITY_GB - $RESERVED_GB" | bc)
    #PROTECTED_BYTES=$(echo "scale=2; $RESERVED_BYTES - $USED_BYTES" | bc)
    PROTECTED_GB=$(echo "scale=2; $RESERVED_GB - $USED_GB" | bc)
}

# Argument handler
case "$1" in
    -s)
        EXTRA=1
        ;;
    *)
        if [ "$#" -ge 1 ]; then ERROR; fi
        EXTRA=0
esac

# Print table header
if [ "$EXTRA" -eq 0 ]; then
    printf "%-12s %-12s %-12s %s\n" "DEVICE" "CAPACITY" "USED" "FREE"
else
    printf "%-12s %-12s %-12s %-12s %-12s %-12s %s\n" "DEVICE" "CAPACITY" "USED" "FREE" "RESERVED" "USABLE" "PROTECTED"
fi

# Compute and print usage
for i in $(cat /proc/mounts | grep nilfs2 | cut -d ' ' -f1); do
    GETINFO
    if [ "$EXTRA" -eq 0 ]; then
        printf "%-12s %-12s %-12s %s\n" "$i" "$CAPACITY_GB"\GB "$USED_GB"\GB "$FREE_GB"\GB
    else
        GETEXTRA
        printf "%-12s %-12s %-12s %-12s %-12s %-12s %s\n" "$i" "$CAPACITY_GB"\GB "$USED_GB"\GB "$FREE_GB"\GB "$RESERVED_GB"\GB "$USABLE_GB"\GB "$PROTECTED_GB"\GB
    fi
    #DEBUG
done

It is now possible to calculate the "real" used space (reserved) by using the option -s

Example output:

[mbb@mbb-laptop ~]$ time sudo nilfs2-usage.sh  -s
DEVICE       CAPACITY     USED         FREE         RESERVED     USABLE       PROTECTED
/dev/sda2    159.06GB     53.57GB      105.49GB     64.86GB      94.20GB      11.29GB

real    4m22,052s
user    0m6,905s
sys     0m50,195s

Cheers,

Retrieve history for a particular file or directory: as you might know, there are NILFS browsers out there. However, I haven't found any suiting me because they're always attached to software I don't use. So I tried to find a way to get the history of a specif item, as that's one of the main advantages of having continuous snapshots.

dumpseg is a tool from the package nilfs-utils which allows inspection of a specific segment. This tool displays info about inodes and checkpoints, besides other valuable info:

segment: segnum = 481
  sequence number = 481, next segnum = 482
  partial segment: blocknr = 985088, nblocks = 214
    creation time = 2018-03-16 12:58:51
    nfinfo = 16
    finfo
      ino = 16867, cno = 130, nblocks = 8, ndatblk = 7
        vblocknr = 970560, blkoff = 0, blocknr = 985089
        vblocknr = 970561, blkoff = 1, blocknr = 985090
[...]

As already mentioned, lssu lists segments in use. This means that, once you know the inode of a particular file or directory, you can scan that specific value on each segment and get the checkpoints where it is referenced. Then you can see the dates of those checkpoints using lscp and mount any of them if you want to:

  • Get the inode for the item you want to retrieve the history for:
[mbb@mbb-laptop buildtest]$ ls -i /mnt/Data/mbb-home-backup/file-include.txt 
2026 /mnt/Data/mbb-home-backup/file-include.txt

For this example I chose file /mnt/Data/mbb-home-backup/file-include.txt, which is inode 2026.

  • Scan the inode for each segment lssu outputs:
[mbb@mbb-laptop buildtest]$ SEGLIST=$(lssu | tail +2 | awk '{print $1}'); for i in $(echo "$SEGLIST"); do sudo dumpseg /dev/sda2 $i | grep 'ino = 2026,'; done
[sudo] password for mbb: 
      ino = 2026, cno = 39, nblocks = 1, ndatblk = 1
      ino = 2026, cno = 481, nblocks = 1, ndatblk = 1
      ino = 2026, cno = 5388, nblocks = 1, ndatblk = 1
      ino = 2026, cno = 5537, nblocks = 1, ndatblk = 1
      ino = 2026, cno = 5662, nblocks = 1, ndatblk = 1

As can be seen above, 5 versions of this inode were found, referenced to checkpoints 39, 481, 5388, 5537 and 5662.

  • Use lscp to get the dates:
[mbb@mbb-laptop buildtest]$ lscp -i 38 -n $((5662 - 39)) | grep -e ' 39 ' -e ' 481 ' -e ' 5388 ' -e ' 5537 ' -e ' 5662 '
                  39  2018-03-16 12:51:47   cp    -       140228       2165
                 481  2018-03-16 13:22:26   cp    -      6982300      44806
                5388  2018-03-28 23:20:45   cp    -     14043778     110233
                5537  2018-03-29 19:00:12   cp    -     14044507     110155
                5662  2018-03-30 18:57:27   cp    -     16758200     113078

You can now turn any of those checkpoints into a snapshot and mount it to retrieve an older version of the file/directory [1]:

[mbb@mbb-laptop buildtest]$ sudo chcp ss 5388
[mbb@mbb-laptop buildtest]$ sudo mount -t nilfs2 /dev/sda2 /tmp/tmpmount/ -o ro,cp=5388
[mbb@mbb-laptop buildtest]$ md5sum /tmp/tmpmount/mbb-home-backup/file-include.txt 
73658bbb1998049f9774b9f253eb9cde  /tmp/tmpmount/mbb-home-backup/file-include.txt
[mbb@mbb-laptop buildtest]$ md5sum /mnt/Data/mbb-home-backup/file-include.txt 
d5d167f00753452e1dd9041d6e17f4b6  /mnt/Data/mbb-home-backup/file-include.txt

This begs for a script. I'll do it whenever I have time, as it may come handy in the future.

[1] Inodes are mappings to disk blocks and as such can be reused when freed. For this reason, the item being searched may not be present in the oldest checkpoints returned by dumpseg. In this case there were 5 checkpoints returned but the file was only present on the last 3 (since 5388 forward).

Update: Script to retrieve the history of a particular file or directory

Script to retrieve the history of a particular file or directory: this is a first attempt. Don't know when or if I'll come back to this and make it better.

#!/bin/bash
## nilfs2-history: retrieve history of a filesystem item
# TODO: add option '-c' to check for item's existence on each checkpoint found and output valid info only
#       allow multiple paths and/or regex
#       optimizations

ERROR() { #error handler
    case "$1" in
        1)
            echo "Error: you don't have permission to run this script; it needs to be run as root or super user."
            ;;
        *)
            echo "Error: item \""$1"\" not found or not on a NILFS2 file system"
    esac
    exit
}

DEBUG() { #print info for debugging purposes
    case "$1" in
        SEG_LIST)
            echo "DEVICE:" $DEVICE
            echo "INODE:" $INODE
            echo "SEGLIST:" "$SEGLIST"
            echo "waiting..."
            read
            ;;
        CHECK_LIST)
            echo "SEG:" $SEG
            echo "chkLIST:" $chkLIST
            echo "i:" $i
            echo "chkPOINT[$i]:" ${chkPOINT[$i]}
            echo "waiting..."
            read
            ;;
        trim_LIST)
            echo "chkPOINT[1]:" ${chkPOINT[1]}
            echo "i:" $i "chkPOINT[$i]:" ${chkPOINT[$i]}
            echo "chkPOINT[$i] - chkPOINT[1]:" $((${chkPOINT[$i]} - ${chkPOINT[1]}))
            echo "waiting..."
            read
            echo "chkLIST:" "$chkLIST"
            ;;
        *)
    esac
}

GETINFO() { #retrieve checkpoint list between first and last dates containing the inode
    DEVICE=$(df $inPATH | tail +2 | awk '{print $1}')
    INODE=$(ls -di $inPATH | awk '{print $1}')
    SEGLIST=$(lssu | tail +2 | awk '{print $1}')
    #DEBUG SEG_LIST
    
    # Check permissions
    if [ ! -w "$DEVICE" ]; then ERROR "1"; fi

    # Get checkpoint list
    i=0
    for SEG in $(echo "$SEGLIST"); do
        chkLIST=$(dumpseg $DEVICE $SEG | grep 'ino = '$INODE, | awk '{print $6}')
        if [ "${#chkLIST}" -gt 0 ]; then
            i=$(($i+1))
            chkPOINT[$i]=${chkLIST:0:$((${#chkLIST} - 1))}
            #DEBUG CHECK_LIST
        fi
    done
    chkLIST=$(lscp -a -i $((${chkPOINT[1]} - 1)) -n $((${chkPOINT[$i]} - ${chkPOINT[1]} + 1)) $DEVICE | tail +2)
    #DEBUG trim_LIST

}

# Test path
inPATH=$(find $1 -maxdepth 0 -fstype nilfs2 2> /dev/null)
if [ "${#inPATH}" = 0 ]; then ERROR $1; fi

# Get history
GETINFO

# Print table header, parse list and print results
printf "%-12s %-12s %-12s %-12s %s\n" "CHECKPOINT" "DATE" "TIME" "TYPE" "FLAGS"
i=1
while read LINE; do
    if [ "$(echo "$LINE" | awk '{print $1}')" = "${chkPOINT[$i]}" ]; then
        printf "%-12s %-12s %-12s %-12s %s\n" $(echo "$LINE" | awk '{print $1}') $(echo "$LINE" | awk '{print $2}') $(echo "$LINE" | awk '{print $3}') $(echo "$LINE" | awk '{print $4}') $(echo "$LINE" | awk '{print $5}')
        i=$(($i+1))
    fi
done <<< $chkLIST

# Print extra info
echo
echo "NOTE: the item you're looking for may not be present on the oldest checkpoints."
#echo "      use option '-c' to output valid checkpoints only."
echo "NOTE: nilfs2-usage.sh only searches for one item;"
echo "      if you input more than one path or a regular expression as a path,"
echo "      only the inode of the first item will be searched."

It takes a while scanning all the segments for the intended inode, but for now it does its job.

Example output:

[mbb@mbb-laptop ~]$ time sudo nilfs2-history.sh /mnt/Data/mbb-home-backup/file-include.txt
CHECKPOINT   DATE         TIME         TYPE         FLAGS
39           2018-03-16   12:51:47     cp           -
481          2018-03-16   13:22:26     cp           -
5388         2018-03-28   23:20:45     cp           -
5537         2018-03-29   19:00:12     cp           -
5662         2018-03-30   18:57:27     cp           -

NOTE: the item you're looking for may not be present on the oldest checkpoints.
NOTE: nilfs2-usage.sh only searches for one item;
      if you input more than one path or a regular expression as a path,
      only the inode of the first item will be searched.

real    1m52,399s
user    2m33,068s
sys     0m47,923s

The only thing I still want to know better is tuning the garbage collector. Will test it when I have time.

Cheers,

EDIT: three small changes to the script because it was throwing errors for directories (not sure if there's a history for these), it was wasting one cycle when looking for the checkpoints to print and it was eating up the first checkpoint found from the list, so it didn't return any result when there was a single checkpoint found.

EDIT: after using this filesystem for a while I needed this script and noticed it wasn't working. I already figured out why and solved the problem. However, the changes I made are extremely inefficient and need to be changed. I'll post an updated script whenever I have time.

1 Like

I haven't tested the garbage colector yet, but today I reached the lower limit (16GB). As expected, the reported free space slowly increased until the upper limit (32GB). I haven't noticed any slowdowns, but I wasn't asking much from the file system anyway. CPU usage isn't noticeable and cleanerd takes less than 700KB of RAM. It took a while to clean the 16GB, and, as expected, the speeds I deduced on the OP are maximum values, as each cleaning step varies on the amount of freed space (depending on the number of used blocks on the cleaned segments).

Next tests will be to charge the file system with large transfers (enough to get way beyond the lower limit) and observe throughput, CPU and RAM impact.

Forum kindly sponsored by