- Add support for libarchive filters in dwarfsextract. - When hashing, start by only hashing the first, say, 4KiB, and only if the hashes are identical, hash the whole file - Add support for compressing uncompressed image formats, currently primarily FITS, using different formats (e.g. JPEG2K, Rice, Hcompress). - Fragment counts don't necessarily match up in presence of errors. - When opening a file again, check that its timestamp, size and potentially checksum did not change from when we first saw it. - Option to log to a file instead of stderr? - Use thrift definitions for all options to make them easily printable/storable? - Use Elias-Fano for delta-encoded lists in metadata? - Implement rewriting properly; keep order of blocks etc; ability to remove history?; ability to re-pack metadata?; ability to change other metadata properties (e.g. stuff like --set-owner, --set-group, --chmod, --no-create-ts, --set-time could all be done on existing metadata, but obviously wouldn't be undo-able) - Packaging of libs added via FetchContent - Re-assemble global bloom filter rather than merging? - Use smaller bloom filters for individual blocks? - Use bigger (non-resettable?) global bloom filter? - file discovery progress? - show defaults for categorized options - take a look at CPU measurements, those for nilsimsa ordering are probably wrong - segmenter tests with different granularities, block sizes, any other options - Bloom filters can be wasteful if lookback gets really long. Maybe we can use smaller bloom filters for individual blocks and one or two larger "global" bloom filters? It's going to be impossible to rebuild those from the smaller filters, though. - Compress long repetitions of the same byte more efficiently. Currently, segmentation finds an overlap after about one window size. This goes on and on repeatedly. So we end up with a *lot* of chunks pointing to the same segment. The smaller the window size, the larger the number of chunks. It's definitely a trade off, as storing large segments of repeating bytes is wasteful when mounting the image. Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF) hash values for window_size bytes to detect long sequences of identical bytes. OTHER intriguing idea: let a categorizer (could even be the incompressible categorizer, but also "sparse file" categorizer or something like that) detect these repetitions up front so the segmenter doesn't have to do it (and it can be optional). Then, we can customize the segmenter to run *extremely* fast in this case. - Wiki with use cases - Perl releases - Videos with shared streams - Backups of audio files - Compression of filesystem images for forensic purposes - Mounting lots of images with shared cache? - different scenarios for categorized files / chunks: - Video files - just store without compression, but perform segmentation, nothing special - keep in lookback buffer for longer, as it doesn't cost memory - needs parallelized segmenter (see below) - PCM audio (see also github #95) - segment first in case of e.g. different trims or other types of overlap - split into chunks (for parallel decompression) - compress each chunk as flac - headers to be saved separately - need to store original size and other information This is actually quite easy: - Identify PCM audio files (libmagic?) - Nilsimsa similarity works surprisingly well - We can potentially switch to larger window size for segmentation and use larger lookback - Run segmentation as usual - Compress each block using FLAC (hopefully we can configure how much header data and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf - I *think* this can be done even with the current metadata format without any additions - The features needed should be largely orthogonal to the features needed for the scenarios below - Executables, Shared Libs, ... - run filter over blocks that contain a certain kind of binary data before compression - a single binary may contain machine code for different architectures, so we may have to store different parts of the binary in different blocks - actually quite similar to audio files above, except for the additional filter used during compression/decompression - JPG - just store a recompressed version - need to store original size - no need for segmentation except for exact - PDF - decompress contents - then segment - then compress along with other PDFs (or documents in general) - Other compressed format (gz, xz, ...) - decompress - segment - compress - essentially like PDF - maybe only do this for small files? (option for size limit?) - It should be possible to treat individual chunks differently, e.g. WAV-header should be stored independently from contents; at some point, we might look deeper into tar files and compress individual contents differently. - in the metadata, we need to know: - the fact that a stored inode is "special" (can be reflected in a single bit) - the type of "specialness" - the original file size - multi-threaded pre-matcher (for -Bn with n > 0) - pre-compute matches/cyclic hashes for completed blocks; these don't change and so we can do this with very little synchronization - there are two possible strategies: - split the input stream into chunks and then process each chunk in a separate thread, checking all n blocks - process the input stream in each thread and then only checking a subset of past blocks (this seems more wasteful, but each thread would only operate on a few instead of all bloom filters, which could be better from a cache locality pov) - similarity size limit to avoid similarity computation for huge files - store files without similarity hash first, sorted descending by size - use streaming interface for zstd decompressor - json metadata recovery - handle sparse files? - try to be more resilient to modifications of the input while creating fs - dwarfsck: - show which entries a block references - show partial metadata dumps at lower detail levels - make dwarfsck more usable - cleanup TODOs - folly: dynamic should support string_view - frozen: ViewBase.getPosition() should be const - docs, moar tests - extended attributes: - number of blocks - number of chunks - number of times opened? - per-file "hotness" (how often was a file opened); dump to file upon umount - --unpack option - readahead? - window-increment-shift seems silly to configure? - identify blocks that contain mostly binary data and adjust compressor? - metadata stripping (i.e. re-write metadata without owner/time info) - metadata repacking (e.g. just recompress/decompress the metadata block) /* scanner: bhw= - 388.3s 13.07 GiB bhw= 8 812.9s 7.559 GiB bhw= 9 693.1s 7.565 GiB bhw=10 651.8s 7.617 GiB bhw=11 618.7s 7.313 GiB bhw=12 603.6s 7.625 GiB bhw=13 591.2s 7.858 GiB bhw=14 574.1s 8.306 GiB bhw=15 553.8s 8.869 GiB bhw=16 541.9s 9.529 GiB lz4: <---- 1m29.535s / 9m31.212s lz4hc: 1 - 20.94s - 2546 MiB 2 - 21.67s - 2441 MiB 3 - 24.19s - 2377 MiB 4 - 27.29s - 2337 MiB 5 - 31.49s - 2311 MiB 6 - 36.39s - 2294 MiB 7 - 42.04s - 2284 MiB 8 - 48.67s - 2277 MiB 9 - 56.94s - 2273 MiB <---- 1m27.979s / 9m20.637s 10 - 68.03s - 2271 MiB 11 - 79.54s - 2269 MiB 12 - 94.84s - 2268 MiB zstd: 1 - 11.42s - 1667 MiB 2 - 12.95s - 1591 MiB <---- 2m8.351s / 15m25.752s 3 - 22.03s - 1454 MiB 4 - 25.64s - 1398 MiB 5 - 32.34s - 1383 MiB 6 - 41.45s - 1118 MiB <---- 2m4.258s / 14m28.627s 7 - 46.26s - 1104 MiB 8 - 53.34s - 1077 MiB 9 - 59.99s - 1066 MiB 10 - 63.3s - 1066 MiB 11 - 66.97s - 956 MiB <---- 2m3.496s / 14m17.862s 12 - 79.89s - 953 MiB 13 - 89.8s - 943 MiB 14 - 118.1s - 941 MiB 15 - 230s - 951 MiB 16 - 247.4s - 863 MiB <---- 2m11.202s / 14m57.245s 17 - 294.5s - 854 MiB 18 - 634s - 806 MiB 19 - 762.5s - 780 MiB 20 - 776.8s - 718 MiB <---- 2m16.448s / 15m43.923s 21 - 990.4s - 716 MiB 22 - 984.3s - 715 MiB <---- 2m18.133s / 15m55.263s lzma: level=6:dict_size=21 921.9s - 838.8 MiB <---- 5m11.219s / 37m36.002s */ Perl: 542 versions of perl found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written bench build real user ----------------------------------------------------------------------------------------------------- -rw-r--r-- 1 mhx users 14G Jul 27 23:11 perl-install-0.dwarfs 8:05 0:38 0:45 -rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs 6:34 0:14 1:24 -rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs 7:31 0:17 1:11 -rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs 10:11 0:11 0:59 -rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs 11:05 0:14 1:24 -rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs 11:53 0:13 1:15 -rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs 17:42 0:14 1:25 -rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs 20:52 0:20 2:14 -rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs 24:13 0:50 6:02 -rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs 34:40 0:51 5:50 -rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs 17:20 -rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs 71:49 soak: -7 (cache=1g) Passed with 542 of 542 combinations. real 75m21.191s user 68m3.903s sys 6m21.020s -9 (cache=1g) Passed with 542 of 542 combinations. real 118m48.371s user 107m35.685s sys 7m16.438s squashfs-opt real 81m36.957s user 62m37.369s sys 20m52.367s -1 (cache=2g) mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none real 2m19.927s user 0m16.813s sys 2m4.293s -7 (cache=2g) mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none real 2m24.346s user 0m17.007s sys 1m59.823s squash-default mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none real 8m41.594s user 1m25.346s sys 19m12.036s squash-opt mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none real 141m41.092s user 1m12.650s sys 59m18.194s