Information
ID: 605
PHID: PHID-TASK-rmlmclldrixkknshl67q
Author: Patrick
Status at Migration Time: open
Priority at Migration Time: Wishlist
Description
The current tar.xz compression code is a burden since it literally takes hours. Currently building VBox and KVM images until upload finished therefore takes more than a day.
Using tar
--xz
and --mtime="2014-05-06 00:00:00"
so the archives are deterministic.
Using --sparse
…
-S, --sparse
handle sparse files efficiently
The replacement requirements:
faster than current one
deterministic
handle sparse files efficiently
** currently the result of the compression is reducing the a sparse file with a real size of ~ 4.5
GB (and apparent size 100
GB) workstation qcow2 file to ~ 1.5
GB tar.xz.
** the new file size should be similarly small
** (not 100
GB reduced to ~ 30
GB)
The priority is high, since this reduces my motivation to create Non-Qubes-Whonix images.
Comments
HulaHoop
2017-01-15 02:27:14 UTC
Patrick
2017-01-15 05:09:41 UTC
quote @anonymous1
I’m not experienced as to how to improve xz compression, assuming you don’t want to experience with less known compressors.
What I do know is that freearc (with its many unique compression filters and technologies such as srep) and nanozip are the best compressors around in terms of both speed and compression ratio. I don’t know the details of the ticket you mentioned but I think srep might be the best tool to speed it up again. But then it is not deterministic, I’m not sure if there is a way to make it so
quote @anonymous1
Could it help trying tar implementation of other programs like 7zip?
quote @anonymous1
bsdtar or star didn’t help?
How can I speed up operations on sparse files with tar, gzip, rsync? - Unix & Linux Stack Exchange
HulaHoop
2017-01-15 15:13:37 UTC
OK so its reproducibility > speed
Most compression algorithms are deterministic. Being “adaptive” in no way contradicts being “deterministic”: it only means varying behavior based on input, so if the input is the same, so will be the output.
You can easily verify this by compressing the same file several times using an algorithm of your choice (zip, gzip, bzip2, 7z, etc.) and comparing the outputs. For example on linux, you can run this command several times to compress the file /etc/fstab and compare if its checksum is the same each time: gzip < /etc/fstab | md5sum -
though the algorithm itself is indeed deterministic, the implementation will sometimes store additionnal information (file permissions, timestamps etc) which can make it look like the output is not deterministic. adding a touch on the file between the compress and decompress can generate a different zip even though the file’s content did not change. That being said, it’s still deterministic once all parameters are factored in.
Any Deterministic Compression Algorithms out There? - Software Engineering Stack Exchange
The next best algo in speed is gzip and it explicitly supports disabling timestamps with
gzip:!timestamp
tar(1)
With lz4 and tar this may be possible as it is with gzip:
Preserve timestamp when compressing files with lz4 on linux
compression - Preserve timestamp when compressing files with lz4 on linux - Stack Overflow
anonymous1
2017-01-18 20:31:32 UTC
anonymous1
2017-01-19 08:38:09 UTC
anonymous1
2017-01-19 08:41:39 UTC
anonymous1
2017-01-19 20:29:45 UTC
@Patrick good news
tar has finally added support for SEEK_DATA/SEEK_HOLE for sparse file detection in latest version 1.29
GNU tar - News: tar 1.29 [Savannah]
Upgrading to this version should speed up your compression without changing any command. Please let me know how it goes
Patrick
2017-01-20 12:58:31 UTC
Patrick
2017-01-20 13:12:41 UTC
Patrick
2017-01-21 00:04:55 UTC
anonymous1
2017-01-21 03:25:58 UTC
HulaHoop
2017-01-23 00:10:43 UTC
anonymous1
2017-03-07 20:59:04 UTC
Patrick
2017-03-10 03:44:48 UTC
The xz
command just uses about 10% of CPU and just about 90 MB RAM. iotop -a
is below 1%.
Building on Debian stretch with xz-utils 5.2.2-1.2
/ tar 1.29b-1.1
. Using ext4
as file system.
Any idea why system load is so low? I’d like a much higher load so it goes faster.
libvirt_compress
function at time of writing:
developer-meta-files/release/prepare_release at bb1907e319acda314a1c57df200ff1696f979971 · Kicksecure/developer-meta-files · GitHub
The compression command from bash xtrace.
tar --create --verbose --owner=0 --group=0 --numeric-owner --mode=go=rX,u+rw,a-s --sort=name --sparse '--mtime=2015-10-21 00:00Z' --xz --directory=/home/user/whonix_binary --file Whonix-Gateway-14.0.0.4.0.libvirt.xz Whonix-Gateway-14.0.0.4.0.qcow2 Whonix-Gateway-14.0.0.4.0.xml Whonix_external_network-14.0.0.4.0.xml Whonix_internal_network-14.0.0.4.0.xml
The whole prepare_release
script now took 21 minutes for Whonix-Gateway. libvirt archive creation is still that part that takes the longest time. Archive size: 1.2 GB.
By adding environment variable XZ_OPT="-0"
, time is down to 4:45 min, size is up to 1.4 GB.
By adding environment variable XZ_OPT="-2"
, time is down to 7:45 min, size is up to 1.3 GB.
(XZ_OPT="-0 --fail"
makes it fail as expected. Did that as a test to see if environment variable XZ_OPT
is honored.)
! In T605#11756, @anonymous1 wrote:
@Patrick good news
tar has finally added support for SEEK_DATA/SEEK_HOLE for sparse file detection in latest version 1.29
GNU tar - News: tar 1.29 [Savannah]
Upgrading to this version should speed up your compression without changing any command. Please let me know how it goes
As per GNU tar - News: tar 1.29 [Savannah] it should be automatically using seek hole detection on systems that support it
. How do I find out if my system supports it or how to enable it?
If you have other ideas to speed it up / shrink size while keeping it reproducible, could you suggest changes to the prepare_release
script please? Perhaps by making a github pull request?
anonymous1
2017-03-10 04:56:52 UTC
anonymous1
2017-03-10 05:03:47 UTC
anonymous1
2017-03-10 05:24:19 UTC
there is some related information here:
Utilizing multi core for tar+gzip/bzip compression/decompression - Stack Overflow
you could also try XZ Utils, it has multi-threaded compression support for some time, perhaps you could do this
tar --use-compress-program=xz
but it may not be multi-threaded then. the documentation for xz states that:
Multi-threaded compression can be enabled with the --threads (-T) option.
I’m not sure if you can use this option from inside tar though
anonymous1
2017-03-10 05:31:18 UTC
anonymous1
2017-03-10 11:54:05 UTC
anonymous1
2017-03-10 12:25:58 UTC
It seems you could use something like this with xz utils:
export XZ_OPT=“–threads=0”
-T threads, --threads=threads
Specify the number of worker threads to use. Setting threads to a special value 0 makes xz use as many threads as there are CPU cores on the system.
The actual number of threads can be less than threads if the input file is not big enough for threading with the given settings or if using more
threads would exceed the memory usage limit.
Currently the only threading method is to split the input into blocks and compress them independently from each other. The default block size
depends on the compression level and can be overriden with the --block-size=size option.
Patrick
2017-03-10 17:06:23 UTC
xz(1) — xz-utils — Debian testing — Debian Manpages
“–threads=0” results in 100% CPU usage, yay!
XZ_OPT=“-0 --threads=0”
Time down to 1:35.
Size up to 1.4 GB.
XZ_OPT=“-9 --extreme --threads=0”
XZ_OPT=“-6 --threads=0”
4:28
1.2 GB
reproducible: yes
XZ_OPT=“-6 --threads=0”
installed pxz
replaced --xz with --use-compress-program=pxz
(really uses pxz and not xz as per ps aux)
replaced --xz with --use-compress-program=pxz
Looks like using tar with --use-compress-program=pxz really is not worth it.
Perhaps worth trying pxz directly without tar? But then we might be back to non-reproducibility. Needs testing. Does pxz support to auto detect how much threads can be used maximum?
Patrick
2017-03-10 17:12:46 UTC
anonymous1
2017-03-10 17:50:03 UTC
anonymous1
2017-03-10 17:53:58 UTC
But I have a feeling it would produce different archives with different number of threads, single core vs dual core vs quad core vs custom vm cores
you could test this by setting the threads to 1, 2, 3, 8 and so on
Patrick
2017-03-10 18:11:32 UTC
anonymous1 (anonymous1):
anonymous1 added a comment.
But I have a feeling it would produce different archives with different number of threads, single core vs dual core vs quad core vs custom vm cores
Good point.
Just now tested --threads=1 vs --threads=8. Different checksum.
–threads=8 vs --threads=8 however results in the same checksum.
Perhaps I should change --threads=0 to --threads=8? (Assuming quad core
with 2 threads per core?)
May not be a problem on slower machines. --threads=30 (for testing
purposes, exceeding available threads) also worked for me.
Patrick
2017-03-10 18:22:08 UTC
anonymous1 (anonymous1):
anonymous1 added a comment.
you could also try lowering or increasing the compression dictionary size to see how it affects the size and speed, however I don’t know the commands
Is this different from -0 to -9 (–extreme) compression settings?
In other words… Do you think it is worth to play with various
“–dict=” settings independent from the compression level setting for
better speeds or smaller file sizes?
anonymous1
2017-03-10 19:02:58 UTC
I think the default settings are optimal
–threads=8 should still work on slower machines, however it would work like --threads=4 or --threads=2 I guess. In that case choosing the default threads is up to you, could you try with 4? it may not be too different from 8
anonymous1
2017-03-10 19:32:30 UTC
If you have 8 threads and if using more than 8 produces same checksum as 8, then what I said would be true
I would recommend 4 max, but it’s your choice
It’s also a good idea to test with same threads on different machines whether there is any variation or not
anonymous1
2017-03-10 19:40:42 UTC
Patrick
2017-03-10 19:53:59 UTC
! In T605#12649, @anonymous1 wrote:
I think the default settings are optimal
Okay.
–threads=8 should still work on slower machines, however it would work like --threads=4 or --threads=2 I guess. In that case choosing the default threads is up to you, could you try with 4? it may not be too different from 8
4 uses only 50% of CPU.
Done, made that 8:
https://github.com/Whonix/Whonix/commit/17581ebbd05cc04f5ed52637e675481ddecc0845
! In T605#12650, @anonymous1 wrote:
If you have 8 threads and if using more than 8 produces same checksum as 8, then what I said would be true
I would recommend 4 max, but it’s your choice
It’s also a good idea to test with same threads on different machines whether there is any variation or not
Theoretically lets say a single core machine might produce a different checksum than a quad core due to threads. But I doubt that. It’s probably not using physical cpu threads but virtual cpu threads. top -H
shows easily more than 500 virtual threads on a usual linux system.
A few more threads than physical threads will probably only have a negligible performance penalty. 8 vs 4 should not matter on slow system. (However I speculate 10000 threads would cause significant overhead.)
! In T605#12665, @anonymous1 wrote:
I think in the worst case you could care less about a perfectly reproducible end archive (tar.xz) and instead focus on the extracted (tar) file being reproducible
Having the final file reproducible makes verification instructions and automation a lot easier.
Then it’s just “rebuild the libvirt.xz, and compare the hashes”.
Otherwise it’s "rebuild libvirt.qcow2, download libvirt.xz, extract the qcow2, and compare the hashes, of the qcow2 files not libvirt.xz files.
Hypothetical the compressed libvirt.xz could contain an exploit against xz that compromises the system during decompression. By having reproducible libvirt.xz we can exclude that.
For now it does not really matter if libvirt.xz is reproducible. It’s very far forward thinking, since Whonix reproducible images are for now far away unfortunately, see:
https://forums.whonix.org/t/is-whonix-reproducible-yet-backdoor-protection
anonymous1
2017-03-10 20:27:19 UTC
Could you please check how long it takes with 4 threads, using %50 of cpu is expected, it does not necessarily mean it will take twice as long
What I expect is that a PC with 4 threads may not reproduce the same archive even if --threads=8 doesn’t give any error, it may produce an archive like if you used --threads=4
If 4 threads doesn’t change the build time much, that would be a safer default
anonymous1
2017-03-10 20:51:50 UTC
Patrick
2017-03-10 21:13:40 UTC
anonymous1 (anonymous1):
anonymous1 added a comment.
Could you please check how long it takes with 4 threads, using %50 of
cpu is expected, it does mean it will take twice as long
Takes 1 minute longer.
What I expect is that a PC with 4 threads may not reproduce the same
archive even if --threads=8 doesn’t give any error, it may produce an
archive like if you used --threads=4
It doesn’t. I already tried some number that exceeds my physical cores
more than twice (30) and had the same checksum each time I used the same
number of threads (30). I am pretty sure it’s virtual, not physical
threads. At that level at abstraction, it would make little sense to see
“wanted 30 threads, but just got 4 physical cores, will reduce threads
silently to 8”.
anonymous1
2017-03-10 21:19:18 UTC
Did you compare your --threads=30 archive with --threads=8 archive?
They may turn out to be the same, or you can try any number bigger than 8
If that’s the case, 4 is a safer default for reproducibility, with little impact on speed
anonymous1
2017-03-10 21:22:07 UTC
anonymous1
2017-03-10 21:36:16 UTC
I may be wrong, the best way to test this is to maybe create the same archive with half of the available cores in a VM however I can’t do this, I don’t have debian stretch
But at least one thing is clear, memory requirement is directly proportional to number of threads and if the machine at hand does not meet those requirements it will lower the number of these threads
anonymous1
2017-03-10 21:48:25 UTC
anonymous1
2017-03-10 22:53:43 UTC
Sorry for all this confusion, I think it is only a difference between whether the program “tries” to operate in a single-threaded mode or multi-threaded mode, when we use --threads=1 or don’t specify it (default is 1) it compresses the whole file in a single block, however setting --threads 0 or bigger than 1 triggers the multi-threaded mode and the file is split into blocks depending on the compression level and then compressed resulting in a difference in the archive file. how many threads actually used is irrelevant. changing compression level or manually specifying the block sizes will change the outcome.
with setting --threads 8 instead of 0 we actually enforce the multi-threaded mode (splitting the file into blocks) and prevent at least one cause of non-determinism: when --threads is set to 0 inside a single core machine, xz operates in “single-threaded mode” and compresses the file in a single block whereas setting this to anything higher than 1 enforces “multi-threaded mode” without being multi-threaded at all but still splits and compresses the file in blocks. You can see this with a VM
so it should be safe to set --threads to 8 or 16 or higher, this option means: “use at most NUM threads”
anonymous1
2017-03-11 15:13:17 UTC
Patrick
2017-03-13 11:40:30 UTC
! In T605#12670, @anonymous1 wrote:
Did you compare your --threads=30 archive with --threads=8 archive?
Doesn’t make a difference in speed.
! In T605#12675, @anonymous1 wrote:
If you have time could you check how long it takes with 5 or 6 threads? I think it will be near equal to 8, not for reproducibility reasons just for efficient use of system resources. There is probably no reason to use 16 cores on a machine that supports it which would be overkill
6 causes 75% CPU, takes 4:59 minutes.
Actually I am for maximum system resource usage as default value. Capping stuff should be user opt-in, custom settings though the operating system by the user. (Run in a VM or use other tools to add caps to the build script.)
Patrick
2019-04-04 18:17:58 UTC