improving compression of Whonix image downloads

You should use another data compression archive file format which supports deduplication.

whonix-gateway and whonix-workstation share a big amount of the same binary files.
Especially the ones in /usr
The differences are only small and in /etc

Last weekend i did the following test.
I unpacked your big Whonix-XFCE-14.0.1.4.4.libvirt.xz archive file.
This file has a compressed size of 1099884060 Bytes, around 1100 MB.

Then i packed the two files
Whonix-Gateway-XFCE-14.0.1.4.4.qcow2
and
Whonix-Workstation-XFCE-14.0.1.4.4.qcow2
individually.
I ignored the xml files, they have such a small size that they are not important for this comparison.

I used the following commands:

tar -cvf workstation.tar Whonix-Workstation-XFCE-14.0.1.4.4.qcow2 
xz -z workstation.tar 
tar -cvf gateway.tar Whonix-Gateway-XFCE-14.0.1.4.4.qcow2
xz -z gateway.tar 

This resulted in the following two files:

-rw-rw-r-- 1 XXXXXXXX XXXXXXXXX 498373356 Mar 30 02:37 gateway.tar.xz
-rw-rw-r-- 1 XXXXXXXX XXXXXXXXX 638927196 Mar 30 01:43 workstation.tar.xz

If you do the calculation you get the following:
498373356+638927196 = 1137300552 Bytes
compared to the 1099884060 Bytes before with all the additional *.xml stuff.

This shows, that the compression isn’t good.
As i said before, the two *.qcow2 files have a lot in common.
If you mount them, you can see that their directories are nearly identical.
Both use Debian, so both use the same program and library files.
Only the configuration might differ plus a small amount of additional software only available on gateway or workstation.

This means, if you use a compression format, that can find these file duplications you can save around 500 MB.

You could try to use a raw image file instead of qcow2, this might allow the compression software to find the duplicates easier on a byte level.

Or you could build up the two images with a script on the users side with just one big file. There are plenty of possibilities.

1 Like

Deduplication would be very much welcome to save upload time (over Tor) and download traffic and user download time.

  • The input files (qcow2 images) are not yet deterministic/reproducible. Hopefully at some point in future as Debian reproducible builds project progresses.
  • Nevertheless the output (the compressed archive) should be deterministic/reproducible. (I.e. same images with same compression command should end up with a byte for byte identical archive.)
  • qcow2 images are sparse files.
  • Not using gzip, because it cannot handle sparse files.
  • Previous related discussion speed up libvirt tarball creation time which might contain suggestions.
    • Compression time may not be priority anymore since KVM builds are nowadays done by KVM maintainer @HulaHoop.
  • The related source code is here.
  • help-steps/variables

contains:

[ -n "$XZ_OPT" ] || XZ_OPT="--threads=8"
export XZ_OPT

Patches are welcome.

I have some suggestions.

Suggestion 1:

Make it so that /usr is in its own partition and qcow2 file and put all the Gateway or Workstation specific things, that are not common to /opt in their own qcow2 images.

This will lead to 3 qcow2 files.
gateway.qcow2
workstation.qcow2
and
base_or_common.qcow2

It is not perfect, because /lib and the kernel in /boot are still in gateway.qcow2 or workstation.qcow2 but the big junk of data of around 400-500 MB resides in a common qcow2 file which can be cloned after downloading with

virt-clone  base_or_common.qcow2 gateway_base.qcow2
mv base_or_common.qcow2 workstation_base.qcow2

This way, you will make the download 400-500 MB smaller.

When starting the virtual machine the user only needs to add two qcow2 images to the VM.

For gateway VM:
gateway.qcow2 and gateway_base.qcow2

and for workstation VM:
workstation.qcow2 and workstation_base.qcow2

On the positive side this solution will work and is still easy enough for the typical user.
If you make it a little more easier just provide a script that does the cloning and renaming thing.

For compiling workstation or gateway specific stuff most open source software that use autoconfigure ./configure scripts the option “–prefix=/opt” can be applied.
This will make sure, that the software will be installed to /opt and not to /usr.
For example you might compile tor browser and install it to /opt then, if firefox does use autoconfigure, then you can compile it with the following command

./configure --prefix=/opt
make
make install # this can be replaced by the thing you do when creating a deb package

Suggestion 2:

Another alternative to the above suggestion is to make it so, that the system files
are stored in a raw image file with a max size of around 10 GB instead of the 100 GB it is now.
And only the partition for the /home directory in qcow2 sparse files.
Usually you don’t need 100 GB big partitions just for system files.
On Linux the big partitions are usually only needed for /home.

By using smaller raw image files instead of qcow2 image files the compression software might achieve a better compression with duplicates.

In this case you will have to pack 4 virtual machine image files.
gateway_home.qcow2
workstation_home.qcow2
gateway_system.raw
workstation_system.raw

into one whonix.tar.xz archive.

Because gateway_system.raw and workstation_system.raw are in a raw image format, the compression software might be able to compress the duplicates much better.
At least that’s what i assume, but this have to be tested.
And this also will keep the image files on the computer of the user small, because system image partitions do not grow as big as /home image partitions.
For the latter you have qcow2 which can grow as needed.

The first suggestion will definitely work, the second might work too but need some tests to run.

EDIT:
Suggestion 3:

Maybe it’s also possible to work with some sort of sophisticated diff and merge utility.
That will diff out the common data of the two qcow2 files and later merge them back to the qcow2 files.
But i don’t know if such sophisticated diff and merge utilities do exist.

2 Likes

A ton of development work. Incompatible with current way the images are build (base image + installation of Whonix packages). Would require to totally re-organize all packages. A lot software (like Tor Browser) supposed to be installed on the workstation would be installed on the gateway then too. Or somehow move it at first boot in the right place. Most unrealistic.

I don’t think 10 GB vs 100 GB makes any difference. Sparse files aren’t an issue during compression. Could even be 1000 GB without making any difference.

I would really wonder if that is the case. Unused space is just a description, not actually initialized data.

Ideally the complexity of an installer script just for that very purpose can be avoided.

Ideally, we’d just find the proper compression tool and command line.

Okay, i didn’t take a look at how much work it really is.
I just assumed that whonix is basically debian + edited config files which are usually in /etc not in /usr and some additional tor related software.

The Tor browser won’t be installed on the gateway, because Tor browser is installed to /opt, not to /usr
And /opt is in the specific workstation image, the same applies for the config files in /etc and /var.

Workstation or Gateway specific image contains:
/home
/etc
/boot
/var
/opt # Tor software
/lib
/lib64
/dev
…

Specific means that /etc gateway and /etc workstation is not the same. That also applies to /opt and the rest.

common image contains:
/usr # command line tools, c++ libs, maybe Xorg and XFCE if you also want have that in gateway

The good thing about the File Hierarchy Standard of Unix or Linux is, that config files are usually not in /usr
This makes it possible to even mount /usr only readable or put it somewhere as a network share.

Thus the amount of work really depends basically on how many different binaries of the same software do you have in /usr.
And i doubt that this is a lot, if most of the debian files in /usr stay untouched.

Well, the problem are the duplicate files.
And if you pack meta data of the file system with a binary file and compress it by using qcow2 spares files, which does compression by itself, then you will always have a lot differences in your two qcow2 files which makes it very difficult to find duplicates for the compression software.

If you do not compress the image file by using raw images the meta data of the file system and the binaries laying inside of it are still independent.
This should allow a better compression in the end.

Let’s say you have a file in /usr called foolib.so in gateway and workstation.
if you put it in a raw image file, the byte sequence will be the same in gateway and workstation.
If you put it in a sparse file, the way how sparse file work will compress it with the metadata.
foolib.so + filesystem metadata like date 12/01/2019 = compressed byte sequence A
foolib.so + filesystem metadata like date 13/01/2019 = compressed byte sequence B

If you would hash A and B it will result into hash A’ and B’ and they will differ a lot.
How should xv or gzip compress A and B in a good way, if they differ so much?

Thus the solution is to not compress the image files for the Virtual image and only compress the archive for download.

The reason to put the system data into their own image file is, because the user doesn’t want to store a 100 GB bug raw image file if he doesn’t use all of that space inside whonix.
Thus you need a growable image file just for /home.
But the system files can be put inside a raw image file for better compression because of the duplicates.

1 Like

I understand if the files are compressed individually, then a lot less duplication can be found and compressed during archive creation. However, I hope it’s not what’s happening here.

That’s interesting.

We use the following to convert from raw to qcow2 during the build process.

  qemu-img \
     convert \
        -p \
        -O qcow2 \
        -S "$VMSIZE" \
        -o cluster_size=2M \
        -o preallocation=metadata \
        "$binary_image_raw" \

“$binary_image_qcow2”

We are not using qemu-img with -c option for compression. Quote qemu-img man page.

It can be optionally compressed (“-c” option)

I guess we douldn’t want to use compression at image level anyhow since that would be lower performance. On the contrary, long time ago -o preallocation=metadata was introduced for better performance.

Do you mean that our use of -o preallocation=metadata adds compressino even though we are not using -c?

Quote qemu-img man page.

sparse_size indicates the consecutive number of bytes (defaults to 4k) that must contain only zeros for qemu-img to create a sparse image during conversion. If sparse_size is 0, the source will not be scanned for unallocated or zero sectors, and the destination image will always be fully allocated.

And also wikipedia does not indicate that sparse files are related to compression.

Things to be tested:

  • compress the raw images to an archive instead and see if that results in a smaller image size
  • drop sparse files -S, see if that helps
  • drop also -o cluster_size=2M and -o preallocation=metadata, see if that helps
1 Like

sudo --non-interactive -u user qemu-img convert -p -O qcow2 -S 100G -o cluster_size=2M -o preallocation=metadata /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.raw /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.qcow2

That command finished quite fast. 2:20 minutes. Doesn’t indicate compression.

Let’s look at the file sizes.

du -hs /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.raw

3.5G /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.raw

du -hs /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.qcow2 

2.4G /home/user/whonix_binary/Whonix-Gateway-XFCE-14.0.1.4.4.qcow2

Alright, 1.1G less might be due to compression.

1 Like

Doesn’t help.

Also doesn’t help.


That indeed minimized image sizes. Took 30 minutes. Compressed raw rather than qcow2 images:

du -sh Whonix-XFCE-14.0.1.4.4.libvirt.xz

1.1G Whonix-XFCE-14.0.1.4.4.libvirt.xz

Even smaller than Whonix ova (VirtualBox) (which contains compressed (by VirtualBox default) vmdk images).

du -sh Whonix-XFCE-14.0.1.4.4.ova

1.6G Whonix-XFCE-14.0.1.4.4.ova

1 Like

I think the -c option is only related to the qcow version 1 format, not qcow2.
The man page states:

1 Like

Is this really based on the two raw image files?
1.1 G look a little big to me and not smaller than the initial archive with the two qcow2 images.
If it can find the duplicates the compression should be much better than these 1.1 GB.

1 Like

I am very sure. I did not have a workstation qcow2 image before that. (HulaHoop builds and uploads them.) To double check just now scrolled up.

sudo --non-interactive -u user tar --create --verbose --owner=0 --group=0 --numeric-owner --mode=go=rX,u+rw,a-s --sort=name --sparse '--mtime=2015-10-21 00:00Z' --xz --directory=/home/user/whonix_binary --file Whonix-XFCE-14.0.1.4.4.libvirt.xz WHONIX_BINARY_LICENSE_AGREEMENT Whonix-Gateway-XFCE-14.0.1.4.4.raw Whonix-Workstation-XFCE-14.0.1.4.4.raw Whonix-Gateway-XFCE-14.0.1.4.4.xml Whonix-Workstation-XFCE-14.0.1.4.4.xml Whonix_external_network-14.0.1.4.4.xml Whonix_internal_network-14.0.1.4.4.xml
tar: Option --mtime: Treating date '2015-10-21 00:00Z' as 2015-10-21 00:00:00
WHONIX_BINARY_LICENSE_AGREEMENT
Whonix-Gateway-XFCE-14.0.1.4.4.raw
Whonix-Workstation-XFCE-14.0.1.4.4.raw
Whonix-Gateway-XFCE-14.0.1.4.4.xml
Whonix-Workstation-XFCE-14.0.1.4.4.xml
Whonix_external_network-14.0.1.4.4.xml
Whonix_internal_network-14.0.1.4.4.xml

Perhaps it is because the raw images are itself sparse files?

du -sh Whonix-Gateway-XFCE-14.0.1.4.4.raw
3.5G    Whonix-Gateway-XFCE-14.0.1.4.4.raw
user@host:~/whonix_binary$ du -sh --apparent-size Whonix-Gateway-XFCE-14.0.1.4.4.raw
100G    Whonix-Gateway-XFCE-14.0.1.4.4.raw

And they were also treated with zerofree beforehand.

1 Like

Okay, it seems to be, that the compression can’t find the duplicates even when raw image files are used.

I also did some tests, converted the two cow2 files to raw, put them in an archive with tar and compressed them again with xz.
My result was the following:

1099884060 Whonix-XFCE-14.0.1.4.4.libvirt.xz
1137252192 whonix_raw.tar.xz

So it’s even bigger and without the xml files.
I wonder if other compression preset levels or different chunk sizes might help, according to the xz manpage default compression preset level is 6.

I also did a little research about virtual images and deduplication.
You might take a look into these papers:

3 Likes

I did a little further testing.
I created 3 directories

mkdir dupes
mkdir dupes/workstation
mkdir dupes/gateway

and mounted the two qcow2 files read only with guestmount.

sudo guestmount -a Whonix-Workstation-XFCE-14.0.1.4.4.qcow2 -m /dev/sda1 --ro dupes/workstation/
sudo guestmount -a Whonix-Gateway-XFCE-14.0.1.4.4.qcow2 -m /dev/sda1 --ro dupes/gateway/                         

After that i started as root fdupes to calculate how many files are duplicates and what are their uncompressed summarized size:

fdupes -r -m dupes/
86503 duplicate files (in 63077 sets), occupying 1890.7 megabytes

The summarized size means, that this is the extra size the duplicates occupy and the size that can be avoided.

Or in other words as an example a test folder with two identical files with a size of around 7 MB each gives:

# ls -l test/
total 14072
-rw------- 1 root root 7188984 Apr  3 18:49 a.img
-rw------- 1 root root 7188984 Apr  3 18:49 b.img
fdupes -r -m test/
1 duplicate files (in 1 sets), occupying 7.2 megabytes

So we should keep this in mind, that fdupes shows the extra space that is occupied.
The space that could be avoided.

The two whonix folders have a combined size of

du * -sm dupes
4327    dupes

This means, from these 4327 MB 43.69 % is wasted space because of duplicates.
And it could be 1890 MB smaller with a combined size of only 2438 MB.

The two cow2 files have btw. a size of:

2085    Whonix-Gateway-XFCE-14.0.1.4.4.qcow2
2536    Whonix-Workstation-XFCE-14.0.1.4.4.qcow2

4621 MB

I also checked how much space could be saved if the directory /usr would be in its own partition in a common virtual machine image.
And here the result was:

fdupes -r -m  gateway/usr workstation/usr
75987 duplicate files (in 54722 sets), occupying 1604.6 megabytes

Total file size in gateway/usr and workstation/usr is:

du -sm  gateway/usr workstation/usr
1599    gateway/usr
1852    workstation/usr

= 3451 MB
Total file count in gateway/usr and workstation/usr is:

find gateway/usr/ -type  f | wc -l
67674

and

find workstation/usr/ -type  f | wc -l
73340

= 141014 files

Strange is the number of duplicate files which is 75987.
It seems to be higher than the number of files in each directory.
I don’t know why?
141014 is also what is shown in the beginning when i started fdupes.
Maybe the sets of 54722 are the real numbers?
This would at least make more sense.

This also means, if you put the non duplicates to /opt this wouldn’t be a lot of files but you would save 1604 MB.
The other 286 MB of 1890 MB of duplicates are probably in /lib, /bin and /boot
/lib does have a size of 228 MB in the workstation image.
/bin 11 MB
/boot 30 MB
But files that are in /lib are needed during boot time, the same applies to /bin and /boot so putting them somewhere else is not an option.

3 Likes

Maybe these are symlinks which are ignored in the find command with the -type f option.
In this the number 54722 should be the real number of files.

If we assume that it is that way then
workstation/usr does have 73340-54722 = 18618 none duplicate files
and
gateway/user does have 67674-54722 =12952 none duplicate files.

3 Likes

I have an idea.
Most of these none duplicate files in /usr are very likely just additional software from Debian packages which are required for the workstation but not for the gateway.
And if they are simple Debian packages, then they are not Tor or Whonix specific.

And apt does download and store deb packages in /var/apt/cache/apt/archives/ before installing them.
You could make the Workstation specific image in a way, that it already contains these deb files in /var/apt/cache/apt/archives/ but doesn’t have them installed on the system into /usr.
By having them already in the workstation specific image, they don’t have to be downloaded via the slow Tor network.

And when the user boots up the workstation for the first time, a script could install these deb packages from this directory on the system automatically.

By doing so, you could have a common /usr vm image file as mentioned before in suggestion 1 without the need to have a lot of work.
Because all the Debian packages will stay the same and unaltered.
And only a very small amount of Tor or Whonix specific software must go to opt.
Software like the Tor Browser or the Tor software itself.
But usually you build the Whonix or Tor specific software anyway, so it’s only a change of the installation destination.

A common vm image file just for /usr and the duplicates is definitely the best way to keep the download size low.
And the none duplicates as deb packages in /var… which can get installed by a script during first boot up.

1 Like

@Firefox dang you really thought about this. It’s good to see a thought out proposal instead of the usual drive by user requests.

2 Likes

Might be doable in theory, but…
KVM only changes not compatible with VirtualBox changes are very, very unlikely to be contributed.
A third shared image is very, very unlikely to materialize.
A third shared image adds a lot complexity, increase the size of the code base, speak even fewer people would read and understood the source code.
The available labor is very low.

Changing the code for image compression is realistic. But changing the build process for something that benefits KVM only, I am certainly not going to write that code.

Common vm image also seems risky form a security point of view. Which VM would be authorized to make changes to it (update it)?

A shared image can make a lot of sense. Qubes OS TemplateVMs share their root image. Simplified:

  • TemplateBasedVMs can write to root image (which is /usr and others) but these changes will not be shared with any other VMs and be lost after shutdown
  • TemplateVMs can write to the root image persistently
  • after TemplateVM shutdown and TemplateBasedVMs reboot the updated root file system is available to the TemplateBasedVMs
  • this allows huge savings in disk space and centralized updates, i.e. only the TemplateVM has to be updated and shut down then after reboot of the TemplateBasedVMs everything will be up to date.
  • Template implementation | Qubes OS

But I don’t see anyone reinventing that for Whonix KVM only.

Development work, and new bugs.

Well the principles are the same for both, KVM and Virtualbox. As far as i know KVM and Virtualbox are only different kinds of Virtual machines software with their specific VM image format and both understand the raw image format or can convert from raw to their VM specific image format.

Thus building 3 raw images and converting them to qcow2 and vdi should be all that is needed to make them VM specific.

Common vm image also seems risky form a security point of view. Which VM would be authorized to make changes to it (update it)?

No, the common vm image format is only shared for step 1 which is the step of downloading the files. The common vm image file makes the download size smaller by removing the extra size for duplicates, that’s the whole point of this thread.

In step 2 the user clones the common image for whonix workspace and renames the other for whonix gateway, or vice versa. Like described above.
Thus in step 3, when the user boots whonix workspace and whonix gateway both have their own image file for /usr.
So to sum it up, at runtime and later usage these are individual independent images.

That is also needed because in step 4, when the workspace or gateway images are booted for the first time, a script automatically installs the none duplicate deb packages as described above from /var/apt/cache/apt/archives/ making image file gateway_usr.qcow2 and whonix_usr.qcow2 completely independent and different from each other.

The only things that are to keep in mind are:
A. by cloning the common.qcow2 file the UUID of the disk image changes
B. The virtual machine config file may require a change to the new UUID if it is UUID and not filename based.
C. inside of workspace or gateway the /etc/fstab file must be changed according to the UUID. But this should be doable by the script that also installs the deb files when booting the first time.

Steps A and B should be doable by a script the user has to run after download.
Step C can run automatically when the images are booted the first time.

Most open source software offer a way to provide a destination for the installation during the create makefile process.

Thanks.

Firefox via Whonix Forum:

Well the principles are the same for both, KVM and Virtualbox. KVM and Virtualbox are only Virtual machines with their specific VM image format and both understand the raw image format or can convert from raw to their VM specific image format.

Thus building 3 raw images and converting them to qcow2 and vdi should be all that is needed to make them VM specific.

VirtualBox: This would have to be done on the user’s machine. Something
simple (ova image to import) would be converted into something more
complex and platform specific. A script, one for linux, another one for
Windows, and perhaps another one for mac. Rather than “just import the
ova using virtualbox” it’s “extract, make the script executable, run the
script” (or in case of Windows perhaps an installer). Possible but no
development resources for that.

Common vm image also seems risky form a security point of view. Which VM would be authorized to make changes to it (update it)?

No, the common vm image format is only shared for step 1 which is the step of downloading the files. The common vm image file makes the download size smaller by removing the extra size for duplicates, that’s the whole point of this thread.

I see.

In step 2 the user clones the common image for whonix workspace and renames the other for whonix gateway, or vice versa. Like described above.
Thus in step 3, when the user boots whonix workspace and whonix gateway both have their own image file for /usr.
So to sum it up, at runtime and later usage these are individual independent images.

That is also needed because in step 4, when the workspace or gateway images are booted for the first time, a script automatically installs the none duplicate deb packages as described above from /var/apt/cache/apt/archives/ making image file gateway_usr.qcow2 and whonix_usr.qcow2 completely independent and different from each other.

I see.

The only things that are to keep in mind are:
A. by cloning the common.qcow2 file the UUID of the disk image changes
B. The virtual machine config file may require a change to the new UUID if it is UUID and not filename based.
C. inside of workspace or gateway the /etc/fstab file must be changed according to the UUID. But this should be doable by the script that also installs the deb files when booting the first time.

Steps A and B should be doable by a script the user has to run after download.
Step C can run automatically when the images are booted the first time.

A lot code to be written.

Most open source software offer a way to provide a destination for the installation during the create makefile process.

Whonix packages using genmkfile also support setting DESTDIR. However,
make install is only used during the package creation process.
Software by Whonix is installed through packages. These are using the
default paths as per FHS. Packages install to / as per usual. No package
/ just using make install → no good upgrade path. Packages provide a
very good upgrade path. Having the packages install to /opt would be a
lot work updating any paths, new bugs. Also package installation
requires their dependencies being installed already. So Whonix packages
could go to the same directory for initial installation. (To resolve
dependencies, to install in right order, a local apt repository would be
required.)

In theory it’s all doable but super complex and error prone. I would
veto such changes and suggest to fork Whonix instead if this is desired.

I also don’t think Whonix needs to invent something new as complex as
this here. Whonix isn’t an outsider by using VM images. These are very
popular in data centers. These need to backups and transfer files. So
someone must have sorted out deduplication and compression of VM images
in a generic way already.

These two blog posts indicate that just switching to another compression
algorithm could do the trick.

http://www.doublecloud.org/2012/06/best-tool-to-compress-virtual-machines/

If that does not suffice, for duplication we could also research huge
compression algorithm dictionary sizes and/or preprocessors to remove
duplication.

1 Like