The other deduplication tool “rep” from freearc apparently produces deterministic results, though not as powerful as srep the results are still significant and it is fast.
It is available as a separate executable which could be used after tar like I did with srep above.
I should have found this earlier, srep is deterministic too if you change the hash to one of these options: -hash=md5 -hash=sha1 -hash=sha512
or disabling it with -hash- Is there any benefit of using hashes? Not using will make it faster
I think the road is clear to using this in Whonix Installer
@Patrick could you try srep for that ticket too, you may need to play with another option in case it doesn’t handle the file efficiently:
Default settings (-l512) allow to process files that are 10x larger than RAM size. Memory requirements are proportional to 1/L, so by increasing -l option value it’s possible to process even larger files. For example, with -l64k RAM usage will be about 1/1000 of filesize.
If you have more than 10 GB RAM it may not be necessary, otherwise try changing this option.
This link includes both 32-bit and 64-bit executables for linux and windows along with the full source code necessary to build it.
Whonix KVM:
Btw there is one more requirement for these compression tools.
Usability. These compression tools should be installable from Debian
(and Fedora) repository. Otherwise we would have to demand from users to
download, verify and install software from the internet, which can go
wrong and makes the instructions less usable and more lengthy.
Thank you very much. That’s exceptional. Tarball though doesn’t hash in a consistent manner in it’s standard implementation, as far as I know, since it includes timestamps and other aspects dependent on the machine the archive was created on.
I’ll thus have to look into a Windows compatible solution to remove those first.
Are you thinking of using some command line tool(s) to fix the file attributes? At least I know that creating a tar using these commands creates the same file each time on my machine
If we have the same timestamps, filenames and create the tar using the same commands, 7za should create identical files as I found out it doesn’t add user name or ids by default, it uses zeros. But to use these other options such as --mtime we need gnu tar itself, like using linux in a vm or cygwin/mingw. Most tar.exe binaries online are old and don’t support --mtime. Git for windows includes latest version of tar.exe so it can be grabbed from there. You may also build it yourself.
We may also overwrite the file mode with --mode= option to ensure it will be the same. 7za is creating tars with mode 1777 while tar.exe creates with 644 for me.
For really deterministic tars, you should probably add --sort=name --owner=0 --group=0 --numeric-owner and use 00:00:00Z instead of 00:00:00 in mtime to specify the UTC timezone.
Let’s compare now if 7za creates deterministic tars or not
Use the stable whonix ovas named as gateway.ova and workstation.ova. Set their timezones to 2000-01-01 00:00:00 UTC. I used touch.exe from “git for windows” package:
For really deterministic tars, you should probably add --sort=name --owner=0 --group=0 --numeric-owner and use 00:00:00Z instead of 00:00:00 in mtime to specify the UTC timezone.
This is related to recent work on using cowbuilder to build Whonix
packages as well as making orig.tar.xz / debian.tar.xz archive
generation deterministic. ( https://phabricator.whonix.org/T52 )
that mtime command ‘–mtime=2015-10-21 00:00Z’ does not work for me.
do you mean it works but you see 02:00 instead? maybe your system time zone was not set to UTC, in that case it should only be a cosmetic issue, I think
–owner=root --group=root
this is apparently different than using --owner=0 --group=0 as it adds some ids to the file, with the latter the ids are saved as zero. for me it is cleaner. you could see that by creating a small tar file and opening it under a notepad.
that mtime command ‘–mtime=2015-10-21 00:00Z’ does not work for me.
do you mean it works but you see 02:00 instead?
Yes.
maybe your system time zone was not set to UTC, in that case it should only be a cosmetic issue, I think
Right. Set time zone to UTC during genmkfile now.
–owner=root --group=root
this is apparently different than using --owner=0 --group=0 as it adds user/group names and ids to the file as root, with the latter they are not added and the ids are saved as zero. for me it is cleaner. you could see that by creating a small tar file and opening it under a notepad.
Using root/root and --numeric-owner is a safe bet, as it will effectively record 0 as values:
GNU ar and other tools from binutils have a deterministic
mode which will use zero for UIDs, GIDs, timestamps, and use consistent file modes for all files.
Even they say they are trying to achieve zeros but somehow when I use root it doesn’t really record zeros as when I use 0. Maybe I’m missing something or maybe them. They also didn’t provide any recommendation for setting file modes on the page other than mentioning binutils.
As a side note when I use --owner=0 --group=0 no names are added to archive and ids are filled with zeros but with --owner=root --group=root ids are definitely not filled with zero on my machine and the name “root” is added twice, --numeric-owner removes the names but doesn’t change the ids while it doesn’t (need to) do anything if 0s are used instead.
This is from strip-nondeterminism source code for ar archives:
So it is not really a safe bet or reasonable to use “root” just in the end to get to “0”. That assumes root is 0, which may not be the case, especially in my case as in using tar.exe in windows. I just tried tar in whonix and both commands produced the same output, however in windows it is not the case and the safest bet is to use “–owner=0 --group=0 --numeric-owner” to keep determinism across operating systems. You may want to report this “upstream”
Turns out exporting TZ to UTC may not be required, but also looks like very safe, sane to do and will also prevent some confusion, so probably good to keep.
I don’t know. My approach is rather basic. I am following authoritative arguments here. Choose the Debian Reproducible Builds team as the experienced experts on the topic. Following their recommendations as long as seemingly sensible. This was introduced here:
commit 0fe840b4dd3c82b88a2d62550de94d11c3f5731d
Author: Patrick Schleizer <adrelanos@riseup.net>
Date: Thu Jan 19 09:40:35 2017 +0000
add --mode=go=rX,u+rw,a-s to tar to avoid non-determinism
as suggested by https://wiki.debian.org/ReproducibleBuilds/VaryingPermissionsInTarballs
Then the strategy is to keep testing it. Should issues arise (non-determinism reported), I’d investigate further. As for Whonix 14, Whonix deb reproducibility was only on a best effort basis. ( https://phabricator.whonix.org/T52 )
More progress is scheduled during development Whonix 15. ( ⚓ T615 use Reproducible Builds Experimental Toolchain by Debian )
(Or earlier if someone contributes.)
Having said that… You seem to be knowledgeable on the topic.
Please consider re-posting that question on the Debian reproducible builds mailing list.
( Reproducible-builds Info Page )
That would quite likely lead to a more educated answer to your question as well as this would be a great service to Whonix.
Your root vs 0 argument seems solid. Could you report it on the reproducible builds mailing list please?