sys-whonix 16 inconsistent connectivity issues (Q4.0 AND Q4.1)

First off, thanks for whonix, and in-particular qubes-whonix! I, and many others I work with, appreciate and depend on these projects on a daily basis.

I’m having an issue with qubes whonix 16 on two machines:

  1. Machine 1 - Qubes 4.0, used whonix-15 for the 15-releas’s whole lifetime. sys-whonix 15 worked wonderfully, on a number of different internet connections, for the entirety of the release’s lifetime. Upgrade to whonix-16 was seemlingly seamless (fresh install), with no customization, no bridges, defaults.

  2. Machine 2 - Qubes 4.1rc2, sys-whonix-16, defaults, no bridges.

On both machines, with whonix-15, as well as the basic tor browser bundle run in a VM through sys-firewall (clearnet), an attempt to make a tor connection is successful 99% of the time, works great, high speed, reliable.

On both machines, now with sys-whonix 16, one of two things happens: when first enabling tor, it either hangs at 30/45% (on the vast majority of attempts), or succeeds (maybe 5% of the time). After a first successful connection, when attempting re-connection, or startup, Tor reaches 95% and then just hangs. Sometimes it completes, tor status seems connected (100%) but connections through tor are very unreliable (but they do happen sometimes). About 5% of the time, I can connect, but Tor work for minutes, and then dies. Restarting tor works maybe 5% of the time, and a successful connection will last minutes.

Things I’ve tried:

  • multiple ISPs (cable, wireless, high speed and reliable institutional, etc), all show the problem.
  • enable ICMP in sys-whonix via ICMP fix
  • disable boot clock randomization (and verified it).
  • disable IPv6 for sys-whonix, sys-firewall, and with my ISP/router, forcing ipv4.
  • clock verification (sys-whonix, sys-net, and xen host are all the same).
  • re-installs fresh (both of sys-whonix-16 on my Qubes 4.0 machine, and qubes 4.1 itself, using defaults)
  • clearnet connectivity.
  • running TBB and sys-whonix-15 in VMs that connect through sys-firewall (both other tor methods work great 99% of the time, with high speed and reliability).
  • whonix templates are up-to-date (from connections I was lucky enough to make).

I have been running qubes for many years, used torvm’s since before qubes-whonix was around, and can handle a fair bit of networking, but am a bit puzzled about how to diagnose this. I have seen nothing in any of the logs that pops out as obvious. I have seen hints in the qubes forums, and here, that others are experiencing a similar (or the same issue) with qubes whonix-16, but perhaps have not been able to capture it either.

I’m at a loss as to the best next diagnostic step, given this is an inconsistent issue. That being said, it’s bad enough that it’s been catastrophic for my workflows the past couple weeks. I have been running almost all of my daily traffic and work through qubes-sys-whonix since it’s inception (thanks again!).

Does anyone have any ideas, things to try next?

That’s going to be difficult, because:

There’s a few more items here to check:

  • disable boot clock randomization (and verified it).
  • disable IPv6 for sys-whonix, sys-firewall, and with my ISP/router, forcing ipv4.

These would have been my top two guesses. Although I see you verified
time was not an issue, could try disabling the sdwdate service entirely
in both gateway & workstation. Also try disabling IPv6 on the
workstation/disp template as well.

1 Like

@awokd Thanks for the ideas. sudo systemctl stop sdwdate (and disable) had no impact, and sudo anondate-get was ok. ipv6 elsewhere can’t have been it (and was not) – sys-whonix itself has issues connect to Tor, even when it seems to connect.

@Patrick If the standard TBB and whonix-15 always work flawlessly on the same machine, then it is really unlikely to be a network obstacle, right? I have tested numerous ISPs, none of which should be blocking in my area of the world (and indeed don’t seem to be, if other Tor options connect and operate fine without bridges). I’ve tried out most of the diagnostics at the links you provided. anondate-get looks ok.

Upon first start (just after install), or if I clear the consensus via anon-consensus-del, it often hangs with:
sudo anon-log

NOTICE[Wed Dec 22 11:23:20 2021]: Vanguards 0.3.1 connected to Tor 0.4.6.8 using stem 1.8.0
NOTICE[Wed Dec 22 11:23:20 2021]: Tor needs descriptors: Cannot read /var/lib/tor/cached-microdesc-consensus: [Errno 2] No such file or directory: '/var/lib/tor/cached-microdesc-consensus'. Trying again...

which is not a permissions issue, and the file does not exist until bootstrapped.

But, anon-log and variants of systemcheck illustrate varied issues, depending on the stage at which I get lucky. After bootstrapping for the first time (on some lucky restart making it past 30%) tor startup often gets stuck at 95%, but even when seemingly connected, tor will not function at all.

I suspect either the new vanguard setup, or something like a packet-size or fragmentation parameter, which is a bit off in gw-16 but not gw-15, and is causing a lower level issue with sys-whonix-16. Indeed, I saw an error about a programmed constant, hsdesc size, causing the connections to be killed, which I think is a hidden service descriptor constant of 30 referenced here (Vanguards - Tor Anonymity Improvement), with relevant logs below. I’m not sure where in Qubes might be best to wireshark the system, but I guess sys-whonix is as good as anything?

Barring some weird interaction between components, this really appears to me to be a problem living in sys-whonix 16 itself, the debian base, or the tor install/setup within it.

It’s a bit ironic, but I may have to fall back on something like the old-school qubes torvm instead of whonix.

1 Like

Just one more diagnostic tidbit: If I boot a different hard drive on the same hardware and internet connections, VirtualBox-Whonix-16 and KVM-Whonix-16 work great.

1 Like

When an existing connection fails (in this case, after an attempted connection from another non-whonix VM routed through sys-whonix):

...
Dec 22 19:43:33.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits
Dec 22 19:43:33.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Dec 22 19:43:36.000 [notice] Bootstrapped 100% (done): Done

vanguards.service:

Started Additional protections for Tor onion services.
NOTICE[Wed Dec 22 19:28:41 2021]: Creating new vanguard state file at: /var/lib/tor/vanguards.state
NOTICE[Wed Dec 22 19:28:41 2021]: Vanguards 0.3.1 connected to Tor 0.4.6.8 using stem 1.8.0
NOTICE[Wed Dec 22 19:38:52 2021]: Tor has been failing all circuits for 49 seconds!
NOTICE[Wed Dec 22 19:38:53 2021]: Circuit use resumed after 50 seconds.
WARNING[Wed Dec 22 19:40:47 2021]: Circ 434 exceeded CIRC_MAX_HSDESC_KILOBYTES: 35121 > 30720.
NOTICE[Wed Dec 22 19:40:47 2021]: We force-closed circuit 434
NOTICE[Wed Dec 22 19:43:32 2021]: Tor daemon connection closed. Trying again...
NOTICE[Wed Dec 22 19:43:33 2021]: Vanguards 0.3.1 connected to Tor 0.4.6.8 using stem 1.8.0
1 Like

I’m assuming these errors are not relevant:

user@host:~$ systemcheck --verbose --leak-tests
[INFO] [systemcheck] sys-whonix | Whonix-Gateway | whonix-gw-16 TemplateBased ProxyVM | Wed 22 Dec 2021 07:54:45 PM UTC
[INFO] [systemcheck] Check sudo Result: OK
[INFO] [systemcheck] Whonix build version: 3:8.1-1
[INFO] [systemcheck] whonix-gateway-packages-dependencies-cli: 22.1-1
[INFO] [systemcheck] derivative_major_release_version /etc/whonix_version: 16
[INFO] [systemcheck] Whonix Support Status of this Major Version: Ok.
[WARNING] [systemcheck] Hardened Malloc: Disabled.
[INFO] [systemcheck] Spectre Meltdown Test: skipping since spectre_meltdown_check=false, ok.
[INFO] [systemcheck] Package Manager Consistency Check Result: Output of command dpkg --audit was empty, ok.
[INFO] [systemcheck] systemd journal check Result:
warnings:
########################################

########################################

failed:
########################################
Dec 22 19:28:31 host systemd[1]: apparmor.service: Failed with result 'exit-code'.
Dec 22 19:28:31 host systemd[1]: Failed to start Load AppArmor profiles.
########################################

errors:
########################################
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 00, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 01, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 03, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 04, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 05, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 06, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: ACPI Error: No handler or method for GPE 07, disabling event (20190816/evgpe-841)
Dec 22 19:28:31 host kernel: Error: Driver 'pcspkr' is already registered, aborting...
1 Like

sudo anon-log

...
Dec 22 20:06:16.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits
Dec 22 20:06:16.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Dec 22 20:06:21.000 [notice] Bootstrapped 100% (done): Done
Dec 22 20:07:18.000 [notice] No circuits are opened. Relaxed timeout for circuit 3 (a General-purpose client 3-hop circuit in state doing handshakes with channel state open) to 60000ms. However, it appears the circuit has timed out anyway.
Dec 22 20:09:26.000 [warn] Guard Unnamed ($CC100DC017A25E244C35C964BB518E4308415ACC) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 59/171. Use counts are 0/1. 163 circuits completed, 1 were unusable, 103 collapsed, and 145 timed out. For reference, your timeout cutoff is 60 seconds.
Dec 22 20:09:29.000 [warn] Guard defconorg ($B956A82A9559D482E1ACFEABD898FDC3F2991005) is failing an extremely large amount of circuits. This could indicate a route manipulation attack, extreme network overload, or a bug. Success counts are 43/151. Use counts are 0/0. 143 circuits completed, 0 were unusable, 100 collapsed, and 168 timed out. For reference, your timeout cutoff is 60 seconds.
Dec 22 20:09:41.000 [warn] Guard defconorg ($B956A82A9559D482E1ACFEABD898FDC3F2991005) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 48/156. Use counts are 0/0. 148 circuits completed, 0 were unusable, 100 collapsed, and 169 timed out. For reference, your timeout cutoff is 60 seconds.

vanguards.service:

Started Additional protections for Tor onion services.
NOTICE[Wed Dec 22 20:06:15 2021]: Creating new vanguard state file at: /var/lib/tor/vanguards.state
NOTICE[Wed Dec 22 20:06:15 2021]: Vanguards 0.3.1 connected to Tor 0.4.6.8 using stem 1.8.0

or another instance:

Dec 22 20:12:41.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits
Dec 22 20:12:41.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Dec 22 20:12:47.000 [notice] Bootstrapped 100% (done): Done
Dec 22 20:13:44.000 [warn] Guard Unnamed ($CC100DC017A25E244C35C964BB518E4308415ACC) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 86/208. Use counts are 0/1. 198 circuits completed, 1 were unusable, 111 collapsed, and 173 timed out. For reference, your timeout cutoff is 60 seconds.
Dec 22 20:13:49.000 [warn] Guard defconorg ($B956A82A9559D482E1ACFEABD898FDC3F2991005) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 107/245. Use counts are 0/0. 232 circuits completed, 0 were unusable, 125 collapsed, and 215 timed out. For reference, your timeout cutoff is 60 seconds.
Dec 22 20:14:16.000 [notice] Guard LittleSister ($E08F2FD44CC16B7015138DC95507A660BEB8851D) is failing more circuits than usual. Most likely this means the Tor network is overloaded. Success counts are 88/151. Use counts are 5/5. 149 circuits completed, 0 were unusable, 61 collapsed, and 114 timed out. For reference, your timeout cutoff is 60 seconds.

vanguards.service:

Started Additional protections for Tor onion services.
NOTICE[Wed Dec 22 20:12:41 2021]: Vanguards 0.3.1 connected to Tor 0.4.6.8 using stem 1.8.0
NOTICE[Wed Dec 22 20:13:42 2021]: Tor has been failing all circuits for 30 seconds!
NOTICE[Wed Dec 22 20:13:43 2021]: Circuit use resumed after 31 seconds.
WARNING[Wed Dec 22 20:14:25 2021]: Circ 289 exceeded CIRC_MAX_HSDESC_KILOBYTES: 35121 > 30720.
NOTICE[Wed Dec 22 20:14:25 2021]: We force-closed circuit 289

All while I can easily connect other setups via plain TBB, virtualbox-whonix, etc.

1 Like

Not much ideas now. Maybe later.

But for sure worthwhile to try:
Tor Generic Bug Reproduction

That makes this issue even more weird. For better maintainability Whonix for the different virtualizers has as minimal differences as possible.

For lack of words, I try to put it that way “Whonix does rather simple things”. I wouldn’t know from the top of my head to even cause such a bug on purpose if I wanted. It would require rather complicated research and implementation to cause an issue which is so hard to reliably reproduce in the same manner.


There was recently a Qubes specific bug that broke connectivity in weird ways with Qubes-Whonix among the affected components.

Therefore the only way forward I can see for now is Tor Generic Bug Reproduction.

Interesting, thanks. I suspect an issue at the lower network layers. Like that report, in the past, I have tunneled vpn through tor, appvm → sys-vpn → sys-whonix → sys-firewall → sys-net, to hide the exit node from visited services that block tor, but have not tried it lately. Quite interestingly, I just checked tunneling tor through my openvpn, appvm → sys-whonix → sys-vpn → sys-firewall → sys-net, via qvm-prefs sys-whonix netvm sys-vpn, and it seems to work just fine on 4.0 with whonix-16… masking this issue, such that sys-whonix functions reliably. I’ve seen fragmentation or layer 2 issues where packet size parameters caused something like timeouts before… I have a vague suspicion that this is an issue with local networks timing out or killing packets from sys-whonix selectively, but I have no idea why. My LAN is not blocking any ports outgoing, and FascistFirewall does not help. I’ll check out the generic bug reproduction in more detail.

1 Like

To do “Tor Generic Bug Reproduction”
I cloned debian-11 un-touched, then set up networking:

debian-11-tor-test → sys-firewall → sys-net

sudo apt update
sudo apt full-upgrade
sudo apt install --no-install-recommends tor
sudo apt install --no-install-recommends vanguards
sudo vi /etc/tor/vanguards.conf # and edit control_socket = /run/tor/control 

Tor process still works great. Firefox socks through tor, and torcheck says congratulations you’re using tor. All seems to work well.

1 Like

I wireshark’d sys-whonix, and noticed lots of ICMP and fragmentation issues (as I was suspicious of), and found something of a fix.

SOLUTION: While allowing ICMP alone did not fix the issue, the originally proposed solution to the ICMP problem did: https://github.com/Whonix/whonix-firewall/pull/7/files

In sys-whonix, do:
sudo vi /usr/bin/whonix-gateway-firewall

-   $iptables_cmd -A INPUT -m state --state ESTABLISHED -j ACCEPT
+   $iptables_cmd -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

followed by:
sudo whonix_firewall
and it fixes the issue entirely.

I see the discussion about this being a bit permissive (Have firewall accept ICMP Fragmentation Needed), but clearly ICMP alone was not enough for the LAN to not kill the connections, timing them all out.

2 Likes

I’ve noticed iptables firewall issues like this in the past, when the default final rule is DROP, instead of REJECT. REJECT tends to fail a lot more gracefully than DROP, and imo, be a better option than DROP. The privacy/security reasons for using DROP are exaggerated, in my understanding, though I could be missing some. In fact, just replacing DROP with REJECT in whonix-gateway-firewall fixes the issue, even without the overly permissive RELATED change above. I think this addresses the security issues mentioned above, and yet still functions. @Patrick This might be a real acceptable solution?

In retrospect, I notice that this IPtables firewall issue was only causing moderate interference on some networks, severe on others, and minor on some (though still existed). I would not have noticed on some, other than I have been routing all my traffic through tor for many years, and know about how it should behave on the networks I regularly operate on. I have seen other unresolved forum posts which may be explained by this issue as well. In conclusion, while not all networks or installs will likely see this “bug”, I’d guess I’m not the only one with spotty performance due to this firewall configuration.

1 Like

Also, thanks for taking the time to shepherd me through the troubleshooting!

1 Like

Great analysis!

  1. This means that there’s a Qubes specific bug here requiring RELATED since you already confirmed the issue does not happen with Whonix VirtualBox and Whonix KVM on the same hardware. Could you summarize all your findings please and report it on qubes-issues please?

  2. RELATED would be useful as a separate line with an opt-in configuration options as a separate line:

$iptables_cmd -A INPUT -m state --state RELATED -j ACCEPT
  1. Should be added to Tor Generic Bug Reproduction.

Help, patches welcome!

1 Like

I had an odd situation with a brand new laptop with Qubes 4.1.2 installed. On my WiFi network, all in the same network segment, sys-whonix could only reliably connect using one particular hardware access point. On other access points the connection would almost never complete, and if it did, anon-whonix couldn’t connect to anything via Tor.

Tails connected to Tor with no problem using the same laptop and access points.

After finding this thread, and adding

$iptables_cmd -A INPUT -m state --state RELATED -j ACCEPT

to /usr/bin/whonix-gateway-firewall the laptop now connects to Tor quickly and reliably regardless of the access point used.

Hardware that worked:

  • Buffalo (old)

Hardware that had difficulty included:

  • UniFi (new)
  • TP-Link (new)