Intermittent 503s reported by Varnish

mig5 · July 29, 2018, 3:57am

Working on a theory that it’s keepalive related and maybe some sort of ‘new’ thing since our OS version upgrade. Looking at experimenting with keepalive settings between nginx, varnish and apache soon, or possibly removing varnish altogether and using nginx cache instead. Thanks

mig5 · July 29, 2018, 4:18am

Splitting into a new thread.

I’ve disabled KeepAlive in Apache, working on a hunch. May or may not impact performance though.

Will monitor if there’s any further 503s and if so, I need to catch one ‘in the act’ with varnishlog to see the actual error reported.

mig5 · July 30, 2018, 3:59am

It wasn’t KeepAlive, but I am pretty sure it was related to PHP opcache (which indeed was a ‘new’ thing as of a few weeks ago as part of a big upgrade).

Since I adjusted an opcache setting about 6 or 7 hours ago, there’ve been no more random 503s. Hopefully I’ve caught it…

Please let me know in this thread if you encounter the instant 503 ‘guru meditation’ Varnish error again.

mig5 · July 30, 2018, 5:11am

Nope, it was not opcache, though it was an important thing to fix anyway.

Just saw it happen again, This time counted the max Apache processes was exactly 150 - which is the limit of MaxRequestWorkers. I normally expect to see this limit mentioned in the log which is why I didn’t think it was this issue, but given the ‘coincidence’, think it’s maybe that. Have adjusted that value up. Will see how it goes…

pretty sure it’s Phabricator (as usual) causing some sort of huge spike in traffic - whenever it occurs, there’s a huge spate of 503s that are on the Phabricator .onion and are all or mostly on /diffusion URLs e.g http://phabricator.dds6qkxpwdeubwucdiaord2xgbbeyds25rbsgr73tbfpqpt4a6vjwsyd.onion/diffusion/WHONIX/history/pidgin/whonix_shared/usr/lib/timesync/30_run-sdwdate;12.0.0.0.7-developers-only

mig5 · August 13, 2018, 6:15am

We’ve made some adjustments that will technically remove all ‘Guru meditation’ errors, but may still cause some form of timeout when we get a big rush of traffic particularly to the wiki. However, we’re hoping the performance adjustments prevent those stampedes either way.

One other issue causing 503s was discovered to be intermittent hardware issues which are also being addressed, but obviously can’t be fixed with software optimisations. So, fighting a multi-headed snake right now

Closely monitoring since today’s changes.

torjunkie · August 19, 2018, 3:40am

Varnish stuff has disappeared. So that’s good news.

Now the bad news. It has been replaced by 502 bad gateway (nginx) errors instead here and there.

Immediate refresh loads successfully. Its real whack-a-mole with servers I gather.

I once had 2-3 of these errors in a row recently. But seems less common than the Varnish issue. Will report if it keeps popping up.

Patrick · August 19, 2018, 1:20pm

Same reason most likely.

Need to wait for the new server.

mig5 · August 19, 2018, 10:47pm

Yeah, the latest 502s coincide with the hardware CPU lockups, which for some reason seem to happen at roughly the same time (3PM-ish GMT+2), perhaps some indication of a heavy task, but I can’t find it.

Also frequently coinciding with the 502s is a high rate of ‘crawling’ of the wiki via the v2 onion address by some unknown perpetrator. Half-inclined to disable the v2 onion they hit a lot of the Mediawiki ‘Special’ pages which might create big queries (revision logs etc). My work at optimising Mediawiki with memcache hasn’t been quite enough, and out of ideas.

We’ll see how we go with a new server…

torjunkie · August 21, 2018, 10:10am

Gee… I wonder who that could be…