[HOME] [DOWNLOAD] [DOCS] [BLOG] [SUPPORT] [TIPS] [ISSUES] [Priority Support]

Offline Documentation Discussion


#48

How could I run it on the server? Install a browser on the server and remote VNC (over ssh etc…) to it?

A few posts back you and fortrasse were testing the pdf plugin on the mediawiki server if I’m not mistaken.


#49

You are mistaken. Installed on the server, yes. But tested just as normal through whonix.org web download.


#50

The print PDF button gives an identical file for the same version of a current page. The next step would be to test if an offline wiki with the PDF plugin gives a file with the same hash for the same page. That would be as close to reproducible as we can get.

So for a given version of the Whonix manual a user would download a snapshot of the wiki tarball (when it was used to generate the official PDF) and setup an offline mediaiwiki and generate a PDF and see if the file matches with the (signed) official one provided.

At the moment Whonix and TAILS are at the same level on this. They just use ikiwiki instead and commits are made via git pull requests instead of the easier mediawiki editing but they don’t have an easy to read file thats reproducible for offline use.

If you think this is too much work forget it.


#51

Subgraph handbook uses:

pandoc / po4a. Text in markdown, build produces multiple formats, translations, etc. Check out the Makefile in the repo.


#52

The PdfBook extension (already installed) supports creating a PDF of all pages in a category that includes documentation. So this can be considered solved.

PdfBook help docs:

https://www.mediawiki.org/wiki/Extension:PdfBook#Usage

Example I followed:

http://www.foo.bar/wiki/index.php?title=Category:foo&action=pdfbook

It works! and gives a 16.6MB pdf. The result OK but needs some cleanup. Translated pages are included sometimes in broken gibberish characters. I would suggest moving each set of translated pages under their own superset categories for example Documentation-DE and so on.

https://www.whonix.org/w/index.php?title=Category:Documentation&action=pdfbook

Up to you if you want to convert the PDF into a simpler format for distribution in Whonix but I wouldn’t bother.


#53

It still needs to be documented. Otherwise no one can know about it.

Would be a break of the security model. I consider pdf a binary format. All binaries for now have been produced locally from source code on my developer hardware. (Except for binary Debian packages which are installed inside VM images as is from Debian repository.) Adding a binary file (the pdf) that I generated from the Whonix web server would break that model.

Also build script and instructions shouldn’t include “go to whonix.org webserver, click there”, to be professional, it needs all to be automated, scripted and from source code.


Maintainer for Whonix Welcome Page
#54

OK lets forget about shipping it with Whonix for now. Where can I add this as optional instructions on the wiki?


#55

It fits nowhere. New page Offline_Documentation?


#56

Done. I suggest linking it under the "Other Whonix Resources " section.


#57

Yes, please do.


#58

Instead of sacrificing ease of contribution or offline documentation quality I propose migrating to Dokuwiki.

Advantages: supports most mediawiki features while being much lighter, allows conversion from mediawiki to their format, databaseless meaning it shouldn’t be a PITA for inclusion in whonix for users to use offline.

Disadvantage: included in every Debian version but stretch though pinning can solve this. I don’t know if its a deal breaker but a php supporting webserver is required. In this case nginx the most minimal, popular and actively developed solution can do.


#59

Not an easy migration. A ton of work.

https://www.whonix.org/wiki/Dev/About_Infrastructure#Whonix_homepage_backend


#60

Hi all,

@Patrick had me look at offline documentation recently.

I’ve taken a stab at a new (but still, IMO, temporary/intermediate/imperfect) solution.

This method is as follows:

  1. Once a day, a server-side cronjob generates a collection of sitemap.xml files for the MediaWiki. The ‘parent’ sitemap file is at https://www.whonix.org/wiki/sitemap/sitemap-index-wiki.xml, and it contains a subset of links to other sitemaps - just the way MediaWiki likes to do things.

  2. The scrape-whonix-wiki.sh script (in the repo above) runs later in the day (still server-side), and uses the sitemap to ‘discover’ the URLs to all the wiki content.

It then scrapes those pages with a Python tool called ‘webpage2html’, does a whole heap of other ugly munging to fix most links, remove irrelevant parts of the page, add .html suffixes etc.

Then it commits the new version to the repo above. Thereby, it’s a server-side generated collection of HTML pages from the wiki that more or less looks like the real thing.

Why webpage2html and not wget?

Because MediaWiki loads its CSS and javascript assets dynamically via PHP (from a collection of different sources including MediaWiki core, the skin in use, relevant Extensions, etc). Therefore, there are no ‘static’ assets that can be downloaded and served. A wget of the pages includes the /w/load.php?xxxxxxxxx for assets, which means the pages don’t look right, content all messed up etc.

webpage2html uses a different approach - it fetches the assets somehow and then adds that data inline into the html file itself.

The downside is that each html file is rather large (1.4MB on average). The upside is it actually looks OK and it’s entirely local.

clone it down in a VM and then open file:///home/user/whonix-wiki-html in browser - works quite well.

So, the pros and cons of this method:

Pros:

  1. Automatable via cron

  2. Dynamic discovery of pages (e.g any new pages created in the last 24 hours, will be in the new nightly sitemap xml files, which means the subsequent crawl picks up new content, along with edited content)

  3. Fully offline copy (some footer links etc might link to the main www.whonix.org site, but links to other content within content, should load the respective .html file locally. I skip some useless pages such as many of the Special: ones - I think the main content is what really matters here)

  4. Requires no technical knowledge from the user on how to set up locally unlike the documentation at https://www.whonix.org/wiki/Dev/Replicating_whonix.org which basically requires near-sysadmin knowledge. User only needs to know how to git clone the Github repo above

  5. can also be run by anyone anywhere (no need to rely on this github version)

The negatives:

  1. Files are quite large. Repo is pretty quick to clone, 50-60MB or so, but the resulting local copy is maybe 850MB+ ! Due to all the duplicated images in each .html file. Nothing I can do to fix this.

  2. Still relies on crawling the site to fetch content - sort of a security risk as described above - however, we partially mitigate that by running the script on the server itself, so it’s connecting to ‘localhost’ in essence (not literally localhost, but its own eth0 interface), making it pretty much impossible to MITM. However, whatever HTML it generates, comes from the live wiki, which might be compromised already. A risk that all other ‘export from mediawiki’ solutions already face.

Whilst this is maybe a bit further than previous attempts we’ve made, I personally still consider it a bandaid fix.

Ultimately the best form of ‘offline documentation’ which also resists watering-hole attacks, would be to follow the QubesOS example of using a Github repo with markdown docs. Collaboration through pull requests, and ‘published’ documentation merely a deployed version of those docs. Turns the solution on its head with ‘offline’ coming first, and publication coming second.

The cost is:

  1. a slightly larger learning curve (maybe) sending pull requests for changes.
  2. And maybe a reliance on Github, although there are ways to mitigate that too (the Github repo could be a means to an end, a mirror of some more ‘pristine’ copy held somwhere else rather than a hard dependency).

It also means that you could ship the offline documentation with Whonix itself, (accessible through the home page of Tor Browser), making the website’s copy a fall-back copy only.

In short I think it’s worth the effort to move away from the current wiki approach in general for many reasons (am I offering to migrate all the content? No I am not necessarily :slight_smile: ) A long term goal anyway IMO.

My solution will update in the above wiki on a daily (or nightly) basis. Still fixing a couple small bugs here and there mostly relating to making sure links open local .html version instead of remote version (or local version without .html suffix, somehow a couple keep sneaking through my awful sed fu)


#61

hope we can use something more to freedom/privacy respect like gitlab or so. i wish there will be no more github usage.


#62

@mig5 Thank You! This has long been a PITA for us.

Can using an HTML minifier and/or running the images into a tool that has lossy-compression before they are incorporated, help?


#63

EDIT: Is there a way to dedupe the images by having all pages point to one common copy of the asset?


#64

If I had a way to do that I would obviously do it :slight_smile: that’s the conundrum. wget can’t fetch the assets as static assets because MediaWiki is stupid design. webpage2html is the complete opposite: it has a way to fetch those assets, but it can only load them entirely inline per URL.

I’ll try and experiment some more with just bash/sed/awk/cut/perl hacks, to try and ‘grab’ the <style> tag stuff and maybe load them into their own stylesheet or at least some ‘header.html’ and then see if I can somehow ‘include’ that in each html file.

Total bespoke job and I’m keen to try and avoid spending too much of Patrick’s money on a bandaid fix (IMO the money better spent paying someone to move all the content out of mediawiki entirely and into markdown)


#65

Indeed. :slight_smile:

Size aside… I am uncomfortable to create a deb of this and install this by default. All my packages are built from source code except for packages installed by apt-get. For html offline documentation it would be built on the server (lower trust level than my local machine) so could be compromised. Does that make sense? @HulaHoop

The only way to reach the same security level would be to build from markdown (which can be verified to not include any strange character sequences which could exploit vulnerabilities). Then there is also images. And some content generator markdown to something would be required. Tons of work.


#66

while we can browse whonix documentation inside whonix-workstation or inside the host or …etc. but the question is how can we browse the documentation inside whonix-gateway ? like if someone want to copy paste the commands (because they r 2 long to type) or there is no host nor workstation just the gateway; so i was thinking why dont we save whonix documentation as an offline wiki inside the gateway and it can be updated by apt-get dist upgrade or with each new whonix version 12 13 14 …etc or the updates going to be manually by the reader (if he can).

i donno if its possible to do this step , but i find useful. also i donno any programs doing this or how to or is it easy or not …etc so if anyone can shine my knowledge with this i will be thankful.also i would like to hear ur suggestions about how to view the documentation inside the GW if the offline documentation or wiki is a bad idea.


#67

i have found 3 methods doing this, by:-

1- https://www.httrack.com/

2- wget as per this link for example explaining it:- http://www.linuxjournal.com/content/downloading-entire-web-site-wget

3- as full screen screenshots for example by using this add-on:- https://addons.mozilla.org/en-US/firefox/addon/fireshot/

(i will download the documentations as images so there is no need to view the websites with a browser)

the remaining questions is:- if i download the whole documentations and uploaded them on a server or …etc how can i make it possible that these images going to be inside whonix gateway .ova (i mean for the users)? i think i should take permission from whonix and how to put it …etc?