Yesterday, I was looking for a way to carry around a local copy of a wiki on my iPhone. It seems that there are several apps that let you carry around an offline copy of Wikipedia — in case you are stuck on an airplane or out of cellphone range and need to look something up — but those do not appeal to me because I am always around a 3G or WiFi signal. What I am more interested in is being able to carry around a copy of the office wiki because I can only get iPhone access to that in a small area within the building.
After some research, I found that there was no easy, straightforward, or direct way to do this. But a little experimentation showed that I might be able to roll my own offline reader with a web crawler and an ebook packager. This would allow me to browse the wiki with Stanza. Overall, my experimentation resulted in some good news and some bad news, all of which I will reprint here.
The Good News
The good news is that it looks like this is quite possible. Although I did my experiments by hand, it can easily be scripted. Downloading and packaging do take a little while (somewhere between 4-6 hours for our wiki of about 500 pages), but that is well within the realm of possibility for nightly builds.
The heavy-lifting is done by wget and Calibre. First, you tell wget to spider the wiki. You have to tell it to ignore certain pages, such as edit pages, revision history, and pages requiring login. I even told it to ignore the Apache indexes of the upload directories. The specific command I used looks a little something like this:
wget --no-verbose -Q0 --mirror --restrict-file-names=windows \
--convert-links --page-requisites --no-parent \
--reject '*action=*,*oldid=*,*printable=*,*returnto=*,*redirect=*,*index.html*' \
This took some time, but downloaded the most recent revision of all pages, all interlinked, with links altered for the local filesystem-based pages. I also told it to use Windows filenames because, although everything in my pipeline (my workstation and iPhone) is Unix based, sometimes apps get confused by having a ‘?’ in a filename.
Next, I converted the downloaded archive to an ebook with:
html2epub --output=officewiki.epub ./machinename/wiki/index.php/Main_Page
This took just about as much time as downloading.
Next, I could copy this over to an http-accessible location and point Stanza at it.
The Indifferent News
Stanza picked up the file immediately, but then had to process it. The epub file is 38MB, but is a zip file that decompresses to 68MB and Stanza insisted on working on a decompressed copy. The decompression took about 5-10 minutes, which is bearable if updates are infrequent, but I am not too happy about it.
The Bad News
Although the process seems to work and browsing the wiki ebook works about as best as can be expected, my iPhone backups are now so slow as to be unusable. I have been backing up for 2 hours now and it is still incomplete.
At present, I am not sure where else to take this. I could simply “scp” the HTML files to my jailbroken phone, stored in a place that is not backed up. There are jailbroken apps for reading arbitrary files on the phone. I would really like to avoid this and go the official route, though, because it opens it up to a wider audience.
My knowledge of ebook formats is not too large, but I have to believe there is a format that lets you zip raw HTML files, which saves a large amount of time from the packaging. Perhaps there is an ebook reader that does not require the archive to be unzipped on the phone — directly accessing files within the zip, instead. This would save processing and backup time.
So the initial results are promising, but not spectacular, and (in my opinion) not yet usable. More research will be required.