Benutzer:Dirk Hünniger/tug
Convertig Wikipedia Articles to LaTeX
Abstract
[Bearbeiten]It is often desirable to have access to Wikipedias articles in the LaTeX format. A translation by hand is usually very time consuming and error prone. Thus it is natural to look for algorithmic solutions to this problem. Our solution is currently available free of charge under an open source license for the Windows Operating System as well as Debian Linux. We are not limited to Wikipedia but support all servers running the same wiki software (MediaWiki) as Wikipedia. Especially it is also possible to process local wikis available only on private networks inside institutions.
Introduction
[Bearbeiten]A wiki provides a very convenient way of working on a document with many contributors, without needing to learn the details of specialized version control and typesetting software. MediaWiki provides a function to export PDF files. But the possibilities to incorporate individual requirements on the layout are very limited and usually insufficient for professional publishers. Also the typographic quality of the output is far less elaborate than the one provided by LaTeX. Furthermore the embedding of formulas as raster graphics is often criticized.
User Experience
[Bearbeiten]In the default mode the program takes an URL to web page on a server running MediaWiki and writes the PDF version of that page generated with LaTeX to the local hard disk. It is also possible to retrieve the corresponding LaTeX source code including the images. In the default mode the HTML generated by the MediaWiki server is evaluated. There also is an extended mode where the source code of the wiki page written in the wiki markup language is processed. The wiki markup language provides a mechanism similar to the latex newcommand directive, which is called template. In this mode it is possible to map templates to latex commands and implement them using newcommand or similar methods in the headers. This mechanism provides a fine grained control over the conversion process and thus gives the user the full flexibility of LaTeX.
On the History of the Problem
[Bearbeiten]Quite a few a attempts have been made to tackle this problem programmability. We would like to emphasizes the successful work of Hans Georg Kluge, who modified MediaWikis original parser to produce LaTeX code. Unfortunately it needs to be installed on the server running the wiki in order to run and Wikipedia is currently not attempting to install it. This is also the case since the security of the code is currently being discussed, which is particularly an issue since it is written entirely in PHP. There were quite a few attempts approaching the problem with regular expression or Backus Naur forms. Recently we were able to provide a simple prove, based on the pumping lemma, that improper bracketing of HTML tags as often found on Wikipedia causes the grammar not to be context free anymore and thus renders it indescribable by Backus Naur forms and regular expressions. Thus ruling out most of the standard parsing technology. In our approach we decided to run all software on the users machine only an thus by pass any security concerns of Wikipedia. We opted for monadic parser combinators as parsing technology, and were able to handle the non context free grammar well with that approach.
Technical Details of the Implementation
[Bearbeiten]The program is entirely written in the purely functional language Haskell. To do the necessary image processing the imagemagick library is used. We currently use xelatex as default compiler. Although we recognized that the source (with tiny changes limited to the headers only) does also compile with pdflatex as well as lualatex. Currently there still is no freely available font that covers the whole range of Unicode. A problem in this respect is also that certain code points used for some Asian characters are used for more than one symbol and Wikipedia does not always provide means to find out which symbol is actually meant by a Unicode character. For now we use FreeSerif as default font, thus neglecting the needs for Asian glyphs at all. We also offer a computationally combined font, made of several fonts available under the same open source license that actually covers the full Unicode range. In pdflatex we use just this one font with the CJK package an thus can handle the first 16 bit of Unicode range. This approach allows the user to still use custom fonts like Utopia, Courier etc. For xelatex we provide a set of fonts for the styles bold, italic, teletype, small caps, and combinations thereof. This approach basically works also with lualatex, but caused huge memory and CPU usages with lualatex in our tests.