Benutzer:Dirk Huenniger/wb2pdf/manual

Aus Wikibooks
Zur Navigation springen Zur Suche springen

About MediaWiki2LaTeX[Bearbeiten]

The MediaWiki2LaTeX software converts Wikimedia articles and other pages (Wikipedia Books, Wikibooks, etc.) to other softcopy formats for offline use. It can render a single page or a whole collection of linked pages as a single output file.

MediaWiki2LaTeX was conceived and created by Dirk Huenniger during a period of several years, when the Wkimedia Foundation's (WMF) own offline content generation (OCG) software had broken down and was not available.

Once it became stable and a web interface was added, the WMF made hosting space for an online service at wmflabs. The software is open source and can still also be downloaded and installed locally. A Debian/Ubuntu Linux package is available, if you do not want to compile it yourself.

However, its functionality is not yet fully complete and it remains under active development. Programmers with Haskell skills are especially invited to contribute.

Technically, MediaWiki2LaTeX is programmed for the most part in the Haskell functional language. It can accept either raw wikitext or served html web pages as input and typically converts them to the LaTeX typesetting language as an intermediate format.

The documentation on this webpage was originally written by me, Benutzer:Dirk Huenniger. I am the main software developer of mediawiki2latex, so I probably know the technical details best. Other editors have made updated it and improved readability.

Recent changes[Bearbeiten]

  • 7 November 2016. MediaWiki2LaTeX servers can now run conversion requests in parallel.

Additional User Documentation[Bearbeiten]

There is an independent guide to mediawiki2latex on the Edutech Wiki of the University of Geneva. It may provide complementary information.

Processing Multiple Articles[Bearbeiten]

MediaWiki Transclusion Links[Bearbeiten]

You can simply create a new page on the wiki and type something like this:

{{:MyPageOne}}
{{:MyPageTwo}}

The resulting page will display a concatenation of the pages MyPageOne and MyPageTwo.

MediaWiki Collections[Bearbeiten]

You may create a collection using MediaWiki's Collection Extension, which also has a "Book Creator" user front end. This page can be processed from the web interface by selecting Template Expansion > Book / Collection or from the command line version using the command line option --bookmode. Just keep in mind that the standard web interface has a time limit of 1 hour, giving around 200 pages, although this can be changed and a 4 hour (800 page) service is also available online. The command line version does not have any limit.

Web Interface[Bearbeiten]

Process Limits[Bearbeiten]

Online services

There are two services, each with a different time limit and parallel process capability configured. Consequently each can process a different maximum size of collection or book:

If you request something while the full number of requests is already running you will see an error message saying "Not enough resources available to process your request! Your request has been dropped!" In this case you can either try later, try the other server, or if you will be a frequent user and have some technical knowledge you can install the software locally.

If either service accepts your request but times out while processing, it will fail with a timeout error.

Local installation

If you install mediawiki2latex locally there is no time limit. mediawiki2latex is open source software and thus free to use and even to modify.

URL to the Wiki to be converted[Bearbeiten]

This is the full URL or web address for the Wikipedia article you wish to convert. You can just open the page you want with you web browser and copy the contents of the address bar at the top your browser. That is already the correct URL which you can just paste here.

Output Format[Bearbeiten]

You can choose between the following output formats:

  • PDF: A PDF file of the article you selected by supplying the URL to it. The PDF file will be created using the LaTeX typesetting software, which is often used to ẃrite books and articles in mathematics, physics and related fields. LaTeX first converts the content to its native LaTeX format and then outputs it as a PDF file.
  • LaTeX zip: A ZIP file of the LaTeX intermediate code for the article. This useful if you want to change the layout using the LaTeX software yourself. In this case you will need to install Ubuntu App on Windows or have a Debian- or Ubuntu-like operating system installed. In order to compile the source with LaTeX you will also have to install the mediawiki2latex package from your distro's repository.
  • EPUB: A file in the EPUB format suitable for use with E book readers.
  • ODT (Word Processor): An Open Document Text file. Useful for importing into you favorite word processing software, if you want to modify the article offline.

Template Expansion[Bearbeiten]

  • Standard: The default and recommended mode. Use this if you are unsure. The HTML web page generated by MediaWiki is processed and, in most cases, renders a single page or article. However for a Wikipedia Book in the Book: namespace, the Standard option renders the entire book.
  • Book / Collection: The HTML web page generated by MediaWiki is processed. All links on the first wiki page will be also followed and those pages also processed, but not recursively after that. This allows Wikipedia Books in User: space to be rendered.
  • Expand templates by MediaWiki: The Wikitext source for the pages is processed. Templates are expanded by MediaWiki into Wikitext. The Wikitext is then parsed and processed further. Use this mode if you don't get the result you intended with the standard mode.
  • Expand templates internally: The Wikitext source for the pages is processed. Templates are not expanded automatically but are instead mapped to LaTeX commands using a default mapping file. If a template is not defined in the mapping file, an "unknown template" error message will be written into the output text. This mode can be useful if you intend to compile a wikibook on the English or German Wikibooks. If you know LaTeX and want to create a PDF file that looks exactly the way you want, you can also provide your own mapping file using the -t command line option.

Paper[Bearbeiten]

The size of the page you wish to use. Sizes available are:

Size            mm           Inches
A4         210.0 × 297.0   8.27 × 11.69
A5         148.0 × 210.0   5.83 × 8.27
B5         176.0 × 250.0   6.93 × 9.84
Letter     215.9 × 279.4   8.50 × 11.00
Legal      215.9 × 355.6   8.50 × 14.00
Executive  184.2 × 266.7   7.25 × 10.50

Vector Graphics[Bearbeiten]

Some images might be provided in vector graphics format, allowing lossless arbitrary scaling. Since most PDF tools do not support that well, the "Rasterize" converts vector images to raster (bitmap) graphics format by default. You can override this behavior using the "Keep Vector Form" option.

Command Line Interface[Bearbeiten]

Overview of parameters:

  -V, -?, -v    --version, --help     show version number
  -o FILE       --output=FILE         output FILE (REQUIRED)
  -f START:END  --featured=START:END  run selftest on featured article numbers from START to END
  -x CONFIG     --hex=CONFIG          hex encoded full configuration for run
  -s PORT       --server=PORT         run in server mode listen on the given port
  -t FILE       --templates=FILE      user template map FILE
  -r INTEGER    --resolution=INTEGER  maximum image resolution in dpi INTEGER
  -u URL        --url=URL             input URL (REQUIRED)
  -p PAPER      --paper=PAPER         paper size, one of A4,A5,B5,letter,legal,executive
  -m            --mediawiki           use MediaWiki to expand templates
  -h            --html                use MediaWiki generated html as input (default)
  -k            --bookmode            use book-namespace mode for expansion
  -z            --zip                 output zip archive of latex source
  -b            --epub                output epub file
  -d            --odt                 output odt file
  -g            --vector              keep vector graphics in vector form
  -i            --internal            use internal template definitions
  -l DIRECTORY  --headers=DIRECTORY   use user supplied latex headers
  -c DIRECTORY  --copy=DIRECTORY      copy LaTeX tree to DIRECTORY

--version

Shows version and help information

--output=FILE

Set the output file to where the result will be written. On windows you must ensure that the file is currently not open in any kind of software, since it won't be writable in this case. Any dot extension is not evaluated so you will have to set other parameters to define the output format.

--featured

This option is not implemented and the parameter might go away.

--hex

This parameter takes the whole configuration of mediawiki2latex as a single hex encoded string. This is only used by the mediawiki2latex server when it calls its sub processes. This is necessary to avoid shell injection attacks, as the shell will just see a hex encoded string and not try to run any script from that.

--server=PORT

Run mediawiki2latex web interface as http server. List on PORT.

--templates=FILE

Define a custom mapping file of MediaWiki templates to LaTeX commands. And example is given in file templates.user. The original wikitext will be parsed by mediawiki2latex. MediaWiki will not be used to expand any templates. An "Unknown Template" error message will be added to the output PDF file where templates are encountered which are not given in the mapping file.

--resolution=INTEGER

By default all images with a resolution higher that 300 dpi will be scaled down to 300 dpi in order to reduce the size of the resulting PDF file. With this parameter you can override this with your intended resolution. This is helpful if you need to produce a pdf file that is small enough to be uploaded to a file hosting website.

--url=URL

The URL for the main page you wish to convert

--paper=PAPER

The size of the page you wish to use in the PDF. Supported values are some European DIN norms A4, A5, B5 as well as some American formats: letter, legal, executive. In LaTeX it is possible to define more paper sizes in case you need to.

--mediawiki

Use MediaWiki to expand the MediaWiki templates in the wikitext source, then parse and process the resulting expanded wikitext source with mediawiki2latex

--html

Use MediaWiki to generate a HTML page from the wikitext source and parse and process the resulting HTML with mediawiki2latex

--bookmode

This mode is for processing collections made with the MediaWiki Collection extension. This includes the pages found in the Book namespace on Wikipedia as well as user defined collections in the User namespace. mediawiki2latex will follow all links in the wikitext, but not recursively. For each link it will load the HTML. It will stitch together all HTML loaded, then parse and process that. This option can be combined with --mediawiki or --internal or --templates causing the download of wikicode instead of HTML.

--zip

Create a zip file of the LaTeX intermediate code generated.

--epub

Create an ePub file of the article as output. Essentially an intermediate HTML file will be created. Images will be processed as usual and mathematical formulas will be rendered as images. This intermediate result will be converted to an ePub file by calibre.

--odt

Create an odt file of the article as output. ODT stands for Open Document Text and can be imported by common word processing software. It is native to OpenOffice and LibreOffice. The same approach with an intermediate HTML file as described above for ePub is done, but the ODT file is created by LibreOffice.

--vector

Include the source vector file in the PDF output, instead of converting to raster (bitmap) format by default. Usually PDF processing and viewing software does not work well with vector graphics, so its not recommended to do so.

--internal

Same as --templates, but uses a default template definition file compiled into the mediawiki2latex executable. This might be useful on German and English wikibooks, since the template definition file contains some reasonable definitions for many templates on these sites.

--headers=DIRECTORY

Copy a directory with custom header files into the temporary LaTeX document tree before running xeLaTeX. This way you can define custom layouts and define you own latex newcommands which makes sense with the --templates option described above.

--copy=DIRECTORY

Copy the LaTeX (and possibly HTML) intermediate file to the given directory. This option is useful if you want to manually edit the LaTeX document and compile it yourself. mediawiki2latex will still do everything requested including the creating of output files and compiling the sources, and will also copy the directory immediately before the compile step.

Wiki Source Page Code[Bearbeiten]

MediaWiki2LaTeX is sensitive to some features in the Wikimedia page source. In some cases you can improve the rendering by following the tips given here. It is recommended that you add an HTML comment to the source code, something like this:

<!-- This parameter added to improve print rendering. Please do not remove. -->

Tables[Bearbeiten]

There are some rules for the typesetting of tables:

  • Tables can include horizontal and vertical lines and a frame surrounding the table. These will be drawn if and only if the template prettytable or the attribute class="wikitable" is present in the header of table be drawn.
  • It can be useful to reduce the font size for a whole table. This can be achieved by writing latexfontsize="scriptsize" into the header of the table.
  • In contrast to the tolerant behavior of mediawiki, wikipdf requires a new table to start on a new line.
  • You can define the width of columns in a table using the width attribute with a value in percent (%) in the attributes of cells of the table.
  • Table headings are supported. In a large table spanning several pages, it is often required to repeat the header (that is some rows in the beginning of the table) on the beginning of each page. This is done by marking some cells as header cells using the exclamation mark (!) instead of the vertical bar (|) in the wiki syntax. The program considers the fist few rows to be part of the header as long as they continuously contain header cells.

List of Figures[Bearbeiten]

A table of images, their authors and licenses is automatically created in the appendix. In order to determine the name of the author, the information template on the description page of the image is analyzed, thus it needs to be present and to have a valid author entry.

Server Configuration[Bearbeiten]

Images[Bearbeiten]

Size of Files and Image Resolution[Bearbeiten]

Often there is a maximum size of file allowed by the application you want to use the generated pdf in. This often apples to print-on-demand services. You can reduce the output file size by setting the images to a lower resolution, losing some quality. Typical printing machines used for manufacturing books in an industrial manner today use a resolution of 300 dpi. Thus a higher resolution is usually not necessary. You can enter the maximum allowed resolution in the Graphical User Interface (this feature may not be implemented yet). All images with higher resolutions will be reduced accordingly.

Width of Images[Bearbeiten]

The width of image will usually be as large as possible, determined by the width of the page as well as the margins. You may modify this behavior by using a px command when including the image in the wiki source text. 400 pixels correspond to the maximum available width. Thus writing 200px will reduce the size to one half of the original size.

Wrapping Images[Bearbeiten]

The former template [[Vorlage:Latex Wrapfigure|Latex Wrapfigure]] can be used for that. It takes two parameters, image and width. Width is between 0.0 and 1.0, where 1.0 means full width of text. 0.5 means half the width of the text and so on. Image has to be a link to an image in the wiki notation starting and tailing double square brackets. see also section on used defined templates of this document and manual of the wrapfigure latex package found on ctan.

Templates[Bearbeiten]

Automated Expansion[Bearbeiten]

In the default case all Templates are expanded by MediaWiki. This is the meaning of the setting Template Expansion = MediaWiki in the GUI.

Manual Expansion for PDF and LaTeX Output[Bearbeiten]

It is hard for an algorithm to determine how a mediawiki template should be converted to LaTeX code. This is because templates are implemented using HTML in an extensive manner in order to produce a good looking output on a Webbrowser, which is very different from the "what you get is what you mean" style LaTeX is using. Still all templates are algorithmically expanded by default as explained above. But we recommend an other way of dealing with templates, which will explain now. You have to set Template Inclusion=normal in the GUI. In this case only a limited number of templates is taken into account by wb2pdf. All other templates will cause the text UNKNOWN TEMPLATE message to come up in the resulting file. It is recommendable to search the output files for this string in order to make sure that all templates were processed correctly. To extend the template processor with custom templates you have to modify the file templates.user in the directory wb2pdf/trunk/latex.

[
["mywikitemplate1","MyLaTeXTemplate","paramx","3","paramy"],
["print version cover","LaTeXNullTemplate"],
["GCC_take_home","LaTeXGCCTakeTemplate","1"]
]

it contains a list of sublists. The fist item in each sublist is the name of the template in the wiki. The second is the name of the template in LaTeX. The following n elements of the sublist are the parameters in the wiki, which shall be passed to the template in LaTeX. Certainly you also have to modify templates.tex in the directory wb2pdf/trunk/document/main to add a definition for the LaTeX version of the template. When modifying templates.user be aware that each entry ends with a comma except for the last entry which does not end with a comma. Furthermore umlauts and non ansi characters have to be encoded in decimal utf8 notation this means:

"\195\156berschriftensimulation 5"


This isn't such a big problem since the Unknown Template Error message in main.tex file in directory wb2pdf/trunk/document/main will have exactly this format (decimal utf8 notation), thus you just need to copy and paste them.

If you need to have more degrees of freedom in defining how a template is processed you can also edit the source code of the template processor In order to extend the template processor of mediawiki2latex with you custom templates you need can also modify the function templateProcessor in the file LatexRenderer.hs an to recompile. In order to do so you need to install the Glasgow Haskell compiler as well as its package manager (cabal). Many examples for custom templates are given in LatexRenderer.hs. Still this file is coded in the purely functional programming language Haskell, which having learned about will help you to define the processing of your custom template. LatexRenderer.hs is essentially a code generator writing code in the LaTeX typesetting language which you will also need to learn in order to extend the custom template abilities of wb2pdf.

Manual Expansion for EPUB ODT and HTML Output[Bearbeiten]

You need to modify the function templateProcessor in the file HtmlRenderer.hs and recompile after that you need to run mediawiki2latex with the -i command line option. More hints on that can be found on the discussion Page. Just click Diskussion on the top of this page.