# Benutzer:Dirk Huenniger/wb2pdf/parser

Zur Navigation springen Zur Suche springen

## Theoretical considerations

### Nonexistance of a BNF

The task is to parse pages written by humans in mediawikis markup language. In particular this language allows for the use of HTML. And as it is written by humans tags are often closed in the wrong order. In particular you will often see:

<i><b></i></b>


In before writing a parser it is very interesting to know whether there is a backus naur form (BNF) for this language. It is well known that the language has to be context free for a BNF to exist. We will proof that a language allowing this kind of improper bracketing is not context-free. We start with the pumping lemma as given in Wikipedia.

If a language L is context-free, then there exists some integer p ≥ 1 such that any string s in L with |s| ≥ p (where p is a pumping length) can be written as

s = uvxyz

with substrings u, v, x, y and z, such that

1. |vxy| ≤ p,
2. |vy| ≥ 1, and
3. uv nxy nz is in L for every integer n ≥ 0.

We define the Language L by.

${\displaystyle h\in L{\text{ iff }}\exists n\in \mathbb {N} :h=(^{n}[^{n})^{n}]^{n}}$

with ${\displaystyle (^{n}}$ we mean the string that conisits of the character '(' repeated n times. Thus we write string concatenation a multiplication. We show that assuming that L is context free causes a contradiction. So let L be context-free and p ≥ 1. We take the string. Therefor a BNF doesn't exist.

${\displaystyle g:=(^{p}[^{p})^{p}]^{p}}$

Since |vxy| ≤ p. The string vxy consists only of at most two kinds of different brackets. In particular there are only tree possibilities. '([' or '[)' or ')]'. Either v is nonempty or y is nonempty. In either case. ${\displaystyle uv^{2}xy^{2}z\notin L}$. Since for a string to be part of L it must have the same number of opening and closing brackets for each kind, which can not be fulfilled in this case. Thus this is a contradiction. So the language is not context-free.

### How to correct impropper bracketing

For further processing it is very useful to have correctly bracketed expressions. So we need to correct the brackting in some way. I will first start with a few examples to give you an Idea on our way of doing this.

(c)->(c)
(c[c]c)->(c[c]c)
(c[c)c]->(c[c])[c]
(c[c{c)c]c}->(c[c{c}])[{c}]{c}


So we keep a stack of brackets currently open. And if we meet a closing bracket that whos matching opening bracket is on the stack but no its top. We close all bracktes above it on the stack. Close the bracket in question. Remove it from the stack and open all other brackets we just closed again. Once we got a token stream which is properly bracketet is very easy to transform it into a tree of nested environments.

## Technical implementation

Up to now we looked at brackets consisting only of a single chacter. In case of HTML tags we also have to consider attributs. That is a Map of key value pairs assoviated with each opening bracket. Still there are other possibilities of data that can be associated with an opening bracket, which we won't explain in detail.

data StartData = TagAttr String (Map String String)
| ...


In our parsetree we want to have nested enviroments. Which brings us to a definition like:

data EnvType = Italic
| Bold
| Tag
| ...


Next we need to take into account very generic bracket. Here we make use of the parsec parser combinator library. Within this library a parse a basically that tries to fit the beginning of an input stream and consumses some of it if it fits ands return a result for further processing or just realizates that it does not fit to the beginning of that particular imput stream. So we can try many parsers until one succeds. That is trying to find out whether the current input stream begin with an opening bracket and to which enviroments that bracket belongs. As we go further we will ceratainly also have to consider closing brackets. And we will also have to deal with the case that the current stream does not start with any opening or closing bracket. In this case we will always insert a single Character in output token stream. This way we ensure that we will always produce a result, even if the inputfile does not conform to any particular grammer. But first of all lets define our first version of a generic brackets:

data MyParser tok = MyParser{start :: GenParser tok () StartData,
end :: GenParser tok () (), allowed :: [EnvType],
self :: EnvType}


So we got start for opening brackets. It Parser for a stream of type tok, were usually tok=Char and in case it matches the begining of the stream returns a result of type StartDate (as defined above) consuming some of the input stream. The parser end is the closing bracket. It is does not return a result or more precisely return a result of the null type (). And of course it has got an environment self associated with it. Unfornutatly an enviroments are not allowed to be arbitarily nested. Particularly a string that should be considerd as an opening bracket in one environment should be considert as plain chareters in an other environmnet. So we have to define in which environment that particlar instance of MyParser is allowed to match. This brings us to the defintion

data MyParser tok = MyParser{start :: GenParser tok () StartData,
end :: GenParser tok () (), allowed :: [EnvType],
self :: EnvType,
allowed= [EnvType]}