Benutzer:Dirk Huenniger/wb2pdf/parser

Theoretical considerations

Nonexistance of a BNF

The task is to parse pages written by humans in mediawikis markup language. In particular this language allows for the use of HTML. And as it is written by humans tags are often closed in the wrong order. In particular you will often see:

<i><b></i></b>

In before writing a parser it is very interesting to know whether there is a backus naur form (BNF) for this language. It is well known that the language has to be context free for a BNF to exist. We will proof that a language allowing this kind of improper bracketing is not context-free. We start with the pumping lemma as given in Wikipedia.

If a language L is context-free, then there exists some integer p ≥ 1 such that any string s in L with |s| ≥ p (where p is a pumping length) can be written as

s = uvxyz

with substrings u, v, x, y and z, such that

1. |vxy| ≤ p,

2. |vy| ≥ 1, and

3. uvⁿxyⁿz is in L for every integer n ≥ 0.

We define the Language L by.

$h\in L{\text{ iff }}\exists n\in \mathbb {N} :h=(^{n}[^{n})^{n}]^{n}$

with $(^{n}$ we mean the string that conisits of the character '(' repeated n times. Thus we write string concatenation a multiplication. We show that assuming that L is context free causes a contradiction. So let L be context-free and p ≥ 1. We take the string. Therefor a BNF doesn't exist.

$g:=(^{p}[^{p})^{p}]^{p}$

Since |vxy| ≤ p. The string vxy consists only of at most two kinds of different brackets. In particular there are only tree possibilities. '([' or '[)' or ')]'. Either v is nonempty or y is nonempty. In either case. $uv^{2}xy^{2}z\notin L$ . Since for a string to be part of L it must have the same number of opening and closing brackets for each kind, which can not be fulfilled in this case. Thus this is a contradiction. So the language is not context-free.

How to correct impropper bracketing

For further processing it is very useful to have correctly bracketed expressions. So we need to correct the brackting in some way. I will first start with a few examples to give you an Idea on our way of doing this.

(c)->(c)
(c[c]c)->(c[c]c)
(c[c)c]->(c[c])[c]
(c[c{c)c]c}->(c[c{c}])[{c}]{c}

So we keep a stack of brackets currently open. And if we meet a closing bracket that whos matching opening bracket is on the stack but no its top. We close all bracktes above it on the stack. Close the bracket in question. Remove it from the stack and open all other brackets we just closed again. Once we got a token stream which is properly bracketet is very easy to transform it into a tree of nested environments.

Technical implementation

Up to now we looked at brackets consisting only of a single chacter. In case of HTML tags we also have to consider attributs. That is a Map of key value pairs assoviated with each opening bracket. Still there are other possibilities of data that can be associated with an opening bracket, which we won't explain in detail.

data StartData = TagAttr String (Map String String)
               | ...

In our parsetree we want to have nested enviroments. Which brings us to a definition like:

data EnvType = Italic
             | Bold
             | Tag
             | ...

Next we need to take into account very generic bracket. Here we make use of the parsec parser combinator library. Within this library a parse a basically that tries to fit the beginning of an input stream and consumses some of it if it fits ands return a result for further processing or just realizates that it does not fit to the beginning of that particular imput stream. So we can try many parsers until one succeds. That is trying to find out whether the current input stream begin with an opening bracket and to which enviroments that bracket belongs. As we go further we will ceratainly also have to consider closing brackets. And we will also have to deal with the case that the current stream does not start with any opening or closing bracket. In this case we will always insert a single Character in output token stream. This way we ensure that we will always produce a result, even if the inputfile does not conform to any particular grammer. But first of all lets define our first version of a generic brackets:

data MyParser tok = MyParser{start :: GenParser tok () StartData,
                             end :: GenParser tok () (), allowed :: [EnvType],
                             self :: EnvType}

So we got start for opening brackets. It Parser for a stream of type tok, were usually tok=Char and in case it matches the begining of the stream returns a result of type StartDate (as defined above) consuming some of the input stream. The parser end is the closing bracket. It is does not return a result or more precisely return a result of the null type (). And of course it has got an environment self associated with it. Unfornutatly an enviroments are not allowed to be arbitarily nested. Particularly a string that should be considerd as an opening bracket in one environment should be considert as plain chareters in an other environmnet. So we have to define in which environment that particlar instance of MyParser is allowed to match. This brings us to the defintion

data MyParser tok = MyParser{start :: GenParser tok () StartData,
                             end :: GenParser tok () (), allowed :: [EnvType],
                             self :: EnvType,
                             allowed= [EnvType]}

Understanding the Parser

The central algorithms of the parser is mything. mything may either return RRight (s,i) or BBad (s,i) where s is some kind of stack we will learn about later and i is the resulting parsetree. If mything is call from toplevel in a reasonable way it can hopefully be prove that it will terminate within a reasonable time return RRight (s,i). If it is called on an inner level it might also return BBad (s,i). Our desription of the grammer is based on environment. The parser tries to generate a list cosisting of tokens and subenvironment where each subenvironment contains a list of the same type, namely Anything. Our grammer definition lets us define a parser that has to match at the begining of an environment as well as one that has to match at the end. It also allows us to define a bad parser for each environment. If the bad parser matches within an environment the environment is disregared and we backtrack to the begging of the enviroment and try other parsers from that position. So in mything we first of all have to check weather bad matches (mtnotbad) if this is the case we return Bad (s,i) and we are done The environments currenty open are kept on the stack s. So mtnotbad has just to take the the bad parser of top of s and try to match it and if it matchs return BBad otherwise RRight. So if the bad parser didn't match we try closing brackes. We do so by calling the mtcl function. We call iom2 to get the index of match we first of all reduce the list of currently open environment to the list of parsers for closing of the environments. We apply indexofmatch2 to the list. Either we get pzero than there was not matching closing bracket or we get the index of the match on the stack. That is the index of the enviroment just about to close on the stack. So iom2 return either pzero or the stackindex to the closing environment. If its on top of the stack we are done, otherwise we have to ensure that nither the current enviroment nor the closing environement are preseving enviroment (that is something like verbatim in latex, so if somebody writes \end{itemize} within verbatim latex shall not close the enviroment but rather path out the string unchanged). If this is not mtcl retruns pzero, signaling that not closing bracket was found. If that is all ok we call mything from mtcl and thus cause a recursion. We pass mything the new stack with the just closed enviromentent removed and a new current parsetree with prober tokens inserted to singal necessary closing and opening of brackets. If we close and environment that is not on top of the stack, we first of all have to close all brackets higher on the stack, than we can close the bracket we wanted to close and than we have to reopen all brackets we just had to close except the one we wanted to close. the opening brackets are handeld by obrk, the closing one by cbrk. We have to consider that we are contructing the parsetree in reverse order. Usually one would add any new toke to the end of the list of allready parsed tokes. But this is not effienct in Haskell we can only at to the front of a list in an efficient manner. So we build the list in reverse order so we first have to add closing brackets to the front and than we have to add opening brackets to the front. After pasing the whole text mything will finally return and its result will be returned by the initial call the mything and that well return the final result. But we also have to consider the case that mtcl does not match in the iniital call to mything in this case we will try opening bracket (mtps) if this also does not match we match a single character and return call mything recursively. So are left to consider mtps in detail. It v is the current lenght of the stack. mtps just tries to match the starting parser adds and opening bracket to the result and calls mything recusively. If mything returns BBad and the current stackindex is in the stack return with bad we know that the enviroment we are trying to open will cause its bad parser to match later. So we know this opening bracket should be disconsiderd. So we return pzero. In case the current stackindex is not in the stack return by the call to mything with BBad, we know that our current bracket was not problem and we thrown on the error by returning BBAd ourselfs. In case of RRight we can just return that.