Parser

To run the parser we call parse. This requires a grammar and the output from the lexer (which is a list of tokens).

parse : (act : Grammar tok c ty) -> (xs : List tok) ->
      Either (ParseError tok) (ty, List tok)

If successful this returns ‘Right’ with a pair of

  • the parse result.
  • the unparsed tokens (the remaining input).

otherwise it returns ‘Left’ with the error message.

So we need to define the grammar for our parser, this is done using the following ‘Grammar’ data structure. This is a combinator structure, similar in principle to the recogniser combinator for the lexer, which was discussed on the previous page.

As with the Recogniser the Grammar type is dependent on a boolean ‘consumes’ value which allows us to ensure that complicated Grammar structures will always consume input.

||| Description of a language's grammar. The `tok` parameter is the type
||| of tokens, and the `consumes` flag is True if the language is guaranteed
||| to be non-empty - that is, successfully parsing the language is guaranteed
||| to consume some input.
public export
data Grammar : (tok : Type) -> (consumes : Bool) -> Type -> Type where
   Empty : (val : ty) -> Grammar tok False ty
   Terminal : String -> (tok -> Maybe a) -> Grammar tok True a
   NextIs : String -> (tok -> Bool) -> Grammar tok False tok
   EOF : Grammar tok False ()

   Fail : Bool -> String -> Grammar tok c ty
   Commit : Grammar tok False ()
   MustWork : Grammar tok c a -> Grammar tok c a

   SeqEat : Grammar tok True a -> Inf (a -> Grammar tok c2 b) ->
            Grammar tok True b
   SeqEmpty : {c1, c2 : Bool} ->
              Grammar tok c1 a -> (a -> Grammar tok c2 b) ->
              Grammar tok (c1 || c2) b
   Alt : {c1, c2 : Bool} ->
         Grammar tok c1 ty -> Grammar tok c2 ty ->
         Grammar tok (c1 && c2) ty

So an example of a grammer type may look something like this: Grammar (TokenData ExpressionToken) True Integer. This is a complicated type name and a given parser will need to use it a lot. So to reduce the amount of typing we can use the following type synonym (similar to what is done in Idris 2 : https://github.com/edwinb/Idris2/blob/master/src/Parser/Support.idr )

public export
Rule : Type -> Type
Rule ty = Grammar (TokenData ExpressionToken) True ty

Parser Example

On the previous page we implemented a lexer to ‘lex’ a very simple expression, on this page we will go on to add a parser for this running example.

The syntax we are aiming at is something like this (expressed in Backus–Naur form (BNF)):
<expr> ::= <integer literal>
 |  <expr>'+'<expr>
 |  <expr>'-'<expr>
 |  <expr>'*'<expr>
 |  '('<expr>')'

<integer literal> ::= <digit>|<integer literal><digit>
To start, here is a grammar to parse an integer literal (that is, a sequence of numeric characters).
export
intLiteral : Rule Integer
intLiteral
  = terminal (\x => case tok x of
                  Number i => Just i
                  _ => Nothing)

In order to try this out, here is a temporary function, this calls parse which takes two parameters:

  • The grammar (in this case intLiteral)
  • The token list from the lexer.
test1 : String -> Either (ParseError (TokenData ExpressionToken))
                      (Integer, List (TokenData ExpressionToken))
test1 s = parse intLiteral (fst (lex expressionTokens s))

As required, if we pass it a string which is a number literal then it will return the number in the ‘Right’ option.

*parserEx> test1 "123"
Right (123, []) : Either (ParseError (TokenData ExpressionToken))
                       (Integer, List (TokenData ExpressionToken))

If we pass it a string which is not a number literal then it will return an error message.

*parserEx> test1 "a"
Left (Error "End of input"
          []) : Either (ParseError (TokenData ExpressionToken))
                       (Integer, List (TokenData ExpressionToken))

If we pass it a number followed by something else, then it will still be successful, this is because we are not specifically checking for end-of-file.

*parserEx> test1 "123a"
Right (123, []) : Either (ParseError (TokenData ExpressionToken))
                       (Integer, List (TokenData ExpressionToken))
*parserEx>
The intLiteral function above uses the terminal function to construct the grammar
||| Succeeds if running the predicate on the
||| next token returns Just x, returning x.
||| Otherwise fails.
export
terminal : (tok -> Maybe a) -> Grammar tok True a
terminal = Terminal

This is defined here : https://github.com/idris-lang/Idris-dev/blob/master/libs/contrib/Text/Parser/Core.idr Idris 2 uses a slightly different version which stores an error message like “Expected integer literal” which can be output if the rule fails : https://github.com/edwinb/Idris2/blob/master/src/Text/Parser/Core.idr

The ‘terminal’ function is also used to construct the other elements of the grammar that we require, for instance, opening parenthesis:
openParen : Rule Integer
openParen = terminal (\x => case tok x of
                  OParen => Just 0
                  _ => Nothing)

Integer value is not really relevant for parenthesis so 0 (zero) is used as a default value.

As before, this can be tested with a function like this:

test2 : String -> Either (ParseError (TokenData ExpressionToken))
                      (Integer, List (TokenData ExpressionToken))
test2 s = parse openParen (fst (lex expressionTokens s))

We can see below that it correctly parses an open parenthesis and gives an error for anything else:

*parserEx> test2 "("
Right (0, []) : Either (ParseError (TokenData ExpressionToken))
                     (Integer, List (TokenData ExpressionToken))
*parserEx> test2 "123"
Left (Error "Unrecognised token"
          [MkToken 0
                   0
                   (Number 123)]) : Either (ParseError (TokenData ExpressionToken))
                                           (Integer,
                                            List (TokenData ExpressionToken))

Now we have two Grammars we can try combining them. The following test looks for openParen followed by intLiteral, the two Grammars are combined using <*>. The map const part uses the integer value from the first.

The following test is looking for ( followed by a number:

test3 : String -> Either (ParseError (TokenData ExpressionToken))
                      (Integer, List (TokenData ExpressionToken))
test3 s = parse (map const openParen <*> intLiteral) (fst (lex expressionTokens s))

We can see below that ( followed by a number is successfully parsed but, as required, other token lists are not:

*parserEx> test3 "(123"
Right (0, []) : Either (ParseError (TokenData ExpressionToken))
                     (Integer, List (TokenData ExpressionToken))
*parserEx> test3 "(("
Left (Error "Unrecognised token"
          [MkToken 0
                   (case fspan (\ARG => not (intToBool (prim__eqChar ARG '\n')))
                               "(" of
                      (incol, "") => c + cast (length incol)
                      (incol, b) => cast (length incol))
                   OParen]) : Either (ParseError (TokenData ExpressionToken))
                                     (Integer, List (TokenData ExpressionToken))

*parserEx> test3 "123"
Left (Error "Unrecognised token"
          [MkToken 0
                   0
                   (Number 123)]) : Either (ParseError (TokenData ExpressionToken))
                                           (Integer,
                                            List (TokenData ExpressionToken))
*parserEx> test3 "123("
Left (Error "Unrecognised token"
          [MkToken 0 0 (Number 123),
           MkToken 0
                   (case fspan (\ARG => not (intToBool (prim__eqChar ARG '\n')))
                               "321" of
                      (incol, "") => c + cast (length incol)
                      (incol, b) => cast (length incol))
                   OParen]) : Either (ParseError (TokenData ExpressionToken))
                                     (Integer, List (TokenData ExpressionToken))
The closing parenthesis is constructed in the same way.
closeParen : Rule Integer
closeParen = terminal (\x => case tok x of
                   CParen => Just 0
                   _ => Nothing)
Now we can generate a Grammar for an expression inside parenthesis like this.
paren : Rule Integer -> Rule Integer
paren exp = openParen *> exp <* closeParen

The use of *> and <* instead of <*> is an easy way to use the integer value from the inner expression.

So lets try to use this to define a grammar which recognises either:

  • An integer literal
  • An integer literal inside parenthesis
  • An integer literal inside parenthesis inside parenthesis
  • … and so on.
This requires a recursively defined structure like this:
partial
expr : Rule Integer
expr = intLiteral <|> (paren expr)

This is a valid grammar because every time it is called it is guaranteed to consume a token. However, as an Idris structure, it is problematic due to the recursion. Defining it as partial allows it to compile but at runtime we are likely to get a crash with an (unhelpful) error message like killed by signal 11.

So a better method is to use do notation as described in the following section.

Monadic Combinator

In addition to <|> and <*> there is a >>= combinator, which is defined like this:

||| Sequence two grammars. If either consumes some input, the sequence is
||| guaranteed to consume some input. If the first one consumes input, the
||| second is allowed to be recursive (because it means some input has been
||| consumed and therefore the input is smaller)
export %inline
(>>=) : {c1 : Bool} ->
      Grammar tok c1 a ->
      inf c1 (a -> Grammar tok c2 b) ->
      Grammar tok (c1 || c2) b
(>>=) {c1 = False} = SeqEmpty
(>>=) {c1 = True} = SeqEat
This allows us to use do notation for the previous parenthesis example:
expr = intLiteral <|> do
       openParen
       r <- expr
       closeParen
       pure r

This can be more flexible than using the <*> operator. Also it is defined using Inf so we can implement recursively defined grammars as above.

Implementing the Arithmetic Operators

Now for the operations, in this case: +, - and *. The syntax we require for these infix operators would require recursive grammars like this:
expr = expr (op "+") expr

As already explained, do notation can allow us to use recursive definitions but not necessarily left-recursion like this.

In order to work up to this gradually lets start with prefix operators (sometimes known as Polish notation) then modify later for infix operators.

So prefix operators would have this sort of form:
expr = (add (op "+") expr expr)

where op is defined like this:

||| Matches if this is an operator token and string matches, that is,
||| it is the required type of operator.
op : String -> Rule Integer
op s = terminal (\x => case tok x of
                         (Operator s1) => if s==s1 then Just 0 else Nothing
                         _ => Nothing)

and add is defined like this:

Where:

  • x is the add operator.
  • y is the first operand.
  • z is the second operand.
addInt : Integer -> Integer -> Integer
addInt a b = a+b

export
add : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
add x y z = map addInt (x *> y) <*> z

The resulting integer will be the sum of the two operands.

The other operators are defined in a similar way:
subInt : Integer -> Integer -> Integer
subInt a b = a-b

export
sub : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
sub x y z = map subInt (x *> y) <*> z

multInt : Integer -> Integer -> Integer
multInt a b = a*b

export
mult : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
mult x y z = map multInt (x *> y) <*> z

So the top level Grammar can now be defined as follows. Note that this is partial as it is a potentially infinite structure and so not total.

partial
expr : Rule Integer
expr = (add (op "+") expr expr)
     <|> (sub (op "-") expr expr)
     <|> (mult (op "*") expr expr)
     <|> intLiteral <|> (paren expr)
To make the testing easier we can use this function:
partial
test : String -> Either (ParseError (TokenData ExpressionToken))
                (Integer, List (TokenData ExpressionToken))
test s = parse expr (fst (lex expressionTokens s))
First test a valid (prefix) expression:
*parserEx> test "+1*6(4)"
Right (25,
     []) : Either (ParseError (TokenData ExpressionToken))
            (Integer, List (TokenData ExpressionToken))

Then an invalid syntax:

*parserEx> test "))"
Left (Error "Unrecognised token"
          [MkToken 0 0 CParen,
           MkToken 0
                   (case fspan (\ARG =>
                                  not (intToBool (prim__eqChar ARG
                                                               '\n')))
                               ")" of
                      (incol, "") => c + cast (length incol)
                      (incol, b) => cast (length incol))
                   CParen]) : Either (ParseError (TokenData ExpressionToken))
                                     (Integer,
                                      List (TokenData ExpressionToken))

However if we try something that is invalid, but starts with a valid token, then it will return Right (to indicate success)

*parserEx> test "1))"
Right (1,
     [MkToken 0
              (case fspan (\ARG =>
                             not (intToBool (prim__eqChar ARG '\n')))
                          "1" of
                 (incol, "") => c + cast (length incol)
                 (incol, b) => cast (length incol))
              CParen,
      MkToken 0
              (case fspan (\ARG =>
                             not (intToBool (prim__eqChar ARG '\n')))
                          ")" of
                 (incol, "") => c + cast (length incol)
                 (incol, b) => cast (length incol))
              CParen]) : Either (ParseError (TokenData ExpressionToken))
                                (Integer,
                                 List (TokenData ExpressionToken))

Infix Notation

So far we have implemented a prefix notation for operators (like this: + expr expr) but the aim is to implemented an infix notation (like this: expr + expr). To do this we must be able to deal with potentially infinite data structures (see Codata Types here Types and Functions).

First alter the grammar to have infix operations:

addInt : Integer -> Integer -> Integer
addInt a b = a+b

export
add : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
add x y z = map addInt (x <* y) <*> z

subInt : Integer -> Integer -> Integer
subInt a b = a-b

export
sub : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
sub x y z = map subInt (x <* y) <*> z


multInt : Integer -> Integer -> Integer
multInt a b = a*b

export
mult : Grammar tok c1 Integer ->
    Grammar tok c2 Integer ->
    Grammar tok c3 Integer ->
    Grammar tok ((c1 || c2) || c3) Integer
mult x y z = map multInt (x <* y) <*> z

partial
expr : Rule Integer
expr = (add expr (op "+") expr)
     <|> (sub expr (op "-") expr)
     <|> (mult expr (op "*") expr)
     <|> intLiteral <|> (paren expr)

However, if this was run, the code would not terminate.

Implement Left Recursion Elimination

Left factoring, like this, is a general problem.

If we have a rule like this:

A -> (x<*>y) <|> (x<*>z)

If x<*>y fails but x<*>z would succeed a problem is that, x<*>y has already consumed x, so now x<*>z will fail.

so we could write code to backtrack. That is try x<*>y without consuming so that, if the first token succeeds but the following tokens fail, then the first tokens would not be consumed.

Another approach is left factoring:

Left Factoring

Replace the rule with two rules (that is we add a non-terminal symbol) so for example, instead of:

A -> (x<*>y) <|> (x<*>z)

we add an extra rule to give:

A -> x<*>N
N -> y <|> z

That is we convert a general context-free grammar to a LL(1) grammar. Although not every context-free grammar can be converted to a LL(1) grammar.

This still does not solve the infinite recursion issue and there is another problem: the precedence of the operators +, - and * is not explicit.

To resolve this we can alter the example to have a BNF like this: Where braces { ... } mean the contents can occur 1 or more times.
'expression' ::=  ('+'|'-') 'term' {('+'|'-') 'term'}

'term' ::=  'factor' {'*' 'factor'}

'factor' ::=
   number
   | '(' 'expression' ')'
expr : Rule Integer

factor : Rule Integer
factor = intLiteral <|> do
              openParen
              r <- expr
              closeParen
              pure r

term : Rule Integer
term = map multInt factor <*> (
        (op "*")
        *> factor)
     <|> factor

expr = map addInt term <*> (
        (op "+")
        *> term)
     <|> map subInt term <*> (
        (op "-")
        *> term)
     <|> term

calc : String -> Either (ParseError (TokenData ExpressionToken))
                      (Integer, List (TokenData ExpressionToken))
calc s = parse expr (fst (lex expressionTokens s))

lft : (ParseError (TokenData ExpressionToken)) -> IO ()
lft (Error s lst) = putStrLn ("error:"++s)

rht : (Integer, List (TokenData ExpressionToken)) -> IO ()
rht i = putStrLn ("right " ++ (show i))

main : IO ()
main = do
  putStr "alg>"
  x <- getLine
  either lft rht (calc x) -- eliminator for Either

As the following tests show, this now handles infix operators and precedence.

For example ‘*’ is an infix operator:
*algebraREPL> :exec
alg>2*3
right (6, [])
Check that atomic number literals work:
*algebraREPL> :exec
alg>2
right (2, [])
Check that ‘*’ has a higher precedence than ‘+’.
*algebraREPL> :exec
alg>2+3*4
right (14, [])
Also ‘*’ has a higher precedence than ‘+’ when the order is reversed.
*algebraREPL> :exec
alg>3*4+2
right (14, [])
Also precedence can be overridden by parenthesis.
*algebraREPL> :exec
alg>(2+3)*4
right (20, [])

This is still not correct because it does not correctly parse multiple sums or terms like 1+2+3 or 1*2*3.