« Typeclassopedia v2: moving my efforts to the Haskell wiki » Announcing diagrams 0.5

Parsing context-sensitive languages with Applicative

Posted on January 5, 2012
Tagged , , , , , ,

Many parser combinator libraries in Haskell (such as parsec) have both a Monad interface as well as an Applicative interface. (Actually, to be really useful, you also need MonadPlus along with Monad, or Alternative along with Applicative, in order to encode choice; from now on when I talk about Monad and Applicative note that I really have MonadPlus or Alternative in mind as well.) The Applicative interface is often nicer, but it is less powerful than the Monad interface: in particular, using Applicative you can only parse context-free languages, whereas Monad lets you parse arbitrary context-sensitive languages. Intuitively, this is because the structure of Applicative computations cannot depend on intermediate results, whereas Monad computations allow you to choose which computation to run next based on intermediate results.

This is a good bit of intuition, with only one minor caveat: it isn’t true! I believe it was two years ago, during the second Hac phi, when I first learned from Edward Kmett how Applicative (by which I mean, of course, Alternative) can be used to parse arbitrary context-sensitive languages. The question just came up again in the #haskell IRC channel, and I figured it would be useful to have this all written down somewhere. In particular, Reid Barton gave a nice sketch which I decided to turn into some working code.

Here’s the key insight: normally, grammars are defined as finite objects: a finite set of terminals, a finite set of nonterminals, and a finite set of productions. However, Haskell’s general recursion means that we can write down a "grammar" with an infinite set of production rules. This is what lets us get away with parsing context-sensitive languages with Applicative: we just make a different production rule for every possible input!

First, some imports. Notice that I do not import Control.Monad.

> import Text.Parsec
> import Text.Parsec.String
> import Control.Arrow ((&&&))
> import Control.Applicative hiding ((<|>))
> import Data.List (group)

The usual guard function is for MonadPlus, but we can make something equivalent for Alternative.

> guard' :: Alternative f => Bool -> f ()
> guard' True  = pure ()
> guard' False = empty

And now for the meat of the example. parseArbitrary takes an arbitrary predicate on Strings built from lowercase letters and turns it into a parser. The created parser will accept Strings for which the predicate evaluates to True (returning ()) and fail for any other string.

> parseArbitrary :: (String -> Bool) -> Parser ()
> parseArbitrary p =

If we encounter eof, we simply ensure that the predicate holds of the empty string.

>       (eof <* guard' (p [])) 

Otherwise, we choose between 26 alternatives based on the next character in the input. If the character c is encountered, we make a recursive call to parseArbitrary (p . (c:)). The remainder of the input must satisfy (p . (c:)), that is, it must consist of some String s such that (c:s) satisfies the predicate p.

>   <|> foldr (<|>) parserZero 
>         (map (\c -> char c *> 
>                     parseArbitrary (p . (c:))
>              ) 
>              ['a'..'z']
>         )

For any given predicate p, you can think of parseArbitrary p as an infinite tree with a 26-way branch at each node. Each node "remembers" the path taken to reach it from the root of the tree, in the form of prepend functions composed with the original predicate. We have constructed an infinite grammar: each node in the tree corresponds to a production, one for every possible input prefix.

Let’s try it out. Here’s a function which only accepts Strings of the form "aaabbbccc", with an equal number of a’s, b’s, and c’s. This is a well-known example of a language which is not context-free (easily shown using the pumping lemma for context-free languages).

> f :: String -> Bool
> f s 
>   | [('a',na), ('b',nb), ('c',nc)] 
>     <- map (head &&& length). group $ s
> 
>     = na == nb && nb == nc
> 
>   | otherwise = False

Now we make f into a parser and test it on some example inputs:

> p = parseArbitrary f
> 
> main = do
>   parseTest p "aaa"
>   parseTest p "aaabbbcc"
>   parseTest p "aaaabcccc"
>   parseTest p "aaaaabbbbbccccc"

The last test succeeds by returning (). For the first three, we get error messages like this:

parse error at (line 1, column 4):
unexpected end of input
expecting "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y" or "z"

Obviously, these are not very helpful. But what were you expecting? After all, this is one of those things that is interesting in theory, but in practice amounts to an awful hack that no one would ever want to use in real code.

In the end, it’s still true that Applicative can only parse context-free languages as long as we restrict ourselves to finite grammars—which any sensible programmer would do anyway.

[ETA: it looks like using infinite grammars is not as impractical as I thought—see the comments below!]