« New ko-fi page: help me attend ICFP! » Competitive programming in Haskell: introduction to dynamic programming

Competitive programming in Haskell: parsing with an NFA

Posted on May 24, 2023
Tagged , , ,

In my previous post, I challenged you to solve Chemist’s Vows. In this problem, we have to decide which words can be made by concatenating atomic element symbols. So this is another parsing problem; but unlike the previous problem, element symbols are not prefix-free. For example, B and Be are both element symbols. So, if we see BE…, we don’t immediately know whether we should parse it as Be, or as B followed by an element that starts with E (such as Er).

A first try

A parsing problem, eh? Haskell actually shines in this area because of its nice parser combinator libraries. The Kattis environment does in fact have the parsec package available; and even on platforms that don’t have parsec, we can always use the Text.ParserCombinators.ReadP module that comes in base. So let’s try throwing one of those packages at the problem and see what happens!

If we try using parsec, we immediately run into problems; honestly, I don’t even know how to solve the problem using parsec. The problem is that <|> represents left-biased choice. If we parse p1 <|> p2 and parser p1 succeeds, then we will never consider p2. But for this parsing problem, because the symbols are not prefix-free, sometimes we can’t know which of two options we should have picked until later.

ReadP, on the other hand, explicitly has both biased and unbiased choice operators, and can return a list of possible parses instead of just a single parse. That sounds promising! Here’s a simple attempt using ReadP: to parse a single element, we use an unbiased choice over all the element names; then we use many parseElement <* eof to parse each word, and check whether there are any successful parses at all.

{-# LANGUAGE OverloadedStrings #-}

import           Control.Arrow
import           Data.Bool
import qualified Data.ByteString.Lazy.Char8   as C
import           Text.ParserCombinators.ReadP (ReadP, choice, eof, many,
                                               readP_to_S, string)

main = C.interact $
  C.lines >>> drop 1 >>> map (solve >>> bool "NO" "YES") >>> C.unlines

solve :: C.ByteString -> Bool
solve s = case readP_to_S (many parseElement <* eof) (C.unpack s) of
  [] -> False
  _  -> True

elements :: [String]
elements = words $
  "h he li be b c n o f ne na mg al si p s cl ar k ca sc ti v cr mn fe co ni cu zn ga ge as se br kr rb sr y zr nb mo tc ru rh pd ag cd in sn sb te i xe cs ba hf ta w re os ir pt au hg tl pb bi po at rn fr ra rf db sg bh hs mt ds rg cn fl lv la ce pr nd pm sm eu gd tb dy ho er tm yb lu ac th pa u np pu am cm bk cf es fm md no lr"

parseElement :: ReadP String
parseElement = choice (map string elements)

Unfortunately, this fails with a Time Limit Exceeded error (it takes longer than the allotted 5 seconds). The problem is that backtracking and trying every possible parse like this is super inefficient. One of the secret test inputs is almost cerainly constructed so that there are an exponential number of ways to parse some prefix of the input, but no way to parse the entire thing. As a simple example, the string crf can be parsed as either c rf (carbon + rutherfordium) or cr f (chromium + fluorine), so by repeating crf \(n\) times we can make a string of length \(3n\) which has \(2^n\) different parses. If we fed this string to the ReadP solution above, it would quickly succeed with more or less the first thing that it tried. However, if we stick a letter on the end that does not occur in any element symbol (such as q), the result will be an unparseable string, and the ReadP solution will spend a very long time backtracking through exponentially many parses that all ultimately fail.

Solution

The key insight is that we don’t really care about all the different possible parses; we only care whether the given string is parseable at all. At any given point in the string, there are only two possible states we could be in: we could be finished reading one element symbol and about to start reading the next one, or we could be in the middle of reading a two-letter element symbol. We can just scan through the string and keep track of the set of (at most two) possible states; in other words, we will simulate an NFA which accepts the language of strings composed of element symbols.

First, some setup as before.

{-# LANGUAGE OverloadedStrings #-}

import           Control.Arrow              ((>>>))
import           Data.Array                 (Array, accumArray, (!))
import           Data.Bool                  (bool)
import qualified Data.ByteString.Lazy.Char8 as C
import           Data.List                  (partition, nub)
import           Data.Set                   (Set)
import qualified Data.Set                   as S

main = C.interact $
  C.lines >>> drop 1 >>> map (solve >>> bool "NO" "YES") >>> C.unlines

elements :: [String]
elements = words $
  "h he li be b c n o f ne na mg al si p s cl ar k ca sc ti v cr mn
fe co ni cu zn ga ge as se br kr rb sr y zr nb mo tc ru rh pd ag cd
in sn sb te i xe cs ba hf ta w re os ir pt au hg tl pb bi po at rn
fr ra rf db sg bh hs mt ds rg cn fl lv la ce pr nd pm sm eu gd tb dy
ho er tm yb lu ac th pa u np pu am cm bk cf es fm md no lr"

Now, let’s split the element symbols into one-letter and two-letter symbols:

singles, doubles :: [String]
(singles, doubles) = partition ((==1).length) elements

We can now make boolean lookup arrays that tell us whether a given letter occurs as a single-letter element symbol (single) and whether a given letter occurs as the first letter of a two-letter symbol (lead). We also make a Set of all two-letter element symbols, for fast lookup.

mkAlphaArray :: [Char] -> Array Char Bool
mkAlphaArray cs = accumArray (||) False ('a', 'z') (zip cs (repeat True))

single, lead :: Array Char Bool
[single, lead] = map (mkAlphaArray . map head) [singles, doubles]

doubleSet :: Set String
doubleSet = S.fromList doubles

Now for simulating the NFA itself. There are two states we can be in: START means we are about to start and/or have just finished reading an element symbol; SEEN c means we have seen the first character of some element (c) and are waiting to see another.

data State = START | SEEN Char
  deriving (Eq, Ord, Show)

Our transition function takes a character c and a state and returns a set of all possible next states (we just use a list since these sets will be very small). If we are in the START state, we could end up in the START state again if c is a single-letter element symbol; we could also end up in the SEEN c state if c is the first letter of any two-letter element symbol. On the other hand, if we are in the SEEN x state, then we have to check whether xc is a valid element symbol; if so, we return to START.

delta :: Char -> State -> [State]
delta c START    = [START | single!c] ++ [SEEN c | lead!c]
delta c (SEEN x) = [START | [x,c] `S.member` doubleSet]

We can now extend delta to act on a set of states, giving us the set of all possible resulting states; the drive function then iterates this one-letter transition over an entire input string. Finally, to solve the problem, we start with the singleton set [START], call drive using the input string, and check whether START (which is also the only accepting state) is an element of the resulting set of states.

trans :: Char -> [State] -> [State]
trans c sts = nub (sts >>= delta c)

drive :: C.ByteString -> ([State] -> [State])
drive = C.foldr (\c -> (trans c >>>)) id

solve :: C.ByteString -> Bool
solve s = START `elem` drive s [START]

And that’s it! This solution is accepted in 0.27 seconds (out of a maximum allowed 5 seconds).

For next time