Competitive programming in Haskell: parsing with an NFA
In my previous post, I challenged you to solve Chemist’s Vows. In this problem, we have to decide which words can be made by concatenating atomic element symbols. So this is another parsing problem; but unlike the previous problem, element symbols are not prefix-free. For example, B
and Be
are both element symbols. So, if we see BE…
, we don’t immediately know whether we should parse it as Be
, or as B
followed by an element that starts with E
(such as Er
).
A first try
A parsing problem, eh? Haskell actually shines in this area because of its nice parser combinator libraries. The Kattis environment does in fact have the parsec
package available; and even on platforms that don’t have parsec
, we can always use the Text.ParserCombinators.ReadP
module that comes in base
. So let’s try throwing one of those packages at the problem and see what happens!
If we try using parsec
, we immediately run into problems; honestly, I don’t even know how to solve the problem using parsec
. The problem is that <|>
represents left-biased choice. If we parse p1 <|> p2
and parser p1
succeeds, then we will never consider p2
. But for this parsing problem, because the symbols are not prefix-free, sometimes we can’t know which of two options we should have picked until later.
ReadP
, on the other hand, explicitly has both biased and unbiased choice operators, and can return a list of possible parses instead of just a single parse. That sounds promising! Here’s a simple attempt using ReadP
: to parse a single element, we use an unbiased choice
over all the element names; then we use many parseElement <* eof
to parse each word, and check whether there are any successful parses at all.
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow
import Data.Bool
import qualified Data.ByteString.Lazy.Char8 as C
import Text.ParserCombinators.ReadP (ReadP, choice, eof, many,
readP_to_S, string)
main = C.interact $
C.lines >>> drop 1 >>> map (solve >>> bool "NO" "YES") >>> C.unlines
solve :: C.ByteString -> Bool
solve s = case readP_to_S (many parseElement <* eof) (C.unpack s) of
[] -> False
_ -> True
elements :: [String]
elements = words $
"h he li be b c n o f ne na mg al si p s cl ar k ca sc ti v cr mn fe co ni cu zn ga ge as se br kr rb sr y zr nb mo tc ru rh pd ag cd in sn sb te i xe cs ba hf ta w re os ir pt au hg tl pb bi po at rn fr ra rf db sg bh hs mt ds rg cn fl lv la ce pr nd pm sm eu gd tb dy ho er tm yb lu ac th pa u np pu am cm bk cf es fm md no lr"
parseElement :: ReadP String
parseElement = choice (map string elements)
Unfortunately, this fails with a Time Limit Exceeded error (it takes longer than the allotted 5 seconds). The problem is that backtracking and trying every possible parse like this is super inefficient. One of the secret test inputs is almost cerainly constructed so that there are an exponential number of ways to parse some prefix of the input, but no way to parse the entire thing. As a simple example, the string crf
can be parsed as either c rf
(carbon + rutherfordium) or cr f
(chromium + fluorine), so by repeating crf
\(n\) times we can make a string of length \(3n\) which has \(2^n\) different parses. If we fed this string to the ReadP
solution above, it would quickly succeed with more or less the first thing that it tried. However, if we stick a letter on the end that does not occur in any element symbol (such as q
), the result will be an unparseable string, and the ReadP
solution will spend a very long time backtracking through exponentially many parses that all ultimately fail.
Solution
The key insight is that we don’t really care about all the different possible parses; we only care whether the given string is parseable at all. At any given point in the string, there are only two possible states we could be in: we could be finished reading one element symbol and about to start reading the next one, or we could be in the middle of reading a two-letter element symbol. We can just scan through the string and keep track of the set of (at most two) possible states; in other words, we will simulate an NFA which accepts the language of strings composed of element symbols.
First, some setup as before.
{-# LANGUAGE OverloadedStrings #-}
import Control.Arrow ((>>>))
import Data.Array (Array, accumArray, (!))
import Data.Bool (bool)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.List (partition, nub)
import Data.Set (Set)
import qualified Data.Set as S
main = C.interact $
C.lines >>> drop 1 >>> map (solve >>> bool "NO" "YES") >>> C.unlines
elements :: [String]
elements = words $
"h he li be b c n o f ne na mg al si p s cl ar k ca sc ti v cr mn
fe co ni cu zn ga ge as se br kr rb sr y zr nb mo tc ru rh pd ag cd
in sn sb te i xe cs ba hf ta w re os ir pt au hg tl pb bi po at rn
fr ra rf db sg bh hs mt ds rg cn fl lv la ce pr nd pm sm eu gd tb dy
ho er tm yb lu ac th pa u np pu am cm bk cf es fm md no lr"
Now, let’s split the element symbols into one-letter and two-letter symbols:
singles, doubles :: [String]
(singles, doubles) = partition ((==1).length) elements
We can now make boolean lookup arrays that tell us whether a given letter occurs as a single-letter element symbol (single
) and whether a given letter occurs as the first letter of a two-letter symbol (lead
). We also make a Set
of all two-letter element symbols, for fast lookup.
mkAlphaArray :: [Char] -> Array Char Bool
mkAlphaArray cs = accumArray (||) False ('a', 'z') (zip cs (repeat True))
single, lead :: Array Char Bool
[single, lead] = map (mkAlphaArray . map head) [singles, doubles]
doubleSet :: Set String
doubleSet = S.fromList doubles
Now for simulating the NFA itself. There are two states we can be in: START
means we are about to start and/or have just finished reading an element symbol; SEEN c
means we have seen the first character of some element (c
) and are waiting to see another.
data State = START | SEEN Char
deriving (Eq, Ord, Show)
Our transition function takes a character c
and a state and returns a set of all possible next states (we just use a list since these sets will be very small). If we are in the START
state, we could end up in the START
state again if c
is a single-letter element symbol; we could also end up in the SEEN c
state if c
is the first letter of any two-letter element symbol. On the other hand, if we are in the SEEN x
state, then we have to check whether xc
is a valid element symbol; if so, we return to START
.
delta :: Char -> State -> [State]
delta c START = [START | single!c] ++ [SEEN c | lead!c]
delta c (SEEN x) = [START | [x,c] `S.member` doubleSet]
We can now extend delta
to act on a set of states, giving us the set of all possible resulting states; the drive
function then iterates this one-letter transition over an entire input string. Finally, to solve the problem, we start with the singleton set [START]
, call drive
using the input string, and check whether START
(which is also the only accepting state) is an element of the resulting set of states.
trans :: Char -> [State] -> [State]
trans c sts = nub (sts >>= delta c)
drive :: C.ByteString -> ([State] -> [State])
drive = C.foldr (\c -> (trans c >>>)) id
solve :: C.ByteString -> Bool
solve s = START `elem` drive s [START]
And that’s it! This solution is accepted in 0.27 seconds (out of a maximum allowed 5 seconds).