« Lightweight, efficiently sampleable enumerations in Haskell » Lightweight invertible enumerations in Haskell

Competitive Programming in Haskell: Scanner

Posted on May 22, 2019
Tagged , , ,

In my previous post I explored solving a simple competitive programming problem in Haskell. The input of the problem just consisted of a bunch of lines containing specific data, so that we could parse it using lines and words. There is another common class of problems, however, which follow this pattern:

The first line of the input consists of an integer \(T\). Each of the next \(T\) lines consists of…

That is, the input contains integers which are not input data per se but just tell you how many things are to follow. This is really easy to process in an imperative language like Java or C++. For example, in Java we might write code like this:

Scanner in = new Scanner(System.in);
int T = in.nextInt();
for (int i = 0; i < T; i++) {
   // process each line
}

Occasionally, we can get away with completely ignoring the extra information in Haskell. For example, if the input consists of a number \(T\) followed by \(T\) lines, each of which contains a number \(n\) followed by a list of \(n\) numbers, we can just write

main = interact $
  lines >>> drop 1 >>> map (words >>> drop 1 >>> map read) >>> ...

That is, we can ignore the first line containing \(T\) since the end-of-file will tell us how many lines there are; and we can ignore the \(n\) at the beginning of each line, since the newline character tells us when the list on that line is done.

Sometimes, however, this isn’t possible, especially when there are multiple test cases, or when a single test case has multiple parts, each of which can have a variable length. For example, consider Popular Vote, which describes its input as follows:

The first line of input contains a single positive integer \(T \leq 500\) indicating the number of test cases. The first line of each test case also contains a single positive integer \(n\) indicating the number of candidates in the election. This is followed by \(n\) lines, with the \(i\)th line containing a single nonnegative integer indicating the number of votes candidate \(i\) received.

How would we parse this? We could still ignore \(T\)—just keep reading until the end of the file—but there’s no way we can ignore the \(n\) values. Since the values for each test case are all on separate lines instead of on one line, there’s otherwise no way to know when one test case ends and the next begins.

Once upon a time, I would have done this using splitAt and explicit recursion, like so:

type Election = [Int]

readInput :: String -> [Election]
readInput = lines >>> drop 1 {- ignore T -} >>> map read >>> go
  where
    go :: [Int] -> [Election]
    go []     = []
    go (n:xs) = votes : go rest
      where (votes,rest) = splitAt n xs

However, this is really annoying to write and easy to get wrong. There are way too many variable names to keep track of (n, xs, votes, rest, go) and for more complex inputs it becomes simply unmanageable. You might think we should switch to using a real parser combinator library—parsec is indeed installed in the environment Kattis uses to run Haskell solutions—and although sometimes a full-blown parser combinator library is needed, in this case it’s quite a bit more heavyweight than we would like. I can never remember which modules I have to import to get parsec set up; there’s a bunch of boilerplate needed to set up a lexer; and so on. Using parsec is only worth it if we’re parsing something really complex.

Scanner

The heart of the issue is that we want to be able to specify a high-level description of the sequence of things we expect to see in the input, without worrying about managing the stream of tokens explicitly. Another key insight is that 99% of the time, we don’t need the ability to deal with parse failure or the ability to parse multiple alternatives. With these insights in mind, we can create a very simple Scanner abstraction, which is just a Stateful computation over a list of tokens:

type Scanner = State [String]

runScanner :: Scanner a -> String -> a
runScanner s = evalState s . words

To run a scanner, we just feed it the entire input as a String, which gets chopped into tokens using words. (Of course in some scenarios we might want to use lines instead of words, or even do more complex tokenization.)

Note since Scanner is just a type synonym for State [String], it is automatically an instance of Functor, Applicative, and Monad (but not Alternative).

So let’s develop a little Scanner DSL. The most fundamental thing we can do is read the next token.

str :: Scanner String
str = get >>= \case { s:ss -> put ss >> return s }

(This uses the LambdaCase extension, though we could easily rewrite it without.) str gets the current list of tokens, puts it back without the first token, and returns the first token. Note that I purposely didn’t include a case for the empty list. You might think we want to include a case for the empty token list and have it return the empty string or something like that. But since the input will always be properly formatted, if this scenario ever happens it means my program has a bug—e.g. perhaps I misunderstood the description of the input format. In this scenario I want it to crash loudly, as soon as possible, rather than continuing on with some bogus data.

We can now add some scanners for reading specific token types other than String, simply by mapping the read function over the output of str:

int :: Scanner Int
int = read <$> str

integer :: Scanner Integer
integer = read <$> str

double :: Scanner Double
double = read <$> str

Again, these will crash if they see a token in an unexpected format, and that is a very deliberate choice.

Now, as I explained earlier, a very common pattern is to have an integer \(n\) followed by \(n\) copies of something. So let’s make a combinator to encapsulate that pattern:

numberOf :: Scanner a -> Scanner [a]
numberOf s = int >>= flip replicateM s

numberOf s expects to first see an Int value \(n\), and then it runs the provided scanner \(n\) times, returning a list of the results.

It’s also sometimes useful to have a way to repeat a Scanner some unknown number of times until encountering EOF (for example, the input for some problems doesn’t specify the number of test cases up front the way that Popular Vote does). This is similar to the many combinator from Alternative.

many :: Scanner a -> Scanner [a]
many s = get >>= \case { [] -> return []; _ -> (:) <$> s <*> many s }

many s repeats the scanner s as many times as it can, returning a list of the results. In particular it first peeks at the current token list to see if it is empty. If so, it returns the empty list of results; if there are more tokens, it runs s once and then recursively calls many s, consing the results together.

Finally, it’s quite common to want to parse a specific small number of something, e.g. two double values representing a 2D coordinate pair. We could just write replicateM 2 double, but this is common enough that I find it helpful to define dedicated combinators with short names:

two, three, four :: Scanner a -> Scanner [a]
[two, three, four] = map replicateM [2..4]

The complete file can be found on GitHub. As I continue this series I’ll be putting more code into that repository. Note I do not intend to make this into a Hackage package, since that wouldn’t be useful: you can’t tell Kattis to go download a package from Hackage before running your submission. However, it is possible to submit multiple files at once, so you can include Scanner.hs in your submission and just import Scanner at the top of your main module.

Examples

So what have we gained? Writing the parser for Popular Vote is now almost trivial:

type Election = [Int]

main = interact $ runScanner elections >>> ...

elections :: Scanner [Election]
elections = numberOf (numberOf int)

In practice I would probably just inline the definition of elections directly: interact $ runScanner (numberOf (numberOf int)) >>> …

As a slightly more involved example, chosen almost at random, consider Board Wrapping:

On the first line of input there is one integer, \(N \leq 50\), giving the number of test cases (moulds) in the input. After this line, \(N\) test cases follow. Each test case starts with a line containing one integer \(n, 1 \leq n \leq 600\), which is the number of boards in the mould. Then \(n\) lines follow, each with five floating point numbers \(x,y,w,h,v\) where \(0 \leq x,y,w,h \leq 10000\) and \(-90^{\circ} &lt; v \leq 90^{\circ}\). The \(x\) and \(y\) are the coordinates of the center of the board and \(w\) and \(h\) are the width and height of the board, respectively. \(v\) is the angle between the height axis of the board to the \(y\)-axis in degrees, positive clockwise.

Here’s how I would set up the input, using Scanner and a custom data type to represent boards.

import Scanner

type V = [Double]     -- 2D vectors/points
newtype A = A Double  -- angle (radians)
                      -- newtype helps avoid conversion errors

fromDeg :: Double -> A
fromDeg d = A (d * pi / 180)

data Board = Board { boardLoc :: V, boardDims :: V, boardAngle :: A }

board :: Scanner Board
board = Board
  <$> two double
  <*> two double
  <*> ((fromDeg . negate) <$> double)

main = interact $
  runScanner (numberOf (numberOf board)) >>> ...