19 Parsing Structured Text

19.1 Simple Parsers	365
19.2 Parsing Structured Values	370
19.3 Implementing a Calculator	375
19.4 Parser Debugging and Error Reporting	380

> def parser[T: P] = P(
    ("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
  )

> fastparse.parse("hello seattle", parser(using _))
res0: fastparse.Parsed[(String, String)] = Success(
  value = ("hello", "seattle"),
  index = 13
)

> fastparse.parse("hello     world", parser(using _))
res1: fastparse.Parsed[(String, String)] = Success(
  value = ("hello", "world"),
  index = 15
)
19.1.scala

Snippet 19.1: parsing simple text formats using the FastParse library

One common programming task is parsing structured text. This chapter will introduce how to parse text in Scala using the FastParse library, before diving into an example where we write a simple arithmetic parser in Scala. This will allow you to work competently with unusual data formats, query languages, or source code for which you do not already have an existing parser at hand.

We will build upon the parsing techniques learned in this chapter as part of Chapter 20: Implementing a Programming Language.

19.1 Simple Parsers

For this chapter, we will include the FastParse library as a dependency via Scala CLI. The simplest FastParse parser is shown below:

$ scala --import com.lihaoyi::fastparse:3.1.1 --repl

> import fastparse.*, NoWhitespace.*

> def parser[T: P] = P( "hello" )
19.2.scala

Here, we import the fastparse library and define a parser with the P(...) function and [T: P] context bound. Every FastParse parser must be defined with these, and in this case our def parser parses a single string, "hello", and nothing else. You can use this parser by calling fastparse.parse on it:

> fastparse.parse("hello", parser(using _))
res2: fastparse.Parsed[Unit] = Success(value = (), index = 5)

> fastparse.parse("goodbye", parser(using _))
res3: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "goodbye")
19.3.scala

We can see that parsing "hello" succeeded, returning () (which is Unit, meaning "no value" in Scala) and parsing until index 5. On the other hand, trying to parse "goodbye" failed at row 1 character 1: the first character. We can make it print out what it expected by asking for the trace.

> val Parsed.Failure(expected, index, extra) =
    fastparse.parse("goodbye", parser(using _)).runtimeChecked

> println(extra.trace().longMsg)
Expected parser:1:1 / "hello":1:1, found "goodbye"
19.4.scala

The .trace().longMsg tells us that it expected "hello" but instead found "goodbye". Asking for the .trace().msg or .trace().longMsg gives a more detailed error message, but at a cost: FastParse needs to re-parse the entire input with additional instrumentation. This takes additional time, and so is not done by default unless you ask for it.

While in the above example we pattern-matched using val Parsed.Failure(...) = since we know the parse failed, in production code you would handle the result of a parse using a match block as below:

fastparse.parse("goodbye", parser(using _)) match
  case Parsed.Success(value, index) => ...
  case Parsed.Failure(expected, index, extra) => ...
19.5.scala

You can then handle the error however you like: logging it, showing it to the user, etc.

19.1.1 Partial & Incomplete Parses

fastparse.parse also fails if there's not enough input to parse:

> fastparse.parse("hel", parser(using _))
res4: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "hel")
19.6.scala

On the other hand, if there's too much input, fastparse.parse succeeds but with the index showing how much it actually parsed:

> fastparse.parse("hellogoodbye", parser(using _))
res5: fastparse.Parsed[Unit] = Success(value = (), index = 5)
19.7.scala

If we want to catch the case where we didn't completely parse the input, it's straightforward to compare the success index (5) against the length of the input string (12) to see whether or not the input was fully consumed. We can also use the End operator, which we'll discuss later in Sequence Parsers (19.1.3).

19.1.2 Alternative Parsers

a | b can parse anything a or b can parse, tried left-to-right (i.e. a gets priority, and b is only tried if a fails):

> def parser[T: P] = P( "hello" | "goodbye" )

> fastparse.parse("hello", parser(using _))
res6: fastparse.Parsed[Unit] = Success(value = (), index = 5)

> fastparse.parse("goodbye", parser(using _))
res7: fastparse.Parsed[Unit] = Success(value = (), index = 7)

> fastparse.parse("dunno", parser(using _))
res8: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "dunno")
19.8.scala

As you can see, parsing either "hello" or "goodbye" works, but parsing "dunno" makes it fail. Again, you can call .trace().longMsg for info:

> val Parsed.Failure(expected, index, extra) =
    fastparse.parse("dunno", parser(using _)).runtimeChecked

> println(extra.trace().longMsg)
Expected parser:1:1 / ("hello" | "goodbye"):1:1, found "dunno"
19.9.scala

Above, we can see that it expected "hello" | "goodbye" at row 1 character 1 but instead found a "dunno". You can chain more alternatives together with |, and FastParse will try all of them left-to-right.

19.1.3 Sequence Parsers

a ~ b parses a followed by b:

> def parser[T: P] = P( "hello" ~ "goodbye" )

> fastparse.parse("hellogoodbye", parser(using _))
res9: fastparse.Parsed[Unit] = Success(value = (), index = 12)

> fastparse.parse("hello", parser(using _))
res10: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "")

> fastparse.parse("goodbye", parser(using _))
res11: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "goodbye")
19.10.scala

The parser above parses "hellogoodbye" all at once.

If you give it just "hello", it fails since it's looking for "goodbye" at row 1 character 6, but found no input since it had reached the end of the string
If you try to parse just "goodbye", it fails since it's looking for "hello" at row 1 character 1, but instead found "goodbye"

Like |, ~ is chainable: you can ~ together as many parsers as you want in sequence and they'll each run one after the other. If you want to ensure your parser is able to consume all the given input, you can end with a ~ End parser. This will fail the parse if there is input left over:

> def parser[T: P] = P( "hello" ~ "goodbye" ~ End )

> fastparse.parse("hellogoodbye", parser(using _))
res12: fastparse.Parsed[Unit] = Success(value = (), index = 12)

> fastparse.parse("hellogoodbyeworld", parser(using _))
res13: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:13, found "world")

> val Parsed.Failure(msg, idx, extra) =
    fastparse.parse("hellogoodbyeworld", parser(using _)).runtimeChecked

> extra.trace().longMsg
res14: String = "Expected parser:1:1 / end-of-input:1:13, found \"world\""
19.11.scala

Here, parsing "hellogoodbyeworld" failed because it was expecting to have parsed until the End of the input string, but instead it found more characters left over. This is something you usually only want to do at the "end" of the parser when you know nothing should be left over, and saves you from always needing to check if the success-index and input-length line up.

19.1.4 Combining Alternatives with Sequences

You can combine | and ~, which is when things start getting interesting:

> def parser[T: P] = P(("hello" | "goodbye") ~ " " ~ ("world" | "seattle") ~ End)

This passes on the following inputs:

> fastparse.parse("hello world", parser(using _))
res15: fastparse.Parsed[Unit] = Success(value = (), index = 11)

> fastparse.parse("hello seattle", parser(using _))
res16: fastparse.Parsed[Unit] = Success(value = (), index = 13)

> fastparse.parse("goodbye world", parser(using _))
res17: fastparse.Parsed[Unit] = Success(value = (), index = 13)
19.12.scala

And fails on the following:

> fastparse.parse("hello universe", parser(using _)) // Not "world" or "seattle"
res18: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:7, found "universe")

> fastparse.parse("helloworld", parser(using _)) // Missing required " " blank space
res19: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "world")

> fastparse.parse("hello  world", parser(using _)) // Too many blank spaces
res20: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:7, found " world")

> fastparse.parse("i love seattle", parser(using _)) // Not a hello or goodbye
res21: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "i love sea")

> fastparse.parse("hello seattle moo", parser(using _)) // unexpected extra text
res22: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:14, found " moo")
19.13.scala

If you want more verbose error logging for why the parse failed, you can ask for the .trace().longMsg as we did earlier.

We now have a FastParse parser which is about as powerful as the regex (hello|goodbye) (world|seattle), though with more informative errors when the parse fails.

19.1.5 Repeated Parsers

You can call .rep(n) on any parser to repeat it:

> def parser[T: P] = P(
    ("hello" | "goodbye") ~ " ".rep(1) ~ ("world" | "seattle") ~ End
  )

> fastparse.parse("hello world", parser(using _))
res23: fastparse.Parsed[Unit] = Success(value = (), index = 11)

> fastparse.parse("hello     world", parser(using _))
res24: fastparse.Parsed[Unit] = Success(value = (), index = 15)

> fastparse.parse("helloworld", parser(using _))
res25: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "world")
19.14.scala

Here, .rep(1) means it repeats the " " parser at least once.

If we have one or more spaces between the "hello" and "world", the parser consumes all of them
If we have no spaces at all, it fails with a message that it was looking for a " " but instead found a "world" at character 6

You can also pass in explicit min=..., max=... arguments if you want to bound it to a particular range, or exactly=... if you want it to repeat exactly N times.

19.1.6 Optional Parsers

Marking a parser as optional is done using .?:

> def parser[T: P] = P(
    ("hello" | "goodbye") ~ (" ".rep(1) ~ ("world" | "seattle")).? ~ End
  )
19.15.scala

Here, you can see that the " ".rep(1) parser as well as the ("world" | "seattle") parser are all optional. That means that the parser lets you include or omit the entire " world" suffix:

> fastparse.parse("hello     world", parser(using _))
res26: fastparse.Parsed[Unit] = Success(value = (), index = 15)

> fastparse.parse("hello", parser(using _))
res27: fastparse.Parsed[Unit] = Success(value = (), index = 5)
19.16.scala

If a blank space " " is present, the parser expects "world" or "seattle" to follow:

> fastparse.parse("hello ", parser(using _))
res28: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found " ")
19.17.scala

This is because when we try to parse "hello " with a trailing space:

" ".rep(1) succeeds in parsing the space
("world" | "seattle") then fails since there's no more input after the space.
Since the whole trailing-space-world is optional, the parse backtracks to character 6 just after "hello"
FastParse then tries to continue the parse without the optional portion, now expecting to see the End of input
Since it isn't the end, with one more " " at index 6, the parse fails

This last parser is similar to the regex (hello|goodbye)( +(world|seattle))? The next few sections of this chapter will cover features of the FastParse library that go beyond what a regex is capable of.

19.2 Parsing Structured Values

So far, all our parsers have been of type Parser[Unit], returning () in the Success result. Unit in Scala means "no value": i.e. these parsers parse the input, check that it matches what the parser expects, but doesn't return any concrete value representing the parsed input. This is the default because most parsers don't care about most of the things they're parsing. For example, If you are parsing Java you don't care about all the whitespace, all the { or }s, (s or )s, ,s or ;s, // comments or /**/ comments. FastParse thus avoids building data structures that will ultimately be ignored or discarded.

19.2.1 Capturing Strings

In the cases where we care about the text we're parsing, we must capture it using the .! operator:

> def parser[T: P] = P(
    ("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
  )

> fastparse.parse("hello seattle", parser(using _))
res29: fastparse.Parsed[(String, String)] = Success(
  value = ("hello", "seattle"),
  index = 13
)

> fastparse.parse("hello     world", parser(using _))
res30: fastparse.Parsed[(String, String)] = Success(
  value = ("hello", "world"),
  index = 15
)
19.18.scala

As you can see, we added .! to capture both the ("hello" | "goodbye") as well as the ("world" | "seattle"), and we successfully parsed it to get the results as a tuple containing the two strings. We did not .! capture the one-or-more-spaces parser in the middle, so the spaces it parsed did not appear in the output.

19.2.2 Parsing Case Classes

Often we don't want tuples of strings, and would prefer some kind of object with named fields that contains the data we want, such as an instance of a case class. Below we define such a case class Phrase to represent our phrases, and use .map to transform the parsed tuple of (String, String) into a Phrase:

> case class Phrase(isHello: Boolean, place: String)

> def parser[T: P]: P[Phrase] = P(
    ("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
  ).map:
    case ("hello", place) => Phrase(true, place)
    case ("goodbye", place) => Phrase(false, place)

> val Parsed.Success(result, index) =
    fastparse.parse("goodbye   seattle", parser(using _)).runtimeChecked

> result.isHello
res31: Boolean = false

> result.place
res32: String = "seattle"
19.19.scala

On successful parses, we thus get a Phrase object we can work with. We can then conveniently refer to parts of the parsed data via .isHello or .place, rather than indexing via (0) and (1) we would need if we were working directly with tuples. This is much less error-prone than using a regex with capturing groups to retrieve portions of parsed string by index (result.group(0), result.group(1), ...).

19.2.3 Modularizing and Typing our Parsers

The above example is getting a bit long, but it is easy to break it up. We define smaller parsers using the same def foo[T: P] = P(...) syntax, and make use of them in the "main" parser:

> def prefix[T: P]: P[String] = P( "hello" | "goodbye" ).!

> def suffix[T: P]: P[String] = P( "world" | "seattle" ).!

> def ws[T: P]: P[Unit] = P( " ".rep(1) ) // white-space

> def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
    case ("hello", place) => Phrase(true, place)
    case ("goodbye", place) => Phrase(false, place)

> val Parsed.Success(result, index) =
    fastparse.parse("goodbye   world", parser(using _)).runtimeChecked

val result: Phrase = Phrase(isHello = false, place = "world")
val index: Int = 15
19.20.scala

Here we can see that the individual prefix and suffix parsers are P[String] rather than P[Unit]. This means they will return a String if their parse succeeds. ws is still P[Unit] since it did not capture anything, prefix ~ ws ~ suffix is still P[(String, String)], which we .map into a P[Phrase].

FastParse parsers generally parse typed values - case classes, tuples, Seqs, primitives, and so on. That makes working with their results very convenient: the compiler can make sure your parsers are producing results that your downstream code can use, and mistakes are compile errors rather than runtime exceptions. This makes it very easy to safely combine simpler parsers to form complex ones, something that is difficult or error-prone when working with untyped regexes.

For example, if we made a mistake and in the .map call assumed that the prefix ~ " ".rep(1) ~ suffix produced a 3-tuple (String, String, String), the compiler will catch it right away, before any code is run:

> def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
    case ("hello", spaces, place) => Phrase(true, place)
    case ("goodbye", spaces, place) => Phrase(false, place)
-- [E007] Type Mismatch Error: -------------------------------------------------
2 |  case ("hello", spaces, place) => Phrase(true, place)
  |                                                    ^^^^^
  |                                                   Found:    (place : Any)
  |                                                   Required: String

> // second attempt to diagnose - don't use wrongly-typed `place` as argument
  def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
    case ("hello", spaces, place) => Phrase(true, ???)
    case ("goodbye", spaces, place) => Phrase(false, ???)
-- [E029] Pattern Match Exhaustivity Warning: ----------------------------------
2 |  case ("hello", spaces, place) => Phrase(true, ???)
  |  ^
  |  match may not be exhaustive.
  |
  |  It would fail on pattern case: (_, _)
-- Error: ----------------------------------------------------------------------
2 |  case ("hello", spaces, place) => Phrase(true, ???)
  |       ^
  |this case is unreachable since type (String, String) is not a subclass of class Tuple3
19.21.scala

In the first expression, the error should be unexpected, as we would assume place should be a String, and not be of type Any. This can be a hint that the pattern might be wrong. Indeed if we replace the place argument with a placeholder, then we see the root of the problem:

19.2.4 Using Sub-Parsers Independently

A prefix or suffix are themselves P[String] parsers, we can use them on their own to parse input:

> fastparse.parse("hello", prefix(using _))
res33: fastparse.Parsed[String] = Success(value = "hello", index = 5)

> fastparse.parse("goodbye", prefix(using _))
res34: fastparse.Parsed[String] = Success(value = "goodbye", index = 7)

> fastparse.parse("moo", prefix(using _))
res35: fastparse.Parsed[String] = Parsed.Failure(Position 1:1, found "moo")
19.22.scala

With FastParse, every P[T] can have its components easily broken out into separate parts, each of which is statically typed as P[String], P[Unit], P[(String, String)], P[Phrase], etc.. The compiler can help ensure that the combinations and transformations are valid, and every sub-parser can be used and tested independently.

In the end, all these def prefix, def suffix, def parser definitions are just methods returning a ParsingRun[T] type, abbreviated as P[T]. They can be defined anywhere, as part of objects or classes or within method bodies. You can refactor parts of a large parser into smaller parsers, assign them meaningful names, and in general manage them as you would any other piece of code.

19.2.5 Recursive Parsers

This ability to reference parsers by name inside another parser means parsers can be recursive! For example here we change Phrase into a tree-like object: it's either a Word containing a string, or a Pair containing two other phrases. This can be modeled as a enum Phrase with two cases:

> enum Phrase:
    case Word(s: String)
    case Pair(lhs: Phrase, rhs: Phrase)
19.23.scala

Here, we wrap everything in {}s, so the REPL will execute the three statements as one "block" rather than separate commands. This is necessary here and for the subsequent definitions, which are recursive and thus only make sense when defined together. We can now parse these Phrases as follows, again using {}s to define the parser as a single block:

> {
  def prefix[T: P] = P( "hello" | "goodbye" ).!.map(Phrase.Word(_))
  def suffix[T: P] = P( "world" | "seattle" ).!.map(Phrase.Word(_))
  def ws[T: P] = P( " ".rep(1) )
  def parened[T: P] = P( "(" ~ parser ~ ")" )
  def parser[T: P]: P[Phrase] = P((parened | prefix) ~ ws ~ (parened | suffix)).map:
    case (lhs, rhs) => Phrase.Pair(lhs, rhs)
  }
19.24.scala

Compared to our earlier non-recursive Phrase parser, this has the following major changes:

We introduce a new parened parser, which wraps parser with a "(" before it and ")" after
Inside parser, prefix is now (parened | prefix) and suffix is now (parened | suffix)
prefix and suffix have their result transformed using .map(Word(_)) to turn them from P[String]s into P[Word]s, which is a subtype of P[Phrase]
parser has its result transformed using .map to become a P[Pair], also a subtype of P[Phrase]

Thus, parser and parened are now mutually recursive: each one can call the other as part of their parse. The definitions of prefix, suffix and ws themselves are unchanged from what we saw earlier. We can pass in various combinations of phrases and parentheses to see it in action:

> fastparse.parse(
    "(hello  world)   ((goodbye seattle) world)",
    parser(using _)
  )
res36: fastparse.Parsed[Phrase] = Success(
  value = Pair(
    lhs = Pair(lhs = Word("hello"), rhs = Word("world")),
    rhs = Pair(
      lhs = Pair(lhs = Word("goodbye"), rhs = Word("seattle")),
      rhs = Word("world")
    )
  ),
  index = 42
)
19.25.scala

We now have a working parser that can parse a given tree-shaped input string into a tree-shaped data-structure. We just needed to write 7 short lines of code to define the shape of the input we want to parse, and the FastParse library did the rest of the work of parsing the string and transforming it into the data structure we need.

See example 19.1 - Phrases

19.3 Implementing a Calculator

As a capstone project for this chapter, we will implement a small arithmetic evaluator. This is a common programming interview challenge, but we will add a twist: it must work on the English representations of numbers! For example, while a traditional arithmetic evaluator may evaluate the following to return 18:

(1 + 2) * (9 - 3)

For this exercise, we will instead parse input such as:

(one plus two) times (nine minus three)

To simplify things, we will limit the numbers to the range zero to nine, avoiding all the complexities around parsing English numbers like fifteen or twelve or one hundred twenty eight.

19.3.1 Defining our Syntax Tree

The first thing we need to do is define exactly what structure we want to parse our arithmetic expressions into. Since we only care about binary operations on numbers, with the numbers being integers, we could model that structure as follows:

> enum Expr:
    case BinOp(left: Expr, op: String, right: Expr)
    case Number(value: Int)
19.26.scala

It is always a judgement call how much information you want to model in your syntax tree. Here we ignore whitespace, line numbers, or parentheses, since they are not necessary for evaluation. Exactly how much detail about the input text you want to model in your syntax tree depends on what you need to do with it.

19.3.2 Parsing Literals

To begin let's implement a parser for simple numbers:

> def number[T: P]: P[Expr] = P(
    "zero" | "one" | "two" | "three" | "four" |
    "five" | "six" | "seven" | "eight" | "nine"
  ).!.map:
    case "zero"  => Expr.Number(0); case "one"   => Expr.Number(1)
    case "two"   => Expr.Number(2); case "three" => Expr.Number(3)
    case "four"  => Expr.Number(4); case "five"  => Expr.Number(5)
    case "six"   => Expr.Number(6); case "seven" => Expr.Number(7)
    case "eight" => Expr.Number(8); case "nine"  => Expr.Number(9)
19.27.scala

Here, we manually list out all the possible numbers, and inside the map we pattern-match on the value and assign each string to the integer value it represents. You can see that due to our map call, number is now a P[Expr]: on success, the parse returns an Int. This can be tested as follows:

> fastparse.parse("seven", number(using _))
res37: fastparse.Parsed[Expr] = Success(value = Number(7), index = 5)

> fastparse.parse("zero", number(using _))
res38: fastparse.Parsed[Expr] = Success(value = Number(0), index = 4)

> fastparse.parse("lol", number(using _))
res39: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "lol")
19.28.scala

Next, we need to define how to parse operators: let's restrict it to the four basic arithmetic operators for now, and use .! on the operator parser so we can capture which operator was parsed:

> def operator[T: P]: P[String] = P( "plus" | "minus" | "times" | "divide" ).!

The last terminal parser we will need is to parse whitespace; for now we will use the same def ws parser we used earlier:

> def ws[T: P] = P( " ".rep(1) )

19.3.3 Parsing Arithmetic

Next, we have to parse the recursive tree-like structure, so we can handle inputs like:

(one plus two) times (nine minus three)

The syntax we want to be able to parse is a left-hand-side expression, followed by an operator, followed by a right-hand-side expression. Each side's expression is either a number, or the parser itself surrounded by parentheses. Let's call this def expr:

def expr[T: P]: P[Expr] = P( "(" ~ parser ~ ")" | number )
def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
19.29.scala

Now, we're almost done. But if you try to run this, you get a compile error:

> {
  def expr[T: P] = P( "(" ~ parser ~ ")" | number )
  def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
  }
-- [E172] Type Error: ----------------------------------------------------------
3 |def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
  |                                                               ^
  |No given instance of type fastparse.Implicits.Sequencer[(Expr, String), Expr,
  |Expr] was found for parameter s of method ~ in package fastparse
19.30.scala

The reason for the failure is that the expected result of def parser is P[Expr], but the ~ operator cannot find a given instance of Sequencer that can combine all the sub-parsers to the required type. Looking closer at the expression, it is made of three sub-parsers which return values:

(expr: P[Expr]) ~ ws ~ (operator: P[String]) ~ ws ~ (expr: P[Expr])

On its own, the expression combines to P[(Expr, String, Expr)]. To convert it to a P[Expr], we need to map it. In this case, let's make the map function combine the lhs and rhs results, depending on what the operator was:

> {
  def expr[T: P] = P( "(" ~ parser ~ ")" | number )
  def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr ).map:
    case (lhs, op, rhs) => Expr.BinOp(lhs, op, rhs)
  }
19.31.scala

Now our parser is able to parse the recursive syntax of our arithmetic expressions:

> val t = fastparse.parse(
    "(one plus two) times (three plus four)",
    parser(using _)
  ).get.value
t: Expr = BinOp(
  left = BinOp(left = Number(1), op = "plus", right = Number(2)),
  op = "times",
  right = BinOp(left = Number(3), op = "plus", right = Number(4))
)
19.32.scala

19.3.4 Manipulating the Syntax Tree

Now that we have our input string parsed into an Expr tree, we can use pattern matching to traverse and manipulate. For example, we can pretty-print it back into a String, or evaluate it into a Int value:

> def stringify(e: Expr): String = e match
    case Expr.BinOp(left, op, right) => s"(${stringify(left)} $op ${stringify(right)})"
    case Expr.Number(0) => "zero";   case Expr.Number(1) => "one"
    case Expr.Number(2) => "two";    case Expr.Number(3) => "three"
    case Expr.Number(4) => "four";   case Expr.Number(5) => "five"
    case Expr.Number(6) => "six";    case Expr.Number(7) => "seven"
    case Expr.Number(8) => "eight";  case Expr.Number(9) => "nine"

> stringify(t)
res40: String = "((one plus two) times (three plus four))"

> def evaluate(e: Expr): Int = e match
    case Expr.BinOp(left, "plus", right) => evaluate(left) + evaluate(right)
    case Expr.BinOp(left, "minus", right) => evaluate(left) - evaluate(right)
    case Expr.BinOp(left, "times", right) => evaluate(left) * evaluate(right)
    case Expr.BinOp(left, "divide", right) => evaluate(left) / evaluate(right)
    case Expr.Number(n) => n

> evaluate(t)
res41: Int = 21
19.33.scala

In general, once we have the input parsed into a syntax tree structure, it is easy to recursively traverse the tree using pattern matching to do useful things: evaluating it, pretty-printing it, validating it, and so on.

We have now written a simple FastParse parser for "English-like" arithmetic expressions, letting us evaluate the arithmetic expression to an integer or pretty-print it back as a string. The entire program is 40 short lines of code, as follows:

Arithmetic.scala//| mvnDeps:
//| - com.lihaoyi::fastparse:3.1.1
import fastparse.*, NoWhitespace.*

enum Expr:
  case BinOp(left: Expr, op: String, right: Expr)
  case Number(value: Int)

def number[T: P] = P(
  "zero" | "one" | "two" | "three" | "four" |
  "five" | "six" | "seven" | "eight" | "nine"
).!.map:
  case "zero"  => Expr.Number(0); case "one"   => Expr.Number(1)
  case "two"   => Expr.Number(2); case "three" => Expr.Number(3)
  case "four"  => Expr.Number(4); case "five"  => Expr.Number(5)
  case "six"   => Expr.Number(6); case "seven" => Expr.Number(7)
  case "eight" => Expr.Number(8); case "nine"  => Expr.Number(9)

def ws[T: P] = P( " ".rep(1) )
def operator[T: P] = P( "plus" | "minus" | "times" | "divide" ).!
def expr[T: P] = P( "(" ~ parser ~ ")" | number )
def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr ).map:
  case (lhs, op, rhs) => Expr.BinOp(lhs, op, rhs)19.34.scala

Traversals.scala//| moduleDeps: [Arithmetic.scala]
def stringify(e: Expr): String = e match
  case Expr.BinOp(left, op, right) => s"(${stringify(left)} $op ${stringify(right)})"
  case Expr.Number(0) => "zero";  case Expr.Number(1) => "one"
  case Expr.Number(2) => "two";   case Expr.Number(3) => "three"
  case Expr.Number(4) => "four";  case Expr.Number(5) => "five"
  case Expr.Number(6) => "six";   case Expr.Number(7) => "seven"
  case Expr.Number(8) => "eight"; case Expr.Number(9) => "nine"

def evaluate(e: Expr): Int = e match
  case Expr.BinOp(left, "plus", right) => evaluate(left)   + evaluate(right)
  case Expr.BinOp(left, "minus", right) => evaluate(left)  - evaluate(right)
  case Expr.BinOp(left, "times", right) => evaluate(left)  * evaluate(right)
  case Expr.BinOp(left, "divide", right) => evaluate(left) / evaluate(right)
  case Expr.Number(n) => n19.35.scala

See example 19.2 - Arithmetic

19.4 Parser Debugging and Error Reporting

The happy path of correctly parsing input is only half of your experience working with parsers. The other half is when things go wrong: your parsers are not parsing what you want them to, or your users are submitting invalid input and causing the parse to fail. In both cases, we need to figure out why a parser is misbehaving. This section will go through two techniques: Debugging Parsers via .log (19.4.1), and Error Reporting with Cuts (19.4.2).

19.4.1 Debugging Parsers via .log

Consider the arithmetic parser we implemented earlier. This works great when the input is correct, but less great when the input is erroneous:

> fastparse.parse("(two plus ten) times seven", parser(using _))
res42: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "(two plus ")
19.36.scala

In the parse above, our parser fails with the message Position 1:1, found "(two plus ". Our input was wrong starting from the first character. That tells us nothing about what is wrong with the input!

19.4.1.1 Adding Logs

In order to find out more, we can use the .log method to insert logging on the various parser methods:

Arithmetic.scala def number[T: P] = P(
   "zero" | "one" | "two" | "three" | "four" |
   "five" | "six" | "seven" | "eight" | "nine"
 ).!.map:
   case "zero"  => Expr.Number(0); case "one"   => Expr.Number(1)
   ...
+.log

 def ws[T: P] = P( " ".rep(1) )

-def operator[T: P] = P( "plus" | "minus" | "times" | "divide" ).!
+def operator[T: P] = P( "plus" | "minus" | "times" | "divide" ).!.log19.37.scala

Arithmetic.scala-def expr[T: P] = P( "(" ~ parser ~ ")" | number )
+def expr[T: P] = P( "(" ~ parser ~ ")" | number ).log

 def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr ).map:
   case (lhs, op, rhs) => Expr.BinOp(lhs, op, rhs)
+.log19.38.scala

19.4.1.2 Logging a Successful Parse

By appending .log onto a parser, we are telling FastParse to print out exactly what it is doing. For example, here is the log of successfully parsing a small input:

> fastparse.parse("one plus two", parser(using _))
+parser:1:1, cut
  +expr:1:1, cut
    +number:1:1
    -number:1:1:Success(1:4)
  -expr:1:1:Success(1:4, cut)
  +operator:1:5, cut
  -operator:1:5:Success(1:9, cut)
  +expr:1:10, cut
    +number:1:10
    -number:1:10:Success(1:13)
  -expr:1:10:Success(1:13, cut)
-parser:1:1:Success(1:13, cut)
res43: fastparse.Parsed[Expr] = Success(
  value = BinOp(left = Number(1), op = "plus", right = Number(2)),
  index = 12
)
19.39.scala

Every line starting with a + indicates at which line and column in the input text a parser began, and every line starting with - indicates where a parser completed either with a Success or Failure.

Above, we can see parser beginning 1:1 (row 1 column 1), expr at 1:1, number at 1:1, number succeeding at 1:4, and so on. Eventually, we see parser succeed at 1:13, producing our final result.

19.4.1.3 Logging a Failed Parse

If we feed our earlier incorrect input to our logged parser, we can see the following logs:

> fastparse.parse("(two plus ten) times seven", parser(using _))
+parser:1:1, cut
  +expr:1:1, cut
    +parser:1:2
      +expr:1:2
        +number:1:2
        -number:1:2:Success(1:5)
      -expr:1:2:Success(1:5)
      +operator:1:6
      -operator:1:6:Success(1:10)
      +expr:1:11
        +number:1:11
        -number:1:11:Failure(number:1:11 / ("zero" | "one" ...):1:11 ..."ten) times")
      -expr:1:11:Failure(expr:1:11 / ("(" | number):1:11 ..."ten) times")
    -parser:1:2:Failure(parser:1:2 / expr:1:11 / ("(" | number):1:11 ..."two plus t")
    +number:1:1
    -number:1:1:Failure(number:1:1 / ("zero" | "one" | "two" ...):1:1 ..."(two plus ")
  -expr:1:1:Failure(expr:1:1 / ("(" ~ parser | number):1:1 ..."(two plus ", cut)
-parser:1:1:Failure(parser:1:1 / expr:1:1 / ("(" ~ parser | number):1:1 ..."(two plus ")
19.40.scala

Here, we can see it successfully parsing all the way until row 1:11 (row 1 column 11), at which the number parser fails: it expected a number of the form "zero" | "one" ..., but instead saw a ten. We can then see it backtrack out of the parse, eventually failing at 1:1. Now the problem is obvious: ten is not a valid number in our parser.

In this small parser, we can add logs throughout the code. In a larger parser we may have to be a bit more selective in what we log to avoid drowning in logging output. In either case, by adding logs to the part of the parser that might be misbehaving, we can get visibility in what the parser is doing. This will help us narrow down exactly which part of the parser or which part of the input is incorrect.

Instrumenting our parser with .logs works fine for the developer trying to investigate a misbehavior in the parser. But if the mistake was in user input, that user likely would not have the freedom to sprinkle .logs throughout our code. What then?

19.4.2 Error Reporting with Cuts

The reason the erroneous input above gives a poor error message is as follows:

The parser fails at 1:11 because ten is not a "(" nor is it a valid number
It backtracks to 1:2 where it was trying to parse parser, finds no other alternatives, and fails
It backtracks out to 1:1 where it earlier tried to parse "(", tries to parse number, and fails
Only when that fails does the parse fail since there is nowhere left to backtrack to

When a parse fails, FastParse backtracks to try the other alternatives defined by |s, in an attempt to find one that can successfully parse. When all of alternatives fail, FastParse does not know which one was "meant" to succeed. Thus can only provide the vague error saying "something is wrong at 1:1".

To produce a better error message, we need to limit how much backtracking FastParse will do.

19.4.2.1 Minimal Example: No Cuts

The basic idea of a cut is a point in which the parser cannot backtrack. For example, let us consider the following standalone parser nocut:

def alpha[T: P] = P( CharIn("a-z") )
def nocut[T: P] = P( "val " ~ alpha.rep(1).! | "def " ~ alpha.rep(1).!)

val Parsed.Success("abcd", _) = parse("val abcd", nocut(using _)).runtimeChecked
val failure = parse("val 1234", nocut(using _)).asInstanceOf[Parsed.Failure]
val trace = failure.trace().longAggregateMsg
failure.index // 0
trace //  Expected nocut:1:1 / ("val " ~ alpha.rep(1) | "def "):1:1, found "val 1234"
19.41.scala

Above we have a simple parser: it either parses a val or def, a space, and its (lower-case only) name. On success this works as expected. On failure, it backtracks all the way to the start of the input. This is a minimized example of the backtracking behavior we saw earlier.

The reason it needs to backtrack is that FastParse does not know that after successfully parsing "val ", only the left branch of the alternative is viable regardless of what input text is being parsed. Thus it has no choice but to offer both alternatives in the error message. We might know that after seeing val , only "val " ~ alpha.rep(1).! could possibly succeed, but FastParse does not know that.

19.4.2.2 Minimal Example: Cuts

To solve this issue, we can add cuts using the ~/ operator:

def alpha[T: P] = P( CharIn("a-z") )
def hascuts[T: P] = P( "val " ~/ alpha.rep(1).! | "def " ~/ alpha.rep(1).!)

val Parsed.Success("abcd", _) = parse("val abcd", hascuts(using _)).runtimeChecked
val failure = parse("val 1234", hascuts(using _)).asInstanceOf[Parsed.Failure]
val trace = failure.trace().longAggregateMsg
failure.index // 4
trace // Expected hascuts:1:1 / alpha:1:5 / [a-z]:1:5, found "1234"
19.42.scala

~/ is similar to the Sequence operator ~. The difference is that once the parse has crossed a cut, it can no longer backtrack past that point. This basically makes FastParse "commit" to the current branch of the alternative, so it knows that if the parse fails the error must be within the current branch and not in any of the other branches.

Once we have added the cut ~/, FastParse no longer backtracks to index 0 on failure. Instead, it shows a much more precise error: at index 4, expecting one of the small set of alphanumeric characters.

19.4.2.3 Arithmetic Parser Cuts

Going back to our Arithmetic parser above, we know that once we see an open (, there is no point backtracking: an open ( must be followed by a parser, then a closing ). FastParse doesn't know this, but we can tell it via a cut after the open (:

Arithmetic.scala-def expr[T: P] = P( "(" ~ parser ~ ")" | number )
+def expr[T: P] = P( "(" ~/ parser ~ ")" | number )19.43.scala

After we add the cut ~/, our error message is improved greatly:

> val result = fastparse.parse("(two plus ten) times seven", parser(using _))
result: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "(two plus ")

> val Parsed.Failure(msg, idx, extra) = result.runtimeChecked

> println(extra.trace().msg)
Expected ("(" ~ parser | number):1:1, found "(two plus "
19.44.scala

We now get a precise error telling us that the "ten" in our string is problematic. Using .trace().msg to get more information, we can see that the parser expected either a "(" or a number at column 11, but instead found "ten". This tells us precisely where the problematic input is and what the parser was expecting!

In general, adding cuts greatly helps improve the error reporting when your parser is given invalid input. Whenever you have a parser that can backtrack:

Alternative a | b | c parsers
Repeated a.rep parsers
Optional a.? parsers

It is worth considering if there is a point in the parse past which backtracking is futile. If there is, adding a cut ~/ would greatly improve the error messages the parser produces.

Logging and Cuts are two complementary techniques that help figure out why the parser is not doing what you expect. In the inevitable situation where something is going wrong, it is good to have both in your toolbox.

19.5 Conclusion

We have now walked through the basics of parsing structured text into useful values: parsing arithmetic expressions into a syntax tree that we can traverse and evaluate in various ways. Note this is a minimal example of what you can do with the FastParse library. The actual FastParse documentation has a lot more details in how to write parsers, and is available online at:

https://com-lihaoyi.github.io/fastparse

FastParse can be used in any Scala project, using the following dependency:

Millmvn"com.lihaoyi::fastparse:3.1.1"

FastParse parsers are flexible and performant enough to build production parsers for complex, real-world languages like CSS, JSON, Scala, or Python. While in this chapter we parsed a strawman arithmetic syntax, there are many real-world use cases for writing your own parser: handling uncommon data formats, writing a simple query language, or parsing your existing source code for static analysis.

Next, in Chapter 20: Implementing a Programming Language we will use FastParse as the basis for writing a simple programming language interpreter.

Exercise: Extend our arithmetic parser to allow parsing chained operators like one plus two plus three plus four which should evaluate to 10. For now, ignore operator precedence, such that one plus two times three plus four evaluates to 13 rather than 11.

See example 19.3 - ArithmeticChained

Exercise: Extend our arithmetic parser to support operator precedence with chained operators, such that one plus two times three plus four evaluates to 11 instead of 13. One possible approach is using precedence climbing.

See example 19.4 - ArithmeticPrecedence

Exercise: Modify the arithmetic parser to evaluate the arithmetic expression during the parse and return a single Int once the parse completes, rather than first parsing into an Expr structure that we need to recursively traverse and evaluate later. What exact type our parser methods return is up to us, depending on what we do in the .map calls.

See example 19.5 - ArithmeticDirect

Discuss Chapter 19 online at https://www.handsonscala.com/discuss/19