| 19.1 Simple Parsers | 359 |
| 19.2 Parsing Structured Values | 364 |
| 19.3 Implementing a Calculator | 368 |
| 19.4 Parser Debugging and Error Reporting | 373 |
> def parser[T: P] = P(
("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
)
> fastparse.parse("hello seattle", parser(using _))
res0: fastparse.Parsed[(String, String)] = Success(
value = ("hello", "seattle"),
index = 13
)
> fastparse.parse("hello world", parser(using _))
res1: fastparse.Parsed[(String, String)] = Success(
value = ("hello", "world"),
index = 15
)
19.1.scala
Snippet 19.1: parsing simple text formats using the FastParse library
One common programming task is parsing structured text. This chapter will introduce how to parse text in Scala using the FastParse library, before diving into an example where we write a simple arithmetic parser in Scala. This will allow you to work competently with unusual data formats, query languages, or source code for which you do not already have an existing parser at hand.
We will build upon the parsing techniques learned in this chapter as part of Chapter 20: Implementing a Programming Language.
For this chapter, we will include the FastParse library as a dependency via Scala CLI. The simplest FastParse parser is shown below:
$ scala --import com.lihaoyi::fastparse:3.1.1 --repl
> import fastparse.*, NoWhitespace.*
> def parser[T: P] = P( "hello" )
19.2.scala
Here, we import the fastparse library and define a parser with the P(...)
function and [T: P] context bound. Every FastParse parser must be defined with
these, and in this case our def parser parses a single string, "hello", and
nothing else. You can use this parser by calling fastparse.parse on it:
> fastparse.parse("hello", parser(using _))
res2: fastparse.Parsed[Unit] = Success(value = (), index = 5)
> fastparse.parse("goodbye", parser(using _))
res3: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "goodbye")
19.3.scala
We can see that parsing "hello" succeeded, returning () (which is Unit,
meaning "no value" in Scala) and parsing until index 5. On the other hand,
trying to parse "goodbye" failed at row 1 character 1: the first
character. We can make it print out what it expected by asking for the trace.
> val Parsed.Failure(expected, index, extra) =
fastparse.parse("goodbye", parser(using _)).runtimeChecked
> println(extra.trace().longMsg)
Expected parser:1:1 / "hello":1:1, found "goodbye"
19.4.scala
The .trace().longMsg tells us that it expected "hello" but instead found
"goodbye". Asking for the .trace().msg or .trace().longMsg gives a more
detailed error message, but at a cost: FastParse needs to re-parse the entire
input with additional instrumentation. This takes additional time, and so is not
done by default unless you ask for it.
While in the above example we pattern-matched using val Parsed.Failure(...) =
since we know the parse failed, in production code you would handle the result
of a parse using a match block as below:
fastparse.parse("goodbye", parser(using _)) match
case Parsed.Success(value, index) => ...
case Parsed.Failure(expected, index, extra) => ...
19.5.scala
You can then handle the error however you like: logging it, showing it to the user, etc.
fastparse.parse also fails if there's not enough input to parse:
> fastparse.parse("hel", parser(using _))
res4: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "hel")
19.6.scala
On the other hand, if there's too much input, fastparse.parse succeeds but
with the index showing how much it actually parsed:
> fastparse.parse("hellogoodbye", parser(using _))
res5: fastparse.Parsed[Unit] = Success(value = (), index = 5)
19.7.scala
If we want to catch the case where we didn't completely parse the input, it's
straightforward to compare the success index (5) against the length of the
input string (12) to see whether or not the input was fully consumed. We can
also use the End operator, which we'll discuss later in
Sequence Parsers (19.1.3).
a | b can parse anything a or b can parse, tried left-to-right
(i.e. a gets priority, and b is only tried if a fails):
> def parser[T: P] = P( "hello" | "goodbye" )
> fastparse.parse("hello", parser(using _))
res6: fastparse.Parsed[Unit] = Success(value = (), index = 5)
> fastparse.parse("goodbye", parser(using _))
res7: fastparse.Parsed[Unit] = Success(value = (), index = 7)
> fastparse.parse("dunno", parser(using _))
res8: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "dunno")
19.8.scala
As you can see, parsing either "hello" or "goodbye" works, but parsing
"dunno" makes it fail. Again, you can call .trace().longMsg for info:
> val Parsed.Failure(expected, index, extra) =
fastparse.parse("dunno", parser(using _)).runtimeChecked
> println(extra.trace().longMsg)
Expected parser:1:1 / ("hello" | "goodbye"):1:1, found "dunno"
19.9.scala
Above, we can see that it expected "hello" | "goodbye" at row 1 character
1 but instead found a "dunno". You can chain more alternatives together with
|, and FastParse will try all of them left-to-right.
a ~ b parses a followed by b:
> def parser[T: P] = P( "hello" ~ "goodbye" )
> fastparse.parse("hellogoodbye", parser(using _))
res9: fastparse.Parsed[Unit] = Success(value = (), index = 12)
> fastparse.parse("hello", parser(using _))
res10: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "")
> fastparse.parse("goodbye", parser(using _))
res11: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "goodbye")
19.10.scala
The parser above parses "hellogoodbye" all at once.
If you give it just "hello", it fails since it's looking for "goodbye" at
row 1 character 6, but found no input since it had reached the end of the
string
If you try to parse just "goodbye", it fails since it's looking for
"hello" at row 1 character 1, but instead found "goodbye"
Like |, ~ is chainable: you can ~ together as many parsers as you want in
sequence and they'll each run one after the other. If you want to ensure your
parser is able to consume all the given input, you can end with a ~ End
parser. This will fail the parse if there is input left over:
> def parser[T: P] = P( "hello" ~ "goodbye" ~ End )
> fastparse.parse("hellogoodbye", parser(using _))
res12: fastparse.Parsed[Unit] = Success(value = (), index = 12)
> fastparse.parse("hellogoodbyeworld", parser(using _))
res13: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:13, found "world")
> val Parsed.Failure(msg, idx, extra) =
fastparse.parse("hellogoodbyeworld", parser(using _)).runtimeChecked
> extra.trace().longMsg
res14: String = "Expected parser:1:1 / end-of-input:1:13, found \"world\""
19.11.scala
Here, parsing "hellogoodbyeworld" failed because it was expecting to have
parsed until the End of the input string, but instead it found more characters
left over. This is something you usually only want to do at the "end" of the
parser when you know nothing should be left over, and saves you from always
needing to check if the success-index and input-length line up.
You can combine | and ~, which is when things start getting interesting:
> def parser[T: P] = P(("hello" | "goodbye") ~ " " ~ ("world" | "seattle") ~ End)
This passes on the following inputs:
> fastparse.parse("hello world", parser(using _))
res15: fastparse.Parsed[Unit] = Success(value = (), index = 11)
> fastparse.parse("hello seattle", parser(using _))
res16: fastparse.Parsed[Unit] = Success(value = (), index = 13)
> fastparse.parse("goodbye world", parser(using _))
res17: fastparse.Parsed[Unit] = Success(value = (), index = 13)
19.12.scala
And fails on the following:
> fastparse.parse("hello universe", parser(using _)) // Not "world" or "seattle"
res18: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:7, found "universe")
> fastparse.parse("helloworld", parser(using _)) // Missing required " " blank space
res19: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "world")
> fastparse.parse("hello world", parser(using _)) // Too many blank spaces
res20: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:7, found " world")
> fastparse.parse("i love seattle", parser(using _)) // Not a hello or goodbye
res21: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:1, found "i love sea")
> fastparse.parse("hello seattle moo", parser(using _)) // unexpected extra text
res22: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:14, found " moo")
19.13.scala
If you want more verbose error logging for why the parse failed, you can ask for
the .trace().longMsg as we did earlier.
We now have a FastParse parser which is about as powerful as the regex
(hello|goodbye) (world|seattle), though with more informative errors when the
parse fails.
You can call .rep(n) on any parser to repeat it:
> def parser[T: P] = P(
("hello" | "goodbye") ~ " ".rep(1) ~ ("world" | "seattle") ~ End
)
> fastparse.parse("hello world", parser(using _))
res23: fastparse.Parsed[Unit] = Success(value = (), index = 11)
> fastparse.parse("hello world", parser(using _))
res24: fastparse.Parsed[Unit] = Success(value = (), index = 15)
> fastparse.parse("helloworld", parser(using _))
res25: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found "world")
19.14.scala
Here, .rep(1) means it repeats the " " parser at least once.
If we have one or more spaces between the "hello" and "world", the parser
consumes all of them
If we have no spaces at all, it fails with a message that it was looking for a
" " but instead found a "world" at character 6
You can also pass in explicit min=..., max=... arguments if you want to
bound it to a particular range, or exactly=... if you want it to repeat
exactly N times.
Marking a parser as optional is done using .?:
> def parser[T: P] = P(
("hello" | "goodbye") ~ (" ".rep(1) ~ ("world" | "seattle")).? ~ End
)
19.15.scala
Here, you can see that the " ".rep(1) parser as well as the ("world" | "seattle") parser are all optional. That means that the parser lets you include
or omit the entire " world" suffix:
> fastparse.parse("hello world", parser(using _))
res26: fastparse.Parsed[Unit] = Success(value = (), index = 15)
> fastparse.parse("hello", parser(using _))
res27: fastparse.Parsed[Unit] = Success(value = (), index = 5)
19.16.scala
If a blank space " " is present, the parser expects "world" or "seattle"
to follow:
> fastparse.parse("hello ", parser(using _))
res28: fastparse.Parsed[Unit] = Parsed.Failure(Position 1:6, found " ")
19.17.scala
This is because when we try to parse "hello " with a trailing space:
" ".rep(1) succeeds in parsing the space
("world" | "seattle") then fails since there's no more input after the
space.
Since the whole trailing-space-world is optional, the parse backtracks to
character 6 just after "hello"
FastParse then tries to continue the parse without the optional portion, now
expecting to see the End of input
Since it isn't the end, with one more " " at index 6, the parse fails
This last parser is similar to the regex (hello|goodbye)( +(world|seattle))?
The next few sections of this chapter will cover features of the FastParse
library that go beyond what a regex is capable of.
So far, all our parsers have been of type Parser[Unit], returning () in the
Success result. Unit in Scala means "no value": i.e. these parsers parse the
input, check that it matches what the parser expects, but doesn't return any
concrete value representing the parsed input. This is the default because most
parsers don't care about most of the things they're parsing. For example, If you
are parsing Java you don't care about all the whitespace, all the { or }s,
(s or )s, ,s or ;s, // comments or /**/ comments. FastParse thus
avoids building data structures that will ultimately be ignored or discarded.
In the cases where we care about the text we're parsing, we must capture it
using the .! operator:
> def parser[T: P] = P(
("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
)
> fastparse.parse("hello seattle", parser(using _))
res29: fastparse.Parsed[(String, String)] = Success(
value = ("hello", "seattle"),
index = 13
)
> fastparse.parse("hello world", parser(using _))
res30: fastparse.Parsed[(String, String)] = Success(
value = ("hello", "world"),
index = 15
)
19.18.scala
As you can see, we added .! to capture both the ("hello" | "goodbye") as
well as the ("world" | "seattle"), and we successfully parsed it to get the
results as a tuple containing the two strings. We did not .! capture the
one-or-more-spaces parser in the middle, so the spaces it parsed did not appear
in the output.
Often we don't want tuples of strings, and would prefer some kind of object with
named fields that contains the data we want, such as an instance of a
case class. Below we define such a case class Phrase to represent our phrases, and
use .map to transform the parsed tuple of (String, String) into a Phrase:
> case class Phrase(isHello: Boolean, place: String)
> def parser[T: P]: P[Phrase] = P(
("hello" | "goodbye").! ~ " ".rep(1) ~ ("world" | "seattle").! ~ End
).map:
case ("hello", place) => Phrase(true, place)
case ("goodbye", place) => Phrase(false, place)
> val Parsed.Success(result, index) =
fastparse.parse("goodbye seattle", parser(using _)).runtimeChecked
> result.isHello
res31: Boolean = false
> result.place
res32: String = "seattle"
19.19.scala
On successful parses, we thus get a Phrase object we can work with. We can
then conveniently refer to parts of the parsed data via .isHello or .place,
rather than indexing via (0) and (1) we would need if we were working directly with
tuples. This is much less error-prone than using a regex with capturing groups
to retrieve portions of parsed string by index (result.group(0),
result.group(1), ...).
The above example is getting a bit long, but it is easy to break it up. We
define smaller parsers using the same def foo[T: P] = P(...) syntax, and make
use of them in the "main" parser:
> def prefix[T: P]: P[String] = P( "hello" | "goodbye" ).!
> def suffix[T: P]: P[String] = P( "world" | "seattle" ).!
> def ws[T: P]: P[Unit] = P( " ".rep(1) ) // white-space
> def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
case ("hello", place) => Phrase(true, place)
case ("goodbye", place) => Phrase(false, place)
> val Parsed.Success(result, index) =
fastparse.parse("goodbye world", parser(using _)).runtimeChecked
val result: Phrase = Phrase(isHello = false, place = "world")
val index: Int = 15
19.20.scala
Here we can see that the individual prefix and suffix parsers are
P[String] rather than P[Unit]. This means they will return a String if
their parse succeeds. ws is still P[Unit] since it did not capture anything,
prefix ~ ws ~ suffix is still P[(String, String)], which we .map into a
P[Phrase].
FastParse parsers generally parse typed values - case classes, tuples, Seqs,
primitives, and so on. That makes working with their results very convenient:
the compiler can make sure your parsers are producing results that your
downstream code can use, and mistakes are compile errors rather than runtime
exceptions. This makes it very easy to safely combine simpler parsers to form
complex ones, something that is difficult or error-prone when working with
untyped regexes.
For example, if we made a mistake and in the .map call assumed that the
prefix ~ " ".rep(1) ~ suffix produced a 3-tuple (String, String, String),
the compiler will catch it right away, before any code is run:
> def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
case ("hello", spaces, place) => Phrase(true, place)
case ("goodbye", spaces, place) => Phrase(false, place)
-- [E007] Type Mismatch Error: -------------------------------------------------
2 | case ("hello", spaces, place) => Phrase(true, place)
| ^^^^^
| Found: (place : Any)
| Required: String
> // second attempt to diagnose - don't use wrongly-typed `place` as argument
def parser[T: P] = P( prefix ~ ws ~ suffix ).map:
case ("hello", spaces, place) => Phrase(true, ???)
case ("goodbye", spaces, place) => Phrase(false, ???)
-- [E029] Pattern Match Exhaustivity Warning: ----------------------------------
2 | case ("hello", spaces, place) => Phrase(true, ???)
| ^
| match may not be exhaustive.
|
| It would fail on pattern case: (_, _)
-- Error: ----------------------------------------------------------------------
2 | case ("hello", spaces, place) => Phrase(true, ???)
| ^
|this case is unreachable since type (String, String) is not a subclass of class Tuple3
19.21.scala
In the first expression, the error should be unexpected, as we would assume place
should be a String, and not be of type Any. This can be a hint that the pattern might be wrong.
Indeed if we replace the place argument with a placeholder,
then we see the root of the problem:
A prefix or suffix are themselves P[String] parsers, we can use them on
their own to parse input:
> fastparse.parse("hello", prefix(using _))
res33: fastparse.Parsed[String] = Success(value = "hello", index = 5)
> fastparse.parse("goodbye", prefix(using _))
res34: fastparse.Parsed[String] = Success(value = "goodbye", index = 7)
> fastparse.parse("moo", prefix(using _))
res35: fastparse.Parsed[String] = Parsed.Failure(Position 1:1, found "moo")
19.22.scala
With FastParse, every P[T] can have its components easily broken out into
separate parts, each of which is statically typed as P[String], P[Unit],
P[(String, String)], P[Phrase], etc.. The compiler can help ensure that the
combinations and transformations are valid, and every sub-parser can be used and
tested independently.
In the end, all these def prefix, def suffix, def parser definitions are
just methods returning a ParsingRun[T] type, abbreviated as P[T]. They can
be defined anywhere, as part of objects or classes or within method bodies. You
can refactor parts of a large parser into smaller parsers, assign them
meaningful names, and in general manage them as you would any other piece of
code.
This ability to reference parsers by name inside another parser means parsers
can be recursive! For example here we change Phrase into a tree-like object:
it's either a Word containing a string, or a Pair containing two other
phrases. This can be modeled as a enum Phrase with two cases:
> enum Phrase:
case Word(s: String)
case Pair(lhs: Phrase, rhs: Phrase)
19.23.scala
Here, we wrap everything in {}s, so the REPL will execute the three statements
as one "block" rather than separate commands. This is necessary here and for the
subsequent definitions, which are recursive and thus only make sense when
defined together. We can now parse these Phrases as follows, again using {}s
to define the parser as a single block:
> {
def prefix[T: P] = P( "hello" | "goodbye" ).!.map(Phrase.Word(_))
def suffix[T: P] = P( "world" | "seattle" ).!.map(Phrase.Word(_))
def ws[T: P] = P( " ".rep(1) )
def parened[T: P] = P( "(" ~ parser ~ ")" )
def parser[T: P]: P[Phrase] = P((parened | prefix) ~ ws ~ (parened | suffix)).map:
case (lhs, rhs) => Phrase.Pair(lhs, rhs)
}
19.24.scala
Compared to our earlier non-recursive Phrase parser, this has the following
major changes:
We introduce a new parened parser, which wraps parser with a "(" before
it and ")" after
Inside parser, prefix is now (parened | prefix) and suffix is now
(parened | suffix)
prefix and suffix have their result transformed using .map(Word(_)) to turn
them from P[String]s into P[Word]s, which is a subtype of P[Phrase]
parser has its result transformed using .map to become a P[Pair], also a
subtype of P[Phrase]
Thus, parser and parened are now mutually recursive: each one can call the
other as part of their parse. The definitions of prefix, suffix and ws
themselves are unchanged from what we saw earlier. We can pass in various
combinations of phrases and parentheses to see it in action:
> fastparse.parse(
"(hello world) ((goodbye seattle) world)",
parser(using _)
)
res36: fastparse.Parsed[Phrase] = Success(
value = Pair(
lhs = Pair(lhs = Word("hello"), rhs = Word("world")),
rhs = Pair(
lhs = Pair(lhs = Word("goodbye"), rhs = Word("seattle")),
rhs = Word("world")
)
),
index = 42
)
19.25.scala
We now have a working parser that can parse a given tree-shaped input string into a tree-shaped data-structure. We just needed to write 7 short lines of code to define the shape of the input we want to parse, and the FastParse library did the rest of the work of parsing the string and transforming it into the data structure we need.
As a capstone project for this chapter, we will implement a small arithmetic
evaluator. This is a common programming interview challenge, but we will add a
twist: it must work on the English representations of numbers! For example,
while a traditional arithmetic evaluator may evaluate the following to return
18:
(1 + 2) * (9 - 3)
For this exercise, we will instead parse input such as:
(one plus two) times (nine minus three)
To simplify things, we will limit the numbers to the range zero to nine,
avoiding all the complexities around parsing English numbers like fifteen or
twelve or one hundred twenty eight.
The first thing we need to do is define exactly what structure we want to parse our arithmetic expressions into. Since we only care about binary operations on numbers, with the numbers being integers, we could model that structure as follows:
> enum Expr:
case BinOp(left: Expr, op: String, right: Expr)
case Number(value: Int)
19.26.scala
It is always a judgement call how much information you want to model in your syntax tree. Here we ignore whitespace, line numbers, or parentheses, since they are not necessary for evaluation. Exactly how much detail about the input text you want to model in your syntax tree depends on what you need to do with it.
To begin let's implement a parser for simple numbers:
> def number[T: P]: P[Expr] = P(
"zero" | "one" | "two" | "three" | "four" |
"five" | "six" | "seven" | "eight" | "nine"
).!.map:
case "zero" => Expr.Number(0); case "one" => Expr.Number(1)
case "two" => Expr.Number(2); case "three" => Expr.Number(3)
case "four" => Expr.Number(4); case "five" => Expr.Number(5)
case "six" => Expr.Number(6); case "seven" => Expr.Number(7)
case "eight" => Expr.Number(8); case "nine" => Expr.Number(9)
19.27.scala
Here, we manually list out all the possible numbers, and inside the map we
pattern-match on the value and assign each string to the integer value it
represents. You can see that due to our map call, number is now a P[Int]:
on success, the parse returns an Int. This can be tested as follows:
> fastparse.parse("seven", number(using _))
res37: fastparse.Parsed[Expr] = Success(value = Number(7), index = 5)
> fastparse.parse("zero", number(using _))
res38: fastparse.Parsed[Expr] = Success(value = Number(0), index = 4)
> fastparse.parse("lol", number(using _))
res39: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "lol")
19.28.scala
Next, we need to define how to parse operators: let's restrict it to the four
basic arithmetic operators for now, and use .! on the operator parser
so we can capture which operator was parsed:
> def operator[T: P]: P[String] = P( "plus" | "minus" | "times" | "divide" ).!
The last terminal parser we will need is to parse whitespace; for now we will
use the same def ws parser we used earlier:
> def ws[T: P] = P( " ".rep(1) )
Next, we have to parse the recursive tree-like structure, so we can handle inputs like:
(one plus two) times (nine minus three)
The syntax we want to be able to parse is a left-hand-side expression, followed
by an operator, followed by a right-hand-side expression. Each side's expression
is either a number, or the parser itself surrounded by parentheses. Let's
call this def expr:
def expr[T: P]: P[Expr] = P( "(" ~ parser ~ ")" | number )
def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
19.29.scala
Now, we're almost done. But if you try to run this, you get a compile error:
> {
def expr[T: P] = P( "(" ~ parser ~ ")" | number )
def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
}
-- [E172] Type Error: ----------------------------------------------------------
3 |def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr )
| ^
|No given instance of type fastparse.Implicits.Sequencer[(Expr, String), Expr,
|Expr] was found for parameter s of method ~ in package fastparse
19.30.scala
The reason for the failure is that the expected result of def parser is P[Expr], but the ~ operator cannot find a given instance of Sequencer that can combine all the sub-parsers to the required type. Looking closer at the expression, it is made of three sub-parsers which return values:
(expr: P[Expr]) ~ ws ~ (operator: P[String]) ~ ws ~ (expr: P[Expr])
On its own, the expression combines to P[(Int, String, Int)]. To convert it to a P[Expr], we
need to map it. In this case, let's make the map function combine the lhs
and rhs results, depending on what the operator was:
> {
def expr[T: P] = P( "(" ~ parser ~ ")" | number )
def parser[T: P]: P[Expr] = P( expr ~ ws ~ operator ~ ws ~ expr ).map:
case (lhs, op, rhs) => Expr.BinOp(lhs, op, rhs)
}
19.31.scala
Now our parser is able to parse the recursive syntax of our arithmetic
expressions:
> val t = fastparse.parse(
"(one plus two) times (three plus four)",
parser(using _)
).get.value
t: Expr = BinOp(
left = BinOp(left = Number(1), op = "plus", right = Number(2)),
op = "times",
right = BinOp(left = Number(3), op = "plus", right = Number(4))
)
19.32.scalaNow that we have our input string parsed into an Expr tree, we can use pattern
matching to traverse and manipulate. For example, we can pretty-print it back
into a String, or evaluate it into a Int value:
> def stringify(e: Expr): String = e match
case Expr.BinOp(left, op, right) =>
s"(${stringify(left)} $op ${stringify(right)})"
case Expr.Number(0) => "zero"; case Expr.Number(1) => "one"
case Expr.Number(2) => "two"; case Expr.Number(3) => "three"
case Expr.Number(4) => "four"; case Expr.Number(5) => "five"
case Expr.Number(6) => "six"; case Expr.Number(7) => "seven"
case Expr.Number(8) => "eight"; case Expr.Number(9) => "nine"
> stringify(t)
res40: String = "((one plus two) times (three plus four))"
> def evaluate(e: Expr): Int = e match
case Expr.BinOp(left, "plus", right) => evaluate(left) + evaluate(right)
case Expr.BinOp(left, "minus", right) => evaluate(left) - evaluate(right)
case Expr.BinOp(left, "times", right) => evaluate(left) * evaluate(right)
case Expr.BinOp(left, "divide", right) => evaluate(left) / evaluate(right)
case Expr.Number(n) => n
> evaluate(t)
res41: Int = 21
19.33.scala
In general, once we have the input parsed into a syntax tree structure, it is easy to recursively traverse the tree using pattern matching to do useful things: evaluating it, pretty-printing it, validating it, and so on.
We have now written a simple FastParse parser for "English-like" arithmetic expressions, letting us evaluate the arithmetic expression to an integer or pretty-print it back as a string. The entire program is 40 short lines of code, as follows:
Arithmetic.scala19.34.scala//| mvnDeps://| - com.lihaoyi::fastparse:3.1.1importfastparse.*,NoWhitespace.*enumExpr:caseBinOp(left:Expr,op:String,right:Expr)caseNumber(value:Int)defnumber[T:P]=P("zero"|"one"|"two"|"three"|"four"|"five"|"six"|"seven"|"eight"|"nine").!.map:case"zero"=>Expr.Number(0);case"one"=>Expr.Number(1)case"two"=>Expr.Number(2);case"three"=>Expr.Number(3)case"four"=>Expr.Number(4);case"five"=>Expr.Number(5)case"six"=>Expr.Number(6);case"seven"=>Expr.Number(7)case"eight"=>Expr.Number(8);case"nine"=>Expr.Number(9)defws[T:P]=P(" ".rep(1))defoperator[T:P]=P("plus"|"minus"|"times"|"divide").!defexpr[T:P]=P("("~parser~")"|number)defparser[T:P]:P[Expr]=P(expr~ws~operator~ws~expr).map:case(lhs,op,rhs)=>Expr.BinOp(lhs,op,rhs)
Traversals.scala19.35.scala//| moduleDeps: [./Arithmetic.scala]defstringify(e:Expr):String=ematchcaseExpr.BinOp(left,op,right)=>s"(${stringify(left)}$op${stringify(right)})"caseExpr.Number(0)=>"zero";caseExpr.Number(1)=>"one"caseExpr.Number(2)=>"two";caseExpr.Number(3)=>"three"caseExpr.Number(4)=>"four";caseExpr.Number(5)=>"five"caseExpr.Number(6)=>"six";caseExpr.Number(7)=>"seven"caseExpr.Number(8)=>"eight";caseExpr.Number(9)=>"nine"defevaluate(e:Expr):Int=ematchcaseExpr.BinOp(left,"plus",right)=>evaluate(left)+evaluate(right)caseExpr.BinOp(left,"minus",right)=>evaluate(left)-evaluate(right)caseExpr.BinOp(left,"times",right)=>evaluate(left)*evaluate(right)caseExpr.BinOp(left,"divide",right)=>evaluate(left)/evaluate(right)caseExpr.Number(n)=>n
The happy path of correctly parsing input is only half of your experience working with parsers. The other half is when things go wrong: your parsers are not parsing what you want them to, or your users are submitting invalid input and causing the parse to fail. In both cases, we need to figure out why a parser is misbehaving. This section will go through two techniques: Debugging Parsers via .log (19.4.1), and Error Reporting with Cuts (19.4.2).
Consider the arithmetic parser we implemented earlier. This works great when the input is correct, but less great when the input is erroneous:
> fastparse.parse("(two plus ten) times seven", parser(using _))
res42: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "(two plus ")
19.36.scala
In the parse above, our parser fails with the message Position 1:1, found "(two plus ". Our input was wrong starting from the first character. That tells us
nothing about what is wrong with the input!
In order to find out more, we can use the .log method to insert logging on the
various parser methods:
Arithmetic.scala19.37.scaladefnumber[T:P]=P("zero"|"one"|"two"|"three"|"four"|"five"|"six"|"seven"|"eight"|"nine").!.map:case"zero"=>Expr.Number(0);case"one"=>Expr.Number(1)...+.logdefws[T:P]=P(" ".rep(1))-defoperator[T:P]=P("plus"|"minus"|"times"|"divide").!+defoperator[T:P]=P("plus"|"minus"|"times"|"divide").!.log
Arithmetic.scala19.38.scala-defexpr[T:P]=P("("~parser~")"|number)+defexpr[T:P]=P("("~parser~")"|number).logdefparser[T:P]:P[Expr]=P(expr~ws~operator~ws~expr).map:case(lhs,op,rhs)=>Expr.BinOp(lhs,op,rhs)+.log
By appending .log onto a parser, we are telling FastParse to print out exactly
what it is doing. For example, here is the log of successfully parsing a small
input:
> fastparse.parse("one plus two", parser(using _))
+parser:1:1, cut
+expr:1:1, cut
+number:1:1
-number:1:1:Success(1:4)
-expr:1:1:Success(1:4, cut)
+operator:1:5, cut
-operator:1:5:Success(1:9, cut)
+expr:1:10, cut
+number:1:10
-number:1:10:Success(1:13)
-expr:1:10:Success(1:13, cut)
-parser:1:1:Success(1:13, cut)
res43: fastparse.Parsed[Expr] = Success(
value = BinOp(left = Number(1), op = "plus", right = Number(2)),
index = 12
)
19.39.scala
Every line starting with a + indicates at which line and column in the input
text a parser began, and every line starting with - indicates where a parser
completed either with a Success or Failure.
Above, we can see parser beginning 1:1 (row 1 column 1), expr at 1:1,
number at 1:1, number succeeding at 1:4, and so on. Eventually, we see
parser succeed at 1:13, producing our final result.
If we feed our earlier incorrect input to our logged parser, we can see the following logs:
> fastparse.parse("(two plus ten) times seven", parser(using _))
+parser:1:1, cut
+expr:1:1, cut
+parser:1:2
+expr:1:2
+number:1:2
-number:1:2:Success(1:5)
-expr:1:2:Success(1:5)
+operator:1:6
-operator:1:6:Success(1:10)
+expr:1:11
+number:1:11
-number:1:11:Failure(number:1:11 / ("zero" | "one" ...):1:11 ..."ten) times")
-expr:1:11:Failure(expr:1:11 / ("(" | number):1:11 ..."ten) times")
-parser:1:2:Failure(parser:1:2 / expr:1:11 / ("(" | number):1:11 ..."two plus t")
+number:1:1
-number:1:1:Failure(number:1:1 / ("zero" | "one" | "two" ...):1:1 ..."(two plus ")
-expr:1:1:Failure(expr:1:1 / ("(" ~ parser | number):1:1 ..."(two plus ", cut)
-parser:1:1:Failure(parser:1:1 / expr:1:1 / ("(" ~ parser | number):1:1 ..."(two plus ")
19.40.scala
Here, we can see it successfully parsing all the way until row 1:11 (row 1
column 11), at which the number parser fails: it expected a number of the form
"zero" | "one" ..., but instead saw a ten. We can then see it backtrack out
of the parse, eventually failing at 1:1. Now the problem is obvious: ten is
not a valid number in our parser.
In this small parser, we can add logs throughout the code. In a larger parser
we may have to be a bit more selective in what we log to avoid drowning in
logging output. In either case, by adding logs to the part of the parser that
might be misbehaving, we can get visibility in what the parser is doing. This
will help us narrow down exactly which part of the parser or which part of the
input is incorrect.
Instrumenting our parser with .logs works fine for the developer trying to
investigate a misbehavior in the parser. But if the mistake was in user input,
that user likely would not have the freedom to sprinkle .logs throughout our
code. What then?
The reason the erroneous input above gives a poor error message is as follows:
1:11 because ten is not a "(" nor is it a valid
number1:2 where it was trying to parse parser, finds no other
alternatives, and fails1:1 where it earlier tried to parse "(", tries to
parse number, and failsWhen a parse fails, FastParse backtracks to try the other alternatives defined
by |s, in an attempt to find one that can successfully parse. When all of
alternatives fail, FastParse does not know which one was "meant" to succeed.
Thus can only provide the vague error saying "something is wrong at 1:1".
To produce a better error message, we need to limit how much backtracking FastParse will do.
The basic idea of a cut is a point in which the parser cannot backtrack. For
example, let us consider the following standalone parser nocut:
def alpha[T: P] = P( CharIn("a-z") )
def nocut[T: P] = P( "val " ~ alpha.rep(1).! | "def " ~ alpha.rep(1).!)
val Parsed.Success("abcd", _) = parse("val abcd", nocut(using _)).runtimeChecked
val failure = parse("val 1234", nocut(using _)).asInstanceOf[Parsed.Failure]
val trace = failure.trace().longAggregateMsg
failure.index // 0
trace // Expected nocut:1:1 / ("val " ~ alpha.rep(1) | "def "):1:1, found "val 1234"
19.41.scala
Above we have a simple parser: it either parses a val or def, a space, and
its (lower-case only) name. On success this works as expected. On failure, it
backtracks all the way to the start of the input. This is a minimized example of
the backtracking behavior we saw earlier.
The reason it needs to backtrack is that FastParse does not know that after
successfully parsing "val ", only the left branch of the alternative is viable
regardless of what input text is being parsed. Thus it has no choice but to
offer both alternatives in the error message. We might know that after seeing
val , only "val " ~ alpha.rep(1).! could possibly succeed, but FastParse
does not know that.
To solve this issue, we can add cuts using the ~/ operator:
def alpha[T: P] = P( CharIn("a-z") )
def hascuts[T: P] = P( "val " ~/ alpha.rep(1).! | "def " ~/ alpha.rep(1).!)
val Parsed.Success("abcd", _) = parse("val abcd", hascuts(using _)).runtimeChecked
val failure = parse("val 1234", hascuts(using _)).asInstanceOf[Parsed.Failure]
val trace = failure.trace().longAggregateMsg
failure.index // 4
trace // Expected hascuts:1:1 / alpha:1:5 / [a-z]:1:5, found "1234"
19.42.scala
~/ is similar to the Sequence operator ~. The difference is that once the
parse has crossed a cut, it can no longer backtrack past that point. This
basically makes FastParse "commit" to the current branch of the alternative, so
it knows that if the parse fails the error must be within the current branch and
not in any of the other branches.
Once we have added the cut ~/, FastParse no longer backtracks to index 0 on
failure. Instead, it shows a much more precise error: at index 4, expecting
one of the small set of alphanumeric characters.
Going back to our Arithmetic parser above, we know that once we see an open (,
there is no point backtracking: an open ( must be followed by a parser,
then a closing ). FastParse doesn't know this, but we can tell it via a cut
after the open (:
Arithmetic.scala19.43.scala-defexpr[T:P]=P("("~parser~")"|number)+defexpr[T:P]=P("("~/parser~")"|number)
After we add the cut ~/, our error message is improved greatly:
> val result = fastparse.parse("(two plus ten) times seven", parser(using _))
result: fastparse.Parsed[Expr] = Parsed.Failure(Position 1:1, found "(two plus ")
> val Parsed.Failure(msg, idx, extra) = result.runtimeChecked
> println(extra.trace().msg)
Expected ("(" ~ parser | number):1:1, found "(two plus "
19.44.scala
We now get a precise error telling us that the "ten" in our string is
problematic. Using .trace().msg to get more information, we can see that the
parser expected either a "(" or a number column 11, but instead found
"ten". This tells us precisely where the problematic input is and what the
parser was expecting!
In general, adding cuts greatly helps improve the error reporting when your parser is given invalid input. Whenever you have a parser that can backtrack:
a | b | c parsersa.rep parsersa.? parsersIt is worth considering if there is a point in the parse past which backtracking
is futile. If there is, adding a cut ~/ would greatly improve the error
messages the parser produces.
Logging and Cuts are two complementary techniques that help figure out why the parser is not doing what you expect. In the inevitable situation where something is going wrong, it is good to have both in your toolbox.
We have now walked through the basics of parsing structured text into useful values: parsing arithmetic expressions into a syntax tree that we can traverse and evaluate in various ways. Note this is a minimal example of what you can do with the FastParse library. The actual FastParse documentation has a lot more details in how to write parsers, and is available online at:
FastParse can be used in any Scala project, using the following dependency:
Millmvn"com.lihaoyi::fastparse:3.1.1"
FastParse parsers are flexible and performant enough to build production parsers for complex, real-world languages like CSS, JSON, Scala, or Python. While in this chapter we parsed a strawman arithmetic syntax, there are many real-world use cases for writing your own parser: handling uncommon data formats, writing a simple query language, or parsing your existing source code for static analysis.
Next, in Chapter 20: Implementing a Programming Language we will use FastParse as the basis for writing a simple programming language interpreter.
Exercise: Extend our arithmetic parser to allow parsing chained operators like one plus two plus three plus four which should evaluate to 10. For now, ignore
operator precedence, such that one plus two times three plus four evaluates
to 13 rather than 11.
Exercise: Extend our arithmetic parser to support operator precedence with chained
operators, such that one plus two times three plus four evaluates to 11
instead of 13. One possible approach is using
precedence climbing.
Exercise: Modify the arithmetic parser to evaluate the arithmetic expression during the
parse and return a single Int once the parse completes, rather than first
parsing into an Expr structure that we need to recursively traverse and
evaluate later. What exact type out parser methods return is up to us,
depending on what we do in the .map calls.