8

JSON and Binary Data Serialization


8.1 Manipulating JSON149
8.2 JSON Serialization of Scala Data Types153
8.3 Writing your own Generic Serialization Methods157
8.4 Binary Serialization159

> val output = ujson.Arr(
    ujson.Obj("hello" -> "world", "answer" -> 42),
    true
  )

> output(0)("hello") = "goodbye"

> output(0)("tags") = ujson.Arr("cool", "yay", "nice")

> println(output)
[{"hello":"goodbye","answer":42,"tags":["cool","yay","nice"]},true]
8.1.scala

Snippet 8.1: manipulating a JSON tree structure in the Scala REPL

Data serialization is an important tool in any programmer's toolbox. While variables and classes are enough to store data within a process, most data tends to outlive a single program process: whether saved to disk, exchanged between processes, or sent over the network. This chapter will cover how to serialize your Scala data structures to two common data formats - textual JSON and binary MessagePack - and how you can interact with the structured data in a variety of useful ways.

The JSON workflows we learn in this chapter will be used later in Chapter 12: Working with HTTP APIs and Chapter 14: Simple Web and API Servers, while the binary serialization techniques we learn here will be used later in Chapter 17: Multi-Process Applications.

The easiest way to work with JSON and binary serialization is through the uPickle library (and its uJson structured data type).

To begin, first download the sample JSON data at:

uJson and uPickle need to be added as library dependencies to use with the Scala CLI REPL, which is handled by a command line flag --dep com.lihaoyi::upickle:4.4.2.

To explore the libraries, you can type ujson.<tab> and upickle.<tab> to see the listing of available operations.

8.1 Manipulating JSON

Given a JSON string, You can parse it into a ujson.Value using ujson.read:

> val jsonString = os.read(os.pwd / "ammonite-releases.json")
jsonString: String = """[
  {
    "url": "https://api.github.com/repos/.../releases/17991367",
    "assets_url": "https://api.github.com/repos/.../releases/17991367/assets",
    "upload_url": "https://uploads.github.com/repos/.../releases/17991367/assets",
...
"""

> val data = ujson.read(jsonString)
data: ujson.Value.Value = Arr(
  ArrayBuffer(
    Obj(
      Map(
        "url" -> Str("https://api.github.com/repos/.../17991367"),
        "assets_url" -> Str("https://api.github.com/repos/.../17991367/assets"),
...
8.2.scala

You can also construct JSON data structures directly using the ujson.* constructors. The constructors for primitive types like numbers, strings, and booleans are optional and can be elided:

> val small = ujson.Arr(
    ujson.Obj("hello" -> ujson.Str("world"), "answer" -> ujson.Num(42)),
    ujson.Bool(true)
  )

> val small = ujson.Arr(
    ujson.Obj("hello" -> "world", "answer" -> 42),
    true
  )
8.3.scala

These can be serialized back to a string using the ujson.write function, or written directly to a file without needing to first serialize it to a String in-memory:

> println(ujson.write(small))
[{"hello":"world","answer":42},true]

> os.write(os.pwd / "out.json", small)

> os.read(os.pwd / "out.json")
res0: String = "[{\"hello\":\"world\",\"answer\":42},true]"
8.4.scala
See example 8.1 - Create

8.1.1 The ujson.Value Data Type

A ujson.Value can be one of several types:

sealed trait Value

case class Str(value: String) extends Value
case class Obj(value: mutable.LinkedHashMap[String, Value]) extends Value
case class Arr(value: mutable.ArrayBuffer[Value]) extends Value
case class Num(value: Double) extends Value

sealed trait Bool extends Value
case object False extends Bool
case object True extends Bool
case object Null extends Value
8.5.scala

Value is a sealed trait, indicating that this set of classes and objects encompass all the possible JSON Values. You can conveniently cast a ujson.Value to a specific sub-type and get its internal data by using the .bool, .num, .arr, .obj, or .str methods:

> data.
apply     boolOpt            num       render       update
arr       formatted          numOpt    str          value
arrOpt    httpContentType    obj       strOpt       writeBytesTo
bool      isNull             objOpt    transform
8.6.scala

If you are working with your JSON as an opaque tree - indexing into it by index or key, updating elements by index or key - you can do that directly using the data(...) and data(...) = ... syntax.

8.1.2 Querying and Modifying JSON

You can look up entries in the JSON data structure using data(...) syntax, similar to how you look up entries in an Array or Map:

> data(0)
res1: ujson.Value = Obj(
  Map(
    "url" -> Str("https://api.github.com/repos/.../17991367"),
    "assets_url" -> Str("https://api.github.com/repos/.../17991367/assets"),
...

> data(0)("url")
res2: ujson.Value = Str(
  "https://api.github.com/repos/lihaoyi/Ammonite/releases/17991367"
)

> data(0)("author")("id")
res3: ujson.Value = Num(2.0607116E7)
8.7.scala

ujson.Values are mutable:

> println(small)
[{"hello":"world","answer":42},true]

> small(0)("hello") = "goodbye"

> small(0)("tags") = ujson.Arr("cool", "yay", "nice")

> println(small)
[{"hello":"goodbye","answer":42,"tags":["cool","yay","nice"]},true]
8.8.scala

8.1.3 Extracting Typed Values

If you want to assume your JSON value is of a particular type and do type-specific operations like "iterate over array", "get length of string", or "get keys of object", you need to use .arr, .str, or .obj to cast your JSON structure to the specific type and extract the value.

For example, fetching and manipulating the fields of a ujson.Obj requires use of .obj:

> small(0).obj.remove("hello")

> small.arr.append(123)

> println(small)
[{"answer":42,"tags":["cool","yay","nice"]},true,123]
8.9.scala

Extracting values as primitive types requires use of .str or .num. Note that ujson.Nums are stored as doubles. You can call .toInt to convert the ujson.Nums to integers:

> data(0)("url").str
res6: String = "https://api.github.com/repos/.../releases/17991367"

> data(0)("author")("id").num
res7: Double = 2.0607116E7

> data(0)("author")("id").num.toInt
res8: Int = 20607116
8.10.scala
See example 8.2 - Manipulate

If the type of the JSON value is not parseable into the str or num type we are expecting, the call throws a runtime exception.

8.1.4 Traversing JSON

To traverse over the tree structure of the ujson.Value (8.1.1), we can use a recursive function. For example, here is one that recursively traverses the data we had parsed earlier, and collects all the ujson.Str nodes in the JSON structure.

> def traverse(v: ujson.Value): Iterable[String] = v match
    case a: ujson.Arr => a.arr.map(traverse).flatten
    case o: ujson.Obj => o.obj.values.map(traverse).flatten
    case s: ujson.Str => Seq(s.str)
    case _ => Nil

> traverse(data)
res9: Iterable[String] = ArrayBuffer(
  "https://api.github.com/repos/.../releases/17991367",
  "https://api.github.com/repos/.../releases/17991367/assets",
  "https://uploads.github.com/repos/.../releases/17991367/assets",
  "https://github.com/.../releases/tag/1.6.8",
...
8.11.scala
See example 8.3 - Traverse

8.2 JSON Serialization of Scala Data Types

Often you do not just want dynamically-typed JSON trees: rather, you usually want Scala collections or case classes, with fields of known types. Serializing values of type T is done by looking up given serializers of type ReadWriter[T]. Some of these serializers are provided by the library, while others you have to define yourself in your own code.

8.2.1 Serializing Scala Builtins

Given ReadWriters are already defined for most common Scala data types: Ints, Doubles, Strings, Seqs, Lists, Maps, tuples, etc. You can thus serialize and deserialize collections of primitives and other builtin types automatically.

> val numbers = upickle.read[Seq[Int]]("[1, 2, 3, 4]")
numbers: Seq[Int] = List(1, 2, 3, 4)

> upickle.write(numbers)
res10: String = "[1,2,3,4]"

> val tuples = upickle.read[Seq[(Int, Boolean)]](
    "[[1, true], [2, false]]"
  )
tuples: Seq[(Int, Boolean)] = List((1, true), (2, false))

> upickle.write(tuples)
res11: String = "[[1,true],[2,false]]"
8.12.scala

Serialization is done via the Typeclass Inference technique we covered in Chapter 5: Notable Scala Features, and thus can work for arbitrarily deep nested data structures:

> val input = """{"weasel": ["i", "am"], "baboon": ["i", "r"]}"""

> val parsed = upickle.read[Map[String, Seq[String]]](input)
parsed: Map[String, Seq[String]] = Map(
  "weasel" -> List("i", "am"),
  "baboon" -> List("i", "r")
)

> upickle.write(parsed)
res12: String = "{\"weasel\":[\"i\",\"am\"],\"baboon\":[\"i\",\"r\"]}"
8.13.scala

8.2.2 Serializing Case Classes

To convert a JSON structure into a case class, there are a few steps:

  1. Define a case class representing the fields and types you expect to be present in the JSON
  2. Define a given upickle.ReadWriter for that case class
  3. Use upickle.read to deserialize the JSON structure.

For example, the author value in the JSON data we saw earlier has the following fields:

> println(ujson.write(data(0)("author"), indent = 4))
{
    "login": "Ammonite-Bot",
    "id": 20607116,
    "node_id": "MDQ6VXNlcjIwNjA3MTE2",
    "gravatar_id": "",
    "type": "User",
    "site_admin": false,
    ...
}
8.14.scala

Which can be (partially) modeled as the following case class:

> case class Author(login: String, id: Int, site_admin: Boolean) derives upickle.ReadWriter

For every case class you want to serialize, you have to define a contextual upickle.ReadWriter to mark it as serializable. With Scala 3 we can use the derives keyword, which generates a given ReadWriter that serializes and deserializes the case class with its field names mapped to corresponding JSON object keys, but you could also do it manually via a Mapped Serializer (8.2.3) if you need more flexibility or customization.

> val author = upickle.read[Author](data(0)("author")) // read uJson
author: Author = Author(
  login = "Ammonite-Bot",
  id = 20607116,
  site_admin = false
)

> author.login
res14: String = "Ammonite-Bot"

> val author2 = upickle.read[Author](  // read directly from a String
    """{"login": "lihaoyi", "id": 313373, "site_admin": true}"""
  )
author2: Author = Author(login = "lihaoyi", id = 313373, site_admin = true)

> upickle.write(author2)
res15: String = "{\"login\":\"lihaoyi\",\"id\":313373,\"site_admin\":true}"
8.15.scala

Once you have defined a ReadWriter[Author], you can then also serialize and de-serialize Authors as part of any larger data structure:

> upickle.read[Map[String, Author]]("""{
    "haoyi": {"login": "lihaoyi", "id": 1337, "site_admin": true},
    "bot": {"login": "ammonite-bot", "id": 31337, "site_admin": false}
  }""")
res16: Map[String, Author] = Map(
  "haoyi" -> Author(login = "lihaoyi", id = 1337, site_admin = true),
  "bot" -> Author(login = "ammonite-bot", id = 31337, site_admin = false)
)
8.16.scala

In general, you can serialize any arbitrarily nested tree of case classes, collections, and primitives, as long as every value within that structure is itself serializable.

8.2.3 Mapped Serializers

uPickle allows you to easily construct given serializers for new types based on existing ones. For example, by default uPickle does not have support for serializing os.Paths:

> upickle.write(os.pwd)
-- [E172] Type Error: ----------------------------------------------------------
1 |upickle.write(os.pwd)
  |                     ^
  |No given instance of type upickle.Writer[os.Path] was found for a
  |context parameter of method write in trait Api.
8.17.scala

The compiler will try to find compatible givens and suggest you import them. However, because os.Paths can be trivially converted to and from Strings, we can use the bimap function to construct a ReadWriter[os.Path] from the existing ReadWriter[String]:

> given pathRw: upickle.ReadWriter[os.Path] =
    upickle.readwriter[String].bimap[os.Path](
      p => ... /* convert os.Path to String */,
      s => ... /* convert String to os.Path */
    )
8.18.scala

bimap needs you to specify what your existing serializer is (here String), and what new type you want to serialize (os.Path), and provide conversion functions to convert back and forth between the two types. In this case, we could use the following converters:

> given pathRw: upickle.ReadWriter[os.Path] =
    upickle.readwriter[String].bimap[os.Path](
      p => p.toString,
      s => os.Path(s)
    )
8.19.scala

With this given pathRw defined, we can now serialize and deserialize os.Paths. This applies recursively as well, so any case classes or collections contain os.Path can now be serialized as well:

> val str = upickle.write(os.pwd)
str: String = "\"/Users/lihaoyi/test\""

> upickle.read[os.Path](str)
res17: os.Path = /Users/lihaoyi/test

> val str2 = upickle.write(Array(os.pwd, os.home, os.root))
str2: String = "[\"/Users/lihaoyi/test\",\"/Users/lihaoyi\",\"/\"]"

> upickle.read[Array[os.Path]](str2)
res18: Array[os.Path] = Array(
  /Users/lihaoyi/test, /Users/lihaoyi, /
)
8.20.scala

If you want more flexibility in how your JSON is deserialized into your data type, you can use upickle.readwriter[ujson.Value].bimap to work with the raw ujson.Values:

> given pathRw: upickle.ReadWriter[Thing] =
    upickle.readwriter[ujson.Value].bimap[Thing](
      p => ... /* convert a Thing to ujson.Value */,
      s => ... /* convert a ujson.Value to Thing */
    )
8.21.scala

You then have full freedom in how you want to convert a ujson.Value into a Thing, and how you want to serialize the Thing back into a ujson.Value.

8.3 Writing your own Generic Serialization Methods

You can define your own methods that are able to serialize (or deserialize) values of various types by making them generic with a context bound of Reader, Writer, or ReadWriter.

8.3.1 uPickle Context Bounds

The key context bounds relevant to the uPickle serialization library are:

  • def foo[T: upickle.Reader]: allows use of upickle.read[T]

  • def foo[T: upickle.Writer]: allows use of upickle.write[T]

  • def foo[T: upickle.ReadWriter]: allows use of both upickle.read[T] and upickle.write[T]

As we discussed in Chapter 5: Notable Scala Features, the context bound syntax above is equivalent to the following context parameter:

def foo[T](using reader: upickle.Reader[T])

This allows the compiler to infer the parameter if it is not explicitly provided, and saves us the inconvenience of having to pass serializers around manually.

8.3.2 Generic Serialization Methods

Using context bounds, we can write generic methods that can operate on any input type, as long as that type is JSON serializable. For example, if we want to write a method that serializes a value and prints out the JSON to the console, we can do that as follows:

> case class Asset(id: Int, name: String) derives upickle.ReadWriter

> def myPrintJson[T: upickle.Writer](t: T) = println(upickle.write(t))
8.22.scala
> myPrintJson(Asset(1, "hello"))
{"id":1,"name":"hello"}

> myPrintJson(Seq(1, 2, 3))
[1,2,3]

> myPrintJson(Seq(Asset(1, "hello"), Asset(2, "goodbye")))
[{"id":1,"name":"hello"},{"id":2,"name":"goodbye"}]
8.23.scala

If we want to write a method that reads input from the console and parses it to JSON of a particular type, we can do that as well:

> def myReadJson[T: upickle.Reader](): T =
    print("Enter some JSON: ")
    upickle.read[T](Console.in.readLine())

> myReadJson[Seq[Int]]()
Enter some JSON: [1, 2, 3, 4, 5]
res19: Seq[Int] = List(1, 2, 3, 4, 5)

> myReadJson[Author]()
Enter some JSON: {"login": "Haoyi", "id": 1337, "site_admin": true}
res20: Author = Author("Haoyi", 1337, true)
8.24.scala

Note that when calling myReadJson(), we have to pass in the type parameter [Seq[Int]] or [Author] explicitly, whereas when calling myPrintJson() the compiler can infer the type parameter based on the type of the given value Asset(1, "hello"), Seq(1, 2, 3), etc.

In general, we do not need a context bound when we are writing code that operates on a single concrete type, as the compiler will already be able to infer the correct concrete serializer. We only need a context bound if the method is generic, to indicate to the compiler that it should be callable only with concrete types that have an given Reader[T] or Writer[T] available

We will be using this ability to write generic methods dealing with serialization and de-serialization to write generic RPC (Remote Procedure Call) logic in Chapter 17: Multi-Process Applications.

8.3.3 Why Context Bounds?

The advantage of using context bounds over other ways of serializing data types is that they allow the serialization logic to be inferred statically. That has three consequences:

8.3.3.1 Performance with Convenience

uPickle's serializers being resolved at compile time using Scala's givens gives you the convenience of reflection-based frameworks with the performance of hand-written serialization code.

  • Unlike hand-written serializers, the compiler does most of the busy-work constructing the serialization logic for you. You only need to teach it how to serialize and deserialize your basic primitives and collections and it will know how to serialize all combinations of these without additional boilerplate.

  • Unlike reflection-based serializers, uPickles serializers are fast: they avoid runtime reflection which has significant overhead in most languages, and can be optimized by the compiler to generate lean and efficient code to execute at run time.

8.3.3.2 Compile-Time Error Reporting

The compiler is able to reject non-serializable data types early during compilation, rather than blowing up later after the code has been deployed to production. For example, trying to serialize an open process output stream results in a compile error telling us that what we are doing is invalid before the code even runs:

> myPrintJson(System.out)
-- [E172] Type Error: ----------------------------------------------------------
1 |myPrintJson(System.out)
  |                       ^
  |No given instance of type upickle.Writer[java.io.PrintStream] was
  |found for a context parameter of method myPrintJson.
8.25.scala

8.3.3.3 Security

Because every upickle.read call has a statically-specified type, we will never deserialize a value of unexpected type: this rules out a class of security issues where an attacker can force your code to accidentally deserialize an unsafe object able to compromise your application.

For example, if we accidentally try to deserialize a sun.misc.Unsafe instance from JSON, we get an immediate compile time error:

> myReadJson[sun.misc.Unsafe]()
-- [E172] Type Error: ----------------------------------------------------------
1 |myReadJson[sun.misc.Unsafe]()
  |                             ^
  |No given instance of type upickle.Reader[sun.misc.Unsafe] was found
  |for a context parameter of method myReadJson.
8.26.scala

In general, the Scala language allows you to check the serializability of your data structures at compile time, avoiding an entire class of bugs and security vulnerabilities. Rather than finding your serialization logic crashing or misbehaving in production due to an unexpected value appearing in your data structure, the Scala compiler surfaces these issues at compile time, making them much easier to diagnose and fix.

8.4 Binary Serialization

Apart from serializing Scala data types as JSON, uPickle also supports serializing them to compact MessagePack binary blobs. These are often more compact than JSON, especially for binary data that would need to be Base 64 encoded to fit in a JSON string, at the expense of losing human readability.

8.4.1 writeBinary and readBinary

Serializing data structures to binary blobs is done via the writeBinary and readBinary methods:

> val blob = upickle.writeBinary(Author("haoyi", 31337, true))
blob: Array[Byte] = Array(-125, -91, 108, 111, ...)

> upickle.readBinary[Author](blob)
res21: Author = Author(login = "haoyi", id = 31337, site_admin = true)
8.27.scala

writeBinary and readBinary work on any data type that can be converted to JSON. The following example demonstrates serialization and de-serialization of Map[Int, List[Author]]s:

> val data = Map(
    1 -> Nil,
    2 -> List(Author("haoyi", 1337, true), Author("lihaoyi", 31337, true))
  )

> val blob2 = upickle.writeBinary(data)
blob2: Array[Byte] = Array(-126, 1, -112, 2, -110, ...)

> upickle.readBinary[Map[Int, List[Author]]](blob2)
res22: Map[Int, List[Author]] = Map(
  1 -> List(),
  2 -> List(
    Author(login = "haoyi", id = 1337, site_admin = true),
    Author(login = "lihaoyi", id = 31337, site_admin = true)
  )
)
8.28.scala

Unlike JSON, MessagePack binary blobs are not human readable by default: Array(-110, -110, 1, -112, ...) is not something you can quickly skim and see what it contains! If you are working with a third-party server returning MessagePack binaries with an unknown or unusual structure, this can make it difficult to understand what a MessagePack blob contains so you can properly deserialize it.

8.4.2 MessagePack Structures

To help work with the MessagePack blobs of unknown structure, uPickle comes with a uPack library that lets you read the blobs into an in-memory upack.Msg structure (similar to ujson.Value) that is easy to inspect:

> upack.read(blob)
res23: upack.Msg = Obj(
  Map(Str("login") -> Str("haoyi"), Str("id") -> Int32(31337), Str("site_admin") -> True)
)
8.29.scala
> upack.read(blob2)
res24: upack.Msg = Obj(
  Map(
    Int32(1) -> Arr(ArrayBuffer()),
    Int32(2) -> Arr(
      ArrayBuffer(
        Obj(
          Map(
            Str("login") -> Str("haoyi"),
            Str("id") -> Int32(1337),
            Str("site_admin") -> True
          )
        ),
...
8.30.scala

Reading the binary blobs into upack.Msgs is a great debugging tool, and can help you figure out what is going on under the hood if your writeBinary/readBinary serialization is misbehaving.

Like ujson.Values, you can manually construct upack.Msg from scratch using their constituent parts upack.Arr, upack.Obj, upack.Bool, etc. This can be useful if you need to interact with some third-party systems and need full control of the MessagePack messages you are sending:

> val msg = upack.Obj(
    upack.Str("login") -> upack.Str("haoyi"),
    upack.Str("id") -> upack.Int32(31337),
    upack.Str("site_admin") -> upack.True
  )

> val blob3 = upack.write(msg)
blob3: Array[Byte] = Array(-125, -91, 108, 111, ...)

> val deserialized = upickle.readBinary[Author](blob3)
deserialized: Author = Author(
  login = "haoyi",
  id = 31337,
  site_admin = true
)
8.31.scala

8.5 Conclusion

Serializing data is one of the core tools that any programmer needs to have. This chapter introduces you to the basics of working with data serialization in a Scala program, using the uPickle library. uPickle focuses on providing convenient serialization for built in data structures and user-defined case classes, though with Mapped Serializers (8.2.3) you can extend it yourself to support any arbitrary data type. For more details on using the uPickle serialization library to work with JSON or MessagePack data, you can refer to the reference documentation:

uPickle is also available for you to use in projects built using Mill or other build tools at the following coordinates:

Millmvn"com.lihaoyi::upickle:4.4.2"

We will use the JSON APIs we learned in this chapter later in Chapter 12: Working with HTTP APIs, Chapter 14: Simple Web and API Servers, and use the MessagePack binary serialization techniques in Chapter 17: Multi-Process Applications.

There are many other JSON or binary serialization libraries in the Scala ecosystem. For simplicity the rest of this book will be using uPickle, but you can try these other libraries if you wish:

This flow chart covers most of the common workflows working with textual JSON and binary MessagePack data in Scala:

futures String String ujson.Value ujson.Value String->ujson.Value ujson.read case class case class String->case class upickle.read Writable Writable String->Writable ujson.Value->String ujson.write ujson.Value->case class upickle.read ujson.Value->Writable case class->String upickle.write case class->ujson.Value upickle.writeJs case class->Writable upickle.stream Array[Byte] Array[Byte] case class->Array[Byte] upickle.writeBinary os.write os.write Writable->os.write requests.post requests.post Writable->requests.post ... ... Writable->... Array[Byte]->case class upickle.readBinary Array[Byte]->Writable upack.Msg upack.Msg Array[Byte]->upack.Msg upack.read upack.Msg->Writable upack.Msg->Array[Byte] upack.write

Exercise: Given a normal class class Foo(val i: Int, val s: String) with two public fields, using the bimap method we saw earlier to define a given ReadWriter for it to allow instances to be serialized to Javascript objects {"i": ..., "s": ...}.

See example 8.7 - BiMapClass

Exercise: Often JSON data structures have fields that you do not care about, and make skimming through the JSON verbose and tedious: e.g. the ammonite-releases.json we receive from Github comes loaded with lots of verbose and often-not-very-useful URLs shown below.

"followers_url": "https://api.github.com/users/Ammonite-Bot/followers",
"following_url": "https://api.github.com/users/Ammonite-Bot/following{/other_user}",
"gists_url": "https://api.github.com/users/Ammonite-Bot/gists{/gist_id}",
8.32.json

Write a method that takes a ujson.Value, and removes any values which are strings beginning with "https://". You can do so either in a mutable or immutable style: either modifying the ujson.Value in place, or constructing and returning a new ujson.Value with those values elided.

See example 8.8 - TraverseFilter
Discuss Chapter 8 online at https://www.handsonscala.com/discuss/8