Part III: Web Services


11 Scraping Websites202
12 Working with HTTP APIs218
13 Fork-Join Parallelism with Futures236
14 Simple Web and API Servers258
15 Querying SQL Databases284

The third part of this book covers using Scala in a world of servers and clients, systems and services. We will explore using Scala both as a client and as a server, exchanging HTML and JSON over HTTP or Websockets. This part builds towards two capstone projects: a parallel web crawler and an interactive chat website, each representing common use cases you are likely to encounter using Scala in a networked, distributed environment.

11

Scraping Websites


11.1 Scraping Wikipedia203
11.2 MDN Web Documentation207
11.3 Scraping MDN209
11.4 Putting it Together213

@ val doc = Jsoup.connect("http://en.wikipedia.org/").get()

@ doc.title()
res2: String = "Wikipedia, the free encyclopedia"

@ val headlines = doc.select("#mp-itn b a")
headlines: select.Elements =
<a href="/wiki/Bek_Air_Flight_2100" title="Bek Air Flight 2100">Bek Air Flight 2100</a>
<a href="/wiki/Assassination_of_..." title="Assassination of ...">2018 killing</a>
<a href="/wiki/State_of_the_..." title="State of the...">upholds a ruling</a>
...
</> 11.1.scala

Snippet 11.1: scraping Wikipedia's front-page links using the Jsoup third-party library in the Scala REPL

The user-facing interface of most networked systems is a website. In fact, often that is the only interface! This chapter will walk you through using the Jsoup library from Scala to scrape human-readable HTML pages, unlocking the ability to extract data from websites that do not provide access via an API.

Apart from third-party scraping websites, Jsoup is also a useful tool for testing the HTML user interfaces that we will encounter in Chapter 14: Simple Web and API Servers. This chapter is also a chance to get more familiar with using Java libraries from Scala, a necessary skill to take advantage of the broad and deep Java ecosystem. Lastly, it is an exercise in doing non-trivial interactive development in the Scala REPL, which is a great place to prototype and try out pieces of code that are not ready to be saved in a script or project.

12

Working with HTTP APIs


12.1 The Task: Github Issue Migrator219
12.2 Creating Issues and Comments221
12.3 Fetching Issues and Comments223
12.4 Migrating Issues and Comments228

@ requests.post(
    "https://api.github.com/repos/lihaoyi/test/issues",
    data = ujson.Obj("title" -> "hello"),
    headers = Map("Authorization" -> s"token $token")
  )
res1: requests.Response = Response(
  "https://api.github.com/repos/lihaoyi/test/issues",
  201,
  "Created",
...
</> 12.1.scala

Snippet 12.1: interacting with Github's HTTP API from the Scala REPL

HTTP APIs have become the standard for any organization that wants to let external developers integrate with their systems. This chapter will walk you through how to access HTTP APIs in Scala, building up to a simple use case: migrating Github issues from one repository to another using Github's public API.

We will build upon techniques learned in this chapter in Chapter 13: Fork-Join Parallelism with Futures, where we will be writing a parallel web crawler using the Wikipedia JSON API to walk the graph of articles and the links between them.

13

Fork-Join Parallelism with Futures


13.1 Parallel Computation using Futures237
13.2 N-Ways Parallelism240
13.3 Parallel Web Crawling243
13.4 Asynchronous Futures248
13.5 Asynchronous Web Crawling252

def fetchAllLinksParallel(startTitle: String, depth: Int): Set[String] = {
  var seen = Set(startTitle)
  var current = Set(startTitle)
  for (i <- Range(0, depth)) {
    val futures = for (title <- current) yield Future{ fetchLinks(title) }
    val nextTitleLists = futures.map(Await.result(_, Inf))
    current = nextTitleLists.flatten.filter(!seen.contains(_))
    seen = seen ++ current
  }
  seen
}
</> 13.1.scala

Snippet 13.1: a simple parallel web-crawler implemented using Scala Futures

The Scala programming language comes with a Futures API. Futures make parallel and asynchronous programming much easier to handle than working with traditional techniques of threads, locks, and callbacks.

This chapter dives into Scala's Futures: how to use them, how they work, and how you can use them to parallelize data processing workflows. It culminates in using Futures together with the techniques we learned in Chapter 12: Working with HTTP APIs to write a high-performance concurrent web crawler in a straightforward and intuitive way.

14

Simple Web and API Servers


14.1 A Minimal Webserver259
14.2 Serving HTML263
14.3 Forms and Dynamic Data265
14.4 Dynamic Page Updates via API Requests272
14.5 Real-time Updates with Websockets276

object MinimalApplication extends cask.MainRoutes {
  @cask.get("/")
  def hello() = {
    "Hello World!"
  }

  @cask.post("/do-thing")
  def doThing(request: cask.Request) = {
    request.text().reverse
  }

  initialize()
}
</> 14.1.scala

Snippet 14.1: a minimal Scala web application, using the Cask web framework

Web and API servers are the backbone of internet systems. While in the last few chapters we learned to access these systems from a client's perspective, this chapter will teach you how to provide such APIs and Websites from the server's perspective. We will walk through a complete example of building a simple real-time chat website serving both HTML web pages and JSON API endpoints. We will re-visit this website in Chapter 15: Querying SQL Databases, where we will convert its simple in-memory datastore into a proper SQL database.

15

Querying SQL Databases


15.1 Setting up Quill and PostgreSQL285
15.2 Mapping Tables to Case Classes287
15.3 Querying and Updating Data290
15.4 Transactions295
15.5 A Database-Backed Chat Website297

@ ctx.run(query[City].filter(_.population > 5000000).filter(_.countryCode == "CHN"))
res16: List[City] = List(
  City(1890, "Shanghai", "CHN", "Shanghai", 9696300),
  City(1891, "Peking", "CHN", "Peking", 7472000),
  City(1892, "Chongqing", "CHN", "Chongqing", 6351600),
  City(1893, "Tianjin", "CHN", "Tianjin", 5286800)
)
</> 15.1.scala

Snippet 15.1: using the Quill database query library from the Scala REPL

Most modern systems are backed by relational databases. This chapter will walk you through the basics of using a relational database from Scala, using the Quill query library. We will work through small self-contained examples of how to store and query data within a Postgres database, and then convert the interactive chat website we implemented in Chapter 14: Simple Web and API Servers to use a Postgres database for data storage.