7

Files and Subprocesses


7.1 Paths129
7.2 Filesystem Operations131
7.3 Folder Syncing135
7.4 Simple Subprocess Invocations138
7.5 Interactive and Streaming Subprocesses142

> os.walk(os.pwd).filter(os.isFile).map(p => (os.size(p), p)).sortBy(-_._1).take(5)
res0: IndexedSeq[(Long, os.Path)] = ArraySeq(
  (6340270L, /Users/lihaoyi/test/post/Reimagining/GithubHistory.gif),
  (6008395L, /Users/lihaoyi/test/post/SmartNation/routes.json),
  (5499949L, /Users/lihaoyi/test/post/slides/Why-You-Might-Like-Scala.js.pdf),
  (5461595L, /Users/lihaoyi/test/post/slides/Cross-Platform-Development-in-...),
  (4576936L, /Users/lihaoyi/test/post/Reimagining/FluentSearch.gif)
)
7.1.scala

Snippet 7.1: a short Scala code snippet to find the five largest files in a directory tree

Working with files and subprocesses is one of the most common things you do in programming: from the Bash shell, to Python or Ruby scripts, to large applications written in a compiled language. At some point everyone will have to write to a file or talk to a subprocess. This chapter will walk you through how to perform basic file and subprocess operations in Scala.

This chapter finishes with two small projects: building a simple file synchronizer, and building a streaming subprocess pipeline. These projects will form the basis for Chapter 17: Multi-Process Applications and Chapter 18: Building a Real-time File Synchronizer

Throughout this chapter we will be using the OS-Lib library. All functionality within this library comes from the os package.

OS-Lib needs to be added as a library dependency to use with the Scala CLI REPL, which is handled by a command line flag --dep com.lihaoyi::os-lib:0.11.6.

We will be using the following sample folder full of files for these examples:

  • sample-blog.zip (https://github.com/handsonscala/handsonscala/tree/v2/resources/7)

7.1 Paths

Most operations we will be working with involve filesystem paths: we read data from a path, write data to a path, copy files from one path to another, or list a folder path to see what files are inside. This is represented by the os.Path type, with separate os.RelPath and os.SubPath types to model relative paths and sub-paths respectively.

The separation between the different kinds of paths allows additional compile-time checks that help avoid common mistakes when working with files. This helps prevent both correctness bugs as well as security issues such as directory traversal attacks.

There are three paths that are built in to the os package: os.pwd, os.root, os.home. These refer to your process working directory, filesystem root, and user home folder respectively. You can also get the sequence of path segments using .segments, or the last segment of the path (usually the file or folder name), using .last.

> os.pwd
res1: os.Path = /Users/lihaoyi/test

> os.root
res2: os.Path = /

> os.home
res3: os.Path = /Users/lihaoyi
7.2.scala
> os.home.segments.toList
res4: List[String] = List(
  "Users", "lihaoyi"
)

> os.home.last
res5: String = "lihaoyi"
7.3.scala

7.1.1 Constructing Paths

To create a new path based on an existing path, you can either use the / operator provided by the library to add additional path segments, or if you pass a literal string you can use / as a separator, e.g. adding multiple path segments via the string "Github/blog" is also allowed:

> os.home / "Github/blog"
res6: os.Path = /Users/lihaoyi/Github/blog
7.4.scala
> os.home / "Github/blog"
res7: os.Path = /Users/lihaoyi/Github/blog
7.5.scala

If you want to treat the string as a relative path or subpath, you need to use the os.RelPath (7.1.2) or os.SubPath (7.1.3) constructor in order to construct a relative path of more than one segment. This helps avoid confusion between working with individual path segments as Strings and working with more general relative paths as os.SubPaths or os.RelPaths.

The special os.up path segment lets you move up one level:

> os.pwd / os.up
res8: os.Path = /Users/lihaoyi
7.6.scala
> os.pwd / os.up / os.up
res9: os.Path = /Users
7.7.scala

os.Path(...) is used to parse os.Paths from strings. This is useful for paths coming in from elsewhere, e.g. read from a file or command-line arguments. Note that by default the os.Path(...) constructor only allows absolute paths. If you want to parse a string that is a relative path into an absolute os.Path, you have to provide a base path from which that relative path will begin at.

> os.Path("/Users/lihaoyi")
res10: os.Path = /Users/lihaoyi

> os.Path("post")
java.lang.IllegalArgumentException:
requirement failed: post is not an
absolute path
7.8.scala
> os.Path("post", base = os.pwd)
res12: os.Path =
  /Users/lihaoyi/test/post

> os.Path("../mill", base = os.pwd)
res13: os.Path = /Users/lihaoyi/mill
7.9.scala

7.1.2 Relative Paths

To work with relative paths on disk, you can use os.RelPath:

> os.RelPath("post")
res14: os.RelPath = post
7.10.scala
> os.RelPath("../hello/world")
res15: os.RelPath = ../hello/world
7.11.scala

This helps ensure you do not mix up os.Paths which are always absolute and os.RelPaths which are always relative. To combine a relative path and an absolute path, you can use the same / operator. Relative paths themselves can also be combined using /, in any order, but trying to combine an absolute os.Path on the right side of a relative os.RelPath is a compile error:

> val relHello = os.RelPath("../hello")
relHello: os.RelPath = ../hello

> os.home / relHello
res16: os.Path = /Users/hello

> relHello / os.RelPath("post")
res17: relHello.ThisType = ../hello/post
7.12.scala
> relHello / os.pwd
-- [E007] Type Mismatch Error: ---------
1 |relHello / os.pwd
  |           ^^^^^^
  |           Found:    os.Path
  |           Required: os.PathChunk
1 error found
7.13.scala

If you want the relative path between two absolute paths, you can use .relativeTo:

> val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = /Users/lihaoyi/Github

> val usersPath = os.Path("/Users")
usersPath: os.Path = /Users

> githubPath.relativeTo(usersPath)
res18: os.RelPath = lihaoyi/Github

> usersPath.relativeTo(githubPath)
res19: os.RelPath = ../..
7.14.scala

7.1.3 Sub Paths

os.SubPaths are a special case of relative paths, where there cannot be any .. segments at the start. Similar to relative paths, sub-paths can be created between absolute os.Paths using .subRelativeTo.

> os.SubPath("post")
res20: os.SubPath = post

> os.SubPath("../hello/world")
java.lang.IllegalArgumentException:
ups must be zero, but it is 1
in ../hello/world
7.15.scala
> val p1 = os.Path("/Users/lihaoyi/Github")
p1: os.Path = /Users/lihaoyi/Github

> val p2 = os.Path("/Users")
p2: os.Path = /Users

> p1.subRelativeTo(p2)
res22: os.SubPath = lihaoyi/Github
7.16.scala

os.SubPath is useful for cases where you have a relative path that should always be "within" a particular base folder. This can help rule out a whole class of directory traversal attacks where an unexpected .. in a relative path allows the attacker to read your /etc/passwd or some other sensitive files.

7.2 Filesystem Operations

This section will walk through the most commonly used filesystem operations that we will use throughout the rest of this book.

Note that most filesystem operations only accept absolute os.Paths. os.SubPaths or os.RelPaths must first be converted into an absolute os.Path before being used in such an operation. Many of these commands have variants that let you configure the operation, e.g. os.read lets you pass in an offset to read from and a count of characters to read, os.makeDir has os.makeDir.all to recursively create necessary folders, os.remove.all to recursively remove a folder and its contents, and so on.

7.2.1 Queries

7.2.1.1 os.list

> os.list(os.pwd)
res23: IndexedSeq[os.Path] =
ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post
)
7.17.scala

stub

os.list allows you to list out the direct children of a folder. Note that the order of results returned from os.list may vary depending on the filesystem your computer is using. If you need the results in a particular order, you should sort them yourself using .sorted on the returned collection.

7.2.1.2 os.walk

> os.walk(os.pwd)
res24: IndexedSeq[os.Path] =
ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/post/Interview.md,
  /Users/lihaoyi/test/post/Hub,
  /Users/lihaoyi/test/post/Hub/icon.png,
...
7.18.scala

stub

os.walk does a recursive listing. This may take a significant amount of time and memory for large folder trees; you can use os.walk.stream (7.2.4) to process the results in a streaming fashion.

7.2.1.3 os.stat

> os.stat(os.pwd / ".gitignore")
res25: os.StatInfo = StatInfo(
  size = 178L,
  mtime = 2025-10-06T17:30:43.400592225Z,
  ctime = 2025-10-06T17:30:43Z,
  atime = 2025-10-07T21:06:34.799575364Z,
  fileType = File
)
7.19.scala

stub

os.stat fetches the filesystem metadata for an individual file or folder. You can also perform more specific queries such as os.isFile, os.isDir, os.mtime, os.size, etc. if you have a particular question you want to answer.

7.2.2 Actions

7.2.2.1 os.read, os.write

> os.write(os.pwd / "new.md", "Hi")

> os.list(os.pwd)
res26: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/new.md,
)

> os.read(os.pwd / "new.md")
res27: String = "Hi"
7.20.scala

stub

os.write writes to a file. You can write any datatype implementing the Writable interface: Strings, Array[Byte]s, or even the ujson.Values we will meet in Chapter 8: JSON and Binary Data Serialization and Scalatags HTML templates we will meet in Chapter 9: Self-Contained Scala Scripts.

os.read reads a file as a String. You also have os.read.lines to read the lines of a file as a IndexedSeq[String], and os.read.bytes to read a file as an Array[Byte]:

7.2.2.2 os.move

> os.move(
    os.pwd / "new.md",
    os.pwd / "newer.md"
  )

> os.list(os.pwd)
res28: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/newer.md
)
7.21.scala

7.2.2.3 os.copy

> os.copy(
    os.pwd / "newer.md",
    os.pwd / "newer-2.md"
  )

> os.list(os.pwd)
res29: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/newer-2.md,
  /Users/lihaoyi/test/newer.md
)
7.22.scala

7.2.2.4 os.remove

> os.remove(os.pwd / "newer.md")
res30: Boolean = true

> os.list(os.pwd)
res31: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/newer-2.md
)
7.23.scala

7.2.2.5 os.makeDir

> os.makeDir(os.pwd / "new-folder")

> os.list(os.pwd)
res32: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/.gitignore,
  /Users/lihaoyi/test/post,
  /Users/lihaoyi/test/new-folder,
  /Users/lihaoyi/test/newer-2.md
)
7.24.scala

7.2.3 Combining Operations

Many filesystem operations return collections of results: the files in a folder, the lines of a file, and so on. You can operate on these collections the same way we did in Chapter 4: Scala Collections, chaining together calls to .map/.filter/etc. to get the result we want.

For example, if we wanted to walk a folder recursively and find the largest files within it, we can do that as follows:

> os.walk(os.pwd).filter(os.isFile).map(p => (os.size(p), p)).sortBy(-_._1).take(5)
res33: IndexedSeq[(Long, os.Path)] = ArraySeq(
  (6340270L, /Users/lihaoyi/test/post/Reimagining/GithubHistory.gif),
  (6008395L, /Users/lihaoyi/test/post/SmartNation/routes.json),
  (5499949L, /Users/lihaoyi/test/post/slides/Why-You-Might-Like-Scala.js.pdf),
  (5461595L, /Users/lihaoyi/test/post/slides/Cross-Platform-Development-in-...),
  (4576936L, /Users/lihaoyi/test/post/Reimagining/FluentSearch.gif)
)
7.25.scala
See example 7.1 - LargestFiles

Note that this snippet loads all files into an in-memory data structure before filtering and transforming it. To avoid loading everything into memory, most such operations also provide a .stream variant.

7.2.4 Streaming

The .stream version of filesystem operations allows you to process their output in a streaming fashion, one element at a time, without all the output in-memory at once. This lets you process large results without running out of memory.

For example, you can use os.read.lines.stream to stream the lines of a file, or os.walk.stream to stream the recursive contents of a folder:

> os.read.lines.stream(os.pwd / ".gitignore").foreach(println)
target/
*.iml
.idea
.settings
...

> os.walk.stream(os.pwd).foreach(println)
/Users/lihaoyi/test/.gitignore
/Users/lihaoyi/test/post
/Users/lihaoyi/test/post/Programming Interview.md
/Users/lihaoyi/test/post/Hub
/Users/lihaoyi/test/post/Hub/Search.png
7.26.scala

*.stream operations return a Generator type. These are similar to iterators, except they ensure that resources are always released after processing. Most collection operations like .foreach, .map, .filter, .toArray, etc. are available on Generators.

7.2.5 Transforming Streams

Like the Views we saw in Chapter 4: Scala Collections, Generators allow you to combine several transformations like .map, .filter, etc. into one traversal that runs when you finally call a method like .foreach or .toList. Here's an example that combines os.read.lines.stream, filter, and map to list out the lines in .gitignore which start with a "." character, with the "." stripped:

> os.read.lines.stream(os.pwd / ".gitignore").filter(_.startsWith(".")).map(_.drop(1)).toList
res34: List[String] = List(
  "idea",
  "settings",
  "classpath",
  ...
7.27.scala
> os.read.lines.stream(os.pwd / ".gitignore").collect{case s".$str" => str}.toList
res35: List[String] = List(
  "idea",
  "settings",
  "classpath",
  ...
7.28.scala

Because we are using .stream, the reading/filtering/mapping occurs one line at a time. We do not accumulate the results in-memory until the end when we call .toList: this can be useful when the data being read is larger than can comfortably fit in memory all at once.

Now that we have gone through the basic filesystem operations, let's examine a use case.

7.3 Folder Syncing

For this use case, we will write a program that will take a source and destination folder, and efficiently update the destination folder to look like the source folder: adding any files that may be missing, and modifying files where necessary to make sure they have the same contents. For simplicity, we will ignore deletions and symbolic links. We will also assume that naively deleting the entire destination directory and re-copying the source over is too inefficient, and that we want to synchronize the two folders on a per-file/folder basis.

Let us start by defining the method signature of our folder synchronizer:

> def sync(src: os.Path, dest: os.Path) =
         { /* nothing yet */ }
7.29.scala

7.3.1 Walking the Filesystem

The first thing we need to do is recursively walk all contents of the source folder, so we can see what we have on disk that we might need to sync:

 > def sync(src: os.Path, dest: os.Path) =
-    { /* nothing yet */ }
+    for srcSubPath <- os.walk(src) do
+      val subPath = srcSubPath.subRelativeTo(src)
+      val destSubPath = dest / subPath
+      println((os.isDir(srcSubPath), os.isDir(destSubPath)))

 def sync(src: os.Path, dest: os.Path): Unit
7.30.scala

Running this on our source folder with files and destination folder without files, we get the following:

> sync(os.pwd / "post", os.pwd / "post-copy")
(false, false)
(true, false)
(false, false)
(false, false)
...
7.31.scala

For now, the destination folder doesn't exist, so isDir returns false on all of the paths.

7.3.2 Copying New Files Over

The next step is to start syncing over files. We walk over the src and the corresponding paths in dest together, and if they differ, copy the source sub-path over the destination sub-path:

 > def sync(src: os.Path, dest: os.Path) =
     for srcSubPath <- os.walk(src) do
       val subPath = srcSubPath.subRelativeTo(src)
       val destSubPath = dest / subPath
-      println((os.isDir(srcSubPath), os.isDir(destSubPath)))
+      (os.isDir(srcSubPath), os.isDir(destSubPath)) match
+        case (false, true) | (true, false) =>
+          os.copy.over(srcSubPath, destSubPath, createFolders = true)
+
+        case _ => // do nothing

 def sync(src: os.Path, dest: os.Path): Unit
7.32.scala

Since isDir returns false both when the path refers to a file as well as if the path is empty, the above snippet uses .copy.over to delete the destination path and copy over the contents of the source in the following circumstances:

  • The source path contains a file and the destination path contains a folder
  • The source path contains a folder and the destination path contains a file
  • The source path contains a folder and the destination path is empty

There are three cases the above code does not support:

  • If the source path is empty and the destination path contains a folder. This is a "delete" which we will ignore for now; supporting this is left as an exercise for the reader

  • If the source path is a folder and the destination path is a folder. In this case doing nothing is fine: os.walk will enter the source path folder and process all the files within it recursively

  • If the source path is a file and the destination path is a file, but they have different contents

We will handle the last case next.

7.3.3 Updating Files

The last case we need to support is when both source and destination paths contain files, but with different contents. We can do that by handling the (false, false) in our pattern match, but only if the destination path is empty or the destination path has a file with different contents than the source path:

 > def sync(src: os.Path, dest: os.Path) =
     for srcSubPath <- os.walk(src) do
       val subPath = srcSubPath.subRelativeTo(src)
       val destSubPath = dest / subPath
       (os.isDir(srcSubPath), os.isDir(destSubPath)) match
         case (false, true) | (true, false) =>
           os.copy.over(srcSubPath, destSubPath, createFolders = true)

+        case (false, false)
+          if !os.exists(destSubPath)
+           | !os.read.bytes(srcSubPath).sameElements(os.read.bytes(destSubPath)) =>
+
+          os.copy.over(srcSubPath, destSubPath, createFolders = true)
+
         case _ => // do nothing

 def sync(src: os.Path, dest: os.Path): Unit
7.33.scala
See example 7.2 - FileSync

We are again using os.copy.over, which conveniently over-writes any file that was already at the destination path: a more fine-grained file syncer may want to update the file in place if only part of it has changed. We use .sameElements to compare the two Arrays rather than ==, as Arrays in Scala use reference equality and so == will always return false for two different Arrays even if they have the same contents.

Note that we do not need to run os.exists(srcSubPath) to see if it exists, as the fact that it got picked up by os.walk tells us that it does. While the file may potentially get deleted between os.walk picking it up and os.read.bytes reading its contents, for now we ignore such race conditions and assume that the file system is static while the sync is ongoing. Dealing with such concurrent modification will be the topic of Chapter 18: Building a Real-time File Synchronizer.

7.3.4 Testing Our File Syncer

We can now run sync on our two folders, and then os.walk the dest path and see all our files in place:

> sync(os.pwd / "post", os.pwd / "post-copy")

> os.walk(os.pwd / "post-copy")
res36: IndexedSeq[os.Path] = ArraySeq(
  /Users/lihaoyi/test/post-copy/Optimizing Scala.md,
  /Users/lihaoyi/test/post-copy/Programming Interview.md,
  /Users/lihaoyi/test/post-copy/Scala Vectors.md,
  /Users/lihaoyi/test/post-copy/Reimagining/,
  /Users/lihaoyi/test/post-copy/Reimagining/GithubSearch.png,
...
7.34.scala

To test incremental updates, we can try adding an entry to the src folder, run the sync, and then see that our file has been synced over to dest:

> os.write(os.pwd / "post/ABC.txt", "Hello World")

> sync(os.pwd / "post", os.pwd / "post-copy")

> os.exists(os.pwd / "post-copy/ABC.txt")
res39: Boolean = true

> os.read(os.pwd / "post-copy/ABC.txt")
res40: String = "Hello World"
7.35.scala

By appending some content to one of the files in src, we can verify that modifications to that file also get synced over when sync is run:

> os.write.append(os.pwd / "post/ABC.txt", "\nI am Cow")

> sync(os.pwd / "post", os.pwd / "post-copy")

> os.read(os.pwd / "post-copy/ABC.txt")
res41: String = """Hello World
I am Cow"""
7.36.scala

This example is greatly simplified: we do not consider deletions, permissions, symbolic links, or concurrency/parallelism concerns. Our file synchronizer runs in-process on a single computer, and cannot be easily used to synchronize files over a network. Nevertheless, it should give you a good sense of how working with the filesystem via Scala's OS-Lib library works.

7.4 Simple Subprocess Invocations

The other major operating-system feature that you will likely use is process management. Most complex systems are made of multiple processes: often the tool you need is not usable as a library within your program, but can be easily started as a subprocess to accomplish the task at hand.

A common pattern is to spawn a subprocess, wait for it to complete, and process its output. This is done through the os.call function:

os.call(cmd: command: os.Shellable,
        cwd: Path = null,
        env: Map[String, String] = null,
        stdin: ProcessInput = os.Pipe,
        stdout: ProcessOutput = os.Pipe,
        stderr: ProcessOutput = os.Inherit,
        mergeErrIntoOut: Boolean = false,
        timeout: Long = Long.MaxValue,
        check: Boolean = true,
        propagateEnv: Boolean = true,
        shutdownGracePeriod: Long = 100,
        destroyOnExit: Boolean = true): os.CommandResult
7.37.scala

os.call takes a lot of optional parameters, but at its simplest you pass in the command you want to execute, and it returns you an os.CommandResult object, which contains the exit code, stdout, and stderr of the completed process. For example, we can run git status within a Git repository and receive the output of the Git subprocess:

> val gitStatus = os.call(cmd = ("git", "status"))
gitStatus: os.CommandResult = CommandResult(
  command = ArraySeq("git", "status"),
  exitCode = 0,
...

> gitStatus.exitCode
res43: Int = 0

> gitStatus.out.text()
res44: String = """On branch main
Your branch is up to date with 'origin/main'.
Changes to be committed:
...
7.38.scala

Common things to customize include:

  • cwd to set the location of the subprocess's current working directory; defaults to the os.pwd of the host process

  • env to customize the environment variables passed to the subprocess; defaults to inheriting those of the host process

  • stdin/stderr/stdout to pass data into the subprocess's standard input stream, and redirect where its standard output and error streams go. Defaults to not taking standard input, and collecting the output in the os.CommandResult, and forwarding any errors to the terminal

  • check = false to avoid throwing an exception if the subprocess has a non-zero exit code

While here we are passing Strings into os.call(), you can also pass Seq[String]s, Option[String]s, and os.Paths. We will now explore some simple use cases using os.call to do useful work.

7.4.1 Use Case: remove non-current branches from a Git repo

Often, someone using the Git version control system ends up creating one branch for every task they are working on. After merging those branches into the default branch, e.g. main, the branch names still hang around, and it can be tedious to run git branch -D over and over to remove them. Let's use Git subprocesses to help remove all branches except the current branch from the local Git repo.

To do this, first we run git branch to see the current branches, and get the output as a series of lines:

> val gitBranchLines = os.call(cmd = ("git", "branch")).out.lines()
gitBranchLines: Vector[String] = Vector("  561", "  571", "  595", "* main")
7.39.scala

Next, we find all the branches whose lines start with " ", and remove the whitespace:

> val otherBranches = gitBranchLines.filter(_.startsWith("  ")).map(_.drop(2))
otherBranches: Vector[String] = Vector("561", "571", "595")
7.40.scala

You can also write this using pattern matching and .collect, as:

> val otherBranches = gitBranchLines.collect{case s"  $branch" => branch}
otherBranches: Vector[String] = Vector("561", "571", "595")
7.41.scala

Lastly, we run git branch -D on each such branch to remove them. We can then see only the current * main branch remaining:

> for branch <- otherBranches do os.call(cmd = ("git", "branch", "-D", branch))

> val gitBranchLines = os.call(cmd = ("git", "branch")).out.lines()
gitBranchLines: Vector[String] = Vector("* main")
7.42.scala

While everything we did here using subprocesses could also have been done in-process, e.g. using a library like JGit, it is sometimes easier use a command-line tool you are already familiar with rather than having to learn an entirely new programmatic API. In this case, the git command-line interface is something many developers use every day, and being able to easily re-use those same commands from within your Scala program is a great advantage in readability.

The full code for removing all non-current branches from a Git repo is as follows:

RemoveBranches.scaladef gitBranchLines = os.call(cmd = ("git", "branch")).out.lines()

def main() =
  pprint.log(gitBranchLines)

  val otherBranches = gitBranchLines.collect{case s"  $branchName" => branchName}
  for branch <- otherBranches do os.call(cmd = ("git", "branch", "-D", branch))
  pprint.log(otherBranches)7.43.scala

7.4.2 Use Case: Curl to a local file

The previous example did not configure the subprocess' stdin or stdout streams at all: the git branch operations get all their necessary input from the command line arguments, and returned their output as stdout to the host process. In this next example, we will spawn a curl subprocess and configure stdout to send the output directly to a file without buffering in-memory.

stdout and stderr can be passed the following data types:

  • os.Pipe: the default for stdout, aggregates the output into the .out field in the CommandResult

  • os.Inherit: the default for stderr, forwards the output to the parent process' output. Useful for forwarding error messages or logs straight to your terminal

  • os.Path: a file on disk to forward the output to, which we will use below

We can thus spawn a curl subprocess with the stdout redirected to a local file:

> val url = "https://api.github.com/repos/com-lihaoyi/cask/releases"

> os.call(cmd = ("curl", "-L", url), stdout = os.pwd / "github.json")
res45: os.CommandResult = CommandResult(
  command = ...,
  exitCode = 0,
  chunks = Vector()
)
7.44.scala

After the curl completes, we can now spawn a ls -lh subprocess to get the metadata of the file we just downloaded:

> os.call(cmd = ("ls", "-lh", "github.json")).out.text()
res46: String = """-rw-r--r-- 1 lihaoyi staff 1.8M Jun  3 13:30 github.json
"""
7.45.scala

7.4.3 Streaming Gzip

os.call allows you to set both the stdin as well as stdout, using the subprocess to process data from one file to another in a streaming fashion. Here, we are using the gzip command-line tool to compress the github.json file we downloaded earlier into a github.json.gz:

> os.call(cmd = "gzip", stdin = os.pwd / "github.json", stdout = os.pwd / "github.json.gz")
res47: os.CommandResult = CommandResult(
  command = ArraySeq("gzip"),
  exitCode = 0,
  chunks = Vector()
)

> os.call(cmd = ("ls", "-lh", "github.json.gz")).out.text()
res48: String = """-rw-r--r-- 1 lihaoyi staff 47K Jun  3 13:30 github.json.gz
"""
7.46.scala

This lets you use subprocesses to handle large files and large amounts of data without having to load either the input or the output into the host process's memory. This is useful if the files are large and memory is limited.

os.call only allows you to spawn a single subprocess at a time, and only returns once that subprocess has completed. This is great for simple commands, like the curl and git commands we saw earlier, and can even be used for simple single-stage streaming commands like gzip. However, if we want more control over how the subprocess is used, or to wire up multiple subprocesses to form a multi-stage pipeline, we need os.spawn.

7.5 Interactive and Streaming Subprocesses

os.spawn(cmd: os.Shellable,
         cwd: Path = null,
         env: Map[String, String] = null,
         stdin: os.ProcessInput = os.Pipe,
         stdout: os.ProcessOutput = os.Pipe,
         stderr: os.ProcessOutput = os.Inherit,
         mergeErrIntoOut: Boolean = false,
         shutdownGracePeriod: Long = 100,
         destroyOnExit: Boolean = true): os.SubProcess
7.47.scala

os.spawn takes a similar set of arguments as os.call, but instead of returning a completed os.CommandResult, it instead returns an os.SubProcess object. This represents a subprocess that runs in the background while your program continues to execute, and you can interact with it via its stdin, stdout and stderr streams.

7.5.1 Interacting with a Subprocess

The first big use case for os.spawn is to spawn a long-lived worker process running concurrently with the host process. This background process may not be streaming large amounts of data, but simply has its own in-memory state you want to interact with and preserve between interactions.

A simple example of this is to keep a Python process running in the background:

> val sub = os.spawn(cmd = ("python3", "-u", "-c", "while True: print(eval(input()))"))

This tiny snippet of Python code in the -c parameter reads input lines from stdin, evals them as Python code, and prints the result over stdout back to the host process. By spawning a subprocess, we can write Strings to the subprocess' input streams using sub.stdin.write and sub.stdin.writeLine, and read String lines from its standard output using sub.stdout.readLine():

> sub.stdin.writeLine("1 + 2 + 4")

> sub.stdin.flush()

> sub.stdout.readLine()
res49: String = "7"
7.48.scala
> sub.stdin.writeLine("'a' + 'b'")

> sub.stdin.flush()

> sub.stdout.readLine()
res50: String = "ab"
7.49.scala

stdin and stdout also support reading and writing binary data. When we are done with the subprocess, we can destroy it:

> sub.isAlive()
res51: Boolean = true

> sub.destroy()

> sub.isAlive()
res52: Boolean = false
7.50.scala

This usage pattern is handy in a few cases:

  • The process you are delegating work to is slow to initialize, so you do not want to spawn a new one every time. e.g. a Python process can take 10s to 100s of milliseconds to start

  • The process you are delegating work to has its own in-memory state that needs to be preserved, and throwing away the data by spawning a new subprocess each time simply doesn't work

  • The work you are delegating is untrusted, and you want to run it in a subprocess to take advantage of the sandboxing and security features provided by the operating system

7.5.2 Streaming distinct contributors in a Git repo history

Apart from interactively exchanging data with a subprocess over stdin and stdout, the other common use case for os.spawn is to connect multiple subprocesses via their standard input and standard output. This lets them exchange and process data in a streaming fashion.

The first streaming use case we will visit is to find the distinct contributors to a Git repository. To run these steps in parallel, in a streaming pipeline, you can use os.spawn as follows:

G output output git log git log grep "Author: " grep "Author: " git log->grep "Author: " grep "Author: "->output
> {
  val gitLog = os.spawn(cmd = ("git", "log"))
  val grepAuthor = os.spawn(cmd = ("grep", "Author: "), stdin = gitLog.stdout)
  val output = grepAuthor.stdout.lines().distinct
  }
output: Vector[String] = Vector(
  "Author: Li Haoyi",
  "Author: Guillaume Galy",
  "Author: Nik Vanderhoof",
...
7.51.scala

Here, we spawn one gitLog subprocess, and pass the stdout of gitLog into the stdin of a grepAuthor subprocess. At that point, both os.SubProcesses are running in the background, the output of one feeding into the input of the other. The grepAuthor subprocess exposes a grepAuthor.stdout attribute that you can use to read the output, either interactively via readLine as we saw earlier, or in bulk via methods like .text() or .lines().

7.5.3 Streaming Subprocess Pipelines

Subprocess pipelines do not need to start or end in-memory, or even on your local computer. Here is an example of downloading some data from api.github.com, and re-uploading it to httpbin.org, in a streaming fashion using curl on both ends:

G api.github.com api.github.com curl curl api.github.com->curl httpbin.org httpbin.org curl -X PUT curl -X PUT curl->curl -X PUT curl -X PUT->httpbin.org
> {
  val url = "https://api.github.com/repos/com-lihaoyi/cask/releases"
  val download = os.spawn(cmd = ("curl", "-L", url))

  val upload = os.spawn(
    cmd = ("curl", "-X", "PUT", "-d", "@-", "https://httpbin.org/anything"),
    stdin = download.stdout
  )

  val contentLength = upload.stdout.lines().filter(_.contains("Content-Length"))
  }
val contentLength: Vector[String] = Vector(
  "    \"Content-Length\": \"1759343\", "
)
7.52.scala

Looking at the JSON output of the final upload response, we can see that the "Content-Length" of the output is 1,759,343 bytes which roughly matches the 1.8M number we saw earlier. For now, we are just using a quick .contains check on Strings to find the relevant line in the JSON: we will see how to do JSON processing in a more structured way later in Chapter 8: JSON and Binary Data Serialization.

As each subprocess doesn't know anything about its neighbours, it is easy to change the topology of the pipeline. For example, if we want to introduce base64 and gzip steps in between the curl download and curl -X PUT upload, we could do that as follows:

 > {
   val url = ...
   val download = ...

+  val base64 = os.spawn(cmd = "base64", stdin = download.stdout)
+  val gzip = os.spawn(cmd = "gzip", stdin = base64.stdout)
   val upload = os.spawn(
     cmd = ("curl", "-X", "PUT", "-d", "@-", "https://httpbin.org/anything"),
-    stdin = download.stdout
+    stdin = gzip.stdout
   )

   val contentLength = upload.stdout.lines().filter(_.contains("Content-Length"))
   }
 val contentLength: Vector[String] = Vector("    \"Content-Length\": \"201864\", ")
7.53.scala
G api.github.com api.github.com curl curl api.github.com->curl httpbin.org httpbin.org base64 base64 curl->base64 gzip gzip base64->gzip curl -X PUT curl -X PUT gzip->curl -X PUT curl -X PUT->httpbin.org

os.spawn thus allows you to put together quite sophisticated subprocess pipelines in a straightforward manner. In the example above, all four stages run in parallel, with the download, base 64 encoding, gzip compression, and upload taking place concurrently in four separate subprocess. The data is processed in a streaming fashion as it is piped between the four processes, allowing us to perform the whole computation without ever accumulating the complete dataset in memory.

7.6 Conclusion

In this chapter, we have seen how we can use the os package to make working with the filesystem quick and easy, and how to conveniently set up subprocess pipelines to accomplish what you want. From synchronizing files to doing network data plumbing, all the tools you need are just a few lines of code away. We will be using this package to work with files and subprocesses throughout the rest of this book.

To use OS-Lib in a project built using Mill or another build tool, you need the following coordinates:

Millmvn"com.lihaoyi::os-lib:0.11.6"

In the interest of time, the scope of this chapter is limited. The OS-Lib library documentation is a much more thorough reference:

We will revisit the file synchronizer and subprocess pipelines we built in this chapter later in this book, in Chapter 17: Multi-Process Applications and Chapter 18: Building a Real-time File Synchronizer.

Exercise: Modify the folder sync function we wrote earlier to support deletes: if a file or folder exists in the destination folder but not in the source folder, we should delete it from the destination folder when synchronizing them.

See example 7.7 - FileSyncDelete

Exercise: Add another stage to our streaming subprocess pipeline that uses tee to stream the compressed data to a local file while still uploading it to httpbin.org.

See example 7.8 - StreamingDownloadProcessReuploadTee
Discuss Chapter 7 online at https://www.handsonscala.com/discuss/7