| 7.1 Paths | 129 |
| 7.2 Filesystem Operations | 131 |
| 7.3 Folder Syncing | 135 |
| 7.4 Simple Subprocess Invocations | 138 |
| 7.5 Interactive and Streaming Subprocesses | 142 |
> os.walk(os.pwd).filter(os.isFile).map(p => (os.size(p), p)).sortBy(-_._1).take(5)
res0: IndexedSeq[(Long, os.Path)] = ArraySeq(
(6340270L, /Users/lihaoyi/test/post/Reimagining/GithubHistory.gif),
(6008395L, /Users/lihaoyi/test/post/SmartNation/routes.json),
(5499949L, /Users/lihaoyi/test/post/slides/Why-You-Might-Like-Scala.js.pdf),
(5461595L, /Users/lihaoyi/test/post/slides/Cross-Platform-Development-in-...),
(4576936L, /Users/lihaoyi/test/post/Reimagining/FluentSearch.gif)
)
7.1.scala
Snippet 7.1: a short Scala code snippet to find the five largest files in a directory tree
Working with files and subprocesses is one of the most common things you do in programming: from the Bash shell, to Python or Ruby scripts, to large applications written in a compiled language. At some point everyone will have to write to a file or talk to a subprocess. This chapter will walk you through how to perform basic file and subprocess operations in Scala.
This chapter finishes with two small projects: building a simple file synchronizer, and building a streaming subprocess pipeline. These projects will form the basis for Chapter 17: Multi-Process Applications and Chapter 18: Building a Real-time File Synchronizer
Throughout this chapter we will be using the
OS-Lib library.
All functionality within this library comes from the os package.
OS-Lib needs to be added as a library dependency to use with the Scala CLI REPL,
which is handled by a command line flag --dep com.lihaoyi::os-lib:0.11.6.
We will be using the following sample folder full of files for these examples:
Most operations we will be working with involve filesystem paths: we read data
from a path, write data to a path, copy files from one path to another, or list
a folder path to see what files are inside. This is represented by the os.Path
type, with separate os.RelPath and os.SubPath types to model relative paths
and sub-paths respectively.
The separation between the different kinds of paths allows additional compile-time checks that help avoid common mistakes when working with files. This helps prevent both correctness bugs as well as security issues such as directory traversal attacks.
There are three paths that are built in to the os package: os.pwd,
os.root, os.home. These refer to your process working directory, filesystem
root, and user home folder respectively. You can also get the sequence of path
segments using .segments, or the last segment of the path (usually the file or
folder name), using .last.
| |
To create a new path based on an existing path, you can either use the / operator
provided by the library to add additional path segments, or if you pass a literal string you can use / as a separator, e.g.
adding multiple path segments via the string "Github/blog" is also allowed:
| |
If you want to treat the string as a relative path or subpath, you need to use the
os.RelPath (7.1.2) or os.SubPath (7.1.3) constructor in order
to construct a relative path of more than one segment. This helps avoid
confusion between working with individual path segments as Strings and working
with more general relative paths as os.SubPaths or os.RelPaths.
The special os.up path segment lets you move up one level:
| |
os.Path(...) is used to parse os.Paths from strings. This is useful for
paths coming in from elsewhere, e.g. read from a file or command-line arguments.
Note that by default the os.Path(...) constructor only allows absolute paths.
If you want to parse a string that is a relative path into an absolute
os.Path, you have to provide a base path from which that relative path will
begin at.
| |
To work with relative paths on disk, you can use os.RelPath:
| |
This helps ensure you do not mix up os.Paths which are always absolute and
os.RelPaths which are always relative. To combine a relative path and an
absolute path, you can use the same / operator. Relative paths themselves can
also be combined using /, in any order, but trying to combine an absolute
os.Path on the right side of a relative os.RelPath is a compile error:
| |
If you want the relative path between two absolute paths, you can use
.relativeTo:
> val githubPath = os.Path("/Users/lihaoyi/Github")
githubPath: os.Path = /Users/lihaoyi/Github
> val usersPath = os.Path("/Users")
usersPath: os.Path = /Users
> githubPath.relativeTo(usersPath)
res18: os.RelPath = lihaoyi/Github
> usersPath.relativeTo(githubPath)
res19: os.RelPath = ../..
7.14.scalaos.SubPaths are a special case of relative paths, where there cannot be any
.. segments at the start. Similar to relative paths, sub-paths can be created
between absolute os.Paths using .subRelativeTo.
| |
os.SubPath is useful for cases where you have a relative path that should
always be "within" a particular base folder. This can help rule out a whole
class of
directory traversal attacks
where an unexpected .. in a relative path allows the attacker to read your
/etc/passwd or some other sensitive files.
This section will walk through the most commonly used filesystem operations that we will use throughout the rest of this book.
Note that most filesystem operations only accept absolute os.Paths.
os.SubPaths or os.RelPaths must first be converted into an absolute
os.Path before being used in such an operation. Many of these commands have
variants that let you configure the operation, e.g. os.read lets you pass in
an offset to read from and a count of characters to read, os.makeDir has
os.makeDir.all to recursively create necessary folders, os.remove.all to
recursively remove a folder and its contents, and so on.
7.2.1.1 os.list | stubos.list allows you to list out the
direct children of a folder. Note that the order of results returned from
|
7.2.1.2 os.walk | stubos.walk does a recursive listing. This may take a significant amount of time and memory for large folder trees; you can use os.walk.stream (7.2.4) to process the results in a streaming fashion. |
7.2.1.3 os.stat | stubos.stat fetches the filesystem metadata for an individual file or folder. You can also perform more specific queries such as os.isFile, os.isDir, os.mtime, os.size, etc. if you have a particular question you want to answer. |
7.2.2.1 os.read, os.write | stubos.write writes to a file. You can
write any datatype implementing the
Writable interface: os.read reads a file as a |
7.2.2.2 os.move | 7.2.2.3 os.copy |
7.2.2.4 os.remove | 7.2.2.5 os.makeDir |
Many filesystem operations return collections of results: the files in a folder,
the lines of a file, and so on. You can operate on these collections the same
way we did in Chapter 4: Scala Collections, chaining together calls to
.map/.filter/etc. to get the result we want.
For example, if we wanted to walk a folder recursively and find the largest files within it, we can do that as follows:
> os.walk(os.pwd).filter(os.isFile).map(p => (os.size(p), p)).sortBy(-_._1).take(5)
res33: IndexedSeq[(Long, os.Path)] = ArraySeq(
(6340270L, /Users/lihaoyi/test/post/Reimagining/GithubHistory.gif),
(6008395L, /Users/lihaoyi/test/post/SmartNation/routes.json),
(5499949L, /Users/lihaoyi/test/post/slides/Why-You-Might-Like-Scala.js.pdf),
(5461595L, /Users/lihaoyi/test/post/slides/Cross-Platform-Development-in-...),
(4576936L, /Users/lihaoyi/test/post/Reimagining/FluentSearch.gif)
)
7.25.scalaNote that this snippet loads all files into an in-memory data structure before
filtering and transforming it. To avoid loading everything into memory, most
such operations also provide a .stream variant.
The .stream version of filesystem operations allows you to process their
output in a streaming fashion, one element at a time, without all the output
in-memory at once. This lets you process large results without running out of
memory.
For example, you can use os.read.lines.stream to stream the lines of a file, or os.walk.stream to stream the recursive contents of a folder:
> os.read.lines.stream(os.pwd / ".gitignore").foreach(println)
target/
*.iml
.idea
.settings
...
> os.walk.stream(os.pwd).foreach(println)
/Users/lihaoyi/test/.gitignore
/Users/lihaoyi/test/post
/Users/lihaoyi/test/post/Programming Interview.md
/Users/lihaoyi/test/post/Hub
/Users/lihaoyi/test/post/Hub/Search.png
7.26.scala
*.stream operations return a
Generator type. These are similar to
iterators, except they ensure that resources are always released after
processing. Most collection operations like .foreach, .map, .filter,
.toArray, etc. are available on Generators.
Like the Views we saw in Chapter 4: Scala Collections, Generators allow
you to combine several transformations like .map, .filter, etc. into one
traversal that runs when you finally call a method like .foreach or .toList.
Here's an example that combines os.read.lines.stream, filter, and map to
list out the lines in .gitignore which start with a "." character, with the
"." stripped:
> os.read.lines.stream(os.pwd / ".gitignore").filter(_.startsWith(".")).map(_.drop(1)).toList
res34: List[String] = List(
"idea",
"settings",
"classpath",
...
7.27.scala> os.read.lines.stream(os.pwd / ".gitignore").collect{case s".$str" => str}.toList
res35: List[String] = List(
"idea",
"settings",
"classpath",
...
7.28.scala
Because we are using .stream, the reading/filtering/mapping occurs one line at
a time. We do not accumulate the results in-memory until the end when we call
.toList: this can be useful when the data being read is larger than can
comfortably fit in memory all at once.
Now that we have gone through the basic filesystem operations, let's examine a use case.
For this use case, we will write a program that will take a source and destination folder, and efficiently update the destination folder to look like the source folder: adding any files that may be missing, and modifying files where necessary to make sure they have the same contents. For simplicity, we will ignore deletions and symbolic links. We will also assume that naively deleting the entire destination directory and re-copying the source over is too inefficient, and that we want to synchronize the two folders on a per-file/folder basis.
Let us start by defining the method signature of our folder synchronizer:
> def sync(src: os.Path, dest: os.Path) =
{ /* nothing yet */ }
7.29.scalaThe first thing we need to do is recursively walk all contents of the source folder, so we can see what we have on disk that we might need to sync:
> def sync(src: os.Path, dest: os.Path) =
- { /* nothing yet */ }
+ for srcSubPath <- os.walk(src) do
+ val subPath = srcSubPath.subRelativeTo(src)
+ val destSubPath = dest / subPath
+ println((os.isDir(srcSubPath), os.isDir(destSubPath)))
def sync(src: os.Path, dest: os.Path): Unit
7.30.scala
Running this on our source folder with files and destination folder without files, we get the following:
> sync(os.pwd / "post", os.pwd / "post-copy")
(false, false)
(true, false)
(false, false)
(false, false)
...
7.31.scala
For now, the destination folder doesn't exist, so isDir returns false on all
of the paths.
The next step is to start syncing over files. We walk over the src and the
corresponding paths in dest together, and if they differ, copy the source
sub-path over the destination sub-path:
> def sync(src: os.Path, dest: os.Path) =
for srcSubPath <- os.walk(src) do
val subPath = srcSubPath.subRelativeTo(src)
val destSubPath = dest / subPath
- println((os.isDir(srcSubPath), os.isDir(destSubPath)))
+ (os.isDir(srcSubPath), os.isDir(destSubPath)) match
+ case (false, true) | (true, false) =>
+ os.copy.over(srcSubPath, destSubPath, createFolders = true)
+
+ case _ => // do nothing
def sync(src: os.Path, dest: os.Path): Unit
7.32.scala
Since isDir returns false both when the path refers to a file as well as if
the path is empty, the above snippet uses .copy.over to delete the destination
path and copy over the contents of the source in the following circumstances:
There are three cases the above code does not support:
If the source path is empty and the destination path contains a folder. This is a "delete" which we will ignore for now; supporting this is left as an exercise for the reader
If the source path is a folder and the destination path is a folder. In this
case doing nothing is fine: os.walk will enter the source path folder and
process all the files within it recursively
If the source path is a file and the destination path is a file, but they have different contents
We will handle the last case next.
The last case we need to support is when both source and destination paths
contain files, but with different contents. We can do that by handling the
(false, false) in our pattern match, but only if the destination path is empty
or the destination path has a file with different contents than the source path:
> def sync(src: os.Path, dest: os.Path) =
for srcSubPath <- os.walk(src) do
val subPath = srcSubPath.subRelativeTo(src)
val destSubPath = dest / subPath
(os.isDir(srcSubPath), os.isDir(destSubPath)) match
case (false, true) | (true, false) =>
os.copy.over(srcSubPath, destSubPath, createFolders = true)
+ case (false, false)
+ if !os.exists(destSubPath)
+ | !os.read.bytes(srcSubPath).sameElements(os.read.bytes(destSubPath)) =>
+
+ os.copy.over(srcSubPath, destSubPath, createFolders = true)
+
case _ => // do nothing
def sync(src: os.Path, dest: os.Path): Unit
7.33.scalaWe are again using os.copy.over, which conveniently over-writes any file that
was already at the destination path: a more fine-grained file syncer may want to
update the file in place if only part of it has changed. We use .sameElements
to compare the two Arrays rather than ==, as Arrays in Scala use reference
equality and so == will always return false for two different Arrays even
if they have the same contents.
Note that we do not need to run os.exists(srcSubPath) to see if it exists, as
the fact that it got picked up by os.walk tells us that it does. While the
file may potentially get deleted between os.walk picking it up and
os.read.bytes reading its contents, for now we ignore such race conditions and
assume that the file system is static while the sync is ongoing. Dealing with
such concurrent modification will be the topic of Chapter 18: Building a Real-time File Synchronizer.
We can now run sync on our two folders, and then os.walk the dest path and
see all our files in place:
> sync(os.pwd / "post", os.pwd / "post-copy")
> os.walk(os.pwd / "post-copy")
res36: IndexedSeq[os.Path] = ArraySeq(
/Users/lihaoyi/test/post-copy/Optimizing Scala.md,
/Users/lihaoyi/test/post-copy/Programming Interview.md,
/Users/lihaoyi/test/post-copy/Scala Vectors.md,
/Users/lihaoyi/test/post-copy/Reimagining/,
/Users/lihaoyi/test/post-copy/Reimagining/GithubSearch.png,
...
7.34.scala
To test incremental updates, we can try adding an entry to the src folder, run
the sync, and then see that our file has been synced over to dest:
> os.write(os.pwd / "post/ABC.txt", "Hello World")
> sync(os.pwd / "post", os.pwd / "post-copy")
> os.exists(os.pwd / "post-copy/ABC.txt")
res39: Boolean = true
> os.read(os.pwd / "post-copy/ABC.txt")
res40: String = "Hello World"
7.35.scala
By appending some content to one of the files in src, we can verify that
modifications to that file also get synced over when sync is run:
> os.write.append(os.pwd / "post/ABC.txt", "\nI am Cow")
> sync(os.pwd / "post", os.pwd / "post-copy")
> os.read(os.pwd / "post-copy/ABC.txt")
res41: String = """Hello World
I am Cow"""
7.36.scala
This example is greatly simplified: we do not consider deletions, permissions, symbolic links, or concurrency/parallelism concerns. Our file synchronizer runs in-process on a single computer, and cannot be easily used to synchronize files over a network. Nevertheless, it should give you a good sense of how working with the filesystem via Scala's OS-Lib library works.
The other major operating-system feature that you will likely use is process management. Most complex systems are made of multiple processes: often the tool you need is not usable as a library within your program, but can be easily started as a subprocess to accomplish the task at hand.
A common pattern is to spawn a subprocess, wait for it to complete, and process its output. This is done through the os.call function:
os.call(cmd: command: os.Shellable,
cwd: Path = null,
env: Map[String, String] = null,
stdin: ProcessInput = os.Pipe,
stdout: ProcessOutput = os.Pipe,
stderr: ProcessOutput = os.Inherit,
mergeErrIntoOut: Boolean = false,
timeout: Long = Long.MaxValue,
check: Boolean = true,
propagateEnv: Boolean = true,
shutdownGracePeriod: Long = 100,
destroyOnExit: Boolean = true): os.CommandResult
7.37.scala
os.call takes a lot of optional parameters, but at its simplest you pass
in the command you want to execute, and it returns you an os.CommandResult
object, which contains the exit code, stdout, and stderr of the completed
process. For example, we can run git status within a Git repository and
receive the output of the Git subprocess:
> val gitStatus = os.call(cmd = ("git", "status"))
gitStatus: os.CommandResult = CommandResult(
command = ArraySeq("git", "status"),
exitCode = 0,
...
> gitStatus.exitCode
res43: Int = 0
> gitStatus.out.text()
res44: String = """On branch main
Your branch is up to date with 'origin/main'.
Changes to be committed:
...
7.38.scala
Common things to customize include:
cwd to set the location of the subprocess's current working directory;
defaults to the os.pwd of the host process
env to customize the environment variables passed to the subprocess;
defaults to inheriting those of the host process
stdin/stderr/stdout to pass data into the subprocess's standard input
stream, and redirect where its standard output and error streams go. Defaults
to not taking standard input, and collecting the output in the
os.CommandResult, and forwarding any errors to the terminal
check = false to avoid throwing an exception if the subprocess has a
non-zero exit code
While here we are passing Strings into os.call(), you can also pass
Seq[String]s, Option[String]s, and os.Paths. We will now explore some
simple use cases using os.call to do useful work.
Often, someone using the Git version control system ends up creating one branch
for every task they are working on.
After merging those branches into the default branch, e.g. main,
the branch names still hang around, and it can be tedious to run git branch -D
over and over to remove them. Let's use Git subprocesses to help remove all
branches except the current branch from the local Git repo.
To do this, first we run git branch to see the current branches, and get the
output as a series of lines:
> val gitBranchLines = os.call(cmd = ("git", "branch")).out.lines()
gitBranchLines: Vector[String] = Vector(" 561", " 571", " 595", "* main")
7.39.scala
Next, we find all the branches whose lines start with " ", and remove the
whitespace:
> val otherBranches = gitBranchLines.filter(_.startsWith(" ")).map(_.drop(2))
otherBranches: Vector[String] = Vector("561", "571", "595")
7.40.scala
You can also write this using pattern matching and .collect, as:
> val otherBranches = gitBranchLines.collect{case s" $branch" => branch}
otherBranches: Vector[String] = Vector("561", "571", "595")
7.41.scala
Lastly, we run git branch -D on each such branch to remove them. We can then
see only the current * main branch remaining:
> for branch <- otherBranches do os.call(cmd = ("git", "branch", "-D", branch))
> val gitBranchLines = os.call(cmd = ("git", "branch")).out.lines()
gitBranchLines: Vector[String] = Vector("* main")
7.42.scala
While everything we did here using subprocesses could also have been done
in-process, e.g. using a library like JGit,
it is sometimes easier use a command-line tool you are already familiar with
rather than having to learn an entirely new programmatic API. In this case, the
git command-line interface is something many developers use every day, and
being able to easily re-use those same commands from within your Scala program
is a great advantage in readability.
The full code for removing all non-current branches from a Git repo is as follows:
RemoveBranches.scala7.43.scaladefgitBranchLines=os.call(cmd=("git","branch")).out.lines()defmain()=pprint.log(gitBranchLines)valotherBranches=gitBranchLines.collect{cases"$branchName"=>branchName}forbranch<-otherBranchesdoos.call(cmd=("git","branch","-D",branch))pprint.log(otherBranches)
The previous example did not configure the subprocess' stdin or stdout
streams at all: the git branch operations get all their necessary input from
the command line arguments, and returned their output as stdout to the host
process. In this next example, we will spawn a curl subprocess and configure
stdout to send the output directly to a file without buffering in-memory.
stdout and stderr can be passed the following data types:
os.Pipe: the default for stdout, aggregates the output into the .out
field in the CommandResult
os.Inherit: the default for stderr, forwards the output to the parent
process' output. Useful for forwarding error messages or logs straight to your
terminal
os.Path: a file on disk to forward the output to, which we will use below
We can thus spawn a curl subprocess with the stdout redirected to a local
file:
> val url = "https://api.github.com/repos/com-lihaoyi/cask/releases"
> os.call(cmd = ("curl", "-L", url), stdout = os.pwd / "github.json")
res45: os.CommandResult = CommandResult(
command = ...,
exitCode = 0,
chunks = Vector()
)
7.44.scala
After the curl completes, we can now spawn a ls -lh subprocess to get the
metadata of the file we just downloaded:
> os.call(cmd = ("ls", "-lh", "github.json")).out.text()
res46: String = """-rw-r--r-- 1 lihaoyi staff 1.8M Jun 3 13:30 github.json
"""
7.45.scalaos.call allows you to set both the stdin as well as stdout, using the
subprocess to process data from one file to another in a streaming fashion.
Here, we are using the gzip command-line tool to compress the github.json
file we downloaded earlier into a github.json.gz:
> os.call(cmd = "gzip", stdin = os.pwd / "github.json", stdout = os.pwd / "github.json.gz")
res47: os.CommandResult = CommandResult(
command = ArraySeq("gzip"),
exitCode = 0,
chunks = Vector()
)
> os.call(cmd = ("ls", "-lh", "github.json.gz")).out.text()
res48: String = """-rw-r--r-- 1 lihaoyi staff 47K Jun 3 13:30 github.json.gz
"""
7.46.scala
This lets you use subprocesses to handle large files and large amounts of data without having to load either the input or the output into the host process's memory. This is useful if the files are large and memory is limited.
os.call only allows you to spawn a single subprocess at a time, and only
returns once that subprocess has completed. This is great for simple commands,
like the curl and git commands we saw earlier, and can even be used for
simple single-stage streaming commands like gzip. However, if we want more
control over how the subprocess is used, or to wire up multiple subprocesses to
form a multi-stage pipeline, we need os.spawn.
os.spawn(cmd: os.Shellable,
cwd: Path = null,
env: Map[String, String] = null,
stdin: os.ProcessInput = os.Pipe,
stdout: os.ProcessOutput = os.Pipe,
stderr: os.ProcessOutput = os.Inherit,
mergeErrIntoOut: Boolean = false,
shutdownGracePeriod: Long = 100,
destroyOnExit: Boolean = true): os.SubProcess
7.47.scala
os.spawn takes a similar
set of arguments as os.call, but instead of returning a completed
os.CommandResult, it instead returns an os.SubProcess object. This
represents a subprocess that runs in the background while your program continues
to execute, and you can interact with it via its stdin, stdout and stderr
streams.
The first big use case for os.spawn is to spawn a long-lived worker
process running concurrently with the host process. This background process may
not be streaming large amounts of data, but simply has its own in-memory state
you want to interact with and preserve between interactions.
A simple example of this is to keep a Python process running in the background:
> val sub = os.spawn(cmd = ("python3", "-u", "-c", "while True: print(eval(input()))"))
This tiny snippet of Python code in the -c parameter reads input lines from
stdin, evals them as Python code, and prints the result over stdout back
to the host process. By spawning a subprocess, we can write Strings to the
subprocess' input streams using sub.stdin.write and sub.stdin.writeLine, and
read String lines from its standard output using sub.stdout.readLine():
| |
stdin and stdout also support reading and writing binary data. When we are
done with the subprocess, we can destroy it:
> sub.isAlive()
res51: Boolean = true
> sub.destroy()
> sub.isAlive()
res52: Boolean = false
7.50.scalaThis usage pattern is handy in a few cases:
The process you are delegating work to is slow to initialize, so you do not want to spawn a new one every time. e.g. a Python process can take 10s to 100s of milliseconds to start
The process you are delegating work to has its own in-memory state that needs to be preserved, and throwing away the data by spawning a new subprocess each time simply doesn't work
The work you are delegating is untrusted, and you want to run it in a subprocess to take advantage of the sandboxing and security features provided by the operating system
Apart from interactively exchanging data with a subprocess over stdin and
stdout, the other common use case for os.spawn is to connect multiple
subprocesses via their standard input and standard output. This lets them
exchange and process data in a streaming fashion.
The first streaming use case we will visit is to find the distinct contributors
to a Git repository. To run these steps in parallel, in a streaming pipeline,
you can use os.spawn as follows:
> {
val gitLog = os.spawn(cmd = ("git", "log"))
val grepAuthor = os.spawn(cmd = ("grep", "Author: "), stdin = gitLog.stdout)
val output = grepAuthor.stdout.lines().distinct
}
output: Vector[String] = Vector(
"Author: Li Haoyi",
"Author: Guillaume Galy",
"Author: Nik Vanderhoof",
...
7.51.scala
Here, we spawn one gitLog subprocess, and pass the stdout of gitLog into
the stdin of a grepAuthor subprocess. At that point, both os.SubProcesses
are running in the background, the output of one feeding into the input of the
other. The grepAuthor subprocess exposes a grepAuthor.stdout attribute that
you can use to read the output, either interactively via readLine as we saw
earlier, or in bulk via methods like .text() or .lines().
Subprocess pipelines do not need to start or end in-memory, or even on your
local computer. Here is an example of downloading some data from
api.github.com, and re-uploading it to httpbin.org, in a streaming fashion
using curl on both ends:
> {
val url = "https://api.github.com/repos/com-lihaoyi/cask/releases"
val download = os.spawn(cmd = ("curl", "-L", url))
val upload = os.spawn(
cmd = ("curl", "-X", "PUT", "-d", "@-", "https://httpbin.org/anything"),
stdin = download.stdout
)
val contentLength = upload.stdout.lines().filter(_.contains("Content-Length"))
}
val contentLength: Vector[String] = Vector(
" \"Content-Length\": \"1759343\", "
)
7.52.scalaLooking at the JSON output of the final upload response, we can see that the
"Content-Length" of the output is 1,759,343 bytes which roughly matches the
1.8M number we saw earlier. For now, we are just using a quick .contains
check on Strings to find the relevant line in the JSON: we will see how to do
JSON processing in a more structured way later in Chapter 8: JSON and Binary Data Serialization.
As each subprocess doesn't know anything about its neighbours, it is easy to
change the topology of the pipeline. For example, if we want to introduce
base64 and gzip steps in between the curl download and curl -X PUT
upload, we could do that as follows:
> {
val url = ...
val download = ...
+ val base64 = os.spawn(cmd = "base64", stdin = download.stdout)
+ val gzip = os.spawn(cmd = "gzip", stdin = base64.stdout)
val upload = os.spawn(
cmd = ("curl", "-X", "PUT", "-d", "@-", "https://httpbin.org/anything"),
- stdin = download.stdout
+ stdin = gzip.stdout
)
val contentLength = upload.stdout.lines().filter(_.contains("Content-Length"))
}
val contentLength: Vector[String] = Vector(" \"Content-Length\": \"201864\", ")
7.53.scalaos.spawn thus allows you to put together quite sophisticated subprocess
pipelines in a straightforward manner. In the example above, all four stages run
in parallel, with the download, base 64 encoding, gzip compression, and upload
taking place concurrently in four separate subprocess. The data is processed in
a streaming fashion as it is piped between the four processes, allowing us to
perform the whole computation without ever accumulating the complete dataset in
memory.
In this chapter, we have seen how we can use the os package to make working
with the filesystem quick and easy, and how to conveniently set up subprocess
pipelines to accomplish what you want. From synchronizing files to doing network
data plumbing, all the tools you need are just a few lines of code away. We will
be using this package to work with files and subprocesses throughout the rest of
this book.
To use OS-Lib in a project built using Mill or another build tool, you need the following coordinates:
Millmvn"com.lihaoyi::os-lib:0.11.6"
In the interest of time, the scope of this chapter is limited. The OS-Lib library documentation is a much more thorough reference:
We will revisit the file synchronizer and subprocess pipelines we built in this chapter later in this book, in Chapter 17: Multi-Process Applications and Chapter 18: Building a Real-time File Synchronizer.
Exercise: Modify the folder sync function we wrote earlier to support deletes: if a file or folder exists in the destination folder but not in the source folder, we should delete it from the destination folder when synchronizing them.
See example 7.7 - FileSyncDeleteExercise: Add another stage to our streaming subprocess pipeline that uses tee to
stream the compressed data to a local file while still uploading it to
httpbin.org.