Today, we'll first go over some more content from the PBT assignment, then continue our conversation from last time about testing and how to make it better.

More on the PBT assignment

See Monday's notes for more on test case generation.

In the PBT assignment, you'll implement both one version of the system-under-test (topsort) and a PBT oracle. To make these implementations language-agnostic, your task is to produce executables that communicate via interprocess communication.

Every process has 3 I/O (input/output) streams: 0 stdin, 1 stdout, and 2 stderr. The topsort implementation you write (and the system tsort) read a text-based representation of a DAG from stdin and write the output list to stdout. Your oracle needs to run tsort as a subprocess and handle the writing of inputs and output.

For more on this, including sample code, see the Executables for PBT section under Resources on the website.

Sorting records, continued

Recall that we considered generating random tests and validating properties of the output for the sort_records function.

Have we proven anything about the properties of our records sorting function?

(Think, then click!)

No! We still aren't covering the complete input space.
We've shown that sort_records works on a potentially large number of specific inputs, but we have not done anything to reason about all possible inputs.

Question: if sort_records says a specific input and output combination is correct, is it necessarily correct?

(Think, then click!)

Yes, since the properties we check compose to make up the full specification of the function.

We can make a distinction relevant to these two answers: soundness vs. completeness.

An analysis is sound if it never fails on a working program.

An analysis is complete if it finds all instances of failure (that is, if there exists a non-working program, it the analysis must find it).

This distinction comes from proof theory: a proof system is sound if everything that is provable is in fact true, and complete if if everything that is true has a proof (Source: Cornell soundness and completeness lecture notes).

Other examples

Consider another example function: integer prime factorization.

The task of this function: given an integer, express that integer as the product of primes.

For example:

4 -> 2*2
1 -> 1
12 -> 2*2 * 3

What is the type signature of this function?

(Think, then click!)

Several possible, one is:

Int -> List<Int>

We'll again break down the task of property-based testing to two concerns: (1) generating inputs, and (2) checking the validity of the output the function calculates for that input.

How could we generate inputs to this function?

(Think, then click!)

A simple random number generator.

What do we need to check for validity?

(Think, then click!)

The product of the integers in the output is the input (that is, they multiply to the input).
That each factor is indeed prime.

One problem we encounter with this example is that one of these properties is more difficult to check! Knowing if a number is prime can be computationally difficult as numbers get larger.

We might test just a subset of properties. For example, that the product is correct and that no prime factor is even.

Is this analysis sound? Is it complete?

(Think, then click!)

No, it is not sound, because we might allow an input-output pair to "pass" even if a large factor is not actually a prime. It's also still not complete, since we don't check all inputs.

Assorted testing topics

Next, we'll go through an assortment of topics related to testing, with an eye toward some industry terminology.

"Test-driven development"

The approximate formula:

Take your specification from the client.
Define the data types your program needs, with as little detail as possible to write tests.
Convert the specifications into tests. Run the tests, any new ones should fail (because you haven't actually implemented them).
Implement the simplest code that passes the new tests. Check that they pass.
Improve/optimize the code while maintaining that the tests pass.

While not ever situation or team will follow this exact test-driven development formula, the idea to write tests early (before writing your implementation, if you can) is a good one.

Anecdote: a class I had as an undergraduate actually required us to turn in our test suite far before the assignment was due, and we were graded on how many bad implementations our test suite could uncover!

Distinction: unit vs. integration testing

"Unit" testing tends to mean checking a single, specific function in your codebase.

Large scale software projects also aim to have "integration" tests that intend to check functionality across numerous functions, typically with the goal of modeling a production flow of data. For example, an integration test for a web-based application may test that a piece of data coming in from the network is correctly parsed, analyzed, then written to the database.

A challenge with integration tests is that some inputs and outputs of our computer system may be difficult to simulate in a testing environment. For example, we wouldn't want our tests to write fake data to a real production database.

Mocking objects/designing for testing

To test code written for production without actually impacting production, integration tests often need to "mock" some components of the data.

An analogy from Wikipedia: you can think of mock objects as being similar to crash dummies for testings cars, except for software!

System designers could choose to mock objects for the following reasons: because the real code would modify a production resource (which could also have privacy implications), because the test is trying to simulate something that happens intermittently (like network traffic or a timer going off), or because it would be impractical for an automated test to wait for certain inputs (user interaction, sensor or Bluetooth data).

Mock objects share the same interface as the "real" data (for example, they might be a subclass in an object-oriented language like Java) but have a minimal implementation.

Often, the production implementation code itself might need "hooks" to allow engineers to swap in a testing configuration.

When designing a complex system, it's a good idea to come up with a testing plan (including whether you'll need to mock objects) upfront.

Metric: Test coverage

One way to measure the "success" of a test suite is to look at its *coverage: the number of lines of code executed by a run of the test suite (or sometimes, the number of blocks in the control flow graph executed).

Different languages have different tools and libraries for automatically calculated test coverage. Some teams require new code to strictly improve the total test coverage metric for a project.

What do you think is a reasonable test coverage percentage? Why?

Google Code Coverage Best Practices: "Although there is no “ideal code coverage number,” at Google we offer the general guidelines of 60% as “acceptable”, 75% as “commendable” and 90% as “exemplary.”"

Does a 100% test coverage % mean the code always meets its specification? Why?

In Spring 2025 we didn't get to the following material in class, but feel free to read through it and ask any questions in class on Monday.

Coverage-guided fuzzing

Note, fuzzing is widely used! See the Oss-Fuzz (Open-Source Software program).

One distinction between fuzzing and PBT is that fuzzing systems tend to have less programmer specification on how to generate inputs (though the terminology is quite fuzzy here in practical use).

One way that fuzzing systems get better without having programs specify exact input generation strategies is by automatically searching for inputs that achieve higher test coverage percentages.

For example, if we were fuzzing a Linux command-line utility, much of the initial code where purely random inputs would fail would be in parsing the input strings. This makes inputs fail before they get to the interesting "business logic" of the program. Fuzzing systems attempt to build on inputs that achieve progressively higher coverage, which indicates that the inputs have gotten "deeper" into the code.

Test-case reduction

Note that whether you manually run into an issue, or find it via an automated technique like fuzzing or PBT, bugs are more useful when they are concise, replicable, and actionable.

Filing good bug reports is a skill! Bug report should be as simple/concise as possible in outlining how to reproduce the bug (that is, a randomly generated input should be minimized).

Bug reports should have all of the necessary context (OS, software versions, exact commands run) for the reader to be able to reliably replicate the bug on their end. Including expected results helps to make the bug actionable.

For a fun video example of manual test case reduction, see this blog post and video by my PhD advisor, Adrian Sampson.

Differential testing

Sometimes, it's hard to describe the output with properties for every generated program.

For an example close to my heart, consider compilers that must generate assembly code (e.g., x86-64). There are many different valid sequences of x86 instructions to implement a specific high-level piece of code.

Differential testings is the idea that while the exact outputs of multiple distinct systems need not be the same, some behavior of those outputs should be.

In the case of compilers, differential testing involves compiling the same input program with multiple distinct compilers (e.g., GCC, LLVM, Intel's ICC) and checking that when executed, the programs produce the same output.

Differential testing is useful if you have multiple versions of your implementation (e.g., even an unoptimized vs optimized version of the same program) and can compare some aspect of the output behavior between the systems.