Demystifying Fuzzer Behaviour

Day 1 11:55 Ground en Science
Dec. 27, 2025 11:55-12:35
Despite how it's often portrayed in blogs, scientific articles, or corporate test planning, fuzz testing isn't a magic bug printer; just saying "we fuzz our code" says nothing about how _effectively_ it was tested. Yet, how fuzzers and programs interact is deeply mythologised and poorly misunderstood, even by seasoned professionals. This talk analyses a number of recent works and case studies that reveal the relationship between fuzzers, their inputs, and programs to explain _how_ fuzzers work.

Fuzz testing (or, "fuzzing") is a testing technique that passes randomly-generated inputs to a subject under test (SUT). This term was first coined in 1988 by Miller to describe sending random byte sequences to Unix utilities 1, but was arguably preceded in 1971 by Breuer for fault detection in sequential circuits 2 and in 1972 by Purdom for parser testing by generating sentences from grammars 3. Curiously, they all exhibit different approaches for generating inputs based on knowledge about the SUT, though none of them use feedback from the SUT to make decisions about new inputs.

Fuzzing wasn't yet popular, but industry was catching on. Between the late 90s and 2013, we see a number of strategies appear in industry 4. Some had success with constraint solvers, where they would observe runtime behavior or have knowledge about a target's structure to produce higher quality inputs. Others operated in a different way, by taking an existing input and tweaking it slightly ("mutating") to address the low-likelihood of random generation to produce structured inputs. None was as successful, or as popular, as American Fuzzy Lop, or "AFL", released in 2013. This combined coverage observations for inputs (Ormandy, 2007) with concepts from evolutionary novelty search 5 into a tool which could, from very few initial inputs, evolve over multiple mutations to find new, untested code.

Despite its power, this advancement made it far more difficult to understand how fuzzers even worked. Now all you had to do was point this tool at a program and it would start testing, and the coverage would go up; users were now only responsible for writing "harnesses", code which processed fuzzer-produced inputs and sent them to the SUT. Though there have been a few real advances to fuzzing since (or, at least, strategies which combined previous methods more effectively), fuzzing research has mostly deadended, with new methods squeezing only minor improvements out of older ones. This, and inadequate harness writing, comes from this opaqueness in how fuzzers internally operate: without understanding what these tools do from first principles, there's no clear "right" and "wrong" way to do things because there is no mental model to test them against.

This talk doesn't talk about new bugs, new fuzzers, or new harness generation tools. The purpose of this talk is to uncover mechanisms of fuzzer input production in the context of different classes of SUT and harnesses thereon, highlighting recent papers which have clarified our understanding of how fuzzers and SUTs interact. By the end, you will have a better understanding of why modern fuzzers work, what their limitations are, and how you can write better fuzzers and harnesses yourself.

Speakers of this event