Fuzz testing is a main component of modern software assurance, but some bugs remain elusive to fuzzing.
Google’s libwebp, integral to Chrome and Safari, recently received attention for a severe security flaw that was found to be exploited in the wild. Notably, the bug was not flagged by Google’s expansive fuzzing setup, including OSS-Fuzz.
As Ben Hawkes from Isosceles details, the bug resides in WebP’s lossless compression method, VP8L. Memory allocations based on pre-calculated fixed buffer sizes resulted in a heap buffer overflow.
Fuzzers are usually able to uncover such complex memory corruptions given enough time and solvable path constraints. However, this specific bugs escaped previous fuzzing attempts.
The question the fuzzing community and we were asking – is it possible to find this specific vulnerability with fuzzing? And if so, why was it not found in Google‘s OSS-Fuzz initiative? This article attempts to answer these questions and also tries to give guidance to better fuzzing campaigns.
Note: In the first iteration of this article it was shown that the fuzzing setup found the libwebp vulnerability, however this was due to an error where an already partial corrupted input file was accidentally added as a seed during the campaign.
As the lead developer behind AFL++, I wanted to understand why certain bugs manage to escape automated tools and how we can refine the tools for better bug coverage.
If you are new to fuzzing, you might want to read up on AFL++ first:
* https://github.com/AFLplusplus/AFLplusplus/blob/stable/docs/fuzzing_in_depth.md
* https://github.com/AFLplusplus/AFLplusplus/blob/stable/docs/afl-fuzz_approach.md
* https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.cmplog.md
* https://github.com/AFLplusplus/AFLplusplus/blob/stable/instrumentation/README.laf-intel.md
Installing AFL++ is straightforward (see https://github.com/AFLplusplus/AFLplusplus/blob/stable/docs/INSTALL.md) if you have not installed a full modern LLVM compiler and other necessary build tools):
To test, we used the libwebp version just before the bug was fixed:
We chose not to rely on the pre-existing harnesses or constructing a new, and instead use the example tool that comes with the library because we knew that this can be used to trigger the bug. Hypothetically, these harnesses could have a narrow scope for fuzzing and be the reason the bug wasn’t found.
To optimize for speed, we moved the forkserver within the dwebp code, just before the input is parsed. Finally, we adjusted the compiler in makefile.unix to AFL++.
We wanted AFL++ to focus solely on the code path that lead to this specific bug. For that, we used source code coverage compilation flags (`-fprofile-arcs -ftest-coverage -lgcov --coverage`). Then, we ran the target examples/dwebp with the problematic bad.webp file:´´´examples/dwebp bad.webp -o /dev/null`.
Since a crash does not generate a coverage file, we prior also modified the source code to force the writing of one:
After forcing the coverage file to be written, we used the utility gcov to parse and analyze the generated coverage file. We identified the function names that are involved in the crash, to instrument only those specific functions to guide the fuzzer towards the vulnerability more efficiently.
Doing this can significantly speed up the fuzzing process for the goal. While the fuzzer could eventually find the bug without this focus, directing it will save a considerable amount of time in this specific experiment.
We created a file called libwebp.list that contains the names of the functions to be instrumented, which are relevant to the crash path:
For a thorough fuzzing campaign we want fuzzing instances that have a different view on coverage - for this we compile additional instances with Context Sensitive Coverage (CTX) and a N-Sequence based instrumentation with a step length of 8 (NGRAM-8).
Additionally we want fuzzing instances that help solving path constraints - for this we compile additional instances with CMPLOG and COMPCOV.
Finally we add an instance compiled for detecting any kind of memory corruption easily by enabling ASAN.
Note that in a real-world fuzzing campaign you would add a honggfuzz instance and as well as a libfuzzer instance with the "value-profile" and likely also a libafl and a concolic execution fuzzers like symcc and .TritonDSE. This is however omitted here not make this article overly complex.
Initially, we gathered a few seed files from libwebp and a few random files we downloaded, and placed them in a directory named ‘seeds’. To ensure the fuzzer had a diverse set of inputs, we ran the deduplication command:
With five unique seed files, we were ready to start fuzzing. We set up 8 different ``` screen` sessions, each with different configurations, aiming for a diverse fuzzing setup.
Clearly, a profound understanding and experience with the AFL++ capabilities is a requisite. However, for those keen on delving deeper, the document at https://github.com/AFLplusplus/AFLplusplus/blob/stable/docs/fuzzing_in_depth.md offers thorough insights.
To standardize the environment for all fuzzing instances, we created an afl-env.sh script with default environment variables:
After 3 days, we reviewed the results and found no crash.
Only when an input is added that has already 2/3 of the required specially set-up tables the vulnerability is found, but then it still needs about 40 hours to trigger.
Analysing the exact table setup that is need for this vulnerability reveals why:
// Size of huffman_tables buffer = 654 + 630 + 630 + 630 + 410 = 2954 elements
// To overflow we "just" exceed this number!
static uint32_t code_lengths_counts[5][16] = {
// 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
{0, 1, 1, 0, 0, 0, 0, 0, 0, 3, 229, 41, 1, 1, 1, 2}, // size = 654
{0, 1, 1, 0, 0, 0, 0, 0, 0, 7, 241, 1, 1, 1, 1, 2}, // size = 630
{0, 1, 1, 0, 0, 0, 0, 0, 0, 7, 241, 1, 1, 1, 1, 2}, // size = 630
{0, 1, 1, 0, 0, 0, 0, 0, 0, 7, 241, 1, 1, 1, 1, 2}, // size = 630
{0, 1, 1, 1, 1, 1, 0, 0, 0, 11, 5, 1, 10, 4, 2, 2}, // size = 414
};
The fuzzer would need to find a very specific combination of bytes to create a table that just has the correct size to trigger this bug. Below the number there is no crash, above the number the invalid table is detected and processing is aborted.
If there would be a program flow specific to a crash, this is something a fuzzer is very good at finding.
If there would be a specific value required that is common (e.g. 0xffffffff) then a fuzzer is very good at finding this as well.
However finding a long list of specific values where many of these are not 0, 1, or 255 is a challenge, the more of these values have to be found without any help available from path constraint solving or coverage, or dynamically learning values at runtime from the target is just not possible.
In this specific case the fuzzer would have needed to guess about 20 bytes with very specific values without any help, and that would take thousands of years (figure of speech, it would take longer) until it would randomly find such a setup.
Also other mechanisms like concolic solvers are unable to find such issues. Likely only formal verification methods to proof memory safety would likely showed this vulnerability, but this is not something that can be easily setup up and applied.
The technique of only instrumenting the the execution path of a vulnerability to see if fuzzing can find it - or why it fails is very valuable to get this information in a short amount of time. Without the optimization performed in Step 2, it would have taken around 1 core year to come to the same conclusion.
This libwebp vulnerability is an important reminder to understand that fuzzing is no silver bullet. Some bugs are impossible to find with fuzzing, others bugs need triggers that a fuzzer would not produce, or a large amount of data.
And this is not the first time high profile vulnerabilities were not found by fuzzers - and could not have been.
Sometimes it is because of harness limitations, like in the infamous heartbeet vulnerability.
Sometimes it is because the vulnerability type is not covered by a fuzzer, like the infamous Shellshock vulnerability
And somtimes it is because is just not able to trigger the bug condition like in this libwebp vulnerability or e.g. in a memory corruption in certificate parsing with punycode in OpenSSL.
Manual code inspection, static analysis tools, automated testing and hands-on security tests together are required to have a comprehensive approach to identify the vulnerabilities in software. Just relying on fuzzing is setting yourself up for failure.
OSS-Fuzz had not discovered the vulnerability despite the presence of harnesses within the tests/fuzzer/subdirectory of the libwebp project. We found that two of the seven harnesses (simple_api_fuzzer and advanced_api_fuzzer) could indeed trigger the crash with bad.webp:
It simply had no chance to find the specific setup required to trigger the bug, so this is not OSS-Fuzz fault.
However we found a few issues with how OSS-Fuzz performs its fuzzing which prevents fuzz testing at full potential:
1. Static Instrumentation: Originally, I had implemented a feature for the AFL++ integration into OSS-Fuzz that would randomly instrument the code differently each time it was compiled. This increased probability of reaching new coverage and subsequent bugs. However, due to complications, this feature was removed by Google‘s team, limiting the fuzzing to a standard set of instrumentation options. This limits the views on coverage and path constraints in OSS-Fuzz.
2. Corpus Synchronization Issues: OSS-Fuzz uses ClusterFuzz to handle the actual fuzzing workload. An instance works independently, and it’s results are afterwards merged back into the main corpus. This method used for merging is ` libfuzzer` which sees much less coverage than AFL++, especially AFL++ features like COMPCOV, and hence it loses 10-20% of the coverage AFL++ finds each time.
3. Finally OSS-Fuzz runs ClusterFuzz in a CI-style fashion, running single instances for a few hours only and then terminating. With a large corpus or a slow target the startup time for intelligent fuzzers that first perform a calibration is huge and leaves little time for actual fuzzing.
Some bugs can not be effectively found with CI based fuzzing, and instead need a long running fuzzing campaign, using different techniques to solve path constraints: CMPLOG, COMPCOV, libfuzzer's value profile and in small and medium projects one or two concolic execution frameworks, why also using different coverage options like CTX and/or NGRAM, and running multiple different instances in parallel.