Developing Fuzzer Targets for ICU APIs

Contents

  1. Directory and naming conventions
  2. General structure of a fuzzer target
  3. Makefile.in changes
  4. Fuzzer seed corpus
  5. Guidelines and tips
  6. How to locally reproduce fuzzer findings

This documents describes how to develop a fuzzer target for an ICU API and its integration into the ICU build process.

Directory and naming conventions

Fuzzer targets are exclusively in directory source/test/fuzzer/ and end with _fuzzer.cpp. Only files with such ending are recognized and executed as fuzzer targets by the OSS-Fuzz system.

General structure of a fuzzer target

As a minimum, a fuzzer target contains the function

extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size) {
  ...
}

This function is expected and invoked by the fuzzer system. The data parameter contains the fuzzer-controlled data of size size bytes. Part or all of this data is then passed into the ICU API under test.

Fuzzer target collator_rulebased_fuzzer.cpp illustrates the basic elements.

// © 2019 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html

#include <cstring>

#include "fuzzer_utils.h"
#include "unicode/coll.h"
#include "unicode/localpointer.h"
#include "unicode/locid.h"
#include "unicode/tblcoll.h"

IcuEnvironment* env = new IcuEnvironment();

extern "C" int LLVMFuzzerTestOneInput(const uint8_t* data, size_t size) {
  UErrorCode status = U_ZERO_ERROR;

  size_t unistr_size = size/2;
  std::unique_ptr<char16_t[]> fuzzbuff(new char16_t[unistr_size]);
  std::memcpy(fuzzbuff.get(), data, unistr_size * 2);
  icu::UnicodeString fuzzstr(false, fuzzbuff.get(), unistr_size);

  icu::LocalPointer<icu::RuleBasedCollator> col1(
      new icu::RuleBasedCollator(fuzzstr, status));

  return 0;
}

The ICU API under test is the RuleBasedCollator(const UnicodeString &rules, UErrorCode &status) constructor. The code interprets the fuzzer data as UnicodeString and passes it to the constructor. And that is all. Specific error handling or return value verification is not required because the fuzzer will detect all memory issues by means of memory/address sanitizer findings.

Makefile.in changes

ICU fuzzer targets are built and executed by the OSS-Fuzz project. On side of ICU they are compiled to assure that the code is syntactically correct and, as a sanity check, executed in the most basic manner, i.e. with minimal testdata and without ASAN or MSAN analysis.

Add the new fuzzer target to the list of targets in the FUZZER_TARGETS variable in Makefile.in. The new fuzzer target will then be built and executed as part of a normal ICU4C unit test run. Note that each fuzzer target becomes executable on its own. As such it is linked with the code in fuzzer_driver.cpp, which contains the main() function.

Fuzzer seed corpus

Any fuzzer seed data for a fuzzer target goes into a file with name <fuzzer_target>_seed_corpus.txt. In many cases the input parameter of the ICU API under test is of type UnicodeString, in case of which the seed data should be in UTF-16 format. As an example,see collator_rulebased_fuzzer_seed_corpus.txt.

Guidelines and tips

  • Leave all randomness to the fuzzer. If a random selection of any kind is needed (e.g., of a locale), then use bytes from the fuzzer data to make the selection (example).
  • In many cases ICU unit tests can provide seed data or at least ideas for seed data. If the API under test requires a Unicode string then make sure that the seed data is in UTF-16 encoding. This can be achieved with e.g. the ‘iconv’ command or using an editor that saves text in UTF-16.

How to locally reproduce fuzzer findings

At this time reproduction of fuzzer findings requires Docker installed on the local machine and the OSS-Fuzz project downloaded in a local git client.

  1. Install Docker (Ubuntu):

    sudo apt install docker
    
  2. Download OSS-Fuzz, switch into directory oss-fuzz/

    In a git client directory, download the fuzzer system.

    git clone https://github.com/google/oss-fuzz.git
    cd oss-fuzz/
    
  3. Build the Docker image for ICU. In some setups root permissions may be required to connect to the Docker.

    [sudo] python infra/helper.py build_image icu
    

    A prompt will appear: Pull latest base images (compiler/runtime)? (y/N) Respond: ‘N’. If you are curious then respond with ‘y’ (won’t hurt).

  4. Build the ICU fuzzers:

    [sudo] python infra/helper.py build_fuzzers --sanitizer [address | memory | undefined] icu
    

    Check that the fuzzer targets were built successfully: ls -l build/out/icu

  5. Reproduce the fuzzer finding. First, get the testdata the fuzzer used when finding the issue. In the fuzzer bug report look for ‘Reproducer Testcase’, a click on the link will download the testdata. Then execute

    [sudo] python infra/helper.py reproduce icu <icu_fuzzer> <testdata>
    

    Concrete example:

    sudo python infra/helper.py reproduce icu uregex_open_fuzzer  ~/Downloads/clusterfuzz-testcase-minimized-uregex_open_fuzzer-5732067058384896
    

Limitations: When reproducing a fuzzer finding in the way outlined above the fuzzer environment will use the current ICU trunk from https://github.com/unicode-org/icu.git. Thus it is not possible to modify the code to try out a possible fix. What can be done is to redirect Docker to download ICU from a forked ICU repository. Open the file oss-fuzz/projects/icu/Dockerfile and adjust the line with git clone --depth 1 https://github.com/unicode-org/icu.git icu accordingly. Then modify the code in the forked repository and follow the steps above beginning with step 3, create a Docker image.

This of course is still a tedious way of reproducing and working on a fuzzer finding. Ticket ICU-20734 aims to introduce a fuzzer driver that can reproduce certain fuzzer findings in a local ICU workspace.