waugustus / carpetfuzz Goto Github PK

An NLP-based fuzzing assitance tool for generating valid option combinations.

License: Apache License 2.0

Python 99.74% Roff 0.20% Dockerfile 0.07%

carpetfuzz's Introduction

CarpetFuzz

CarpetFuzz is an NLP-based fuzzing assistance tool for generating valid option combinations.

The basic idea of CarpetFuzz is to use natural language processing (NLP) to identify and extract the relationships (e.g., conflicts or dependencies) among program options from the description of each option in the documentation and filter out invalid combinations to reduce the option combinations that need to be fuzzed.

For more details, please refer to our paper from USENIX Security'23.

The CarpetFuzz-experiments repository contains the data sets, scripts, and documentation required to reproduce our results in the paper.

Prerequisites

We believe that mainstream computers on the market are sufficient to run CarpetFuzz, such as computers with a 1-core CPU, 8GB RAM, and a 128GB hard drive.

Structure

Directory	Description
dataset	Training dataset to obtain the model
fuzzer	Modified fuzzer to switch combinations on the fly. (Submodule)
images	Images used in README.md
models	Models used to extract relationships
output	CarpetFuzz's output files
pict	Microsoft's pairwise tool. (Submodule)
scripts	Python scripts to identify, extract relationships and rank combinations based on their dry-run coverage.
scripts/utils	Some general purposed utility class.
tests	Some sample files to test CarpetFuzz.
tests/dict	Sample dictionary file used to generate stub (involving 49 programs).
tests/manpages	Sample manpage files.

We have highly structured our code and provided extensive comments to enhance readers' comprehension. The implementations for various components of CarpetFuzz can be found in the following functions,

Section	Component	File	Function
3.2	EDR Identification	scripts/find_relationship.py	identifyExplicitRSentences
3.3	IDR Identification	scripts/find_relationship.py	identifyImplicitRSentences
3.4	Relationship Extraction	scripts/find_relationship.py	extractRelationships
3.5	Combination	scripts/generate_combination.py	main
3.5	Prioritization	scripts/rank_combination.py	main

Supported Environments

CarpetFuzz is recommended to be run on Linux systems. We have tested it on the following operating system versions:

Ubuntu 18.04
Ubuntu 20.04

While our testing has primarily focused on these operating systems, it may also work on other Linux distributions. Ensure that your system meets the following requirements:

Linux operating system (Ubuntu 18.04 or 20.04 is recommended)
Python 3.6 or higher
LLVM 12.0.0 or higher
Required dependencies (detailed instructions will be provided during the installation process)

Please note that CarpetFuzz may not be compatible with non-Linux systems or may encounter compatibility issues on other operating systems. We recommend running CarpetFuzz on a supported Linux distribution for the best user experience and performance.

Installation

For easy installation, we offer a ready-to-use Docker image for download,

sudo docker pull 4ugustus/carpetfuzz

or you can compile the image yourself using the Dockerfile we provide.

# Download CarpetFuzz repo with the submodules
git clone --recursive https://github.com/waugustus/CarpetFuzz
cd CarpetFuzz
# Build image
sudo docker build -t 4ugustus/carpetfuzz:latest .

And you can also build CarpetFuzz yourself:

# Download CarpetFuzz repo with the submodules
git clone --recursive https://github.com/waugustus/CarpetFuzz
cd CarpetFuzz

# Build CarpetFuzz-fuzzer (LLVM 11.0+ is recommended)
pushd fuzzer
make clean all
popd

# Build Microsoft pict
pushd pict
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
pushd build && ctest -v && popd
popd

# Install required pip modules (virtualenv is recommended)
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 -m spacy download en_core_web_sm-3.0.0 --direct
echo -e "import nltk\nnltk.download('averaged_perceptron_tagger')\nnltk.download('omw-1.4')\nnltk.download('punkt')\nnltk.download('wordnet')"|python3

# Download AllenNLP's parser model
wget -P models/ https://allennlp.s3.amazonaws.com/models/elmo-constituency-parser-2020.02.10.tar.gz

Usage (Minimal Working Example)

We take the program tiffcp used in the paper as an example,

# Step 1 ( < 5mins )
# Create container
sudo docker run -it 4ugustus/carpetfuzz bash
# Libtiff has already been built
cd /root/programs/libtiff

# Step 2
# Use CarpetFuzz to analyze the relationships from the manpage file  ( < 10mins )
python3 ${CarpetFuzz}/scripts/find_relationship.py --file $PWD/build_carpetfuzz/share/man/man1/tiffcp.1
# Based on the relationship, use pict to generate 6-wise combinations  ( depends on #OPT )
python3 ${CarpetFuzz}/scripts/generate_combination.py --relation ${CarpetFuzz}/output/relation/relation_tiffcp.json
# Rank each combination with its dry-run coverage ( < 10mins )
python3 ${CarpetFuzz}/scripts/rank_combination.py --combination ${CarpetFuzz}/output/combination/combination_tiffcp.txt --dict ${CarpetFuzz}/tests/dict/dict.json --bindir $PWD/build_carpetfuzz/bin --seeddir input

# Step 3
# Fuzz with the ranked stubs
${CarpetFuzz}/fuzzer/afl-fuzz -i input/ -o output/ -K ${CarpetFuzz}/output/stubs/ranked_stubs_tiffcp.txt -- $PWD/build_carpetfuzz/bin/tiffcp @@

If you build CarpetFuzz yourself, you need to change Step 1 as following,

(Notice: I've noticed that starting from libtiff v4.5.0, manpages are no longer generated during the compilation process. I'm not aware of the reasons behind this decision by the developers. If you wish to obtain these manpage files, an easy way is to revert to an earlier version.)

(Update: You can use the command sphinx-build -b man source/rst/dir build/man/dir to generate the manpage for the new version of Libtiff. Thanks to @Mist1987 for providing the method.)

# Step 1 (without docker)
# Set the environment
export CarpetFuzz=/path/to/CarpetFuzz
# Download and build the tiffcp repo with CarpetFuzz-fuzzer
git clone https://gitlab.com/libtiff/libtiff
cd libtiff
git reset --hard b51bb
sh ./autogen.sh
CC=${CarpetFuzz}/fuzzer/afl-clang-fast CXX=${CarpetFuzz}/fuzzer/afl-clang-fast++ ./configure --prefix=$PWD/build_carpetfuzz --disable-shared
make -j;make install;make clean
# Prepare the seed
mkdir input
cp ${CarpetFuzz}/fuzzer/testcases/images/tiff/* input/

FAQ

How to find the manpage file of a new program?

In our experience, manpage files are typically located in the share directory within the compilation directory, such as /your_build_dir/share/man/man1.
How to know which option combination triggered a crash?

You can extract the corresponding argv index from the crash filename. For instance, the filename id:000000,sig:07, src:000000,argv:000334,op:argv,pos:0 indicates that the crash was triggered by argv:000334. You can then find the corresponding argv in line 336 (i.e., 334+2) of the ranked_stubs file.
How to reduce memory consumption when using pict to combine a large number of options?

When there is a large number of options (e.g., gm), PICT consumes a significant amount of memory (more than 128GB). In such cases, you can restricted the number of options by sorting all individual options based on their dry-run coverage and selecting the top 50 options with the highest coverage for combination. The whole process can be done by the simplify_relationship.py script.
```
# Restrict the number of options based on their coverage
python3 ${CarpetFuzz}/scripts/simplify_relation.py --relation ${CarpetFuzz}/output/relation/relation_gm.json --dict ${CarpetFuzz}/tests/dict/dict.json --bindir $PWD/build_carpetfuzz/bin --seeddir input
```

CVEs found by CarpetFuzz

CarpetFuzz has found 56 crashes on our real-world dataset, of which 42 are 0-days. So far, 20 crashes have been assigned with CVE IDs.

CVE	Program	Type
CVE-2022-0865	tiffcp	assertion failure
CVE-2022-0907	tiffcrop	segmentation violation
CVE-2022-0909	tiffcrop	floating point exception
CVE-2022-0924	tiffcp	heap buffer overflow
CVE-2022-1056	tiffcrop	heap buffer overflow
CVE-2022-1622	tiffcp	segmentation violation
CVE-2022-1623	tiffcp	segmentation violation
CVE-2022-2056	tiffcrop	floating point exception
CVE-2022-2057	tiffcrop	floating point exception
CVE-2022-2058	tiffcrop	floating point exception
CVE-2022-2953	tiffcrop	heap buffer overflow
CVE-2022-3597	tiffcrop	heap buffer overflow
CVE-2022-3598	tiffcrop	heap buffer overflow
CVE-2022-3599	tiffcrop	heap buffer overflow
CVE-2022-3626	tiffcrop	heap buffer overflow
CVE-2022-3627	tiffcrop	heap buffer overflow
CVE-2022-4450	openssl-asn1parse	double free
CVE-2022-4645	tiffcp	heap buffer overflow
CVE-2022-29977	img2sixel	assertion failure
CVE-2022-29978	img2sixel	floating point exception
CVE-2023-0795	tiffcrop	segmentation violation
CVE-2023-0796	tiffcrop	segmentation violation
CVE-2023-0797	tiffcrop	segmentation violation
CVE-2023-0798	tiffcrop	segmentation violation
CVE-2023-0799	tiffcrop	heap use after free
CVE-2023-0800	tiffcrop	heap buffer overflow
CVE-2023-0801	tiffcrop	heap buffer overflow
CVE-2023-0802	tiffcrop	heap buffer overflow
CVE-2023-0803	tiffcrop	heap buffer overflow
CVE-2023-0804	tiffcrop	heap buffer overflow

Credit

Thanks to Ying Li (@Fr3ya) and Zhiyu Zhang (@QGrain) for their valuable contributions to this project.

Citing this paper

In case you would like to cite CarpetFuzz, you may use the following BibTex entry:

@inproceedings {
  title = {CarpetFuzz: Automatic Program Option Constraint Extraction from Documentation for Fuzzing},
  author = {Wang, Dawei and Li, Ying and Zhang, Zhiyu and Chen, Kai},
  booktitle = {Proceedings of the 32nd USENIX Conference on Security Symposium},
  publisher = {USENIX Association},
  address = {Anaheim, CA, USA},
  pages = {},
  year = {2023}
}

carpetfuzz's People

Contributors

Stargazers

Watchers

Forkers

fr3ya qgrain 13579and2468 ufwt chameleon10712 yluo39github zenhumany b1ue0ceanrun itsmahbub ahcheongl

carpetfuzz's Issues

Cannot run the fuzzer with steps in README

I use tiffcp.1 which you provided and step by step as README.

I generated relation_tiffcp.json and combination_tiffcp.txt as appended files in generated_file.zip.

It always get zero coverage when I do Step 3 to fuzz tiffcp

generated_file.zip

Thanks

No tiffcrop ??

We would like to try tiffcp and tiffcrop, but "tiffcrop.1" is not found.
We download the "tiffcrop.1" from "archlinux" , https://archlinux.org/packages/extra/x86_64/libtiff/.
But your "generate_combination.py" cannot handle it well.

It seems to us that artifacts are not synchronized with paper you published in USENIX Security'23

Manpage problem

Hi, after installing libtiff, execute the following command:

# Step 2
# Use CarpetFuzz to analyze the relationships from the manpage file  ( < 10mins )
python3 ${CarpetFuzz}/scripts/find_relationship.py --file $PWD/build_carpetfuzz/share/man/man1/tiffcp.1

The error message "Cannot find /tiffcp.1 file" was displayed. However, I couldn't find the tiffcp.1 file in the installation directory of libtiff (/build_carpetfuzz or /share/man/man1).After copying the /test/manpages/tiffcp.1 file (which was the ONLY .1 file available) to the corresponding directory, the subsequent execution was successful.
So, my question is whether you have written the manpages, such as tiffcp.1, yourself using Groff, or if I just haven't found tiffcp.1 using the correct method?

How to fuzz the target not recorded in dict.json

I want to run some targets which was not recorded in CarpetFuzz/tests/dict/dict.json.
Do I must manually add dictionary into the json file?
Or it can be added with NLP model automatically?

Thanks

How to install jsonnet0.18.0 on Windows?

How to know the argument used by a crash poc?

Hello,

I tried to reproduce the crash CarpetFuzz finded. The name of poc in crashes is like "crashes/id:000000,sig:07,src:000000,argv:000334,op:argv,pos:0".

I guessed the used argv is about at line 334/335/336 in ranked_stubs.

I run the following command to extract the argv.

$ cat ranked_stubs_xmllint.txt  | head -n 334
$ cat ranked_stubs_xmllint.txt  | head -n 335
$ cat ranked_stubs_xmllint.txt  | head -n 336

I copy and run like the following picture.

But I still cannot reproduce the crash.
May I ask how to reproduce the finded crashes?

Thanks

获取不到插桩信息

学长您好!
在使用你这个工具的时候，出现获取不到插桩信息的问题

会警告使用gcc模式，而在另一台主机上则是LLVM-PCGUARD可以正常进行测试

这个模式是afl++自动选择吗，我尝试指定LLVM-PCGUARD模式时，报错

而我的clang版本是

想问一下学长怎么解决这个问题呢

how to solve "File is not a zip file"

Hello, when I reproducing this work using docker and run the “Minimal Working example”, and I met this problem. Could you tell me the reason for this problem?
sudo docker pull 4ugustus/carpetfuzz

/usr/local/lib/python3.8/dist-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Traceback (most recent call last):
  File "/root/CarpetFuzz/scripts/find_relationship.py", line 108, in <module>
    implicit_rsent_list = identifyImplicitRSentences(program, opt_desc_dict)
  File "/root/CarpetFuzz/scripts/find_relationship.py", line 43, in identifyImplicitRSentences
    topic_sent_list = nlp_util.extractTopicSentList(program, opt_desc_dict)
  File "/root/CarpetFuzz/scripts/utils/nlp_util.py", line 178, in extractTopicSentList
    predicate, object, prt = self.__getPredAndObj(topic_sentence)
  File "/root/CarpetFuzz/scripts/utils/nlp_util.py", line 700, in __getPredAndObj
    prt = find_prep_result[0] if len(wordnet.synsets("%s_%s" % (predicate, find_prep_result[0]))) > 0 else ""
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 89, in __load
    corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/reader/wordnet.py", line 1176, in __init__
    self.provenances = self.omw_prov()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/reader/wordnet.py", line 1285, in omw_prov
    fileids = self._omw_reader.fileids()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 81, in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 555, in find
    return find(modified_name, paths)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/usr/local/lib/python3.8/dist-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/usr/local/lib/python3.8/dist-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/usr/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Too long command lines

Impressive work :)

I had a problem on replicating the experiment on some subjects,
the generated command lines are too long.

Here's one manpage file example that I had problem : gm.1

The generated command line looks like:

64
gm animate -adjoin -enhance -help -cycle 8 -wave 30x30 -sampling-factor 2x1 -random-threshold "Intensity 10x30" -rotate "90>" -box "#1f1f1f" -colorspace HWB -loop 8 -map best -metric RMSE -sample 20x20+5+5 -profile -modulate 120,90 -channel -colormap private -colorize 7,21,50 -strokewidth 10 -flop -immutable -mosaic -set "attr 16" -silent -virtual-pixel Constant -shave 1x1 -preserve-timestamp -size 384x256+64 -shadow 20x1 -extent 20x20+5+5 -solarize -pointsize 8 -chop 20x20+5+5 -linewidth -iconic -mode concatenate -encoding AdobeCustom -output-directory -units PixelsPerInch -iconGeometry 20x20+5+5 -morph -scene -average -trim -colors 8 -update 1 -highlight-style -shade 2x1 -verbose -implode 8 -operator "All Multiply 50%" -screen -noise "10|Uniform" -resize 50x30 -snaps 4 -view viewstring -dispose "Background Overwrite the image area with the background color" -window root -dither -compose Minus -ping -compress Fax -affine -spread 4 -backdrop -foreground "#1f1f1f" -version -write /tmp/foo -matte -tile -font PostScript -filter Hamming -displace 50x50 -ordered-dither "Green 3x3" -red-primary 50,30 -equalize -lat 20x20+5 -level 10,1.0,250 -flatten -limit "Width 1920" -raise 50x30 -gravity Center -median 50 -density 50x50 -segment 0.015x1.5 -pen "#efefef01" -emboss 50 -negate -clip -scale 20x20+5+5 -unsharp 0x1.0+1.0+0.05 -maximum-error -resample 30x30 -transparent blue   @@
gm SUB -motion-blur 50x1 -mask -transform -mattecolor "#efefef01" -scenes 0-12 -text-font -file -borderwidth 50 -delay -noop -opaque "#efefef01" -depth 16 -blur 50 -shear -contrast -blue-primary 10,10 -strip -frame -gamma 2.0 -endian Native -define tiff:group-three-options=4 -pause 1 -display -stereo -thumbnail 38x25 -stegano 2 -gaussian 30 -highlight-color "#1f1f1f" -list Type -window-group -roll 50+30 -title "MIFF:bird.miff 512x480" -label "MIFF:bird.miff 512x480" -black-threshold 50%,50%,50%,50% -name -render -hald-clut 8 -recolor "1 0 0, 0 1 0, 0 0 1" -border 50x50 -asc-cdl 1.0,0.0,1.0:1.0,0.0,1.0:1.0,0.0,1.0:1.0 -debug Error -region 20x20+5+5 -despeckle -flip -white-point 30,30 -preview Shade -background "#1f1f1f" -stroke blue -fuzz 50% -auto-orient -interlace Line -coalesce -convolve 20,20 -charcoal 10 -descend -monochrome -quality 2 -watermark 30x30 -white-threshold 50%,50%,50%,50% -log "%u %m:%l %e" -magnify -affine -draw "circle x0,y0 x1,y1" -crop 20x20+5+5 -dissolve 50 -type Grayscale -comment "MIFF:bird.miff 512x480" -normalize -green-primary 30,40 -texture -swirl 10 -repage 20x20+5+5 -paint 10 -treedepth 5 -deconstruct -format ART -process Convert=1,2,3 -threshold 50% -minify 8 -edge 50 -orient undefined -visual StaticColor -intent Relative -append -monitor -remote -use-pixmap -create-directories -fill "#efefef01" -geometry 20x20+5+5 -shared-memory -authenticate password -antialias   @@
...

I guess it is not an expected behavior, and I would like to know your opinion.
(I think the number of generated command lines is also low than I expected.)

I changed the order of combination value given to pict execution from 6 (default value) to 4
because my machine (with 128GB RAM) could not handle high combination order values on complex command line manpages (like gm).
Would the decreased value cause the "long command lines" problem?

部分程序出现Segmentation fault问题

在使用carpetfuzz进行测试时，部分出现会出现Segmentation fault问题

gdb简单跟踪发现问题可能出在src/afl-fuzz-argv.c:52处

想问这个问题怎么解决呢

How to generate a dictionary file

hi，Thanks for open source project, I successfully run on ffmpeg.
I have a question.In Step 2, the parameter --dict is provided in the ranking of each combination, which contains 49 programs. If the test program is not included, how can I get the dict file of the target program?

Python package dependency conflict problem

Hello, I have a problem with package conflicts when I use the requestsement file to install python packages. For example, package A depends on the version range of package B, and package C depends on another version range of package B, but there is no difference between the two ranges.

There is no intersection. Let me ask what is your operating environment like?

waugustus / carpetfuzz Goto Github PK

carpetfuzz's Introduction

CarpetFuzz

Prerequisites

Structure

Supported Environments

Installation

Usage (Minimal Working Example)

FAQ

CVEs found by CarpetFuzz

Credit

Citing this paper

carpetfuzz's People

Contributors

Stargazers

Watchers

Forkers

carpetfuzz's Issues

Recommend Projects

Recommend Topics

Recommend Org