Giter VIP home page Giter VIP logo

carpetfuzz's Introduction

CarpetFuzz

DOI

CarpetFuzz thumbnail

CarpetFuzz is an NLP-based fuzzing assistance tool for generating valid option combinations.

The basic idea of CarpetFuzz is to use natural language processing (NLP) to identify and extract the relationships (e.g., conflicts or dependencies) among program options from the description of each option in the documentation and filter out invalid combinations to reduce the option combinations that need to be fuzzed.

For more details, please refer to our paper from USENIX Security'23.

The CarpetFuzz-experiments repository contains the data sets, scripts, and documentation required to reproduce our results in the paper.

Prerequisites

We believe that mainstream computers on the market are sufficient to run CarpetFuzz, such as computers with a 1-core CPU, 8GB RAM, and a 128GB hard drive.

Structure

Directory Description
dataset Training dataset to obtain the model
fuzzer Modified fuzzer to switch combinations on the fly. (Submodule)
images Images used in README.md
models Models used to extract relationships
output CarpetFuzz's output files
pict Microsoft's pairwise tool. (Submodule)
scripts Python scripts to identify, extract relationships and rank combinations based on their dry-run coverage.
scripts/utils Some general purposed utility class.
tests Some sample files to test CarpetFuzz.
tests/dict Sample dictionary file used to generate stub (involving 49 programs).
tests/manpages Sample manpage files.

We have highly structured our code and provided extensive comments to enhance readers' comprehension. The implementations for various components of CarpetFuzz can be found in the following functions,

Section Component File Function
3.2 EDR Identification scripts/find_relationship.py identifyExplicitRSentences
3.3 IDR Identification scripts/find_relationship.py identifyImplicitRSentences
3.4 Relationship Extraction scripts/find_relationship.py extractRelationships
3.5 Combination scripts/generate_combination.py main
3.5 Prioritization scripts/rank_combination.py main

Supported Environments

CarpetFuzz is recommended to be run on Linux systems. We have tested it on the following operating system versions:

  • Ubuntu 18.04
  • Ubuntu 20.04

While our testing has primarily focused on these operating systems, it may also work on other Linux distributions. Ensure that your system meets the following requirements:

  • Linux operating system (Ubuntu 18.04 or 20.04 is recommended)
  • Python 3.6 or higher
  • LLVM 12.0.0 or higher
  • Required dependencies (detailed instructions will be provided during the installation process)

Please note that CarpetFuzz may not be compatible with non-Linux systems or may encounter compatibility issues on other operating systems. We recommend running CarpetFuzz on a supported Linux distribution for the best user experience and performance.

Installation

For easy installation, we offer a ready-to-use Docker image for download,

sudo docker pull 4ugustus/carpetfuzz 

or you can compile the image yourself using the Dockerfile we provide.

# Download CarpetFuzz repo with the submodules
git clone --recursive https://github.com/waugustus/CarpetFuzz
cd CarpetFuzz
# Build image
sudo docker build -t 4ugustus/carpetfuzz:latest .

And you can also build CarpetFuzz yourself:

# Download CarpetFuzz repo with the submodules
git clone --recursive https://github.com/waugustus/CarpetFuzz
cd CarpetFuzz

# Build CarpetFuzz-fuzzer (LLVM 11.0+ is recommended)
pushd fuzzer
make clean all
popd

# Build Microsoft pict
pushd pict
cmake -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
pushd build && ctest -v && popd
popd

# Install required pip modules (virtualenv is recommended)
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 -m spacy download en_core_web_sm-3.0.0 --direct
echo -e "import nltk\nnltk.download('averaged_perceptron_tagger')\nnltk.download('omw-1.4')\nnltk.download('punkt')\nnltk.download('wordnet')"|python3

# Download AllenNLP's parser model
wget -P models/ https://allennlp.s3.amazonaws.com/models/elmo-constituency-parser-2020.02.10.tar.gz

Usage (Minimal Working Example)

We take the program tiffcp used in the paper as an example,

# Step 1 ( < 5mins )
# Create container
sudo docker run -it 4ugustus/carpetfuzz bash
# Libtiff has already been built
cd /root/programs/libtiff

# Step 2
# Use CarpetFuzz to analyze the relationships from the manpage file  ( < 10mins )
python3 ${CarpetFuzz}/scripts/find_relationship.py --file $PWD/build_carpetfuzz/share/man/man1/tiffcp.1
# Based on the relationship, use pict to generate 6-wise combinations  ( depends on #OPT )
python3 ${CarpetFuzz}/scripts/generate_combination.py --relation ${CarpetFuzz}/output/relation/relation_tiffcp.json
# Rank each combination with its dry-run coverage ( < 10mins )
python3 ${CarpetFuzz}/scripts/rank_combination.py --combination ${CarpetFuzz}/output/combination/combination_tiffcp.txt --dict ${CarpetFuzz}/tests/dict/dict.json --bindir $PWD/build_carpetfuzz/bin --seeddir input

# Step 3
# Fuzz with the ranked stubs
${CarpetFuzz}/fuzzer/afl-fuzz -i input/ -o output/ -K ${CarpetFuzz}/output/stubs/ranked_stubs_tiffcp.txt -- $PWD/build_carpetfuzz/bin/tiffcp @@

If you build CarpetFuzz yourself, you need to change Step 1 as following,

(Notice: I've noticed that starting from libtiff v4.5.0, manpages are no longer generated during the compilation process. I'm not aware of the reasons behind this decision by the developers. If you wish to obtain these manpage files, an easy way is to revert to an earlier version.)

(Update: You can use the command sphinx-build -b man source/rst/dir build/man/dir to generate the manpage for the new version of Libtiff. Thanks to @Mist1987 for providing the method.)

# Step 1 (without docker)
# Set the environment
export CarpetFuzz=/path/to/CarpetFuzz
# Download and build the tiffcp repo with CarpetFuzz-fuzzer
git clone https://gitlab.com/libtiff/libtiff
cd libtiff
git reset --hard b51bb
sh ./autogen.sh
CC=${CarpetFuzz}/fuzzer/afl-clang-fast CXX=${CarpetFuzz}/fuzzer/afl-clang-fast++ ./configure --prefix=$PWD/build_carpetfuzz --disable-shared
make -j;make install;make clean
# Prepare the seed
mkdir input
cp ${CarpetFuzz}/fuzzer/testcases/images/tiff/* input/

FAQ

  1. How to find the manpage file of a new program?

    In our experience, manpage files are typically located in the share directory within the compilation directory, such as /your_build_dir/share/man/man1.

  2. How to know which option combination triggered a crash?

    You can extract the corresponding argv index from the crash filename. For instance, the filename id:000000,sig:07, src:000000,argv:000334,op:argv,pos:0 indicates that the crash was triggered by argv:000334. You can then find the corresponding argv in line 336 (i.e., 334+2) of the ranked_stubs file.

  3. How to reduce memory consumption when using pict to combine a large number of options?

    When there is a large number of options (e.g., gm), PICT consumes a significant amount of memory (more than 128GB). In such cases, you can restricted the number of options by sorting all individual options based on their dry-run coverage and selecting the top 50 options with the highest coverage for combination. The whole process can be done by the simplify_relationship.py script.

    # Restrict the number of options based on their coverage
    python3 ${CarpetFuzz}/scripts/simplify_relation.py --relation ${CarpetFuzz}/output/relation/relation_gm.json --dict ${CarpetFuzz}/tests/dict/dict.json --bindir $PWD/build_carpetfuzz/bin --seeddir input
    

CVEs found by CarpetFuzz

CarpetFuzz has found 56 crashes on our real-world dataset, of which 42 are 0-days. So far, 20 crashes have been assigned with CVE IDs.

CVE Program Type
CVE-2022-0865 tiffcp assertion failure
CVE-2022-0907 tiffcrop segmentation violation
CVE-2022-0909 tiffcrop floating point exception
CVE-2022-0924 tiffcp heap buffer overflow
CVE-2022-1056 tiffcrop heap buffer overflow
CVE-2022-1622 tiffcp segmentation violation
CVE-2022-1623 tiffcp segmentation violation
CVE-2022-2056 tiffcrop floating point exception
CVE-2022-2057 tiffcrop floating point exception
CVE-2022-2058 tiffcrop floating point exception
CVE-2022-2953 tiffcrop heap buffer overflow
CVE-2022-3597 tiffcrop heap buffer overflow
CVE-2022-3598 tiffcrop heap buffer overflow
CVE-2022-3599 tiffcrop heap buffer overflow
CVE-2022-3626 tiffcrop heap buffer overflow
CVE-2022-3627 tiffcrop heap buffer overflow
CVE-2022-4450 openssl-asn1parse double free
CVE-2022-4645 tiffcp heap buffer overflow
CVE-2022-29977 img2sixel assertion failure
CVE-2022-29978 img2sixel floating point exception
CVE-2023-0795 tiffcrop segmentation violation
CVE-2023-0796 tiffcrop segmentation violation
CVE-2023-0797 tiffcrop segmentation violation
CVE-2023-0798 tiffcrop segmentation violation
CVE-2023-0799 tiffcrop heap use after free
CVE-2023-0800 tiffcrop heap buffer overflow
CVE-2023-0801 tiffcrop heap buffer overflow
CVE-2023-0802 tiffcrop heap buffer overflow
CVE-2023-0803 tiffcrop heap buffer overflow
CVE-2023-0804 tiffcrop heap buffer overflow

Credit

Thanks to Ying Li (@Fr3ya) and Zhiyu Zhang (@QGrain) for their valuable contributions to this project.

Citing this paper

In case you would like to cite CarpetFuzz, you may use the following BibTex entry:

@inproceedings {
  title = {CarpetFuzz: Automatic Program Option Constraint Extraction from Documentation for Fuzzing},
  author = {Wang, Dawei and Li, Ying and Zhang, Zhiyu and Chen, Kai},
  booktitle = {Proceedings of the 32nd USENIX Conference on Security Symposium},
  publisher = {USENIX Association},
  address = {Anaheim, CA, USA},
  pages = {},
  year = {2023}
}

carpetfuzz's People

Contributors

waugustus avatar 13579and2468 avatar fr3ya avatar qgrain avatar ahcheongl avatar

Stargazers

1_ynx avatar  avatar  avatar  avatar qingmu-z avatar  avatar Zz avatar  avatar  avatar  avatar  avatar item avatar  avatar  avatar  avatar sql7 avatar  avatar Rowan Hart avatar Lukas Gerlach avatar Zeyu Dong avatar Li-Xian Chen avatar Siang avatar Cheng'an Wei avatar b1ue0cean avatar MandaC avatar  avatar 陈孟达 avatar  avatar Xcare avatar Haoxin Zong avatar shinibufa avatar  avatar  avatar  avatar  avatar  avatar Ryota Sakai avatar Gr3yD0g avatar phli avatar

Watchers

 avatar Kostas Georgiou avatar

carpetfuzz's Issues

No tiffcrop ??

We would like to try tiffcp and tiffcrop, but "tiffcrop.1" is not found.
We download the "tiffcrop.1" from "archlinux" , https://archlinux.org/packages/extra/x86_64/libtiff/.
But your "generate_combination.py" cannot handle it well.

It seems to us that artifacts are not synchronized with paper you published in USENIX Security'23

Manpage problem

Hi, after installing libtiff, execute the following command:

# Step 2
# Use CarpetFuzz to analyze the relationships from the manpage file  ( < 10mins )
python3 ${CarpetFuzz}/scripts/find_relationship.py --file $PWD/build_carpetfuzz/share/man/man1/tiffcp.1

The error message "Cannot find /tiffcp.1 file" was displayed. However, I couldn't find the tiffcp.1 file in the installation directory of libtiff (/build_carpetfuzz or /share/man/man1).After copying the /test/manpages/tiffcp.1 file (which was the ONLY .1 file available) to the corresponding directory, the subsequent execution was successful.
So, my question is whether you have written the manpages, such as tiffcp.1, yourself using Groff, or if I just haven't found tiffcp.1 using the correct method?

How to fuzz the target not recorded in dict.json

I want to run some targets which was not recorded in CarpetFuzz/tests/dict/dict.json.
Do I must manually add dictionary into the json file?
Or it can be added with NLP model automatically?

Thanks

How to know the argument used by a crash poc?

Hello,

I tried to reproduce the crash CarpetFuzz finded. The name of poc in crashes is like "crashes/id:000000,sig:07,src:000000,argv:000334,op:argv,pos:0".

I guessed the used argv is about at line 334/335/336 in ranked_stubs.

I run the following command to extract the argv.

$ cat ranked_stubs_xmllint.txt  | head -n 334
$ cat ranked_stubs_xmllint.txt  | head -n 335
$ cat ranked_stubs_xmllint.txt  | head -n 336

I copy and run like the following picture.
Selection_330

But I still cannot reproduce the crash.
May I ask how to reproduce the finded crashes?

Thanks

获取不到插桩信息

学长您好!
在使用你这个工具的时候,出现获取不到插桩信息的问题
image
会警告使用gcc模式,而在另一台主机上则是LLVM-PCGUARD可以正常进行测试
image
这个模式是afl++自动选择吗,我尝试指定LLVM-PCGUARD模式时,报错
image
而我的clang版本是
image
想问一下学长怎么解决这个问题呢

how to solve "File is not a zip file"

Hello, when I reproducing this work using docker and run the “Minimal Working example”, and I met this problem. Could you tell me the reason for this problem?
sudo docker pull 4ugustus/carpetfuzz

/usr/local/lib/python3.8/dist-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Traceback (most recent call last):
  File "/root/CarpetFuzz/scripts/find_relationship.py", line 108, in <module>
    implicit_rsent_list = identifyImplicitRSentences(program, opt_desc_dict)
  File "/root/CarpetFuzz/scripts/find_relationship.py", line 43, in identifyImplicitRSentences
    topic_sent_list = nlp_util.extractTopicSentList(program, opt_desc_dict)
  File "/root/CarpetFuzz/scripts/utils/nlp_util.py", line 178, in extractTopicSentList
    predicate, object, prt = self.__getPredAndObj(topic_sentence)
  File "/root/CarpetFuzz/scripts/utils/nlp_util.py", line 700, in __getPredAndObj
    prt = find_prep_result[0] if len(wordnet.synsets("%s_%s" % (predicate, find_prep_result[0]))) > 0 else ""
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 89, in __load
    corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/reader/wordnet.py", line 1176, in __init__
    self.provenances = self.omw_prov()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/reader/wordnet.py", line 1285, in omw_prov
    fileids = self._omw_reader.fileids()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/usr/local/lib/python3.8/dist-packages/nltk/corpus/util.py", line 81, in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 555, in find
    return find(modified_name, paths)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/usr/local/lib/python3.8/dist-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/usr/local/lib/python3.8/dist-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/usr/lib/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Too long command lines

Impressive work :)

I had a problem on replicating the experiment on some subjects,
the generated command lines are too long.

Here's one manpage file example that I had problem : gm.1

The generated command line looks like:

64
gm animate -adjoin -enhance -help -cycle 8 -wave 30x30 -sampling-factor 2x1 -random-threshold "Intensity 10x30" -rotate "90>" -box "#1f1f1f" -colorspace HWB -loop 8 -map best -metric RMSE -sample 20x20+5+5 -profile -modulate 120,90 -channel -colormap private -colorize 7,21,50 -strokewidth 10 -flop -immutable -mosaic -set "attr 16" -silent -virtual-pixel Constant -shave 1x1 -preserve-timestamp -size 384x256+64 -shadow 20x1 -extent 20x20+5+5 -solarize -pointsize 8 -chop 20x20+5+5 -linewidth -iconic -mode concatenate -encoding AdobeCustom -output-directory -units PixelsPerInch -iconGeometry 20x20+5+5 -morph -scene -average -trim -colors 8 -update 1 -highlight-style -shade 2x1 -verbose -implode 8 -operator "All Multiply 50%" -screen -noise "10|Uniform" -resize 50x30 -snaps 4 -view viewstring -dispose "Background Overwrite the image area with the background color" -window root -dither -compose Minus -ping -compress Fax -affine -spread 4 -backdrop -foreground "#1f1f1f" -version -write /tmp/foo -matte -tile -font PostScript -filter Hamming -displace 50x50 -ordered-dither "Green 3x3" -red-primary 50,30 -equalize -lat 20x20+5 -level 10,1.0,250 -flatten -limit "Width 1920" -raise 50x30 -gravity Center -median 50 -density 50x50 -segment 0.015x1.5 -pen "#efefef01" -emboss 50 -negate -clip -scale 20x20+5+5 -unsharp 0x1.0+1.0+0.05 -maximum-error -resample 30x30 -transparent blue   @@
gm SUB -motion-blur 50x1 -mask -transform -mattecolor "#efefef01" -scenes 0-12 -text-font -file -borderwidth 50 -delay -noop -opaque "#efefef01" -depth 16 -blur 50 -shear -contrast -blue-primary 10,10 -strip -frame -gamma 2.0 -endian Native -define tiff:group-three-options=4 -pause 1 -display -stereo -thumbnail 38x25 -stegano 2 -gaussian 30 -highlight-color "#1f1f1f" -list Type -window-group -roll 50+30 -title "MIFF:bird.miff 512x480" -label "MIFF:bird.miff 512x480" -black-threshold 50%,50%,50%,50% -name -render -hald-clut 8 -recolor "1 0 0, 0 1 0, 0 0 1" -border 50x50 -asc-cdl 1.0,0.0,1.0:1.0,0.0,1.0:1.0,0.0,1.0:1.0 -debug Error -region 20x20+5+5 -despeckle -flip -white-point 30,30 -preview Shade -background "#1f1f1f" -stroke blue -fuzz 50% -auto-orient -interlace Line -coalesce -convolve 20,20 -charcoal 10 -descend -monochrome -quality 2 -watermark 30x30 -white-threshold 50%,50%,50%,50% -log "%u %m:%l %e" -magnify -affine -draw "circle x0,y0 x1,y1" -crop 20x20+5+5 -dissolve 50 -type Grayscale -comment "MIFF:bird.miff 512x480" -normalize -green-primary 30,40 -texture -swirl 10 -repage 20x20+5+5 -paint 10 -treedepth 5 -deconstruct -format ART -process Convert=1,2,3 -threshold 50% -minify 8 -edge 50 -orient undefined -visual StaticColor -intent Relative -append -monitor -remote -use-pixmap -create-directories -fill "#efefef01" -geometry 20x20+5+5 -shared-memory -authenticate password -antialias   @@
...

I guess it is not an expected behavior, and I would like to know your opinion.
(I think the number of generated command lines is also low than I expected.)

I changed the order of combination value given to pict execution from 6 (default value) to 4
because my machine (with 128GB RAM) could not handle high combination order values on complex command line manpages (like gm).
Would the decreased value cause the "long command lines" problem?

部分程序出现Segmentation fault问题

在使用carpetfuzz进行测试时,部分出现会出现Segmentation fault问题
image
gdb简单跟踪发现问题可能出在src/afl-fuzz-argv.c:52处
image
想问这个问题怎么解决呢

How to generate a dictionary file

hi,Thanks for open source project, I successfully run on ffmpeg.
I have a question.In Step 2, the parameter --dict is provided in the ranking of each combination, which contains 49 programs. If the test program is not included, how can I get the dict file of the target program?

Python package dependency conflict problem

Hello, I have a problem with package conflicts when I use the requestsement file to install python packages. For example, package A depends on the version range of package B, and package C depends on another version range of package B, but there is no difference between the two ranges.

2023-05-07 10-20-19 的屏幕截图

There is no intersection. Let me ask what is your operating environment like?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.