Giter VIP home page Giter VIP logo

sdhash's Introduction

sdhash

sdhash is tool that allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data. It is designed to provide quick results during triage and initial investigation phases. It has been in active development since 2010 with the explicit goal of becoming fast, scalable, and reliable.

There two general classes of problems where sdhash can provide significant benefits–fragment identification and version correlation.

In fragment identification, we search for a smaller piece of data inside a bigger piece of data (“needle-in-a-haystack”). For example:

Block vs. file correlation: given a chunk of data (disk block/network packet /RAM page/etc), we can search a reference collection of files to identify whether the chunk came from any of them.

File vs. RAM/disk image: given a file and a target image, we can efficiently determine if any pieces of the file can be found on the image (that includes deallocated storage).

In version correlation, we are interested in correlating data objects (files) that are comparable in size and, thus, similar ones can be viewed as versions. These are two basic scenarios in which this is useful–identifying related documents and identifying code versions.

In all cases, the use of the tool is the same, however the interpretation may differ based on the circumstances.

Current version info:

sdhash 4.0 by Vassil Roussev, Candice Quates [sdhash.org] 12/2013

Usage: sdhash  
Configuration:
  -r [ --deep ]                   generate SDBFs from directories and files
  -f [ --target-list ]            generate SDBFs from list(s) of filenames
  -c [ --compare ]                compare SDBFs in file, or two SDBF files
  -g [ --gen-compare ]            compare all pairs in source data
  -t [ --threshold ] arg (=1)     only show results >=threshold
  -b [ --block-size ] arg         hashes input files in nKB blocks
  -p [ --threads ] arg            restrict compute threads to N threads
  -s [ --sample-size ] arg (=0)   sample N filters for comparisons
  -z [ --segment-size ] arg       set file segment size, 128MB default
  -o [ --output ] arg             send output to files
  --separator arg (=pipe)         for comparison results: pipe csv tab
  --hash-name arg                 set name of hash on stdin
  --fast                          shrink sdbf filters for speedup
  --large                         create larger (1M content) filters
  --validate                      parse SDBF file to check if it is valid
  --details                       parse SDBF-LG file for contents
  --index                         generate indexes while hashing
  --index-search arg              search directory of reference indexes
  --config-file arg (=sdhash.cfg) use config file
  --verbose                       warnings, debug and progress output
  --version                       show version info
  -h [ --help ]                   produce help message

Tutorial: http://roussev.net/sdhash/tutorial/sdhash-tutorial.html

Papers/Version history/etc: http://sdhash.org/

sdhash's People

Contributors

candicenonsense avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdhash's Issues

Wrong result in malware samples comparison

I'm analyzing some malware samples to check their similarity. I met two samples giving 100 similarity but at the same time they are totally different. By using 010 Editor I found that the only bytes they share are the PE MS-DOS header and some sequences of 0s, so their similarity should be around 0. You can find those two samples in this zip archive: samples.zip.
THOSE ARE REAL WINDOWS MALWARE, DON'T EXECUTE THEM UNLESS YOU ARE IN A CONTROLLED ENVIRONMENT. The password of the archive is: "infected". What I'd like to understand is if this is a bug in sdhash algorithm, sdhash implementation or if this is the expected behavior.

sdhash crashes on Win 8.1 when comparing

I am able to generate hashes with no issues, but the application crashes when attempting to compare these hashes:

Unhandled exception at 0x00007FFFE9738B9C in sdhash.exe: Microsoft C++ exception: 
boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_lexical_cast> >
at memory location 0x000000000062D438.

Need to recompile blooms.proto

Hi! I've tried to compile this on an Ubuntu 16.04 machines, but it doesn't work unless I recompile the protocolbuffer files. The command line I run is

$ protoc blooms.proto --cpp_out=sdbf/

Compilation error

Hello,

I'm trying to compile the code (Linux) but get the following error message:

sdhash-src/../sdbf/sdbf_defines.h:9:25: fatal error: openssl/bio.h: No such file or directory compilation terminated. make: *** [sdhash-src/sdhash.o] Error 1

It seems the bio.h is missing. Any idea of how to solve this issue?

Thanks!

No compare value outputted

$ ./sdhash JdbcTemplate.java > jdbc
$ ./sdhash DataSourceUtils.java > ds
$ ./sdhash -c jdbc ds -t 0

This generates no output. There are valid hashes in jdbc and ds but comparing them generates no output whatsoever. I've tried every combination, several different types and sizes of files and I get no output. If I run:

$ ./sdhash JdbcTemplate.java

It prints the hash but refuses to compare them.

I'm on win 7 and I downloaded the tried both precompiled 32bit and 64bit versions.

Building on fairly modern Ubuntu (18.04 LTS) with process and script

With the supplied install instructions, I was unable to build sdhash.

Here is a series of instructions that work on a fairly modern Ubuntu build:

Install dependencies

sudo apt install libssl-dev libevent-pthreads-2.1-6 libomp-dev g++ unzip

Head over to https://github.com/protocolbuffers/protobuf/releases/tag/v2.5.0 and download protobuf-2.5.0.zip file. Extract the file and enter the folder - protobuf-2.5.0**

Install protobuf's dependencies:

sudo apt-get install autoconf automake libtool curl make g++

Once these have been installed, you can install protobuf.

  1. Make sure that you are still in the protobuf folder
    
  2. ./configure
    
  3. make
    
  4. sudo make install
    

Excellent! Now you have protobuf set up and you are ready to install sdhash.

  1. Exit the protobuf-2.5.0 folder cd ../
    
  2. Clone sdhash repo: `git clone https://github.com/sdhash/sdhash.git
    
  3. Enter sdhash folder: cd sdhash
    
  4. Compile: make
    
  5. Install: make install
    

Final step: run 'ldconfig' to fix the "error while loading shared library" problem.

You can also just execute this script:

apt-get update
apt-get -y install libssl-dev libevent-pthreads-2.1-6 libomp-dev g++
apt-get -y install autoconf automake libtool curl make g++ unzip
wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.zip
unzip protobuf-2.5.0.zip
cd protobuf-2.5.0
./configure
make
make install
cd ..
git clone https://github.com/sdhash/sdhash.git
cd sdhash
make
make install
ldconfig

This was tested on a brand new digital ocean 18.04.3 droplet.

make swig-py unable to locate boost package

Hi all.

first of all let me thank the developers for going trough the effort of setting up a python wrapping functionality for your tool. It is really helpful to access sdhash fuctionality in a different framework.

I have encountered a problem when compiling the python library with the default swig-py hook in the makefile, as the compilation fails with:
swig/python/../../sdbf/sdbf_set.h:7:35: fatal error: boost/thread/thread.hpp: No such file or directory #include <boost/thread/thread.hpp>

as it appears the compiler is unable to locate the boost files we solved the problem by explicitly linking the directory with -Iexternal -Lexternal/stage/lib

so in the makefile we changed swig/python/sdbf_wrap.o: to

swig/python/sdbf_wrap.o: sdbf.i $(LIBSDBF)
	swig -c++ -python swig/python/sdbf.i
	g++ -std=c++0x -fPIC -c swig/python/sdbf_wrap.cxx -o swig/python/sdbf_wrap.o -I/usr/include/python2.7 -Iexternal -Lexternal/stage/lib

that worked for us, and we are now able to access sdhash library functions after moving the .so file in the standard python library folder.

I am not sure if this problem is specific to our machine/implementation, but I wanted to post our solution in case other people encounter the same issue.

Python3 compatibility

Hi,
Thank you for the implementation!
I have tried to compile it with Python3 and I have managed to overcome the string class problems by doing the following.

  1. Change the Makefile on root dir to point to python3.
  2. Add #define SWIG_PYTHON_STRICT_BYTE_CHAR in sdbf.i file located in swig/python dir.
    This tells swig to accept and return only byte chars.

Example code from python3:

#!/usr/bin/python3
# Import our module, living in the same directory as _sdbf_class.so
import sdbf_class

# Name a few standalone objects to hash
name = bytes('sdbf_class.py','utf-8')
name2 = bytes( 'sdbf_class.pyc','utf-8')

# Create new objects from these names, in "regular" non-block mode.
test1 = sdbf_class.sdbf(name,0)
test2 = sdbf_class.sdbf(name2,0)

print(test1.name().decode('utf-8'))
print(test1.to_string().decode('utf-8'))
print(test2.to_string().decode('utf-8'))

I hope this is useful.

Result Delimiter Conflict

sdbf result files use of ':' as a delimiter crashes if the filepath in the result contain the character ':'.

sdhash.exe -o bug_result "e:\Received\usbtreeview.zip"
sdhash.exe --validate "bug_result.sdbf"

Fix: Change the delimiter used by sdhash or sanitize the delimiter from the filepath.

Errors while compiling on ARM

Hello,
I'm getting some errors while compiling on ARM, the first one is related to -msse4.2 which I could solve.
the second one is the problematic one:
g++ -fPIC -fopenmp -fno-strict-aliasing -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE_API -D_BSD_SOURCE -I./external -c sdbf/bf_utils.cc -o sdbf/bf_utils.o
sdbf/bf_utils.cc:12:23: fatal error: smmintrin.h: No such file or directory
#include <smmintrin.h>
^
compilation terminated.

Can anyone please help?
Thanks!

Comparison shows no results

Just got sdhash installed. It generates SDBF files, but when I compare two files, nothing is returned at all:

remnux@remnux:/Desktop$ sdhash -o nwiz.sdbf Nwiz.dll
remnux@remnux:
/Desktop$ sdhash -c nwiz.sdbf cerber.sdbf
remnux@remnux:~/Desktop$

Have I installed something improperly, or am I using the command line options improperly?

error during the compare instruction

I have installed the sdhash code (version 4.0) and compile it in Linux Ubuntu 12.04.

I have installed all the requirement. it has compiled successfully.

taking the signature works correctly, but when I run the compare command

sdhash -c a.sdbf b.sdbf

it crash with the following error
"Illegal instruction (core dumped)"

sdhash score for same files comes out to be zero

We have a directory containing a bulk of system files(linux filesystem). We created two different folders with the same content/files within them. When we ran sdhash recursively for comparing the same files in different folder, the sdhash score came out to be 000. This occured with a large number of files whose size > 512 bytes and < 1024 bytes. We also saw the same score (000) with some files whose size > 1024 bytes. We are not sure why sdhash gives a score of 000 to 2 copies of a same file, where even the hashes are exactly the same. Can you please let us know where we might be wrong or is it a limitation?

Example of 1 such file:-
sample shadowconfig file

ubuntu@common-machine:$ ls -lrt
-rw-rw-r-- 1 ubuntu ubuntu 885 Feb 20 08:33 shadowconfig
-rw-rw-r-- 1 ubuntu ubuntu 397 Feb 25 06:20 test1.sdbf
-rw-rw-r-- 1 ubuntu ubuntu 397 Feb 25 06:20 test2.sdbf
ubuntu@common-machine:
$ sdhash -c test1.sdbf test2.sdbf -t 0
shadowconfig|shadowconfig|000
ubuntu@common-machine:~$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.