Giter VIP home page Giter VIP logo

fleur's Introduction

Fleur - A bloom filter implementation in C

Fleur implements a Bloom Filter library that is fully compatible with DCSO's Go and python implementations.

Requirements

Fleur has only been tested under Ubuntu 20.04

sudo apt-get install cmake ninja-build gcc

Compilation

git clone [email protected]:hashlookup/fleur.git
git submodule update --init --recursive
cmake -GNinja -DTARGET_GROUP=production . 
ninja -v

Running tests

cmake -GNinja -DTARGET_GROUP=test . 
ninja -v

Each suite test's executable are then located under ./test/suite_n.

LibFleur usage

struct BloomFilter fleur_initialize(uint64_t n, double p, char *buf);
struct BloomFilter fleur_bloom_filter_from_file(FILE* f);

int fleur_add(BloomFilter * bf, char *buf, size_t buf_size);
int fleur_check(BloomFilter * bf, char *buf, size_t buf_size);
int fleur_join(BloomFilter * src, BloomFilter* dst);
int fleur_bloom_filter_to_file(BloomFilter * bf, FILE* of);
void fleur_set_data(BloomFilter * bf, char* buf, size_t buf_size );
void fleur_fingerprint(BloomFilter * bf, char *buf, size_t buf_size, uint64_t **fingerprint);

void fleur_destroy_filter(BloomFilter * bf);

void fleur_print_header(header * h);
void fleur_print_filter(BloomFilter * bf);
int fleur_check_header(header * h);

Fleur command line tool usage

NAME:
   Fleurcli - Utility to work with bloom filters

USAGE:
   fleurcli [-c] command [command options] [arguments...] bloomfilter.file

VERSION:
   0.1

COMMANDS:
     create         Create a new Bloom filter and store it in the given filename.
     insert         Inserts new values into an existing Bloom filter.
     check          Checks values against an existing Bloom filter.
     set-data       Sets the data associated with the Bloom filter.
     get-data       Prints the data associated with the Bloom filter.
     show           Shows various details about a given Bloom filter.

Example interactions:

$./fleurcli -c create -p 0.0001 -n 100000 mytest.bloom
$bloom show mytest.bloom
File:				/home/jlouis/Git/fleur/fleurcli/mytest.bloom
Capacity:			100000
Elements present:	0
FP probability:		1.00e-04
Bits:				1917011
Hash functions:		14
$echo toto | ./fleurcli -c insert mytest.bloom 
$echo titi | ./fleurcli -c insert mytest.bloom 
$echo babar | ./fleurcli -c insert mytest.bloom
$echo tutu | ./fleurcli -c check mytest.bloom
$echo titi | ./fleurcli -c check mytest.bloom
titi
$echo titi | bloom check mytest.bloom 
titi
$./fleurcli -c show mytest.bloom 
Filter details:
 n: 100000 
 p: 0.000100
 k: 14 
 m: 1917011 
 N: 3 
 M: 29954
 Data: (null).%

Performances

Querying on 2438 sha1 file hashes against ~800MB hashlookup filter, finding for both implementations 2176 known files:

/usr/bin/time -v  bash -c "cat test.txt | bloom check ../hashlookup-full.bloom"
...
	Command being timed: "bash -c cat test.txt | bloom check ../hashlookup-full.bloom"
	User time (seconds): 1.23
	System time (seconds): 0.24
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.47
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 859620
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 8449
	Voluntary context switches: 568
	Involuntary context switches: 15
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
/usr/bin/time -v  bash -c "cat test.txt | ./fleurcli -c check ../hashlookup-full.bloom"
...
	Command being timed: "bash -c cat test.txt | ./fleurcli -c check ../hashlookup-full.bloom"
	User time (seconds): 0.01
	System time (seconds): 0.33
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.35
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 830524
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 207736
	Voluntary context switches: 15
	Involuntary context switches: 30
	Swaps: 0
	File system inputs: 8
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Acknowledgment

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.

fleur's People

Contributors

adulau avatar gallypette avatar satta avatar wllm-rbnt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fleur's Issues

fleur segfaults on insertion using corrupted bloomfilter files

fleur segfauts on insertion with the following corrupted (gzipped) bloomfilter files:
crash0_insert.bin.gz
crash1_insert.bin.gz

The command used to reproduce the crashes is echo bla | ./fleurcli -c insert crash0_insert.bin

Valgind output:

[...]
==1424957== Invalid read of size 8
==1424957==    at 0x403D94: Add (fleur.c:116)
==1424957==    by 0x40221E: main (fleurcli.c:147)
==1424957==  Address 0x233413bc23a9218 is not stack'd, malloc'd or (recently) free'd
[...]

Implement join

fleur should be able to merge two bloom filter of same dimension following bloom's description:

// Join adds the items of another Bloom filter with identical dimensions to
// the receiver. That is, all elements that are described in the
// second filter will also described by the receiver, and the number of elements
// of the receiver will grow by the number of elements in the added filter.
// Note that it is implicitly assumed that both filters are disjoint! Otherwise
// the number of elements in the joined filter must _only_ be considered an
// upper bound and not an exact value!
// Joining two differently dimensioned filters may yield unexpected results and
// hence is not allowed. An error will be returned in this case, and the
// receiver will be left unaltered.

don't calloc in each fingerprint

Could be faster still if it didn't calloc/free a "Fingerprint" buffer each time you add or query an entry. Since its individual BloomFilters are not thread-safe anyway, it could allocate the buffer once at BloomFilter creation time and be done with it.

Heap allocation weirdnesses

  • allocate only once,
  • no static

The usage of static variables makes the BloomFilter code unnecessarily dangerous to use when multiple threads are used even if every thread just wants to allocate and use totally separate bloom filters. Actually it could even cause issues with two bloom filters created after each other since even Initialize just returns a pointer to a single static BloomFilter :o I can't come up with a reason why you'd consider making anything in there static.

Strip out FN1-A related code

DCSO uses only uses FNV1.

void FNV1a64Update(FNV164_CTX *context, const unsigned char *input, unsigned int inputLen) can be removed
as well as the alternate switch in fnv_64_buf.

use prefixes / suffixes to avoid clashes

using good prefix, or even suffix for C library is a must, and come up with something unique (and "fleur" is kind of cool), probably good idea to prefix with it, and then it can be used without any current or future symbol clashes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.