Giter VIP home page Giter VIP logo

dejavu-1's Introduction

Build Status Issues Go Report Card License GoDoc Donate PayPal Donate Bitcoin Avaivable For Hire

Déjà vu

Quickly detect already witnessed data, ideal for deduplication.

Limited memory of witnessed data, oldest are forgotten. Library is thread safe. Offers deterministic and probabilistic (over an order of magnitude less memory consuming) implementation. The probabilistic implementation uses bloom filters, meaning false positives are possible but not false negatives.

Installation

Download binary release

Compiled binaries for many platforms are available and can be downloaded for the latest release.

Extract the binary for your platform and add it to your system path.

Compile from source

Requires golang environment/workspace.

# compile and install library
go get github.com/f483/dejavu

# compile and install binary
go install github.com/f483/dejavu/dejavu

Command line usage

$ dejavu -h
Usage: dejavu [OPTION]... [FILE]...

Concatenate FILE(s) and filter or output duplicate lines.

With no FILE, or when FILE is -, read standard input.

Options:
  -D	use deterministic mode instead of probabilistic
	WARNING requires order of magnitude more memory
  -d	output only duplicates instead of filtering
  -f float
    	chance of false positive, between 0.0 and 1.0
	only for probabilistic mode (default 1e-06)
  -l uint
    	limit after which entries are forgotton (default 1000000)
  -o string
    	output file, defaults to stdout
  -v	output version information and exit

Examples:
  dejavu
	default probabilistic deduplication from stdin to std out with
	1mil entry limit and 1/1mil chance of false positive (~8M mem usage)
  dejavu -o s f - g
	deduplicat f, then stdin, then g, to output s
  dejavu -l 10000000 -fp 0.000000001
	probabilistic deduplication with 10mil entry limit
	and 1/1bil chance of false positive (~70M mem usage)
  dejavu -d -D -l 65536
	output duplicates and avoid false positives with deterministic mode
	lower entry limit to avoid excessive memory usage

Implementation:
  Efficient probabilistic and deterministic duplicate detection with O(1) 
  detection time and O(n) memory usage in relation to entry limit. Default
  probabilistic implementation uses bloom filters, meaning false
  positives are possible but not false negatives.

Author: Fabian Barkhau <[email protected]>
Project: https://github.com/f483/dejavu
License: MIT https://raw.githubusercontent.com/f483/dejavu/master/LICENSE

Library usage (golang)

Probabilistic example

package main

import (
	"fmt"
	"github.com/f483/dejavu"
)

func main() {

	// probably remembers last 65536 with 0.000001 chance of false positive
	p := dejavu.NewProbabilistic(65536, 0.000001)

	fmt.Println(p.Witness([]byte("bar"))) // entry added
	fmt.Println(p.Witness([]byte("bar"))) // probably remembers entry
}

Deterministic example

package main

import (
	"fmt"
	"github.com/f483/dejavu"
)

func main() {

	// always remembers last 1024 entries
	d := dejavu.NewDeterministic(1024)

	fmt.Println(d.Witness([]byte("foo"))) // entry added
	fmt.Println(d.Witness([]byte("foo"))) // remembers entry
}

Performance

Linear memory usage: O(n)

Probabilistic

0.000001 chance of false positive.

Benchmark Memory

Deterministic

Benchmark Memory

Constant witness time: O(1)

Probabilistic

0.000001 chance of false positive.

Benchmark Time

Deterministic

Benchmark Time

dejavu-1's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.