Giter VIP home page Giter VIP logo

liceum's Introduction

Travis branch Clippy Linting Result

Liceum

Advanced license detector written in Rust

Usage

First we need to generate the datasets from a directory:

$ liceum -g /path/to/licenses

Output is placed in $PWD/cache and consists of ngrams.json and hashes.ssdeep files.

$ liceum /some/project
{
   "/some/project/LICENSE": {
      "mit"
   }
}

Prerequisites

ssdeep file

Theory

The code uses two independent search paths to produce precise data with close to zero chances for false positives.

Ssdeep

ssdeep is a fuzzy hashing mechanism capable of producing similarity index between two files. Files with index greater than 75 (out of 100) are considered during the license recognition.

Unique ngrams

Ngrams are extracted from the license corpuses and are crosschecked for uniqueness by constructing a graph mapping corpus nodes to ngram nodes.

                            +------+                             +------+       +------+
                            |File A|                             |File B|       |File C|
                            +---+--+                             +--+---+       +---+--+
                                |                                   |               |
                                |                                   |               |
          +---------------------------------------------+           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       |           |               |
          |                     |                       +-----------+>--------------+
          |                     |                       |                           |
          |                     |                       |                           |
          |                     |                       |                           |
          |                     |                       |                           |
          |                     |                       |                           |
          |                     |                       |                           |
+---------v---------+   +-------v--------+   +----------v---------+   +-------------v-------------+
|[This, is, example]|   |is, example, of]|   |[example, of, ngram]|   |[of, ngram, categorization]|
+-------------------+   +----------------+   +--------------------+   +---------------------------+

We start by iterating over each ngram and checking the number of edges from that ngram to corpus nodes. If the number is one, that means that that particular ngram is unique to the corpus (at that particular level, but more on that shortly), if the number is greater than one then we do an edge cleanup. Edge cleanup consists of removing edges to already finished corpuses. Corpus is considered finished when N unique ngrams were found for that corpus. At the end of each iteration we increase the level value to distinguish between corpuses consisting solely of unique ngrams (1st order) or "redeemed unique" ngrams (nth order, n > 1). "Redeemed unique" means that these ngrams were selected because an edge from that ngram to some now finished corpus was removed during previous iteration.

Let's assume that we're ok with a single ngram per corpus so we start iterating through the ngrams:

1: (This, is, example) -> File A

First ngram is assigned to File A and since there's no other ngram we proceed to next iteration, after the edge cleanup we get:

                                                                 +------+       +------+
                                                                 |File B|       |File C|
                                                                 +--+---+       +---+--+
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                                    |               |
                                                        +-----------+>--------------+
                                                        |                           |
                                                        |                           |
                                                        |                           |
                                                        |                           |
                                                        |                           |
                                                        |                           |
                                             +----------v---------+   +-------------v-------------+
                                             |[example, of, ngram]|   |[of, ngram, categorization]|
                                             +--------------------+   +---------------------------+

Third ngram has a single edge to File B now:

2: (example, of, ngram) -> File B

The next iteration the graph looks like this after edge cleanup:

          +------+
          |File C|
          +---+--+
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
              |
+-------------v-------------+
|[of, ngram, categorization]|
+---------------------------+

Which gives us:

3: (example, of, ngram) -> File C

The level information is finally used by the search algorithm to pick the best candidate (lowest level).

License

GPL-3.0

liceum's People

Contributors

shaded-enmity avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.