Giter VIP home page Giter VIP logo

credentialleakdb's Introduction

credentialleakDB

Pylint flak8 and pytest CodeQL Quality Gate Status Maintainability Rating Reliability Rating Security Rating codecov

A database structure to store leaked credentials.

Think: our own, internal HaveIBeenPwned database.

Why?

  1. To quickly find duplicates before sending it on to further process the data
  2. To have a way to load diverse credential breaches into a common structure and do common queries on it
  3. To quickly generate statistics on credential leaks
  4. To have a well defined interface to pass on data to pass it on to other automation steps

Documentation

Installation

Docker

Via pip and venv

git clone https://github.com/EC-DIGIT-CSIRC/credentialLeakDB.git
cd credentialLeakDB
# create a virtualenv
virtualenv --python=python3.7 venv
source venv/bin/activate
pip install -r requirements.txt

Next, make sure the following files exist:

  • VIPs.txt ... a \n separated list of email addresses which you would consider VIPs.
  • api/config.py ... see below

Database structure

Search in Confluence for "credentialLeakDB" in the Automation space.

SQL structure: db.sql

The EER diagram intentionally got simplified a lot. If we are going to store billions of repeated text datatype records, we can go back to more normalization. For now, however, this seems to be enough.

EER Diagram

Meaning of the fields

Table leak

Column Type Collation Nullable Description
id integer not null primary key. Auto-generated.
breach_ts timestamp with time zone If known, the timestamp when the breach happened.
source_publish_ts timestamp with time zone The timestamp according when the source (f.ex. Spycloud) published the data.
ingestion_ts timestamp with time zone not null The timestamp when we ingested the data.
summary text not null A short summary (slug) of the leak. Used for displaying it somewhere
ticket_id text
reporter_name text The name of the reporter where we got the notification from. E.g. CERT-eu, Spycloud, etc... Who sent us the data?
source_name text The name of the source where this leak came from. Either the name of a collection or some other name.
Indexes:
    "leak_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "leak_data" CONSTRAINT "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

Table leak_data

Column Type Collation Nullable Description
id integer not null primary key, auto-generated.
leak_id integer not null references a leak(id)
email text not null The email address associated with the leak.
password text not null Either the encrypted or unencrypted password. If the unencrypted password is available, that is what is going to be in this field.
password_plain text The plaintext password, if known.
password_hashed text The hashed password, if known.
hash_algo text If we can determine the hashing algo and the password_hashed field is set, for example "md5" or "sha1"
ticket_id text References the ticket systems' ticket ID associated with handling this credential leak . This ticket could contain infos on how we contacted the affected user.
email_verified boolean If the email address was verified if it does exist and is active
password_verified_ok boolean Was that password still valid / active?
ip inet IP address of the client PC in case of a password stealer.
domain text Domain address of the user's email address.
browser text If the password was leaked via a password stealer malware, then the browser of the user goes here. Otherwise empty.
malware_name text If the password was leaked via a password stealer malware, then the malware name goes here. Otherwise empty.
infected_machine text If the password was leaked via a password stealer malware, then the infected (Windows) PC name (some ID for the machine) goes here.
dg text not null The affected DG (in other organisations, this would be called "department")
count_seen integer How often did we already see this unique combination (leak, email, password, domain). I.e. this is a duplicate counter.
Indexes:
    "leak_data_pkey" PRIMARY KEY, btree (id)
    "constr_unique_leak_data_leak_id_email_password_domain" UNIQUE CONSTRAINT, btree (leak_id, email, password, domain)
    "idx_leak_data_unique_leak_id_email_password_domain" UNIQUE, btree (leak_id, email, password, domain)
    "idx_leak_data_dg" btree (dg)
    "idx_leak_data_email" btree (upper(email))
    "idx_leak_data_email_password_machine" btree (email, password, infected_machine)
    "idx_leak_data_malware_name" btree (malware_name)
Foreign-key constraints:
    "leak_data_leak_id_fkey" FOREIGN KEY (leak_id) REFERENCES leak(id)

Usage of the API

Here is how to use the API endpoints: you can start the server (follow the instructions below) and go to $servername/docs where $servername is of course the domain / IP address you installed it under. The docs/ endpoint hosts a swagger / OpenAPI 3

GET parameters

These are pretty self-explanatory thanks to the swagger UI.

POST and PUT

For HTTP POST (a.k.a INSERT into DB) you will need to provide the following JSON info:

leak object

{
  "id": 0,
  "ticket_id": "string",
  "summary": "string",
  "reporter_name": "string",
  "source_name": "string",
  "breach_ts": "2021-03-29T12:21:56.370Z",
  "source_publish_ts": "2021-03-29T12:21:56.370Z"
}

The id field only needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. The id is the internal automatically generated primary key (ID) and will be assigned. So when you use the HTTP POST /leak endpoint, please leave out id. The answer will be a JSON array with a dict with the id inside, such as:

{
  "meta": {
    "version": "0.5",
    "duration": 0.006,
    "count": 1
  },
  "data": [
    {
      "id": 18
    }
  ],
  "error": null
}

Meaning: the version of the API was 0.5, the query duration was 0.006 sec (6 millisec), one answer. The data array contains one element: id=18. Meaning, the ID of the inserted leak object was 18. You can now reference this in the leak_data object insertion.

leak_data object

Same as the leak object, here the id field only needs to be filled out when PUTing data there (a.k.a UPDATE statement). Otherwise please leave it out when POSTing a new leak_data row. Note well: the leak_id field needs to be filled out in this case. You first have to create leak object and then afterwards the leak_data object.

{
  "id": 0,
  "leak_id": 0,
  "email": "[email protected]",
  "password": "string",
  "password_plain": "string",
  "password_hashed": "string",
  "hash_algo": "string",
  "ticket_id": "string",
  "email_verified": true,
  "password_verified_ok": true,
  "ip": "string",
  "domain": "string",
  "browser": "string",
  "malware_name": "string",
  "infected_machine": "string",
  "dg": "string"
}

import/csv/ endpoint

Also pretty self-explanatory. You need to first create a leak object, give it's ID as a GET-style parameter and upload the CSV in spycloud format via the Form.

Installation

  1. Install git and checkout this repository:
apt install git
git clone ...
cd credentialLeakDB
  1. Install Postgresql:
# in Ubuntu:
apt install postgresql-12           
# alternatively, if you are in Debian 10, you can also use postgresql-11, both work:
# apt install postgresql-11
  1. as user postgres:
sudo su - postgres
createdb credentialleakdb
createuser credentialleakdb
psql -c "ALTER ROLE credentialleakdb WITH PASSWORD '<insert some random password here>'" template1
  1. create the DB: psql -u credentialleakdb credentialleakdb < db.sql

  2. set the env vars:

export PORT=8080
export DBNAME=credentialleakdb
export DBUSER=credentialleakdb
export DBPASSWORD=... <insert the password you gave the user> ...
export DBHOST=localhost
  1. Create a virtual environment if it does not exist yet:
    virtualenv --python=python3.7 venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. start the program from the main directory:
export PYTHONPATH=$(pwd); uvicorn --reload --host 0.0.0.0 --port $PORT api.main:app

Configuration.

Please copy the file config.SAMPLE.py to api/config.py and adjust accordingly. Here you can set API keys etc.

credentialleakdb's People

Contributors

aaronkaplan avatar austindimmer avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

credentialleakdb's Issues

Enrichers

  • implement relevant enrichers which are open sourced

Abstract MQ logic

  • implement abstract MQ logic. So far that's only a handing over of mem structures

Refactor Spycloud parser

  • include new parser with pandas
  • move into pydantic struct
  • return iterator of pydantic structs
  • return error msg in case it did not work
  • pass on to further MQ

stuff for the re-write branch until 21st

hard bugs

  • insert_leak_data() pytest failure
  • idf vs. api/model.py? What counts? merge + refactor
  • store() and convert_to_output()
  • bug in enrich (int)
  • make sure mappings between inDF and iDF are ok.
  • CSV file - ; or , as delimiter
  • make sure that if a row can't be enriched properly, that we set needs_human_intervention = True

soft bugs

  • make sure coverage is always > 80% !!! How to do that with the abstract classes?
  • What to do when CED is not here? mock answers?

docs

  • update CHANGELOG
  • update docs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.