Giter VIP home page Giter VIP logo

gollector's Introduction

Gollector

Tool for the collection (and planned enhancement) of domain names from different sources. The purpose of gollector is to enable the analysis of different vantage points of domain name collection, such as zone files, passive DNS logs and more.

IMPORTANT The performance of the tool is heavily important by the optimizations setup in the Postgres database. A couple optimizations have been implemented in gollector, but in order to rely on index-only scans, manual indexes must be added.

Components

gollector consists of various components, which can be ran independently of each other. The core of the tool is a cache process that provides a gRPC api to the other components to insert entries in an underlying (PostgreSQL) database. A set of collectors processes can run in parallel. View the README files for more details about the components:

How to configure

Each component is configure individually with a .yml configuration file. In order to get started, copy one of the template configuration files in the config/ directory.

Running the tool

The tool can be compiled and run with Golang, or run using Docker containers.

Golang

  • Golang (tested with version 1.13)
  • A running PostgreSQL database

Docker-compose

All components are dockerized and can be run with docker-compose. Note that that the cache is expected to be running for any of the collectors to work, so the order in which to start the Docker containers matters. The following is an example:

$ docker-compose build cache zones
$ docker-compose up -d cache
...
...
$ docker-compose up -d zones

Make sure the correct environment variables are set before running with docker-compose (or via a .env file in the root of the project).
Take a look at docker-compose.yml for the environment variables to set.

Contribute

Protobuf

After updating the Protobuf file (api/proto/api.proto), run the following to generate the associated golang source code:

$ cd api/proto
$ protoc --go_out=. --go-grpc_out=. api.proto    

gollector's People

Contributors

gianmarcomennecozzi avatar kdhageman avatar mrtrkmn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gollector's Issues

DB query

skip query db if the cache size is not reached

Better handling of configuration

docker-compose requires all environment variables to be set, even when building the application, which is unnecessary. We should move the environment variable check to the be done in the application itself, rather than in docker-compose

Pass tests on Github Actions

A couple of significant requirements:

  • Some tests are written with data sources (such as an zone file accessed via HTTP) in place, which must either (1) be mocked or (2) these tests must by skipped by GA.
  • Some tests require interaction with a postgres database in place, which should be provided by GA.

Index Start Date

I think there is a problem by retrieving the last stored certificate from the DB. I guess this function dosen't work very well (https://github.com/aau-network-security/gollector/blob/master/app/ct/main.go#L135).

The way ct should work: it should insert in the DB (n) certificates every time is run. the function linked above should get the last entered certificate from the DB in order to scan the next (n) certificates from that one.

The way ct is working right now: the first time i run ct it enters the first 100 certificates. Running it again it dosen't store the next 100 certificates in the DB.

The function linked above give back an error in the tests too
https://github.com/aau-network-security/gollector/pull/39/checks#step:6:111

Separate cache from other processes

With some high-performant IPC communication between a process and a cache container (high throughput => 100,000 messages per second, low latency => (far) under ms), it would be possible to run multiple measurements simultaneously.

Current situation

+------------------+           +------------------+
|                  |           |                  |
|     Process 1    |           |     Process 2    |
|                  |           |                  |
|  +------------+  |           |  +------------+  |
|  |            |  |           |  |            |  |
|  |   Cache    |  |           |  |    Cache   |  |
|  |            |  |           |  |            |  |
+--+------+-----+--+           +--+-----+------+--+
          |                             |
          |                             |
          |                             |
          |                             |
          |                             |
          +--------------+--------------+
                         |
                 +-------+--------+
                 |                |
                 |   Persistent   |
                 |                |
                 +----------------+

Suggested alternative

+------------------+           +------------------+
|                  |           |                  |
|     Process 1    |           |     Process 2    |
|                  |           |                  |
+-------+----------+           +---------+--------+
        |                                |
        +--------------+    +------------+
                       |    |    High performant IPC
                       |    |
                 +-----+----+-----+
                 |                |
                 |     Cache      |
                 |                |
                 +-------+--------+
                         |
                         |  Asynchronous, but reliable
                 +-------+--------+
                 |                |
                 |   Persistent   |
                 |                |
                 +----------------+

DB Libraries

Use just a library to interact with the DB. pq should be the best on

Test FTP over SSH

I am not convinced the current implementation is correct. The only zone file accessible over FTP is the .com one

Improve error reporting

Error messages are somewhat difficult to debug otherwise

  • Wrapping errors to make error prints more clear
  • Report messages to Sentry/Rollbar and send emails when errors are occuring

Mount volume in cache to persist issued certificates

When running the cache over TLS, the certmagic lib automatically obtains certificates. The issued certificates are stored on disk, but because we currently do not mount a volume to persist those certificates, they disappear whenever the container is closed, and with a restart a new cert is issued. As a result, the rate limit of Let's Encrypt may be hit, locking us out of running on TLS for a few days.

The docker-compose config should mount a volume to the correct location where the certs are stored by certmagic

Collectors

Make sure the implementation about CT logs works for all the collectors component we have

Add the notion of a "dataset"

We must be able to distinguish between multiple measurements from different vantage points, knowing exactly which data point belongs to which data set/vantage point.

Add status API call for cache

It might be useful to retrieve the current state of the entries contained in cache process via an gRPC. To go even further, a monitoring tool can read this state and visualize the growth of number of entries over time.

cache add

add the item to the cache:

  • when i find them in the db
  • after i wrote them in the db (could be done in the posthook)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.