Giter VIP home page Giter VIP logo

dataleach's Introduction

Dataleach

This project focuses on pulling in data from various web resources and parsing them for specific identifiers, such as IP addresses.

Configurations

There are two types of configutaitons the data leach processes. First, there is a system configuration, which is devoted to locating site specific congiguration files and the basic output information such as basic output file names. The second, is site specific files, which dictate how a site is treated.

System Configuration

System configuration files dictate how the dataleach operates in general. The following looks for

[INPUT]
DIR=data
EXTENSION=.conf

[OUTPUT]
DIR=data
FORMAT=%Y%M%D.%h

The configuration is divided into two sections.

  1. INPUT - Defines where to look for sources.
    1. DIR - The directory to use
    2. EXTENSION - The extension to look for
  2. OUTPUT - Output directory information
    1. DIR - The path to which site specific outputs will be written
    2. FORMAT - Specificaition for the output file names' format.

Site Configuration

Site Configuation files dictate how the dataleach processes a site. Below is a sample configuration file that extracts the google.com web page and places its content inoutput/source1.

[DETAILS]
name=source1
type=WEB_SOURCE
address=http://www.google.com

[PROCESS]
search=([0-9]{1,3}\.{3})[0-9]{1,3}
filter=<?.*>
[IO]
output_dir=output/source1

The configuration is divided into three sections

  1. DETAILS - The metadata about the source including
    1. name - The name of source.
    2. type - The type of sourc being referenced.
    3. address - The root URL to be accessed.
  2. PROCESS - How data returned should be accessed
    1. search - A regular expression representing the data that should be extracted from the site.
    2. filter - A regular expression represneing data that should be immediately filtered out before searching
  3. IO - The file IO that should be used for this source

dataleach's People

Contributors

janies avatar

Stargazers

 avatar  avatar

Watchers

Andy Freeland avatar  avatar

dataleach's Issues

Add non-text processing to WebSites class

Right now we assume everything is text, but we will have to also work with xml and json data. It might be worth separating out the text, json, and xml parsing into subclasses that can be passed into the website class, or we can automatically instantiate based on the content returned by a web request.

Support Max file sizes

Right now, there is no restriction on the size of output files. We need to enforce this to make sure that we can produce output that is useful on constrained systems.

Support Authentication

At present we do not support authentication, but we have moved to Requests, which makes extending to support this tenable. The user should be able to supply username/password or a private key as configuration parameters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.