Giter VIP home page Giter VIP logo

godap's Introduction

GODAP: The Data Analysis Pipeline

(a port of the ruby-based DAP:

Build Status

DAP was created to transform text-based data on the command-line, specializing in transforms that are annoying or difficult to do with existing tools.

DAP reads data using an input plugin, transforms it through a series of filters, and prints it out again using an output plugin. Every record is treated as a document (aka: hash/dict) and filters are used to reduce, expand, and transform these documents as they pass through. Think of DAP as a mashup between sed, awk, grep, csvtool, and jq, with map/reduce capabilities.

DAP was written to process terabyte-sized public scan datasets, such as those provided by This go version of dap supports parallel processing of data. Results are forwarded to stdout and consistency of ordering is not guaranteed (and are highly likely to be out of order when compared to the input data stream).


Install Go version 1.12 or higher.

go get

godap supports pcap and geoip, which provide an input and filters, respectively. To enable support for these, you must pass a libpcap or libgeoip tag to your go get command. You must also have those libraries installed on your system (libpcap-dev or libgeoip).

For example:

go get -tags="libpcap libgeoip"

Will compile in support for both the pcap input and the geoip filters (geo_ip and geo_ip_org)


Quick Setup for GeoIP Lookups

Note: The documentation below assumes you've properly setup $GOPATH and $PATH (usually $GOPATH/bin:$PATH) per the official golang documentation.

$ go get
$ sudo bash
# mkdir -p /var/lib/geoip && cd /var/lib/geoip && wget && gunzip GeoLiteCity.dat.gz && mv GeoLiteCity.dat geoip.dat
$  echo | godap lines + geo_ip line + json
{"line":"","line.country_code":"US","line.country_code3":"USA","line.country_name":"United States","line.latitude":"38.0","line.longitude":"-97.0"}

Where dap gets fun is doing transforms, like just grabbing the country code:

$  echo | godap lines + geo_ip line + select line.country_code3 + lines

Inputs, filters and outputs

The general syntax when calling godap is godap <input> + (<filter +> <filter +> <filter +> ...) + <output>, where input, filter and output correspond to one of the supported features below. Filters are optional, though an input and output are required. Each feature component is separated by +. Component options are specified immediately after the component declaration. For example, streaming from a wifi adapter and spitting out json documents would resemble: godap pcap iface=en0 rfmon=true + json. Component options with spaces or other complexities can be specified using shell-like quoting. For example, for a bpf pcap filter on the pcap component: godap pcap iface=en0 'filter="tcp port 80"' + json.


Specifies that the input stream is a packet capture. Currently supports streaming in from a file or interface.

NOTE: godap MUST be built using -tags="libpcap" for pcap filter support. If not, the pcap filter will be unavailable.

Option Description Value Default
iface The interface to read packets from. If iface is specified, file must not be specified string interface id <none>
file The pcap file to read from. If file is specified, iface must not be specified string filename <none>
promisc Whether to capture in promiscuous mode boolean false
timeout The capture timeout integer -1 (inf)
snaplen The snap length integer 65536
rfmon Whether to capture in monitor mode (applicable only to adapters which support it) boolean false


  • Pull packets in from monitor mode: godap pcap iface=en0 rfmon=true + json

  • Read pcap (or pcap-ng) file contents and convert to json: godap pcap file=foo.pcap + json

  • Live capture in promiscuous mode: godap pcap iface=en0 promisc=true + json

  • json

Specifies that the input stream is represented as JSON data.

Option Description Value Default
file The file to stream from. If not specified, stdin is assumed. Can also be - for stdin. string filename stdin


$  echo '{"a":2}' | godap json + lines

Specifies that the input stream is represented as newline-terminated plaintext.

Option Description Value Default
file The file to stream from. If not specified, stdin is assumed. Can also be - for stdin. string filename stdin


$  echo hello world | godap lines + json
{"line":"hello world"}


Renames a document field.

Option Description Value Default
<document key> The document key to rename string <destination key> <none>


$  echo world | godap lines + rename line=hello + json

Prevents a document from further processing if a specified key is not present

Option Description Value Default
<document key> The document key that must not exist <none> <none>


$  echo '{"foo":"bar"}' | godap json + not_exists foo + json

Extracts comma-separated fields in the specified key's value into new documents

Option Description Value Default
<document key> The document key which will be split <none> <none>


$  echo '{"foo":"bar,baz"}' | godap json + split_comma foo + json
  • field_split_line

Extracts fields separated by a newline from the source key's value into new fields of the same document. Each new key is named <origkey>.f### where ### is an incremental integer indicating the original field position from left to right.

Option Description Value Default
<document key> The document key which will be split <none> <none>


$  echo '{"foo":"bar\nbaz"}' | godap json + field_split_line foo + json

Filters out a document if the value for a given key is empty

Option Description Value Default
<document key> The document key to filter on <none> <none>


$  echo '{"foo":"bar,baz"}' | godap json + not_empty foo + json

Splits a key into multiple new subkeys each containing a field from the original value split by \t. Each new key is named <origkey>.f### where ### is an incremental integer indicating the original field position from left to right.

Option Description Value Default
<document key> The document key to split <none> <none>


$  echo '{"foo":"bar\tbaz"}' | godap json + field_split_tab foo + json

Sets the value of the specified key to the empty string

Option Description Value Default
<document key> The key to truncate <none> <none>


$  echo '{"foo":"bar\tbaz"}' | godap json + truncate foo + json

Flattens pesky nested json into "" properties of the top-level document. The original key/value are left untouched (you can remove them using the remove filter).

Option Description Value Default
<document key> The key to flatten <none> <none>


$ echo '{"foo":{"bar": "baz"}}' | godap json + flatten foo + json

Adds a new value to the document

Option Description Value Default
<document key> The key to truncate <document value> <empty string>


$  echo '{"foo":"bar\tbaz"}' | godap json + insert a=b + json

Splits a field that contains an array data type value into multiple new fields. Each new key is named <origkey>.f### where ### is an incremental integer indicating the original field position from left to right. The array can contain multiple different data types.

Option Description Value Default
<document key> The key to split <none> <none>


$  echo '{"foo":["a",2]}' | godap json + field_split_array foo + json

Ensures the specified key exists in the source document. If it does not, the document is removed from the pipeline.

Option Description Value Default
<document key> The key that must exist <none> <none>


$  echo '{"foo":"bar\tbaz"}' | godap json + exists a + json
$  echo '{"foo":"bar\tbaz"}' | godap json + exists foo + json

Splits a given key's value into multiple new documents with the same key name, each document containing a field extracted from the source key's value separated by a newline.

Option Description Value Default
<document key> The key to split <none> <none>


$  echo '{"foo":"bar\nbaz"}' | godap json + split_line foo + json

Keeps only the specified keys in the resulting document. Multiple key names can be specified.

Option Description Value Default
<document key> The key to keep <none> <none>


$  echo '{"foo":"bar", "baz":"qux", "a":"b"}' | godap json + select foo + json
$  echo '{"foo":"bar", "baz":"qux", "a":"b"}' | godap json + select foo baz + json

Removes the specified keys from the source document.

Option Description Value Default
<document key> The key to remove <none> <none>


$  echo '{"foo":"bar", "baz":"qux", "a":"b"}' | godap json + remove foo baz + json

Ensures a document key includes a specified string.

Option Description Value Default
<document key> The key to remove string contains_str <none>


$  echo '{"foo":"bar", "baz":"qux", "a":"b"}' | godap json + include a=c + json
$  echo '{"foo":"bar", "baz":"qux", "a":"b"}' | godap json + include a=b + json

Ensures a document key includes a specified string.

Option Description Value Default
<document key> The key to transform utf8encode or ascii or base64encode or base64decode or upcase or downcase or hexencode or ascii <none>


$  echo '{"foo":"bar"}' | godap json + transform foo=base64encode + json

Reverses the string contents of one or more fields. If the field is not a string, this is a no-op.

Option Description Value Default
<field name> The name of the string field to reverse <none> <none>


$ echo '{"foo":"baz","bar":"qux"}' | godap json + reverse foo bar + json

Joins one or more source fields into a destination field, separated by a comma. This filter will attempt to cast the source field types to a string representation.

Option Description Value Default
source A comma-separated list of source field names <string field name comma separated> <none>
dest A destination field which will receive the join result <string field name> <none>
sep A separator to join the fields by <string> ,


$ echo '{"foo":"baz","bar":"qux"}' | godap json + join source=foo,bar dest=example sep=# + json

Runs an input field through a given database, adding recog match fields to the input document.

Option Description Value Default
<field_name>=<database_name> one or more space separated pairs of a field name "=" a recog database name (the name from the recog xml matches attribute) <string field name "=" string database name, space separated> <none>


$ echo "9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.2" | godap lines + recog line=dns.versionbind + json
{"line":"9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.2","line.recog.os.cpe23":"cpe:/o:redhat:enterprise_linux:6","":"Linux","line.recog.os.product":"Enterprise Linux","line.recog.os.vendor":"Red Hat","line.recog.os.version":"6","line.recog.os.version.version":"9","line.recog.service.cpe23":"cpe:/a:isc:bind:9.8.2rc1","":"BIND","line.recog.service.product":"BIND","line.recog.service.vendor":"ISC","line.recog.service.version":"9.8.2rc1"}


godap's People


jhart-r7 avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

godap's Issues

Add warc support to godap?

Currently if someone was to download this warc file from dap:

The could parse this with dap:

$ bzcat iawide.warc.bz2 | dap warc  + json | head -n 1 | jq  'keys'

However if they were to try this with godap:

$ bzcat iawide.warc.bz2 | ./dappy warc  + json | head -n 1 | jq  'keys'
bzcat: Can't open input file iawide.warc.bz2: No such file or directory.
Error: Invalid input plugin: warc

  Usage: ./dappy [input] + [filter] + [output]

Example: echo world | ./dappy lines stdin + rename line=hello + json stdout

Looking at supported types:

$ ./dappy --inputs
 * json
 * lines

It's not there.

We could use this golang library:

Which actually supports reading compressed warc files.

A sample script:

package main

import (

type godapWarc struct {
   Type string `json:"warc_type"`
   TargetUri string `json:"warc_target_uri"`
   Id string `json:"warc_record_id"`
   ContentLength string `json:"content_length"`
   Date string `json:"warc_date"`
   ContentType string `json:"content_type"`
   PayloadDigest string `json:"warc_payload_digest"`
   IpAddress string `json:"warc_ip_address"`
   Content string `json:"content"`

func main(){

reader, err := warc.NewReader(os.Stdin)
if err != nil {
defer reader.Close()

for {
	record, err := reader.ReadRecord()
	if err != nil {
   buf := new(bytes.Buffer)
   warc_rec := &godapWarc{
      Type: record.Header["warc-type"],
      TargetUri: record.Header["warc-target-uri"],
      Id: record.Header["warc-record-id"],
      ContentLength: record.Header["content-length"],
      ContentType: record.Header["content-type"],
      Date: record.Header["warc-date"],
      PayloadDigest: record.Header["warc-payload-digest"],
      IpAddress: record.Header["warc-ip-address"],
      Content:  buf.String(),
    warc_jsonstr, _ := json.Marshal(warc_rec)


Which when run:

$ cat ~/Downloads/iawide.warc.bz2  | go run warc_reader2.go  | head -n 1 | jq 'keys'

$ cat ~/Downloads/iawide.warc.bz2  | go run warc_reader2.go  | head -n 1
{"warc_type":"warcinfo","warc_target_uri":"","warc_record_id":"\u003curn:uuid:88fbcbee-f24e-47c1-b0c4-f7a9530ceb74\u003e","content_length":"442","warc_date":"2011-02-25T18:32:19Z","content_type":"application/warc-fields","warc_payload_digest":"","warc_ip_address":"","content":"software: Heritrix/3.0.1-SNAPSHOT-20110127.213729\r\nip:\r\nhostname:\r\nformat: WARC File Format 1.0\r\nconformsTo:\r\noperator: [email protected]\r\nisPartOf: wide\r\ndescription: seeds.txt\r\nrobots: obey\r\nhttp-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot +\r\n\r\n"}

Can spit out similar content to dap.

Maybe we could add this to filters part of the factory?

Add expand as a filter to dap?

Currently in dap:

$ echo '{"id": 1, "":"jon snow", "info.dead": false, "info.age": 29}' | dap json + expand info + json
{"id":1,"":"jon snow","info.dead":false,"info.age":29,"info":{"name":"jon snow","dead":false,"age":29}}

However trying go dap:

$ echo '{"id": 1, "":"jon snow", "info.dead": false, "info.age": 29}' | ./dappy json + expand info + json
Error: Invalid filter plugin: expand

We probably could use some variation of this script (inspiration taken from the dap) in ./filter/simple.go:

package main

import (

func main() {

   bytes, _ := ioutil.ReadAll(os.Stdin)

   myMap := make(map[string]interface{})
   _ = json.Unmarshal([]byte(bytes), &myMap)

   pattern := fmt.Sprintf("^%s\\.(?P<sub_key>.+)$", os.Args[1])
   r := regexp.MustCompile(pattern)

   tmp := make(map[string]interface{})
   for k, v := range myMap {
      match :=  r.FindStringSubmatch(k)
      if(len(match) > 0){
         tmp[match[1]] =  v
   myMap[os.Args[1]] =  tmp

   jsonString, _ := json.Marshal(myMap)


sample run:

$ echo '{"id": 1, "":"jon snow", "info.dead": false, "info.age": 29}' | dap json + expand info + json
{"id":1,"":"jon snow","info.dead":false,"info.age":29,"info":{"name":"jon snow","dead":false,"age":29}}

add/change godaps support with geoip/mmdb

Currently under usage:

$ go get
$ sudo bash
# mkdir -p /var/lib/geoip && cd /var/lib/geoip && wget && gunzip GeoLiteCity.dat.gz && mv GeoLiteCity.dat geoip.dat

However doing a curl -I on the url :

cam-mbp-5971:dap ssikdar$ curl -I
HTTP/1.1 404 Not Found

Looking here at
looks like this
the file to wget and decompress.

Looking at the contents:

$ ls ~/Downloads/GeoLite2-City_20190409
COPYRIGHT.txt		GeoLite2-City.mmdb	LICENSE.txt		README.txt

godap will probably also need to change to use an mmdb library like this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.