Giter VIP home page Giter VIP logo

pygrok's Introduction

pygrok Build Status

Join the chat at https://gitter.im/garyelephant/pygrok

A Python library to parse strings and extract information from structured/unstructured data

What can I use Grok for?

  • parsing and matching patterns in a string(log, message etc.)
  • relieving from complex regular expressions.
  • extracting information from structured/unstructured data

Installation

    $ pip install pygrok

or download, uncompress and install pygrok from here:

    $ tar zxvf pygrok-xx.tar.gz
    $ cd pygrok_dir
    $ sudo python setup.py install

Getting Started

from pygrok import Grok
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age} years old and weighs %{NUMBER:weight} kilograms'
grok = Grok(pattern)
print grok.match(text)

# {'gender': 'male', 'age': '25', 'name': 'gary', 'weight': '68.5'}

Pretty Cool !

Numbers can be converted from string to int or float if you use %{pattern:name:type} syntax, such as %{NUMBER:age:int}

from pygrok import Grok
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age:int} years old and weighs %{NUMBER:weight:float} kilograms'
grok = Grok(pattern)
print grok.match(text)

# {'gender': 'male', 'age': 25, 'name': 'gary', 'weight': 68.5}

Now age is of type int and weight is of type float.

Awesome !

Some of the pattern you can use are listed here:

`WORD` means \b\w+\b in regular expression.
`NUMBER` means (?:%{BASE10NUM})
`BASE10NUM` means (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))

other patterns such as `IP`, `HOSTNAME`, `URIPATH`, `DATE`, `TIMESTAMP_ISO8601`, `COMMONAPACHELOG`..

See All patterns here

You can also have custom pattern, see these codes.

More details

Beause python re module does not support regular expression syntax atomic grouping(?>),so pygrok requires regex to be installed.

pygrok is inspired by Grok developed by Jordan Sissel. This is not a wrapper of Jordan Sissel's Grok and totally implemented by me.

Grok is a simple software that allows you to easily parse strings, logs and other files. With grok, you can turn unstructured log and event data into structured data.Pygrok does the same thing.

I recommend you to have a look at logstash filter grok, it explains how Grok-like thing work.

pattern files come from logstash filter grok's pattern files

Contribute

  • You are encouraged to fork, improve the code, then make a pull request.
  • Issue tracker

Get Help

mail:[email protected]
twitter:@garyelephant

Contributors

Thanks to all contributors

pygrok's People

Contributors

garyelephant avatar gitter-badger avatar jerryleooo avatar moebiuseye avatar rs-natano avatar tmessi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygrok's Issues

grok.findall()

I occasionally use this to filter events in syslog and kern.log. Splitting the log file by "\n" then grokking each line is slow so I added a .findall()

Is this repo still being maintained? Would a pull request with a .findall() method be welcome or useful to anyone?

regex_core.error issues when using complex grok patterns using EPEL python-regex-2015.06.24

pygrok seems to be having issues processing complex grok patterns such as:

SYSLOGBASE

    match_obj = re.search(py_regex_pattern, text)
  File "/usr/lib64/python2.6/site-packages/regex.py", line 251, in search
    return _compile(pattern, flags, kwargs).search(string, pos, endpos,
  File "/usr/lib64/python2.6/site-packages/regex.py", line 499, in _compile
    caught_exception.pos)
_regex_core.error: bad fuzzy constraint at position 19

PEP 517 compatibility

Description

run pip install pygrok to see a disclaimer

Message

  DEPRECATION: promise is being installed using the legacy 'setup.py inst
all' method, because it does not have a 'pyproject.toml' and the 'wheel' 
package is not installed. pip 23.1 will enforce this behaviour change. A 
possible replacement is to enable the '--use-pep517' option. Discussion c
an be found at https://github.com/pypa/pip/issues/8559                   

PEP8

It's hard to contribute when the code doesn't follow PEP8 rules.

Pygrok New API

From v1.0, Pygrok will make big performance enhancement and the API will change, probably new API looks like this:

from pygrok import Grok

pattern = 'load average: %{NUMBER:load_1:float}, %{NUMBER:load_2:float}, %{NUMBER:load_3:float}'

text = 'load average: 1.88, 1.73, 1.49'

grok = Grok(pattern)

m = grok.match(text)

This will output parsed data:

{'load_1': 1.88, 'load_2': 1.73, 'load_3': 1.49}

With precompiled regex pattern inside, you can invoke grok.match thousands of time without too much performance loss.This will address #3.

_regex_core.error when using grok patterns

I'm having an issue when I use pygrok for processing an aws elb log. I'm using pygrok 0.7.4 and regex (2015.11.22). This is the error:

match_obj = re.search(py_regex_pattern, text)
  File "build/bdist.linux-x86_64/egg/regex.py", line 265, in search
  File "build/bdist.linux-x86_64/egg/regex.py", line 490, in _compile
_regex_core.error: bad fuzzy constraint at position 1554

I think this maybe related with issue #5

Infinite loop on PATH

This will take forever. pretty much infinite loop. If you change the log slightly to remove the numbers or keep only single digit number it will work.

log = '"/var/tmp/testfile_3234234_232.log"'
p = r"^\"%{PATH:path}\" yeh$"
grok = Grok(p)
print (grok.match(log))

Bad Fuzzy constraint error when casting value to type

Logstash supports this type of syntax:

%{NUMBER:key:int}

I'm using pygrok to write unit tests (for a logstash project I'm working on), and my groks relies on this to cast values to integers.

This makes pygrok complain about the syntax, and a chain reaction causing me to complain about this here. 😄

Here's an extract from my unit test script:

--------------------------------------------------
PATTERN == %{DATA:test:int}
[TEXT]  1989
Traceback (most recent call last):
  File "./ci/grok.py", line 20, in <module>
    result=pygrok.grok_match(text,pattern)
  File "/usr/lib/python2.7/site-packages/pygrok/pygrok.py", line 58, in grok_match
    match_obj = re.search(py_regex_pattern, text)
  File "/usr/lib64/python2.7/site-packages/regex.py", line 265, in search
    return _compile(pattern, flags, kwargs).search(string, pos, endpos,
  File "/usr/lib64/python2.7/site-packages/regex.py", line 490, in _compile
    caught_exception.pos)
_regex_core.error: bad fuzzy constraint at position 3

I may find the time to look into the code and make a pull request for this.
If someone can point me to the file(s) I should look into, that would certainly make it easier for me. 😉

move middleware to the right

The middleware should be moved to the right side of the pipeline, probably between "cache" and "bulk".

Otherwise, other middleware that changes the client response won't get handled, eg large objects.

Release current version to pypi

Hi,

the latest version on pypi is from Sep 24, 2016. In the meanwhile some changes happened. Coul you please push a new release with the current master?

TypeError: match() takes exactly 2 arguments (3 given)

Your example says:
from pygrok import Grok
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age:int} years old and weighs %{NUMBER:weight:float} kilograms'
grok = Grok(pattern)
print grok.match(text, pattern)

##{'gender': 'male', 'age': 25, 'name': 'gary', 'weight': 68.5}

When I tried I get following error:

from pygrok import Grok
text = 'gary is male, 25 years old and weighs 68.5 kilograms'
pattern = '%{WORD:name} is %{WORD:gender}, %{NUMBER:age:int} years old and weighs %{NUMBER:weight:float} kilograms'
grok = Grok(pattern)
print grok.match(text, pattern)
Traceback (most recent call last):
File "", line 1, in
TypeError: match() takes exactly 2 arguments (3 given)

pygrok doesn't allow for unnamed capture groups

Logstash grok allows for capture groups that don't have names, but pygrok does not.

example:
log message
May 29 13:47:44 192.168.93.1 1527601664.182105556 SDCEG_SB flows src=192.168.92.250 dst=192.168.93.244 protocol=icmp type=8 pattern: allow all

grok pattern:
Note the third group doesn't have a name.
%{SYSLOGTIMESTAMP:timestamp} %{IP:source} %{DATA:} %{NUMBER:epoch_time} %{WORD:device} flows src=%{IP:src_ip} dst=%{IP:dst_ip} (mac=%{MAC:mac} )?protocol=%{WORD:protocol} (?:(type=%{POSINT:protocol_type} )|(sport=%{POSINT:src_port} dport=%{POSINT:dst_port} ))pattern: (?<pattern>.*)

Results from http://grokdebug.herokuapp.com/ (with named capture only turned on) and from running it on logstash:

{
  "timestamp": [
    [
      "May 29 13:47:44"
    ]
  ],
  "source": [
    [
      "192.168.93.1"
    ]
  ],
  "epoch_time": [
    [
      "1527601664.182105556"
    ]
  ],
  "device": [
    [
      "SDCEG_SB"
    ]
  ],
  "src_ip": [
    [
      "192.168.92.250"
    ]
  ],
  "dst_ip": [
    [
      "192.168.93.244"
    ]
  ],
  "mac": [
    [
      null
    ]
  ],
  "protocol": [
    [
      "icmp"
    ]
  ],
  "protocol_type": [
    [
      "8"
    ]
  ],
  "src_port": [
    [
      null
    ]
  ],
  "dst_port": [
    [
      null
    ]
  ],
  "pattern": [
    [
      "allow all"
    ]
  ]
}

results from pygrok:

$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pygrok import Grok
>>> message = 'May 29 13:47:44 192.168.93.1  1527601664.182105556 SDCEG_SB flows src=192.168.92.250 dst=192.168.93.244 protocol=icmp type=8 pattern: allow all'
>>> pattern = '%{SYSLOGTIMESTAMP:timestamp} %{IP:source} %{DATA:} %{NUMBER:epoch_time} %{WORD:device} flows src=%{IP:src_ip} dst=%{IP:dst_ip} (mac=%{MAC:mac} )?protocol=%{WORD:protocol} (?:(type=%{POSINT:protocol_type} )|(sport=%{POSINT:src_port} dport=%{POSINT:dst_port} ))pattern: (?<pattern>.*)'
>>> 
>>> print(Grok(pattern).match(message))
None

but when I add a name to the third capture group it works in pygrok:

>>> pattern = '%{SYSLOGTIMESTAMP:timestamp} %{IP:source} %{DATA:thing} %{NUMBER:epoch_time} %{WORD:device} flows src=%{IP:src_ip} dst=%{IP:dst_ip} (mac=%{MAC:mac} )?protocol=%{WORD:protocol} (?:(type=%{POSINT:protocol_type} )|(sport=%{POSINT:src_port} dport=%{POSINT:dst_port} ))pattern: (?<pattern>.*)'
>>> print(Grok(pattern).match(message))
{'src_port': None, 'epoch_time': '1527601664.182105556', 'timestamp': 'May 29 13:47:44', 'src_ip': '192.168.92.250', 'thing': '', 'mac': None, 'source': '192.168.93.1', 'pattern': 'allow all', 'device': 'SDCEG_SB', 'protocol': 'icmp', 'dst_ip': '192.168.93.244', 'protocol_type': '8', 'dst_port': None}

EDIT:
upon further review it appears that it does not like unnamed patterns that have a : in them.
ie. %{DATA} vs %{DATA:}

>>> from pygrok import Grok
>>> pattern = '%{DATA}'
>>> print(Grok(pattern).match('bob'))
{}
>>> pattern = '%{DATA:}'
>>> print(Grok(pattern).match('bob'))
None

Parsing Cisco Extended Access Control Lists sometimes fails silently

I've written a grok pattern to parse Cisco Extended ACLs. Mostly this works fine, however for 0.2 % of my ACLs the resulting variable (grok_line = grok.match(line)) is None.

I included my script (parse_cisco_acl.py) as well as some example ACLs which fail to parse (infile.txt).

If you have any pointer how I can debug this myself I'll look into it.

parse_cisco_acl.py:

#!/usr/bin/env python
"""Reads a text-file, uses grok to extract fields from each line."""

from pygrok import Grok
import csv
import os

# Source for the ACL specification:
# Cisco ASA 5500 Series Command Reference, Version 8.2(5)
# https://www.cisco.com/c/en/us/td/docs/security/asa/asa82/command/reference/cmd_ref.pdf
# 
# 1 access-list id [line line-number] [extended] {deny | permit}
# 2 {protocol | object-group protocol_obj_grp_id}
# 3 {src_ip mask | interface ifc_name | host hostname | object-group network_obj_grp_id}
# 4 [object-group service_obj_grp_id | operator port]
# 5 {dst_ip mask | interface ifc_name | host hostname | object-group network_obj_grp_id}
# 6 [object-group service_obj_grp_id | operator port | object-group icmp_type_obj_grp_id]
# 7 [log [[level] [interval secs] | disable | default]]
# 8 [inactive | time-range time_range_name]
# 9 (hitcnt=hitcount) hashcode
#

acl_patterns = {'ANY': '\b(?:any(4)?)\b',
'LOGSTATUS': '\b(?:disable|default)\b',
'RANGE': '(?:%{IPORHOST}|%{POSINT})',
'ACL_LINE1': 'access-list\s%{USERNAME:policy_id}\sline\s\d+\sextended\s%{WORD:action}',
'ACL_LINE2': '((object-group|object)\s%{USERNAME:dst_service}|%{WORD:protocol})',
'ACL_LINE3': '((object-group|object)\s%{USERNAME:src_object_group}|interface\s%{USERNAME:interface}|host\s%{IP:src_ip}|%{ANY:src_host}|%{IP:src_ip}\s%{IP:src_mask})',
'ACL_LINE4': '((object-group|object)\s%{USERNAME:dst_service}|%{WORD:operator}\s%{RANGE:rangestart}(\s%{RANGE:rangeend})?)',
'ACL_LINE5': '((object-group|object)\s%{USERNAME:dst_object_group}|interface\s%{USERNAME:interface}|host\s%{IP:dst_ip}|%{ANY:dst_host}|%{IP:dst_ip}\s%{IP:dst_mask})',
'ACL_LINE6': '((object-group|object)\s%{USERNAME:dst_service}|%{WORD:operator}\s%{RANGE:rangestart}(\s%{RANGE:rangeend})?|%{USERNAME:icmptype})',
'ACL_LINE7': '(log\s((%{INT:loglevel})?(\sinterval\s%{NUMBER:loginterval})?|%{LOGSTATUS:logstatus}))',
'ACL_LINE8': 'inactive|\stime-range\s%{USERNAME:timerange}',
'ACL_LINE9': '\(hitcnt=%{NUMBER:hitcnt}\)\s%{USERNAME:hashcode}((\s|\r|\n)*)?'
}

pattern = '(%{ACL_LINE1:line1}\s%{ACL_LINE2:line2}\s%{ACL_LINE3:line3}(\s%{ACL_LINE4:line4})?\s%{ACL_LINE5:line5}(\s%{ACL_LINE6:line6})?(\s%{ACL_LINE7:line7})?(\s%{ACL_LINE8:line8})?\s%{ACL_LINE9:line9}|%{ACL_LINE1:line1}\s%{ACL_LINE2:line2}\s%{ACL_LINE3:line3}(\s%{ACL_LINE4:line4})?(\s%{ACL_LINE6:line6})?(\s%{ACL_LINE7:line7})?(\s%{ACL_LINE8:line8})?\s%{ACL_LINE9:line9})'



infile="infile.txt"

def main():
  grok = Grok(pattern, custom_patterns=acl_patterns)
  infilept = open(infile, "r")
  gpf=0

  """Parse all lines of input file with grok"""
  for line in infilept:
    if "line" in line and "remark" not in line:
      grok_line = grok.match(line)
      if grok_line is None:
        print("Grok parse failure:\n"+line)
        gpf+=1
        continue
  print("Finished parsing file, number of grok parse failures: "+str(gpf))

if __name__ == "__main__":
    main()

infile.txt:

access-list vlan123-in line 335 extended permit tcp 1.2.3.4 255.255.0.0 1.2.1.1 255.255.255.128 range 2217 2223 log disable (hitcnt=1583) 0x60a9c03b
access-list vlan123-in line 403 extended permit ip any4 1.5.6.0 255.255.255.0 (hitcnt=185048) 0xdc331198
access-list vlan123-in line 404 extended permit tcp any4 object-group foo eq 5723 log disable (hitcnt=0) 0x86f049d0
access-list vlan123-in line 404 extended permit tcp any4 host 1.2.1.6 eq 5723 log disable (hitcnt=0) 0xadc8cd80
access-list vlan123-in line 405 extended permit tcp any4 host 1.2.1.7 eq www (hitcnt=0) 0x14fd6d81
access-list vlan123-in line 517 extended permit icmp any4 any4 (hitcnt=85402033) 0x0674f896
access-list vlan123-in line 726 extended permit tcp any4 object-group foo eq 2003 (hitcnt=4616243) 0x75d35eaf
access-list vlan123-in line 932 extended permit tcp 1.2.1.0 255.255.128.0 host 1.5.6.4 eq netbios-ssn log default (hitcnt=0) 0x8337e616
access-list vlan123-in line 1008 extended deny ip any4 object foobar (hitcnt=134) 0x985e0953
access-list vlan123-in line 1008 extended deny ip any4 host 1.5.6.2 (hitcnt=134) 0x985e0953
access-list vlan123-in line 1162 extended permit ip any4 object-group foo log disable (hitcnt=17503) 0x06dabb44
access-list vlan22-in line 6 extended deny tcp any4 object bar eq 4786 (hitcnt=0) 0x22500030
access-list vlan22-in line 6 extended deny tcp any4 1.2.2.0 255.255.254.0 eq 4786 (hitcnt=0) 0x22500030
access-list vlan22-in line 35 extended deny tcp any4 host 1.2.1.6 eq https (hitcnt=0) 0x24ad7386
access-list vlan22-in line 36 extended deny tcp any4 host 1.2.1.6 eq https (hitcnt=530) 0x0f90e0f2
access-list vlan22-in line 330 extended permit tcp any4 host 1.2.6.2 eq www (hitcnt=0) 0xe5729237
access-list vlan22-in line 331 extended permit tcp any4 host 1.2.6.1 eq https (hitcnt=0) 0x6d7e0e31
access-list vlan22-in line 385 extended permit ip any4 1.4.5.0 255.255.254.0 (hitcnt=235) 0x37911bb7
access-list foo-in line 16 extended permit tcp host 1.2.1.8 1.5.8.0 255.255.128.0 range 48000 48010 log disable (hitcnt=0) 0x8be36362
access-list foo-in line 42 extended deny ip any4 object-group foo (hitcnt=434091) 0x72e5d595
access-list foo-in line 42 extended deny ip any4 9.2.4.0 255.255.254.0 (hitcnt=0) 0x94e6be14
access-list foo-in line 193 extended permit ip any4 object-group bar (hitcnt=11383) 0x0a04d092
access-list foo-in line 393 extended permit object icmp-echo any4 any4 (hitcnt=205452616) 0x8536e84b
access-list foo-in line 393 extended permit icmp any4 any4 echo (hitcnt=205452616) 0x8536e84b
access-list foo-in line 395 extended permit object icmp-time-exceeded any4 any4 (hitcnt=577) 0xd8718f72
access-list vlan66-in line 34 extended permit icmp any4 any4 (hitcnt=192) 0xc66ea5c5 

Deprecation warning due to invalid escape sequences in Python 3.8

find . -iname '*.py'  | xargs -P 4 -I{} python3.8 -Wall -m py_compile {}

./tests/test_pygrok.py:89: DeprecationWarning: invalid escape sequence \[
  pat = '%{HOSTNAME:host} %{IP:client_ip} %{NUMBER:delay}s - \[%{DATA:time_stamp}\]' \
./pygrok/pygrok.py:85: DeprecationWarning: invalid escape sequence \w
  if re.search('%{\w+(:\w+)?}', py_regex_pattern) is None:

默认pattern中包含非ascii字符导致Grok异常

异常如下:

Traceback (most recent call last):
  File "/Users/thuhak.zhou/PycharmProjects/dnslog-parser/dnslog-parser.py", line 68, in <module>
    query_pat = Grok('%{WORD}')
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pygrok/pygrok.py", line 15, in __init__
    self.predefined_patterns = _reload_patterns(DEFAULT_PATTERNS_DIRS)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pygrok/pygrok.py", line 83, in _reload_patterns
    patterns = _load_patterns_from_file(os.path.join(dir, f))
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pygrok/pygrok.py", line 94, in _load_patterns_from_file
    for l in f:
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3122: ordinal not in range(128)

导致问题出现的地点:

grok-patterns中的

MONTH \b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b

Multiple Patterns

Does pygrok support matching of string against multiple patterns?

Any example on how to do that?

Put it on PyPI

Hi, it would be great if you put this project on PyPI so that it can be simply installed.

use WSGIContext

Use the swift.common.wsgi.WSGIContext helper class to ensure that the response is ready for you to consume before actually consuming it. See the catch_errors middleware for a simple example. staticweb is a more complex example.

regex_core.error: multiple repeat at position 1146

I was just playing with pygrok; I ran the following, which might be a useful basis for a test, and got the titular error.

#!/usr/bin/env python3
import os
import gc
import itertools as it
import pygrok as pg

fs = it.chain(*(os.scandir(d) for d in 
                pg.DEFAULT_PATTERNS_DIRS))
ps = it.chain(*(open(f.path).readlines() for f in fs))
gc.collect() # release file descriptors, I think. xonsh got upset without it.

for p in ps:
    try:
        pg.Grok(p)
    except Exception as e:
        print(p, e)

The offending pattern is

SHOREWALL (%{SYSLOGTIMESTAMP:timestamp}) (%{WORD:nf_host}) kernel:.*Shorewall:(%{WORD:nf_action1})?:(%{WORD:nf_action2})?.*IN=(%{USERNAME:nf_in_interface})?.*(OUT= *MAC=(%{COMMONMAC:nf_dst_mac}):(%{COMMONMAC:nf_src_mac})?|OUT=%{USERNAME:nf_out_interface}).*SRC=(%{IPV4:nf_src_ip}).*DST=(%{IPV4:nf_dst_ip}).*LEN=(%{WORD:nf_len}).?*TOS=(%{WORD:nf_tos}).?*PREC=(%{WORD:nf_prec}).?*TTL=(%{INT:nf_ttl}).?*ID=(%{INT:nf_id}).?*PROTO=(%{WORD:nf_protocol}).?*SPT=(%{INT:nf_src_port}?.*DPT=%{INT:nf_dst_port}?.*)

fix md5 etag mismatch on response

I think there are a couple of options here.

  1. Simplest is to drop the etag header from the response.

  2. Generate a new etag that is derived from the object etag and something about grok. eg Etag: "-". This will likely still cause an error in the swift CLI tool, but it gives clients that know about grok a better chance to use conditional requests with grok

Elaborate errors when grok_match failed

Raise GrokParseFailure exception and elaborate error in error message when:

  • NoSuchPattern : pattern not found
  • NoSuchType : type beyond int or float
  • WrongType : string could not be converted to corresponding type(int, float)
  • any other errors

Parsing patterns fails if LANG environment variables not set

The file grok-patterns contains an ä character and pygrok will fail reading the file if the LANG environment variables are not set to something that can support it.
As the files are shipped with UTF-8 so the encoding should be enforced.

0.7.4 to pypi

The latest version of pygrok is not on pypi. I would be nice if it was up there so we dont have to install it via the github link.

grok support matrix?

What is the feature parity between pygrok and the original logstash grok?

If full, this should be written somehow (and have tests to prove..).

If partial, can you please add a support matrix, so it's clear to users what's supported and what's not, and also so it's easier to contribute back?

Parsing Speed 200x slower than regex

I was incorporating pygrok in a project to parse blacklist ips. It is only about 0.5MB in total, only 2 or 3 files. It took 70 seconds to parse through each row in the files. I didn't realize it was due to the pygrok, so I did multi-threading on it. And the speed is still considerably slow. The files are on HDD, not SSD, but then again, 0.5 MB file, it shouldn't take that long no matter SSD or HDD.

And now I replaced pygrok with my regex pattern and it loaded the records, parsed them, within 0.04 seconds. Why is the speed so different? I understand grok patterns are also running on regular expression patterns in the lower level. Is there anything I can do to help investigate this?

Just did a test with 2.2 MB, it took 200 seconds. The same files took 0.187 seconds in regex. Why the difference is so high?

Additional Info
Pattern : %{IP:HOST} %{GREEDYDATA:DESC}
The file excerpt:

46.55.xxx.197 Malicious Host
221.xxx.13.22 Malicious Host
222.xxx.190.71 Malicious Host

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.