inquest / iocextract Goto Github PK

Defanged Indicator of Compromise (IOC) Extractor.

Home Page: https://inquest.readthedocs.io/projects/iocextract/

License: GNU General Public License v2.0

Python 100.00%

ioc indicators-of-compromise library ioc-extractor defang threat-intelligence threat-sharing threatintel malware-research osint

iocextract's Issues

Various URL extraction issues

Hold-all issue for invalid URLs I find that come through extraction.

http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e

URLs with wildcard/regex:

https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/

Extracts part of the match as a second URL:

i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f

URL is not extracted correctly

When I ran the sample script with one line of text, all text was displayed without extracting URLs.

import iocextract

content = \
"""
All the bots are on hxxp://example.com/bad/url these days.
"""

for url in iocextract.extract_urls(content):
    print(url)

The output result is as follows.

$ python3 test.py

All the bots are on hxxp://example.com/bad/url these days.

PyPi License Mismatch

Hey, just letting you know that in PyPi your package is listed as BSD. This is likely due to your configuration in setup.py classifiers. Cheers!

Can't decode url throw an error

Traceback (most recent call last):
  File "extract.py", line 18, in <module>
    for i in iocextract.extract_encoded_urls(f.read(), refang=True):
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 174: invalid start byte

I create an simple python script to find url on current directory with iocextract, but throw an error when using extract_encoded_urls

Add extraction for hex-encoded URLs

In progress...

Add a function to import directly from a server and extract IOCs.

Example iocextract --input 'https://toast.home.us/random' --output '/home/user/k1.txt' --extract-ipv4s

Fails to parse this url correctly

The url is:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>

the trailing > is always stripped off the url even through it is part of it. When I extract_iocs I get:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip

I can give the real url that I discovered this issue with, but it is malicious so I didn't want to include it here.

Extracting URLs that have been base64 encoded

Currently, it seems like iocextract extracts only the first URL found in a base64 encoded string.

For example for the following string (original):
'https://google.com https://amazon.com https://microsoft.com http://google.com http://amazon.com http://microsoft.com'
the base64 encoded string is: 'aHR0cHM6Ly9nb29nbGUuY29tIGh0dHBzOi8vYW1hem9uLmNvbSBodHRwczovL21pY3Jvc29mdC5jb20gaHR0cDovL2dvb2dsZS5jb20gaHR0cDovL2FtYXpvbi5jb20gaHR0cDovL21pY3Jvc29mdC5jb20g'
and only the first found URL is returned.

If I change the sequence of the URLs in the original string and then encode it with base 64, iocextract will return the URL that occurs first this time.

Can you please fix this and return all the URLs existing in a base64 encoded string?

Look in to adding \. defang detection

Example: https://twitter.com/ClearskySec/status/1001833343581900800

c2: www.nubpubwizard.jetos\.com
c2: worktrs.wikaba\.com
Spoofs host header

Failed to parse URL correctly

A URL which is surrounded by Japanese characters is not parsed correctly.

print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']

# My expectation is ['http://example.com']

I'm not sure how to fix it. But I think checking TLD might work well.

Refang excepts in certain cases

We do the urlparse try/except before modifiying the URL, which may cause it to error out after we prepend the scheme. Need to just move all the url modifications before the urlparse test.

Email Obfuscation Edit

Identify and 'refang' emails formatted as follows:
identifier[@]domain[.com]

Improve YARA regex

Improve YARA regex to correctly extract things outside a standard rule { } format. This should include:

includes before a rule
imports before a rule
tags: https://yara.readthedocs.io/en/v3.8.1/writingrules.html#rule-tags
scopes (private and global): https://yara.readthedocs.io/en/v3.8.1/writingrules.html#global-rules

Related context: plyara/plyara#53.

File redirection doesn't work

If I run iocextract.py --input info.txt it will correctly print indicators to what seems to be standard out, however iocextract.py --input info.txt | less simply gives the the "you've got nothing END" in less. It looks like however you're getting the handle to STDOUT isn't the actual STDOUT handle.

Tested on OS X 10.14.6 with Python 3.7.6.

catastrophic backtracking in BACKSLASH_URL_RE

Pretty much the title, discovered this in a downstream project, https://github.com/s0md3v/Photon, commented on it there as well. Thought I'd leave the comment here too, the rest of the defang RE seem to work fine, but the backslash one seemed to cause a lot of hangs when I was using it.

Test against: http://myexample.com/dir/../path/escaping/../too/many/../dots/../in/../the/path/../cause/this/to/fail

Review documentation

Need to review the documentation and verify it's still up to date. Also, it appears to be failing in certain sections.

Add wide-character support to all extractions

extract_unencoded_url is too greedy when parsing Windows command lines

I'm parsing input containing examples of PowerShell or cmd.exe command lines. When a command flag with a slash comes after an URL, then the flag is included in the extracted URL.

Here is an example:

list(iocextract.extract_unencoded_urls("command.exe https://pypi.org/project/iocextract/ /f"))
  # => ['https://pypi.org/project/iocextract/ /f']

The trailing /f should not be included in the extracted URL.

IPv4 extraction doesn't recognize netstat command input

iocextract doesn't seem to recognize any IPv4 addresses from netstat output since they all end with .<port number> or the protocol. For example, 10.1.1.117.4222 and 10.1.1.117.https.
It pulls out IPv6 adddresses just fine, though.

This would be a super useful addition to have when triaging host events from an DFIR standpoint :)

Any suggested work around or is there a possible patch that would cover this?

Exception with some unicode in URLs

Traceback (most recent call last):
  File "iocextract", line 11, in <module>
    sys.exit(main())
  File "local/lib/python2.7/site-packages/iocextract.py", line 433, in main
    for ioc in extract_urls(args.input.read(), refang=args.refang, strip=args.strip_urls):
  File "local/lib/python2.7/site-packages/iocextract.py", line 155, in extract_urls
    url = refang_url(url.group(1))
  File "local/lib/python2.7/site-packages/iocextract.py", line 395, in refang_url
    return parsed.geturl()
  File "/usr/lib64/python2.7/urlparse.py", line 134, in geturl
    return urlunparse(self)
  File "/usr/lib64/python2.7/urlparse.py", line 231, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "/usr/lib64/python2.7/urlparse.py", line 242, in urlunsplit
    url = '//' + (netloc or '') + url
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 17: ordinal not in range(128)

Example url:

https://secure.comodo.net/CPS0C��U���<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+�����h0f0>+��0��2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+��0���http://ocsp.comodoca.com0���U����0���[email protected]

Add IP v4/v6 flags

Add two new flags:

--extract-ipv4s
--extract-ipv6s

Defanged first.last Email Addresses do not refang correctly

Issue seems to be around this defangged format:

firstname[.]lastname[@]domainname[.]org

When refanged, seeing the following:

[email protected]

For some reason, the username format of first.last is getting chopped off to just last.

refang_url converts unknown schemes (such as 'tcp') to 'http'

It seems that refang'ing urls with a scheme not listed in line: https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L626
replaces it with 'http': https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L631.

Maybe a hard-coded conversion mapping could be used, e.g.:

refang_schemes = {
    'http': ['hxxp'],
    'https': ['hxxps'],
    'ftp': ['ftx', 'fxp'],
    'ftps': ['ftxs', 'fxps']
}
for scheme, fanged in refang_schemes.items():
    if parsed.scheme in fanged:
        parsed = parsed._replace(scheme=scheme)
        url = parsed.geturl().replace(scheme + ':///', scheme + '://')

        try:
            _ = urlparse(url)
        except ValueError:
            # Last resort on ipv6 fail.
            url = url.replace('[', '').replace(']', '')

        parsed = urlparse(url)

        break

This is not as catch-all as the current solution, but on the other hand it does not alter the indicator.

Example:

In [1]: import iocextract                                                                              

In [2]: content = """tcp://example[.]com:8989/bad"""                                                   

In [3]: list(iocextract.extract_urls(content))                                                         
Out[3]: ['tcp://example[.]com:8989/bad', 'tcp://example[.]com:8989/bad']

In [4]: list(iocextract.extract_urls(content, refang=True))                                            
Out[4]: ['http://example.com:8989/bad', 'http://example.com:8989/bad']

Note: This behavior is shown in the output examples in the README.rst in the 'Usage' section related to refang.

Handle extraction from all files in a directory

It'd be great to be able to provide a directory path to iocextract and have it iterate over all files, extracting IOC's from each as it goes.

for example, i have a directory of malicious SLK files and I want to quickly dump all the URLs. right now I have to use something like for i in ls; do iocextract --extract-urls --input $i; done

passing a dir to --input obviously throws an exception due to the arguments use to io:

 File "iocextract.py", line 442, in <lambda>
    parser.add_argument('--input', type=lambda x: io.open(x, 'r', encoding='utf-8', errors='ignore'),
IOError: [Errno 21] Is a directory: '/home/adam/research/malware/campaigns/slk-droppers'

Would you be okay with re-working --input to accept a file as input, stdin as an optional positional argument, and add a --dir argument for folders? I can put in a PR if so - or if you have any other suggestions for this use case, that'd be great :D

Add defang function

Add a defang function that accepts a normal URL/domain/IP and returns a defanged version.

Example input/output:

Input	Output
http://example.com/path.ext	hxxp://example[.]com/path.ext
http://example.com/	hxxp://example[.]com/
example.com	example[.]com
127.0.0.1	127[.]0[.]0[.]1

I need this for ThreatIngestor, makes the most sense to include it here.

IPv4 extract should be looser

This doesn't get extracted: 78.128.76.]]]165.

Add support for base64-encoded URLs

Add the function --extract-domains and --extract-subdomains

Sometimes it is necessary to simply extract the domains and or the domains and subdomains.

And a question, are the new longer domain extensions included?

Improve extraction for non-defanged URLs

"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls while their non-defanged counterparts don't"

Issue comment: #34 (comment)

URLs pulling in IPs

If I have a URL with a port - e.g. 1.1.1.1:449 I'm seeing a URL getting extracted in the format of:
http://1.1.1.1:449.

Is that desired behavior?

how do I add a ioc_type label with the output?

This is probably more of a feature request...
Is there a way with the "extract_iocs" function to have it output the IOC Type next to the IOC?

I have a work around, but I have to call each function individually.

import iocextract
import pandas as pd
hashes = pd.DataFrame(iocextract.extract_sha256_hashes(glob), columns=['ioc'])
hashes['ioc_type'] = "sha256_hash"
hashes

BUG: --extract-ipv4s does not work

Unfortunately it doesn't work, I ran it for quite a while but except for stressing one CPU core 100% nothing happened, the IPs were not written to the file.
iocextract --input '/home/user/des.txt' --output '/home/user/k1.txt' --extract-ipv4s

Found IPs being parsed as URLs

Hey! Currently working with iocextract to read from a text file and convert to a query. I just now ran in the issue where the IPs were being extracted as IPs but then they were also being extracted and formatted as URLs.
Input: 101.28[.]225[.]248 ---> Output: RemoteIP =~ "101.28.225.248" or RemoteUrl has "http://101.28.225.248"

Improve IPv6 extraction

Things that look like timestamps, and things like 1:6:0, are getting through. If we can't improve the regex to catch these, maybe add a filter on the iterator?

URL bracket regex is too loose

CDATA[^h00ps://online\(.)americanexpress\(.)com/myca/.*?request_type=authreg_acctAccountSummary]]＞

Should stop at the first character not in [\w-\[\]] when looking backwards. In this case the ^.

Even tighter, we can stop at the first character not in [\w] if it's before a ://.

module 'iocextract' has no attribute 'refang_url'

There is a problem with the latest version of the source code on pypi

Add support for custom regex

Add a function that takes a list of regex strings as input, compiles them, and runs them against a data input, yielding results.
Add a flag to the CLI that takes a file and reads out regex into a list, then passes it to the above function and prints results.

Binary Extraction

Looking at how I might use something like this to pull indicators directly from malware binaries. Wondering if something like this could essentially run strings and extract ioc. Would also be nice to use this as a python library.

Improve documentation

Refactor regex with re.X

https://docs.python.org/3/library/re.html#re.X

Will improve readability and maintainability of the regexes.

Add CLI usage examples to README

Extract domain names without URI scheme

I was trying to pull out a list of domains from a text file input (sample of input / expected output below), but iocextract doesn't recognize anything without a URI scheme I think.

Is it possible to include an --extract-domains, or have --extract-urls optionally ignore the scheme for instance? Just random thoughts, not sure the best way to handle this given how complicated the regex is.

If it's any help, this pattern ([a-zA-Z0-9-_]+(\.)+)?([a-z0-9-_]+)*\.+[a-z]{2,63} should match pretty much any domain name up to the TLD.

matches:

google.com
foo.mywebsite.io
hack-the-planet.com
asdf-fdsa.foo-bar.com
foo-bar.domain.name.com

Sample Input

GLOBAL
Pool    Location    Total Fee/Donations Hashrate    Miners  Link
supportXMR.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL Android APP   DE,FR,US,CA,SG  0.6 %   86.79 MH/s  7228
xmrpool.net
PPS PPLNS SOLO exchange payout custom threshold workerIDs email monitoring SSL  USA/EU/Asia 0.4-0.6 %   642.32 KH/s 179
xmr.nanopool.org
PPLNS exchange payout workerIDs email monitoring SSL    USA/EU/Asia 1 % 105.52 MH   6155
 minergate.com
possible share skimming! People complaining about poor hashrate.
RBPPS PPLNS USA/EU  1-1.5 % 26.50 MH/s  37467
viaxmr.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL   US/UK/AU/JP 0.4 %    API problem     API problem
monero.hashvault.pro

Was hoping to get output of:

supportXMR.com
xmrpool.net
monero.hashvault.pro
minergate.com

'https' scheme values defanged as HXXPS are refanged as 'http'

As in title; e.g.

'hxxps://example.com' is refanged as 'http://example.com'

base64 strings

Hey,

I was looking to use this for decoding some base64 strings inside json and it did not see to find the following when using refang.

  },
      "data": {
        ".dockerconfigjson": "ewoJImF1dGhzIjogewoJCSJjZGUtZG9ja2VyLXJlZ2lzdHJ5LmVpYy5mdWxsc3RyZWFtLmFpIjogewoJCQkiYXV0aCI6ICJZMlJsTFhKbFoybHpkSEo1T21Oa1pTMXlaV2RwYzNSeWVRPT0iCgkJfQoJfQp9"
      },

Any way to improve this at all?

Subdomains and IPs in URLs are not always parsed correctly

Given defanged URLs with an IP address or a subdomain such as:

hXXps://192.168.149[.]100/api/info
hXXps://subdomain.example[.]com/some/path

The GENERIC_URL_RE regex returns the correct results. However, since they are also parsed with the BRACKET_URL_RE regex additional invalid results are also returned:

http://149.100/api/info
http://example.com/some/path

A simple change seems to fix the problem--assuming I'm not missing some false positive scenario.

diff --git a/iocextract.py b/iocextract.py
index 8fdb374..dcd25dd 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -66,7 +66,7 @@ GENERIC_URL_RE = re.compile(r"""
 BRACKET_URL_RE = re.compile(r"""
         \b
         (
-            [\:\/\\\w\[\]\(\)-]+
+            [\.\:\/\\\w\[\]\(\)-]+
             (?:
                 \x20?
                 [\(\[]

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..fc2d80b 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -124,7 +124,7 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)
 
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
+EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+[\s]*@[\s]*[a-zA-Z0-9-]+[[]*\.[]]*[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -247,7 +247,7 @@ def extract_emails(data):
     :rtype: Iterator[:class:`str`]
     """
     for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+        yield email.group(0).replace(" ", "").replace("[.]", ".")
 
 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -420,6 +420,7 @@ def refang_url(url):
     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):
         parsed = parsed._replace(netloc=parsed.netloc.replace('[', '').replace(']', ''))
+        parsed = parsed._replace(path=parsed.path.replace('[.]', '.'))
 
     return parsed.geturl()

ModuleNotFoundError: No module named 'iocextract'

I installed it in Arch Linux, unfortunately I only get an error message.

Steps:

sudo pipx install iocextract --force
iocextract -h

$ /usr/bin/iocextract -h
Traceback (most recent call last):
  File "/usr/bin/iocextract", line 5, in <module>
    from iocextract import main
ModuleNotFoundError: No module named 'iocextract'

inquest / iocextract Goto Github PK

iocextract's Issues

Recommend Projects

Recommend Topics

Recommend Org