Giter VIP home page Giter VIP logo

iocextract's Issues

Various URL extraction issues

Hold-all issue for invalid URLs I find that come through extraction.

http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e

URLs with wildcard/regex:

https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/

Extracts part of the match as a second URL:

i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f

URL is not extracted correctly

When I ran the sample script with one line of text, all text was displayed without extracting URLs.

import iocextract

content = \
"""
All the bots are on hxxp://example.com/bad/url these days.
"""

for url in iocextract.extract_urls(content):
    print(url)

The output result is as follows.

$ python3 test.py

All the bots are on hxxp://example.com/bad/url these days.

PyPi License Mismatch

Hey, just letting you know that in PyPi your package is listed as BSD. This is likely due to your configuration in setup.py classifiers. Cheers!

Can't decode url throw an error

Traceback (most recent call last):
  File "extract.py", line 18, in <module>
    for i in iocextract.extract_encoded_urls(f.read(), refang=True):
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 174: invalid start byte

I create an simple python script to find url on current directory with iocextract, but throw an error when using extract_encoded_urls

Fails to parse this url correctly

The url is:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>

the trailing > is always stripped off the url even through it is part of it. When I extract_iocs I get:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip

I can give the real url that I discovered this issue with, but it is malicious so I didn't want to include it here.

Extracting URLs that have been base64 encoded

Currently, it seems like iocextract extracts only the first URL found in a base64 encoded string.

For example for the following string (original):
'https://google.com https://amazon.com https://microsoft.com http://google.com http://amazon.com http://microsoft.com'
the base64 encoded string is: 'aHR0cHM6Ly9nb29nbGUuY29tIGh0dHBzOi8vYW1hem9uLmNvbSBodHRwczovL21pY3Jvc29mdC5jb20gaHR0cDovL2dvb2dsZS5jb20gaHR0cDovL2FtYXpvbi5jb20gaHR0cDovL21pY3Jvc29mdC5jb20g'
and only the first found URL is returned.

If I change the sequence of the URLs in the original string and then encode it with base 64, iocextract will return the URL that occurs first this time.

Can you please fix this and return all the URLs existing in a base64 encoded string?

Failed to parse URL correctly

A URL which is surrounded by Japanese characters is not parsed correctly.

print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']

# My expectation is ['http://example.com']

I'm not sure how to fix it. But I think checking TLD might work well.

Refang excepts in certain cases

We do the urlparse try/except before modifiying the URL, which may cause it to error out after we prepend the scheme. Need to just move all the url modifications before the urlparse test.

File redirection doesn't work

If I run iocextract.py --input info.txt it will correctly print indicators to what seems to be standard out, however iocextract.py --input info.txt | less simply gives the the "you've got nothing END" in less. It looks like however you're getting the handle to STDOUT isn't the actual STDOUT handle.

Tested on OS X 10.14.6 with Python 3.7.6.

Review documentation

Need to review the documentation and verify it's still up to date. Also, it appears to be failing in certain sections.

extract_unencoded_url is too greedy when parsing Windows command lines

I'm parsing input containing examples of PowerShell or cmd.exe command lines. When a command flag with a slash comes after an URL, then the flag is included in the extracted URL.

Here is an example:

list(iocextract.extract_unencoded_urls("command.exe https://pypi.org/project/iocextract/ /f"))
  # => ['https://pypi.org/project/iocextract/ /f']

The trailing /f should not be included in the extracted URL.

IPv4 extraction doesn't recognize netstat command input

iocextract doesn't seem to recognize any IPv4 addresses from netstat output since they all end with .<port number> or the protocol. For example, 10.1.1.117.4222 and 10.1.1.117.https.
It pulls out IPv6 adddresses just fine, though.

This would be a super useful addition to have when triaging host events from an DFIR standpoint :)

Any suggested work around or is there a possible patch that would cover this?

Exception with some unicode in URLs

Traceback (most recent call last):
  File "iocextract", line 11, in <module>
    sys.exit(main())
  File "local/lib/python2.7/site-packages/iocextract.py", line 433, in main
    for ioc in extract_urls(args.input.read(), refang=args.refang, strip=args.strip_urls):
  File "local/lib/python2.7/site-packages/iocextract.py", line 155, in extract_urls
    url = refang_url(url.group(1))
  File "local/lib/python2.7/site-packages/iocextract.py", line 395, in refang_url
    return parsed.geturl()
  File "/usr/lib64/python2.7/urlparse.py", line 134, in geturl
    return urlunparse(self)
  File "/usr/lib64/python2.7/urlparse.py", line 231, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "/usr/lib64/python2.7/urlparse.py", line 242, in urlunsplit
    url = '//' + (netloc or '') + url
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 17: ordinal not in range(128)

Example url:

https://secure.comodo.net/CPS0C��U���<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+�����h0f0>+��0��2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+��0���http://ocsp.comodoca.com0���U����0���[email protected]

refang_url converts unknown schemes (such as 'tcp') to 'http'

It seems that refang'ing urls with a scheme not listed in line: https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L626
replaces it with 'http': https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L631.

Maybe a hard-coded conversion mapping could be used, e.g.:

refang_schemes = {
    'http': ['hxxp'],
    'https': ['hxxps'],
    'ftp': ['ftx', 'fxp'],
    'ftps': ['ftxs', 'fxps']
}
for scheme, fanged in refang_schemes.items():
    if parsed.scheme in fanged:
        parsed = parsed._replace(scheme=scheme)
        url = parsed.geturl().replace(scheme + ':///', scheme + '://')

        try:
            _ = urlparse(url)
        except ValueError:
            # Last resort on ipv6 fail.
            url = url.replace('[', '').replace(']', '')

        parsed = urlparse(url)

        break

This is not as catch-all as the current solution, but on the other hand it does not alter the indicator.

Example:

In [1]: import iocextract                                                                              

In [2]: content = """tcp://example[.]com:8989/bad"""                                                   

In [3]: list(iocextract.extract_urls(content))                                                         
Out[3]: ['tcp://example[.]com:8989/bad', 'tcp://example[.]com:8989/bad']

In [4]: list(iocextract.extract_urls(content, refang=True))                                            
Out[4]: ['http://example.com:8989/bad', 'http://example.com:8989/bad']

Note: This behavior is shown in the output examples in the README.rst in the 'Usage' section related to refang.

Handle extraction from all files in a directory

It'd be great to be able to provide a directory path to iocextract and have it iterate over all files, extracting IOC's from each as it goes.

for example, i have a directory of malicious SLK files and I want to quickly dump all the URLs. right now I have to use something like for i in ls; do iocextract --extract-urls --input $i; done

passing a dir to --input obviously throws an exception due to the arguments use to io:

 File "iocextract.py", line 442, in <lambda>
    parser.add_argument('--input', type=lambda x: io.open(x, 'r', encoding='utf-8', errors='ignore'),
IOError: [Errno 21] Is a directory: '/home/adam/research/malware/campaigns/slk-droppers'

Would you be okay with re-working --input to accept a file as input, stdin as an optional positional argument, and add a --dir argument for folders? I can put in a PR if so - or if you have any other suggestions for this use case, that'd be great :D

Add defang function

Add a defang function that accepts a normal URL/domain/IP and returns a defanged version.

Example input/output:

Input Output
http://example.com/path.ext hxxp://example[.]com/path.ext
http://example.com/ hxxp://example[.]com/
example.com example[.]com
127.0.0.1 127[.]0[.]0[.]1

I need this for ThreatIngestor, makes the most sense to include it here.

Improve extraction for non-defanged URLs

"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls while their non-defanged counterparts don't"

Issue comment: #34 (comment)

how do I add a ioc_type label with the output?

This is probably more of a feature request...
Is there a way with the "extract_iocs" function to have it output the IOC Type next to the IOC?

I have a work around, but I have to call each function individually.

import iocextract
import pandas as pd
hashes = pd.DataFrame(iocextract.extract_sha256_hashes(glob), columns=['ioc'])
hashes['ioc_type'] = "sha256_hash"
hashes

BUG: --extract-ipv4s does not work

Unfortunately it doesn't work, I ran it for quite a while but except for stressing one CPU core 100% nothing happened, the IPs were not written to the file.
iocextract --input '/home/user/des.txt' --output '/home/user/k1.txt' --extract-ipv4s

Found IPs being parsed as URLs

Hey! Currently working with iocextract to read from a text file and convert to a query. I just now ran in the issue where the IPs were being extracted as IPs but then they were also being extracted and formatted as URLs.
Input: 101.28[.]225[.]248 ---> Output: RemoteIP =~ "101.28.225.248" or RemoteUrl has "http://101.28.225.248"

Improve IPv6 extraction

Things that look like timestamps, and things like 1:6:0, are getting through. If we can't improve the regex to catch these, maybe add a filter on the iterator?

URL bracket regex is too loose

CDATA[^h00ps://online\(.)americanexpress\(.)com/myca/.*?request_type=authreg_acctAccountSummary]]>

Should stop at the first character not in [\w-\[\]\(\)] when looking backwards. In this case the ^.

Even tighter, we can stop at the first character not in [\w] if it's before a ://.

Add support for custom regex

  • Add a function that takes a list of regex strings as input, compiles them, and runs them against a data input, yielding results.
  • Add a flag to the CLI that takes a file and reads out regex into a list, then passes it to the above function and prints results.

Binary Extraction

Looking at how I might use something like this to pull indicators directly from malware binaries. Wondering if something like this could essentially run strings and extract ioc. Would also be nice to use this as a python library.

Extract domain names without URI scheme

I was trying to pull out a list of domains from a text file input (sample of input / expected output below), but iocextract doesn't recognize anything without a URI scheme I think.

Is it possible to include an --extract-domains, or have --extract-urls optionally ignore the scheme for instance? Just random thoughts, not sure the best way to handle this given how complicated the regex is.

If it's any help, this pattern ([a-zA-Z0-9-_]+(\.)+)?([a-z0-9-_]+)*\.+[a-z]{2,63} should match pretty much any domain name up to the TLD.

matches:

google.com
foo.mywebsite.io
hack-the-planet.com
asdf-fdsa.foo-bar.com
foo-bar.domain.name.com

Sample Input

GLOBAL
Pool    Location    Total Fee/Donations Hashrate    Miners  Link
supportXMR.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL Android APP   DE,FR,US,CA,SG  0.6 %   86.79 MH/s  7228
xmrpool.net
PPS PPLNS SOLO exchange payout custom threshold workerIDs email monitoring SSL  USA/EU/Asia 0.4-0.6 %   642.32 KH/s 179
xmr.nanopool.org
PPLNS exchange payout workerIDs email monitoring SSL    USA/EU/Asia 1 % 105.52 MH   6155
 minergate.com
possible share skimming! People complaining about poor hashrate.
RBPPS PPLNS USA/EU  1-1.5 % 26.50 MH/s  37467
viaxmr.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL   US/UK/AU/JP 0.4 %    API problem     API problem
monero.hashvault.pro

Was hoping to get output of:

  • supportXMR.com
  • xmrpool.net
  • monero.hashvault.pro
  • minergate.com

base64 strings

Hey,

I was looking to use this for decoding some base64 strings inside json and it did not see to find the following when using refang.

  },
      "data": {
        ".dockerconfigjson": "ewoJImF1dGhzIjogewoJCSJjZGUtZG9ja2VyLXJlZ2lzdHJ5LmVpYy5mdWxsc3RyZWFtLmFpIjogewoJCQkiYXV0aCI6ICJZMlJsTFhKbFoybHpkSEo1T21Oa1pTMXlaV2RwYzNSeWVRPT0iCgkJfQoJfQp9"
      },

Any way to improve this at all?

Subdomains and IPs in URLs are not always parsed correctly

Given defanged URLs with an IP address or a subdomain such as:

hXXps://192.168.149[.]100/api/info
hXXps://subdomain.example[.]com/some/path

The GENERIC_URL_RE regex returns the correct results. However, since they are also parsed with the BRACKET_URL_RE regex additional invalid results are also returned:

http://149.100/api/info
http://example.com/some/path

A simple change seems to fix the problem--assuming I'm not missing some false positive scenario.

diff --git a/iocextract.py b/iocextract.py
index 8fdb374..dcd25dd 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -66,7 +66,7 @@ GENERIC_URL_RE = re.compile(r"""
 BRACKET_URL_RE = re.compile(r"""
         \b
         (
-            [\:\/\\\w\[\]\(\)-]+
+            [\.\:\/\\\w\[\]\(\)-]+
             (?:
                 \x20?
                 [\(\[]

SHA1 extracts

It appears that the extract for SHA1 only pulls the first 32 characters so it looks like a MD5 hash.

URL path defang and Email extraction

I noticed that if the URL was something like this: hxxps://momorfheinz[.]usa[.]cc/login[.]microsoftonline[.]com then it would only defang that it only fixed the netloc portion of the URL. Also, made a change to the email regex.

What do you think?

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..fc2d80b 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -124,7 +124,7 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)
 
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
+EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+[\s]*@[\s]*[a-zA-Z0-9-]+[[]*\.[]]*[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -247,7 +247,7 @@ def extract_emails(data):
     :rtype: Iterator[:class:`str`]
     """
     for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+        yield email.group(0).replace(" ", "").replace("[.]", ".")
 
 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -420,6 +420,7 @@ def refang_url(url):
     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):
         parsed = parsed._replace(netloc=parsed.netloc.replace('[', '').replace(']', ''))
+        parsed = parsed._replace(path=parsed.path.replace('[.]', '.'))
 
     return parsed.geturl()

ModuleNotFoundError: No module named 'iocextract'

I installed it in Arch Linux, unfortunately I only get an error message.

Steps:

sudo pipx install iocextract --force
iocextract -h
$ /usr/bin/iocextract -h
Traceback (most recent call last):
  File "/usr/bin/iocextract", line 5, in <module>
    from iocextract import main
ModuleNotFoundError: No module named 'iocextract'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.