inquest / iocextract Goto Github PK
View Code? Open in Web Editor NEWDefanged Indicator of Compromise (IOC) Extractor.
Home Page: https://inquest.readthedocs.io/projects/iocextract/
License: GNU General Public License v2.0
Defanged Indicator of Compromise (IOC) Extractor.
Home Page: https://inquest.readthedocs.io/projects/iocextract/
License: GNU General Public License v2.0
Hold-all issue for invalid URLs I find that come through extraction.
http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e
URLs with wildcard/regex:
https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/
Extracts part of the match as a second URL:
i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f
When I ran the sample script with one line of text, all text was displayed without extracting URLs.
import iocextract
content = \
"""
All the bots are on hxxp://example.com/bad/url these days.
"""
for url in iocextract.extract_urls(content):
print(url)
The output result is as follows.
$ python3 test.py
All the bots are on hxxp://example.com/bad/url these days.
Hey, just letting you know that in PyPi your package is listed as BSD. This is likely due to your configuration in setup.py classifiers. Cheers!
Traceback (most recent call last):
File "extract.py", line 18, in <module>
for i in iocextract.extract_encoded_urls(f.read(), refang=True):
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 174: invalid start byte
I create an simple python script to find url on current directory with iocextract, but throw an error when using extract_encoded_urls
In progress...
Example iocextract --input 'https://toast.home.us/random' --output '/home/user/k1.txt' --extract-ipv4s
The url is:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>
the trailing >
is always stripped off the url even through it is part of it. When I extract_iocs I get:
https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip
I can give the real url that I discovered this issue with, but it is malicious so I didn't want to include it here.
Currently, it seems like iocextract extracts only the first URL found in a base64 encoded string.
For example for the following string (original):
'https://google.com https://amazon.com https://microsoft.com http://google.com http://amazon.com http://microsoft.com'
the base64 encoded string is: 'aHR0cHM6Ly9nb29nbGUuY29tIGh0dHBzOi8vYW1hem9uLmNvbSBodHRwczovL21pY3Jvc29mdC5jb20gaHR0cDovL2dvb2dsZS5jb20gaHR0cDovL2FtYXpvbi5jb20gaHR0cDovL21pY3Jvc29mdC5jb20g'
and only the first found URL is returned.
If I change the sequence of the URLs in the original string and then encode it with base 64, iocextract will return the URL that occurs first this time.
Can you please fix this and return all the URLs existing in a base64 encoded string?
Example: https://twitter.com/ClearskySec/status/1001833343581900800
c2: www.nubpubwizard.jetos\.com
c2: worktrs.wikaba\.com
Spoofs host header
A URL which is surrounded by Japanese characters is not parsed correctly.
print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']
# My expectation is ['http://example.com']
I'm not sure how to fix it. But I think checking TLD might work well.
We do the urlparse try/except before modifiying the URL, which may cause it to error out after we prepend the scheme. Need to just move all the url
modifications before the urlparse test.
Identify and 'refang' emails formatted as follows:
identifier[@]domain[.com]
Improve YARA regex to correctly extract things outside a standard rule { }
format. This should include:
Related context: plyara/plyara#53.
If I run iocextract.py --input info.txt
it will correctly print indicators to what seems to be standard out, however iocextract.py --input info.txt | less
simply gives the the "you've got nothing END
" in less. It looks like however you're getting the handle to STDOUT isn't the actual STDOUT handle.
Tested on OS X 10.14.6 with Python 3.7.6.
Pretty much the title, discovered this in a downstream project, https://github.com/s0md3v/Photon, commented on it there as well. Thought I'd leave the comment here too, the rest of the defang RE seem to work fine, but the backslash one seemed to cause a lot of hangs when I was using it.
Test against: http://myexample.com/dir/../path/escaping/../too/many/../dots/../in/../the/path/../cause/this/to/fail
Need to review the documentation and verify it's still up to date. Also, it appears to be failing in certain sections.
I'm parsing input containing examples of PowerShell or cmd.exe command lines. When a command flag with a slash comes after an URL, then the flag is included in the extracted URL.
Here is an example:
list(iocextract.extract_unencoded_urls("command.exe https://pypi.org/project/iocextract/ /f"))
# => ['https://pypi.org/project/iocextract/ /f']
The trailing /f
should not be included in the extracted URL.
iocextract doesn't seem to recognize any IPv4 addresses from netstat output since they all end with .<port number>
or the protocol. For example, 10.1.1.117.4222
and 10.1.1.117.https
.
It pulls out IPv6 adddresses just fine, though.
This would be a super useful addition to have when triaging host events from an DFIR standpoint :)
Any suggested work around or is there a possible patch that would cover this?
Traceback (most recent call last):
File "iocextract", line 11, in <module>
sys.exit(main())
File "local/lib/python2.7/site-packages/iocextract.py", line 433, in main
for ioc in extract_urls(args.input.read(), refang=args.refang, strip=args.strip_urls):
File "local/lib/python2.7/site-packages/iocextract.py", line 155, in extract_urls
url = refang_url(url.group(1))
File "local/lib/python2.7/site-packages/iocextract.py", line 395, in refang_url
return parsed.geturl()
File "/usr/lib64/python2.7/urlparse.py", line 134, in geturl
return urlunparse(self)
File "/usr/lib64/python2.7/urlparse.py", line 231, in urlunparse
return urlunsplit((scheme, netloc, url, query, fragment))
File "/usr/lib64/python2.7/urlparse.py", line 242, in urlunsplit
url = '//' + (netloc or '') + url
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 17: ordinal not in range(128)
Example url:
https://secure.comodo.net/CPS0C��U���<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+�����h0f0>+��0��2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+��0���http://ocsp.comodoca.com0���U����0���[email protected]
Add two new flags:
--extract-ipv4s
--extract-ipv6s
Issue seems to be around this defangged format:
firstname[.]lastname[@]domainname[.]org
When refanged, seeing the following:
For some reason, the username format of first.last is getting chopped off to just last.
It seems that refang'ing urls with a scheme not listed in line: https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L626
replaces it with 'http': https://github.com/InQuest/python-iocextract/blob/4da913206d8e94a6a3b137c011c89e9707cb3966/iocextract.py#L631.
Maybe a hard-coded conversion mapping could be used, e.g.:
refang_schemes = {
'http': ['hxxp'],
'https': ['hxxps'],
'ftp': ['ftx', 'fxp'],
'ftps': ['ftxs', 'fxps']
}
for scheme, fanged in refang_schemes.items():
if parsed.scheme in fanged:
parsed = parsed._replace(scheme=scheme)
url = parsed.geturl().replace(scheme + ':///', scheme + '://')
try:
_ = urlparse(url)
except ValueError:
# Last resort on ipv6 fail.
url = url.replace('[', '').replace(']', '')
parsed = urlparse(url)
break
This is not as catch-all as the current solution, but on the other hand it does not alter the indicator.
Example:
In [1]: import iocextract
In [2]: content = """tcp://example[.]com:8989/bad"""
In [3]: list(iocextract.extract_urls(content))
Out[3]: ['tcp://example[.]com:8989/bad', 'tcp://example[.]com:8989/bad']
In [4]: list(iocextract.extract_urls(content, refang=True))
Out[4]: ['http://example.com:8989/bad', 'http://example.com:8989/bad']
Note: This behavior is shown in the output examples in the README.rst in the 'Usage' section related to refang.
It'd be great to be able to provide a directory path to iocextract and have it iterate over all files, extracting IOC's from each as it goes.
for example, i have a directory of malicious SLK files and I want to quickly dump all the URLs. right now I have to use something like for i in
ls; do iocextract --extract-urls --input $i; done
passing a dir to --input obviously throws an exception due to the arguments use to io
:
File "iocextract.py", line 442, in <lambda>
parser.add_argument('--input', type=lambda x: io.open(x, 'r', encoding='utf-8', errors='ignore'),
IOError: [Errno 21] Is a directory: '/home/adam/research/malware/campaigns/slk-droppers'
Would you be okay with re-working --input to accept a file as input, stdin as an optional positional argument, and add a --dir
argument for folders? I can put in a PR if so - or if you have any other suggestions for this use case, that'd be great :D
Add a defang
function that accepts a normal URL/domain/IP and returns a defanged version.
Example input/output:
Input | Output |
---|---|
http://example.com/path.ext | hxxp://example[.]com/path.ext |
http://example.com/ | hxxp://example[.]com/ |
example.com | example[.]com |
127.0.0.1 | 127[.]0[.]0[.]1 |
I need this for ThreatIngestor, makes the most sense to include it here.
This doesn't get extracted: 78.128.76.]]]165
.
Sometimes it is necessary to simply extract the domains and or the domains and subdomains.
And a question, are the new longer domain extensions included?
"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls
while their non-defanged counterparts don't"
Issue comment: #34 (comment)
If I have a URL with a port - e.g. 1.1.1.1:449 I'm seeing a URL getting extracted in the format of:
http://1.1.1.1:449.
Is that desired behavior?
This is probably more of a feature request...
Is there a way with the "extract_iocs" function to have it output the IOC Type next to the IOC?
I have a work around, but I have to call each function individually.
import iocextract
import pandas as pd
hashes = pd.DataFrame(iocextract.extract_sha256_hashes(glob), columns=['ioc'])
hashes['ioc_type'] = "sha256_hash"
hashes
Unfortunately it doesn't work, I ran it for quite a while but except for stressing one CPU core 100% nothing happened, the IPs were not written to the file.
iocextract --input '/home/user/des.txt' --output '/home/user/k1.txt' --extract-ipv4s
Hey! Currently working with iocextract to read from a text file and convert to a query. I just now ran in the issue where the IPs were being extracted as IPs but then they were also being extracted and formatted as URLs.
Input: 101.28[.]225[.]248 ---> Output: RemoteIP =~ "101.28.225.248" or RemoteUrl has "http://101.28.225.248"
Things that look like timestamps, and things like 1:6:0
, are getting through. If we can't improve the regex to catch these, maybe add a filter on the iterator?
CDATA[^h00ps://online\(.)americanexpress\(.)com/myca/.*?request_type=authreg_acctAccountSummary]]>
Should stop at the first character not in [\w-\[\]\(\)]
when looking backwards. In this case the ^
.
Even tighter, we can stop at the first character not in [\w]
if it's before a ://
.
Looking at how I might use something like this to pull indicators directly from malware binaries. Wondering if something like this could essentially run strings
and extract ioc. Would also be nice to use this as a python library.
https://docs.python.org/3/library/re.html#re.X
Will improve readability and maintainability of the regexes.
I was trying to pull out a list of domains from a text file input (sample of input / expected output below), but iocextract doesn't recognize anything without a URI scheme I think.
Is it possible to include an --extract-domains, or have --extract-urls optionally ignore the scheme for instance? Just random thoughts, not sure the best way to handle this given how complicated the regex is.
If it's any help, this pattern ([a-zA-Z0-9-_]+(\.)+)?([a-z0-9-_]+)*\.+[a-z]{2,63}
should match pretty much any domain name up to the TLD.
matches:
google.com
foo.mywebsite.io
hack-the-planet.com
asdf-fdsa.foo-bar.com
foo-bar.domain.name.com
Sample Input
GLOBAL
Pool Location Total Fee/Donations Hashrate Miners Link
supportXMR.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL Android APP DE,FR,US,CA,SG 0.6 % 86.79 MH/s 7228
xmrpool.net
PPS PPLNS SOLO exchange payout custom threshold workerIDs email monitoring SSL USA/EU/Asia 0.4-0.6 % 642.32 KH/s 179
xmr.nanopool.org
PPLNS exchange payout workerIDs email monitoring SSL USA/EU/Asia 1 % 105.52 MH 6155
minergate.com
possible share skimming! People complaining about poor hashrate.
RBPPS PPLNS USA/EU 1-1.5 % 26.50 MH/s 37467
viaxmr.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL US/UK/AU/JP 0.4 % API problem API problem
monero.hashvault.pro
Was hoping to get output of:
supportXMR.com
xmrpool.net
monero.hashvault.pro
minergate.com
As in title; e.g.
'hxxps://example.com'
is refanged as 'http://example.com'
Hey,
I was looking to use this for decoding some base64 strings inside json and it did not see to find the following when using refang.
},
"data": {
".dockerconfigjson": "ewoJImF1dGhzIjogewoJCSJjZGUtZG9ja2VyLXJlZ2lzdHJ5LmVpYy5mdWxsc3RyZWFtLmFpIjogewoJCQkiYXV0aCI6ICJZMlJsTFhKbFoybHpkSEo1T21Oa1pTMXlaV2RwYzNSeWVRPT0iCgkJfQoJfQp9"
},
Any way to improve this at all?
Given defanged URLs with an IP address or a subdomain such as:
hXXps://192.168.149[.]100/api/info
hXXps://subdomain.example[.]com/some/path
The GENERIC_URL_RE
regex returns the correct results. However, since they are also parsed with the BRACKET_URL_RE
regex additional invalid results are also returned:
http://149.100/api/info
http://example.com/some/path
A simple change seems to fix the problem--assuming I'm not missing some false positive scenario.
diff --git a/iocextract.py b/iocextract.py
index 8fdb374..dcd25dd 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -66,7 +66,7 @@ GENERIC_URL_RE = re.compile(r"""
BRACKET_URL_RE = re.compile(r"""
\b
(
- [\:\/\\\w\[\]\(\)-]+
+ [\.\:\/\\\w\[\]\(\)-]+
(?:
\x20?
[\(\[]
It appears that the extract for SHA1 only pulls the first 32 characters so it looks like a MD5 hash.
Require at least one [\w]
character immediately following the ://
. Exceptions for [\s\[\(]*
to catch defangs.
I noticed that if the URL was something like this: hxxps://momorfheinz[.]usa[.]cc/login[.]microsoftonline[.]com then it would only defang that it only fixed the netloc portion of the URL. Also, made a change to the email regex.
What do you think?
diff --git a/iocextract.py b/iocextract.py
index 814ad8a..fc2d80b 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -124,7 +124,7 @@ IPV6_RE = re.compile(r"""
\b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
""", re.IGNORECASE | re.VERBOSE)
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
+EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+[\s]*@[\s]*[a-zA-Z0-9-]+[[]*\.[]]*[a-zA-Z0-9-.]+)")
MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -247,7 +247,7 @@ def extract_emails(data):
:rtype: Iterator[:class:`str`]
"""
for email in EMAIL_RE.finditer(data):
- yield email.group(0)
+ yield email.group(0).replace(" ", "").replace("[.]", ".")
def extract_hashes(data):
"""Extract MD5/SHA hashes.
@@ -420,6 +420,7 @@ def refang_url(url):
# Fix example[.]com, but keep RFC 2732 URLs intact.
if not _is_ipv6_url(url):
parsed = parsed._replace(netloc=parsed.netloc.replace('[', '').replace(']', ''))
+ parsed = parsed._replace(path=parsed.path.replace('[.]', '.'))
return parsed.geturl()
I installed it in Arch Linux, unfortunately I only get an error message.
Steps:
sudo pipx install iocextract --force
iocextract -h
$ /usr/bin/iocextract -h
Traceback (most recent call last):
File "/usr/bin/iocextract", line 5, in <module>
from iocextract import main
ModuleNotFoundError: No module named 'iocextract'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.