mlsecproject / combine Goto Github PK
View Code? Open in Web Editor NEWTool to gather Threat Intelligence indicators from publicly available sources
Home Page: https://www.mlsecproject.org/
License: GNU General Public License v3.0
Tool to gather Threat Intelligence indicators from publicly available sources
Home Page: https://www.mlsecproject.org/
License: GNU General Public License v3.0
Removed
We perform this process each time an indicator appears rather than using the data for the same core indicator from earlier in the same run. In other works, if four different feeds each list 8.8.8.8, we perform that lookup four times. This is inefficient.
Instead, we should create a list of all the unique indicators in the current dataset, enrich those, and then map back to the original data.
Removed.
(venv)kmaxwell@newton:~/src/combine$ python winnow.py
Traceback (most recent call last):
File "winnow.py", line 88, in <module>
winnow('crop.json', 'winnowed.json')
File "winnow.py", line 78, in winnow
ipaddr = IPAddress(addr)
File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 307, in __init__
'address from %r' % addr)
netaddr.core.AddrFormatError: failed to detect a valid IP address from u'199.222.35.192.in-addr.arpa.'
Should have a proper test suite to detect regression errors, support TDD, and help with refactoring.
@technoskald does just switching around the URL in the text files work?
python combine.py
Fetching inbound URLs
Fetching outbound URLs
Storing raw feeds in harvest.json
Loading raw feed data from harvest.json
Parsing feed from http://www.projecthoneypot.org/list_of_ips.php?rss=1
---snip---
Parsing feed from http://www.nothink.org/blacklist/blacklist_malware_irc.txt
Storing parsed data in crop.json
Reading processed data from crop.json
Output regular data as CSV to harvest.csv
Traceback (most recent call last):
File "combine.py", line 41, in
if args.tiq-test:
AttributeError: 'Namespace' object has no attribute 'tiq'
Alienvault data contains both inbound as well as outbound. The notes will map which indicator is which.
We have some of these but need to evaluate the list for possible additional stuff.
http://1d4.us/archive/network-28-07-2014.txt
http://1d4.us/archive/network-29-07-2014.txt
http://1d4.us/archive/ssh-28-07-2014.txt.txt
http://1d4.us/archive/ssh-29-07-2014.txt.txt
http://1d4.us/archive/ssh-today.txt
http://1d4.us/archive/today.txt
http://atlas-public.ec2.arbor.net/public/ssh_attackers
http://bitcash.cz/misc/log/blacklist
http://charles.the-haleys.org/ssh_dico_attack_hdeny_format.php/hostsdeny.txt
http://cybercrime-tracker.net/all.php
http://danger.rulez.sk/projects/bruteforceblocker/blist.php
http://feodotracker.abuse.ch/blocklist.php?download=ipblocklist
http://jeroen.steeman.org/FS-PlainText
http://lists.blocklist.de/lists/all.txt
http://lists.clean-mx.com/pipermail/phishwatch/20140729.txt
http://lists.clean-mx.com/pipermail/phishwatch/20140730.txt
http://lists.clean-mx.com/pipermail/viruswatch/20140729.txt
http://lists.clean-mx.com/pipermail/viruswatch/20140730.txt
http://malc0de.com/bl/IP_Blacklist.txt
http://multiproxy.org/txt_all/proxy.txt
http://osint.bambenekconsulting.com/feeds/goz-iplist.txt
http://rules.emergingthreats.net/fwrules/emerging-PF-CC.rules
http://rules.emergingthreats.net/open/snort-2.9.0/rules/emerging-tor.rules
http://stefan.gofferje.net/sipblocklist.zone
http://torstatus.blutmagie.de/ip_list_all.php/Tor_ip_list_ALL.csv
http://torstatus.blutmagie.de/ip_list_exit.php/Tor_ip_list_EXIT.csv
http://un1c0rn.net/?module=hosts&action=list&page=1
...
http://un1c0rn.net/?module=hosts&action=list&page=200
http://vmx.yourcmc.ru/BAD_HOSTS.IP4
http://vxvault.siri-urz.net/URL_List.php
http://www.autoshun.org/files/shunlist.csv
http://www.ciarmy.com/list/ci-badguys.txt
http://www.cruzit.com/xwbl2txt.php
http://www.falconcrest.eu/IPBL.aspx
http://www.infiltrated.net/blacklisted
http://www.infiltrated.net/vabl.txt
http://www.infiltrated.net/voipabuse/netblocks.txt
http://www.infiltrated.net/webattackers.txt
http://www.malwaredomainlist.com/hostslist/ip.txt
http://www.michaelbrentecklund.com/whm-cpanel-cphulk-banlist-whm-cpanel-cphulk-blacklist/
http://www.nothink.org/blacklist/blacklist_malware_dns.txt
http://www.nothink.org/blacklist/blacklist_malware_http.txt
http://www.nothink.org/blacklist/blacklist_malware_irc.txt
http://www.nothink.org/blacklist/blacklist_ssh_day.txt
http://www.openbl.org/lists/base_1days.txt
http://www.spamhaus.org/drop/drop.txt
http://www.spamhaus.org/drop/edrop.txt
http://www.stopforumspam.com/downloads/listed_ip_1_all.zip
http://www.stopforumspam.com/downloads/toxic_ip_cidr.txt
http://www.voipbl.org/update/
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=atma
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=spyware
https://blocklist.sigmaprojects.org/api.cfc?method=getList&lists=webexploit
https://isc.sans.edu/api/sources/attacks/10000/2014-07-30
https://isc.sans.edu/api/topips/records/1000/2014-07-30
https://lists.malwarepatrol.net/cgi/getfile?receipt=f1377916320&product=8&list=smoothwall
https://palevotracker.abuse.ch/blocklists.php?download=ipblocklist
https://raw.githubusercontent.com/EmergingThreats/et-open-bad-ip-list/master/IPs.txt
https://reputation.alienvault.com/reputation.generic
https://security.berkeley.edu/aggressive_ips/ips
https://spyeyetracker.abuse.ch/blocklist.php?download=ipblocklist
https://www.dan.me.uk/torlist/
https://www.gpf-comics.com/dnsbl/export.php
https://www.maxmind.com/en/anonymous_proxies
https://zeustracker.abuse.ch/blocklist.php?download=ipblocklist
Dumping results
Traceback (most recent call last):
File "winnower.py", line 150, in <module>
winnow('crop.json', 'crop.json', 'enriched.json')
File "winnower.py", line 146, in winnow
json.dump(enriched, f, indent=2)
File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
for chunk in iterable:
File "/usr/lib/python2.7/json/encoder.py", line 431, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 313, in _iterencode_list
yield buf + _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 22: invalid continuation byte
it's apache2 btw
As users will need to get these data also, we need a quick script (or at least documentation) to fetch and store.
Some sources provide a "last observed" date that we should handle. Specifically, we should exclude observations more than 24 hours ago.
[
"www.google.com",
null,
"outbound",
"http://www.nothink.org/blacklist/blacklist_malware_http.txt",
"",
"2014-08-01"
]
That null
should say DNS
instead. This is similar to #30.
From @alexcpsec in #21:
I would separate the enrichments by "groups" (for the lack of a better name) in a config file. And the groups would have a list of the sources that would be harvested by them.
And we start these groups out as "inbound" and "outbound".
If too generic (i.e, too much work for now), it is fine. But I think this would give you a lot of flexibility for further research (like a "CnC" group, a "malware download" group, etc, etc).
Currently we separate by inbound/outbound which is fine for initial release, but can be enhanced.
.dns.isValidDomain <- function(domain) {
# Check domains
retval = !is.na(domain)
domainLengths = sapply(domain[retval], nchar, USE.NAMES=F)
retval[retval] = (domainLengths > 0) & (domainLengths <= 253)
rm(domainLengths)
if(length(retval[retval]) > 0) {
retval[retval] = sapply(
str_split(domain[retval], fixed(".")),
function(x) {
labelSizes = nchar(x)
return(length(x) <= 127 && all(labelSizes > 0) && all(labelSizes <= 63))
},
USE.NAMES=F
)
}
# Is it a valid IP address, if so, it is not a domain
retval[retval] = !isIPv4(domain[retval])
# Is it has a slash, it is not a domain
retval[retval] = !grepl("/", domain[retval], fixed=T)
return(retval)
}
Add a blurb on the README making it clear people should bring their own Farsight DNSDB API keys.
Maybe the tiq-test option should have a "compulsory" -e?
Storing parsed data in crop.json
Reading processed data from crop.json
Output regular data as CSV to harvest.csv
Traceback (most recent call last):
File "combine.py", line 42, in <module>
tiq_output('crop.json', 'enrich.json')
File "/Users/alexcp/src/combine/baler.py", line 19, in tiq_output
with open(enr_file, 'rb') as f:
IOError: [Errno 2] No such file or directory: 'enrich.json'
We should think about alternative strategies for enrichment (e.g. not just maxhits).
While working on repro for #49 got:
(venv)kmaxwell@newton:~/src/combine$ python thresher.py
Loading raw feed data from harvest.json
[...]
Parsing feed from http://www.autoshun.org/files/shunlist.csv
Traceback (most recent call last):
File "thresher.py", line 189, in <module>
thresh('harvest.json', 'crop.json')
File "thresher.py", line 166, in thresh
harvest += thresher_map[site](response[2], response[0], 'inbound')
File "thresher.py", line 108, in process_autoshun
date = line.split(',')[1].split()[0]
IndexError: list index out of range
Review all the outbound sources to remove URL based intels for now.
Additional option from CSV
Also classify them on "inbound" and "outbound"
Not everybody has DNSDB type availability. Fail gracefully if they don't, preferably with some other source.
We need something more configurable so we can set a proper User-Agent or something like that.
There are probably other adjustments we may want to do (header info, etc) to make it easier to download/scrape stuff.
Can I suggest using {}-style string formatting for Python3 forward compat?
We should select the mix of public and semi-private feeds we are going to use on the presentation, and adapt the 'harvester' code as necessary to be able to gather them.
I don't believe that we need to have full fledged tool implementation for the initial milestone, but at least the minimum we require to prove the concept for the CFP.
This appears to be a version issue with grequests?
Please make sure the CSV output has quotes (") around the strings and numbers (i.e, output all as strings).
We had settled in a non-quoted format before, but that confuses the parsers in R when I have the AS name information in the enriched versions. I am exporting/importing my data all with quotes because of this.
Please adjust accordingly.
For each IP address, get the ASN and hostnames (if enrichment is enabled).
@alexcpsec: How do we want to handle multiple names for an IP address?
Implement a plugin system to thresh new sources (and likely for baling as well).
This is not a "first release" feature (i.e. post-DEFCON).
We need proper documentation for users.
MOAR work!
Here is how things look on the tiq-test
data directory right now:
aperture-2:data alexcp$ ls
enriched population raw
aperture-2:data alexcp$ ls raw
public_inbound public_outbound
aperture-2:data alexcp$ ls raw/pu
public_inbound/ public_outbound/
aperture-2:data alexcp$ ls raw/public_inbound/
20140615.csv.gz 20140618.csv.gz 20140622.csv.gz 20140625.csv.gz 20140628.csv.gz 20140701.csv.gz 20140704.csv.gz 20140707.csv.gz 20140710.csv.gz 20140713.csv.gz
20140616.csv.gz 20140619.csv.gz 20140623.csv.gz 20140626.csv.gz 20140629.csv.gz 20140702.csv.gz 20140705.csv.gz 20140708.csv.gz 20140711.csv.gz 20140714.csv.gz
20140617.csv.gz 20140620.csv.gz 20140624.csv.gz 20140627.csv.gz 20140630.csv.gz 20140703.csv.gz 20140706.csv.gz 20140709.csv.gz 20140712.csv.gz 20140715.csv.gz
Basically we have the following structure:
data/[DATATYPE]/[DATAGROUP]/[YYYYMMDD].csv.gz
considering that:
DATATYPE
should be either raw
or enriched
. The names are references to what to expect on the data structure of the CSVs inside (as described on the README). Disregard the population
type, it should not be a target for this presentation.DATAGROUP
is in reference to the group name of the combine output (currently the "inbound" and "outbound" separation). They can be whatever you like, I am using public_inbound
and public_outbound
for the presentation data.YYYYMMDD
is the way dates should be represented in the whole world.Please note the CSVs are gzipped. The code expects that as well.
Enriching mail.TIKTIKZ.COM
Traceback (most recent call last):
File "winnower.py", line 150, in <module>
winnow('crop.json', 'crop.json', 'enriched.json')
File "winnower.py", line 138, in winnow
e_data = (addr, addr_type, direction, source, note, date, enrich_DNS(ipaddr, date, dnsdb))
File "winnower.py", line 53, in enrich_DNS
records = dnsdb.query_rrset(address, rrtype='A')
File "/home/kmaxwell/src/combine/dnsdb_query.py", line 55, in query_rrset
return self._query(path)
File "/home/kmaxwell/src/combine/dnsdb_query.py", line 77, in _query
http = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 396, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 258, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /lookup/rrset/name/95.85.191.8/A
Possibly controlled via a config option?
There a few blacklist entries in this website here: http://www.nothink.org/honeypots.php
Also malware samples, even though we don't need those at the moment.
Our logging sucks.
Reaper should be able to read local files.
For the "threshing" step, we need to define a normalized data model. This should be aligned with whatever MLSec already uses for ease of "baling" but does not necessarily need to be the same.
We need to define what are we going to analyze on the feeds.
Above all, the metrics and comparisons performed must be helpful to the overall public and be able to tell a good story.
Some starting points have been put in the Wiki: https://github.com/mlsecproject/combine/wiki/Sources-for-Presentation
Feature request to support Collective Intelligence Framework feeds. A fine intermediate step would be to allow importing from local files.
Which leads to things like:
Enriching 150.164.082.010
Traceback (most recent call last):
File "combine.py", line 38, in <module>
winnow('crop.json', 'crop.json', 'enrich.json')
File "/home/kmaxwell/src/combine/winnower.py", line 122, in winnow
ipaddr = IPAddress(addr)
File "/home/kmaxwell/src/combine/venv/local/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 307, in __init__
'address from %r' % addr)
netaddr.core.AddrFormatError: failed to detect a valid IP address from u'150.164.082.010'
Anything worth using?
https://github.com/sherpasurfing/SHERPASURFING/tree/master/enrichmentdatasets
Is Farsight providing an API key for this project? Are they aware that thousands of people may be hitting them on this key each day?
Removed
Today, indicators that for some reason do not match our "IPv4" or "FQDN" validation just stay there without a type. An example:
$ cat harvest.csv | grep -v FQDN | grep -v IPv4
"entity","type","direction","source","notes","date"
"2001:41d0:8:dcd4::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:5f18:8f82::5f18:8f82","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2002:c3d3:9a9f::c3d3:9a9f","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:145::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a00:1210:fffe:72::1","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:238:20a:202:1000::25","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a01:540:2:bd5d:d849:1e69:7736:be41","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:a90f:3bd1:d8d9:3485","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:140:3:b86c:62e8:3e0e:a0fb","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:501b:91a5:76ff:8fa8","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2a03:7380:2380:0:95db:5adb:685d:a0f0","","inbound","http://www.blocklist.de/lists/apache.txt","","2014-09-04"
"2001:41d0:1:c9b2::1","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"2a01:430:17:1::ffff:376","","inbound","http://www.blocklist.de/lists/bots.txt","","2014-09-04"
"Export","","inbound","http://virbl.org/download/virbl.dnsbl.bit.nl.txt","","2014-09-04"
"ckaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","","outbound","http://www.nothink.org/blacklist/blacklist_malware_dns.txt","","2014-09-04"
We are not interested (for now) on IPv6 and the other stuff seem like parsing errors.
I believe we should filter out the indicators that do not match an specific type.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.