samwize / python-email-crawler Goto Github PK

View Code? Open in Web Editor NEW

288.0 30.0 128.0 19 KB

Search on Google, and crawls for emails related to the result

Python 100.00%

python-email-crawler's Introduction

Python Email Crawler

This python script search/google certain keywords, crawls the webpages from the results, and return all emails found.

Requirements

sqlalchemy
urllib2

If you don't have, simply sudo pip install sqlalchemy.

Usage

Start the search with a keyword. We use "iphone developers" as an example.

python email_crawler.py "iphone developers"

The search and crawling process will take quite a while, as it retrieve up to 500 search results (from Google), and crawl up to 2 level deep. It should crawl around 10,000 webpages :)

After the process finished, run this command to get the list of emails

python email_crawler.py --emails

The emails will be saved in ./data/emails.csv

python-email-crawler's People

Contributors

Stargazers

Watchers

Forkers

khalid-afridi jameslarkin troubleboy shokhrukh endika gmelillo yixingcheng 2yeslater driuha madand muhammad-ammar apyrros fr34k8 theringer alvinzhao a-nai vmayoral ahmedsalman jarchuleta onewaterdrop ohhdemgirls zeus911 emissary badreddinetahir jzbgld gb0070 hero2008 yongxu74 tboulogne pacowoodoo mm2012mm abdulrahmaan dfj3302695 silentxnl miradel51 jkullar merlin2004 thereal-configmgr gilby125 001101 atouhou beoleg kevingatera fernweh4 bbsyaya zzhibingren phe-sto no-dice-io pjatx michaelwolz xeddmc charlieporth1 yimingpeng saladinlorenz walkerharrison bambooeric feconroses rizplate arturslogins bikeholik jeromecc bluswimmer mauriciolopezramirez hichamtnsi guvenkaranfil hoangminhtam139 prophit987 dataquery zzleo barneyeldinosaurio wiltonpaulo manusoler xuexijia alisharifi2000 splevine nikillpop xsgchao aepkolol luisjunco auscanaoy franztscharf hhy5277 rollost lormann mohammedgomaa romulovll deconet ezcashing shinehe ryanwind asdbaihu doctsh ishandutta2007 luongphuhoa markdeng206 kass6360 caogtaa concamuflage gasbarroni8 vogtai

python-email-crawler's Issues

URL with UTF-8 char triggers crash

The crawler crashes when a URL with a non-ascii character is encountered (e.g 'ß')

Crash log:

...
[17:11:08] INFO::email_crawler - Crawling https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-weißenbach
[17:11:09] ERROR::email_crawler - EXCEPTION: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'
Traceback (most recent call last):
  File "email_crawler.py", line 217, in <module>
    crawl(arg)
  File "email_crawler.py", line 81, in crawl
    email_set = find_emails_2_level_deep(uncrawled.url)
  File "email_crawler.py", line 143, in find_emails_2_level_deep
    db.enqueue(link, list(email_set))
  File "/home/tomas/python-email-crawler2/database.py", line 35, in enqueue
    res = self.connection.execute(s)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
ProgrammingError: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'] [parameters: ('https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-wei\xc3\x9fenbach',)]

~~However, it may be that the issues is located in the sqlalchemy lib but I don't know for sure.~~

"if link.find(url.netloc)" in email_crawler.py should be "if url.netloc in link"

Currently find_links_in_html_with_same_hostname goes to different hostnames as well.

Service Unavailable

[00:19:27] ERROR::email_crawler - Exception at url: http://www.google.com/search?q=something&start=0
HTTP Error 503: Service Unavailable
[00:19:27] ERROR::email_crawler - EXCEPTION: expected string or buffer

Is there any way to speed the python program up by adding multithreading?

i im constantly getting (Pdb) ????

screen grab of my problem

https://f.cloud.github.com/assets/3819434/249328/ea1abf7e-8b2a-11e2-8b0e-f3d718cba04b.png

Last login: Tue Mar 12 12:59:50 on ttys000
james-larkins-MacBook-Pro:~ jameslarkin$ cd /users/python
james-larkins-MacBook-Pro:python jameslarkin$ python email_crawler.py "south london photographers"

/Users/python/email_crawler.py(12)()
-> logging.config.dictConfig(LOGGING)
(Pdb)

how long should i leave it ?

Redundant url searches

When scanning a large site:
If the same url is on multiple pages:
The same url will be scanned for each page.

If there is a way to remember that the url has already been scanned so it can skip the url when it is found an 1+n time, this would save a lot of time.

Crawl based on a CSV list rather than sys arg?

Is it possible to crawl for emails based on a business name and address in a csv?
For example, could I use the below info to crawl for emails from all these places?

Thank you!

Email Regex

The current email regex returns [email protected] as a valid email.

Other options for email regex: http://emailregex.com/

Error when executing (TypeError: expected String or buffer)

user@meta:~/mailcrawl$ git clone https://github.com/samwize/python-email-crawler.git  
Cloning into 'python-email-crawler'...
remote: Enumerating objects: 63, done.
remote: Total 63 (delta 0), reused 0 (delta 0), pack-reused 63
Unpacking objects: 100% (63/63), done.
user@meta:~/mailcrawl$ cd python-email-crawler/
user@meta:~/mailcrawl/python-email-crawler$ python email_crawler.py test
[11:51:12] INFO::email_crawler - ----------------------------------------
[11:51:12] INFO::email_crawler - Keywords to Google for: test
[11:51:12] INFO::email_crawler - ----------------------------------------
[11:51:12] INFO::email_crawler - Crawling http://www.google.com/search?q=test&start=0
[11:51:12] ERROR::email_crawler - Exception at url: http://www.google.com/search?q=test&start=0
HTTP Error 503: Service Unavailable
[11:51:12] ERROR::email_crawler - EXCEPTION: expected string or buffer 
Traceback (most recent call last):
  File "email_crawler.py", line 212, in <module>
    crawl(arg)
  File "email_crawler.py", line 65, in crawl
    for url in google_url_regex.findall(data):
TypeError: expected string or buffer
user@meta:~/mailcrawl/python-email-crawler$

I get the above when running command.

Recognised images as emails

Example:

ogo_website_200x60_1fe006a7-6795-4ef6-ab21-77ce25ef0772_160x@2x.png

Possible solution may add allow list of TLDs

sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None

i can;t get the scrip to work, what does this mean ? :

"sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None "

the code i get

python email_crawler.py "south london photographers"
james-larkins-MacBook-Pro:~ jameslarkin$ python email_crawler.py "south london photographers"
/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python: can't open file 'email_crawler.py': [Errno 2] No such file or directory
james-larkins-MacBook-Pro:~ jameslarkin$ cd /users/python
james-larkins-MacBook-Pro:python jameslarkin$ python email_crawler.py "south london photographers"
Traceback (most recent call last):
File "email_crawler.py", line 31, in
db.connect()
File "/Users/python/database.py", line 15, in connect
self.connection = self.engine.connect()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1649, in connect
return self._connection_cls(self, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 59, in init
self.connection = connection or engine.raw_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1707, in raw_connection
return self.pool.unique_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 220, in unique_connection
return _ConnectionFairy(self).checkout()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 425, in __init
rec = self._connection_record = pool._do_get()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 855, in _do_get
return self._create_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
return _ConnectionRecord(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 318, in init
self.connection = self.__connect()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 368, in __connect
connection = self.__pool._creator()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
return dialect.connect(_cargs, *_cparams)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 279, in connect
return self.dbapi.connect(_cargs, *_cparams)
sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None None
james-larkins-MacBook-Pro:python jameslarkin$

It keeps searching forever.

It just searches everywhere... I sometimes think that it searches every single website.

Nothing is wrong, it just does not stop, I can see the websites it is searching in, but after 1 - 1.30 hours later it keeps searching unrelated website and does not stop searching.

Are my keywords wrong? What can I do?

0 emails saved

After crawling, always the same 0 emails message :

[11:41:50] INFO::email_crawler - ========================================
[11:41:50] INFO::email_crawler - Processing...
[11:41:50] INFO::email_crawler - There are 0 emails
[11:41:50] INFO::email_crawler - All emails saved to ./data/emails.csv
[11:41:50] INFO::email_crawler - ========================================

Make a note in the read.me needed regarding version python

Must use Python 2 in order for directions to work perfectly. Python 3 does not come with lib2

Licence

What is the licence for this script ?

Assuming that is on github is GPL ?

Empty spaces in the URL can be rendered properly

I've example URL: http://www.website.com/Search/in/Alderley Edge and it failed to fetch the page even though it exists. I've got this as response:

[22:57:43] ERROR::email_crawler - Exception at url: http://www.website.com/Search/in/Alderley Edge HTTP Error 400: Bad Request

Doesn't collect any emails, the "keywords to google for" is blank.

I ran this program with several commands, and it not only didn't save anything:

dave@dave-HP-EliteBook-8560w:~/Code/Downloaded/python-email-crawler-master$ python email_crawler.py "iphone developers"
[01:53:42] INFO::email_crawler - ----------------------------------------
[01:53:42] INFO::email_crawler - Keywords to Google for: [
[01:53:42] INFO::email_crawler - ----------------------------------------
[01:53:42] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=0
[01:53:42] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=10
[01:53:43] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=20
[01:53:43] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=30

It gave me a corrupted csv when I tried to generate one:

\D0\CF�ࡱ�\E1\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00;\00�\00\FE\FF    \00�\00\00\00\00\00\00\00\00\00\00\00�\00\00\00�\00\00\00\00\00\00\00\00�\00\00\FE\FF\FF\FF\00\00\00\00\FE\FF\FF\FF\00\00\00\00\00\00\00\00\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\F

What's happening here?

Improve and make some sql injection function or and other injection.

I'm see sorce my need to Improve and make some sql injection function or and other injection.My be it will good

def test(self):
c = CrawlerDb()
c.connect()
# c.enqueue(['a12222', '11'])
# c.enqueue(['dddaaaaaa2', '22'])
c.enqueue('111')
c.enqueue('222')
website = c.dequeue()
c.crawled(website)
website = c.dequeue()
c.crawled(website, "a,b")
print '---'
c.dequeue()

CrawlerDb().test()

Not sure why this is messing up...

Traceback (most recent call last):
File "email_crawler.py", line 12, in
logging.config.dictConfig(LOGGING)
File "C:\Python27\lib\logging\config.py", line 794, in dictConfig
dictConfigClass(config).configure()
File "C:\Python27\lib\logging\config.py", line 576, in configure
'%r: %s' % (name, e))
ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

Can anyone help?

SyntaxError: multiple exception types must be parenthesized

I am getting this error right after I execute the script, here's the try except that generates the error:
try:
logger.info("Crawling %s" % url)
request = urllib2.urlopen(req)
except urllib2.URLError, e:
logger.error("Exception at url: %s\n%s" % (url, e))
any idea what might be wrong? Thanks.

except urllib2.URLError, e:

HaoyuedeMacBook:python-email-crawler haoyue$ python email_crawler.py "iphone developers"
  File "email_crawler.py", line 96
    except urllib2.URLError, e:
                           ^
SyntaxError: invalid syntax

I run this code two weeks ago, it works well, but it threw a syntaxError today when I tried to run it again. What does this mean? Sorry, I am using this crawler for a research, thank you for your code.

Fork

TypeError: expected string or buffer

Sometimes it runs, sometimes it doesn't.

[14:22:38] INFO::email_crawler - Crawling http://www.google.com.au/search?q=electrician&start=0
[14:22:39] ERROR::email_crawler - Exception at url: http://www.google.com.au/search?q=electrician&start=0
HTTP Error 503: Service Unavailable
[14:22:39] ERROR::email_crawler - EXCEPTION: expected string or buffer

How to search for multiple keywords

Hey,

Is it possible to search for multiple keywords at the same time?

For instance I would like to enter the keywords that I need in the beginning, so it will work on the keywords one by one.

Thanks for the help

Search results don't change

No matter what keywords I select to search, the program will only search for the keywords from the first time the program was run.

Example:

first test search was "iphone developers"
results were as expected

second search was "android developers"
it still searched as if it were looking for iphone developers.

ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

When i'm running it get an error :

Traceback (most recent call last):
File "index.py", line 12, in
logging.config.dictConfig(LOGGING)
File "C:\Python27\lib\logging\config.py", line 794, in dictConfig
dictConfigClass(config).configure()
File "C:\Python27\lib\logging\config.py", line 576, in configure
'%r: %s' % (name, e))
ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

urllib

could you advise please on
a) how to resolve: python email_crawler.py "iphone developers"
File "email_crawler.py", line 96
except urllib2.URLError, e:
^
SyntaxError: invalid syntax
b) how to crawl emails from a specific country, e.g all emails from .it or .com

Thanks