Giter VIP home page Giter VIP logo

python-email-crawler's Introduction

Python Email Crawler

This python script search/google certain keywords, crawls the webpages from the results, and return all emails found.

Requirements

  • sqlalchemy
  • urllib2

If you don't have, simply sudo pip install sqlalchemy.

Usage

Start the search with a keyword. We use "iphone developers" as an example.

python email_crawler.py "iphone developers"

The search and crawling process will take quite a while, as it retrieve up to 500 search results (from Google), and crawl up to 2 level deep. It should crawl around 10,000 webpages :)

After the process finished, run this command to get the list of emails

python email_crawler.py --emails

The emails will be saved in ./data/emails.csv

python-email-crawler's People

Contributors

dcondrey avatar jundaong avatar kevingatera avatar samwize avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-email-crawler's Issues

URL with UTF-8 char triggers crash

The crawler crashes when a URL with a non-ascii character is encountered (e.g 'ß')

Crash log:

...
[17:11:08] INFO::email_crawler - Crawling https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-weißenbach
[17:11:09] ERROR::email_crawler - EXCEPTION: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'
Traceback (most recent call last):
  File "email_crawler.py", line 217, in <module>
    crawl(arg)
  File "email_crawler.py", line 81, in crawl
    email_set = find_emails_2_level_deep(uncrawled.url)
  File "email_crawler.py", line 143, in find_emails_2_level_deep
    db.enqueue(link, list(email_set))
  File "/home/tomas/python-email-crawler2/database.py", line 35, in enqueue
    res = self.connection.execute(s)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
ProgrammingError: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'] [parameters: ('https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-wei\xc3\x9fenbach',)]

However, it may be that the issues is located in the sqlalchemy lib but I don't know for sure.

Redundant url searches

When scanning a large site:
If the same url is on multiple pages:
The same url will be scanned for each page.

If there is a way to remember that the url has already been scanned so it can skip the url when it is found an 1+n time, this would save a lot of time.

Crawl based on a CSV list rather than sys arg?

Is it possible to crawl for emails based on a business name and address in a csv?
For example, could I use the below info to crawl for emails from all these places?

BorrowerName | BorrowerAddress | BorrowerCity | BorrowerState | BorrowerZip
PACIFIC COAST FIRE | 470 DIVISION ST | CAMPBELL | CA | 95008
CAREY VISION MEDICAL GROUP, INC. | 25621 Deerfield Drive | Los Altos Hills | CA | 94022
POINT DIGITAL FINANCE, INC. | 444 High Street, FL 4 | Palo Alto | CA | 94301
ASSOCIATED PATHOLOGY MEDICAL GROUP, INC. | 105a cooper ct | los gatos | CA | 95032
GT4 EVENTS | 475 Vandell Way | Campbell | CA | 95008
COULTER CONSTRUCTION, INC. | 1961 Old Middlefield Way | Mountain View | CA | 94043
ZENFOLIO, INC. | 3515-A Edison Way | Menlo Park | CA | 94025

Thank you!

Error when executing (TypeError: expected String or buffer)

user@meta:~/mailcrawl$ git clone https://github.com/samwize/python-email-crawler.git  
Cloning into 'python-email-crawler'...
remote: Enumerating objects: 63, done.
remote: Total 63 (delta 0), reused 0 (delta 0), pack-reused 63
Unpacking objects: 100% (63/63), done.
user@meta:~/mailcrawl$ cd python-email-crawler/
user@meta:~/mailcrawl/python-email-crawler$ python email_crawler.py test
[11:51:12] INFO::email_crawler - ----------------------------------------
[11:51:12] INFO::email_crawler - Keywords to Google for: test
[11:51:12] INFO::email_crawler - ----------------------------------------
[11:51:12] INFO::email_crawler - Crawling http://www.google.com/search?q=test&start=0
[11:51:12] ERROR::email_crawler - Exception at url: http://www.google.com/search?q=test&start=0
HTTP Error 503: Service Unavailable
[11:51:12] ERROR::email_crawler - EXCEPTION: expected string or buffer 
Traceback (most recent call last):
  File "email_crawler.py", line 212, in <module>
    crawl(arg)
  File "email_crawler.py", line 65, in crawl
    for url in google_url_regex.findall(data):
TypeError: expected string or buffer
user@meta:~/mailcrawl/python-email-crawler$ 

I get the above when running command.

Recognised images as emails

Example:

ogo_website_200x60_1fe006a7-6795-4ef6-ab21-77ce25ef0772_160x@2x.png

Possible solution may add allow list of TLDs

sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None

i can;t get the scrip to work, what does this mean ? :

"sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None "

the code i get

python email_crawler.py "south london photographers"
james-larkins-MacBook-Pro:~ jameslarkin$ python email_crawler.py "south london photographers"
/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python: can't open file 'email_crawler.py': [Errno 2] No such file or directory
james-larkins-MacBook-Pro:~ jameslarkin$ cd /users/python
james-larkins-MacBook-Pro:python jameslarkin$ python email_crawler.py "south london photographers"
Traceback (most recent call last):
File "email_crawler.py", line 31, in
db.connect()
File "/Users/python/database.py", line 15, in connect
self.connection = self.engine.connect()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1649, in connect
return self._connection_cls(self, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 59, in init
self.connection = connection or engine.raw_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1707, in raw_connection
return self.pool.unique_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 220, in unique_connection
return _ConnectionFairy(self).checkout()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 425, in __init

rec = self._connection_record = pool._do_get()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 855, in _do_get
return self._create_connection()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
return _ConnectionRecord(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 318, in init
self.connection = self.__connect()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/pool.py", line 368, in __connect
connection = self.__pool._creator()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
return dialect.connect(_cargs, *_cparams)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 279, in connect
return self.dbapi.connect(_cargs, *_cparams)
sqlalchemy.exc.OperationalError: (OperationalError) unable to open database file None None
james-larkins-MacBook-Pro:python jameslarkin$

It keeps searching forever.

It just searches everywhere... I sometimes think that it searches every single website.

Nothing is wrong, it just does not stop, I can see the websites it is searching in, but after 1 - 1.30 hours later it keeps searching unrelated website and does not stop searching.

Are my keywords wrong? What can I do?

0 emails saved

After crawling, always the same 0 emails message :

[11:41:50] INFO::email_crawler - ========================================
[11:41:50] INFO::email_crawler - Processing...
[11:41:50] INFO::email_crawler - There are 0 emails
[11:41:50] INFO::email_crawler - All emails saved to ./data/emails.csv
[11:41:50] INFO::email_crawler - ========================================

Licence

What is the licence for this script ?

Assuming that is on github is GPL ?

Empty spaces in the URL can be rendered properly

I've example URL: http://www.website.com/Search/in/Alderley Edge and it failed to fetch the page even though it exists. I've got this as response:

[22:57:43] ERROR::email_crawler - Exception at url: http://www.website.com/Search/in/Alderley Edge HTTP Error 400: Bad Request

Doesn't collect any emails, the "keywords to google for" is blank.

I ran this program with several commands, and it not only didn't save anything:

dave@dave-HP-EliteBook-8560w:~/Code/Downloaded/python-email-crawler-master$ python email_crawler.py "iphone developers"
[01:53:42] INFO::email_crawler - ----------------------------------------
[01:53:42] INFO::email_crawler - Keywords to Google for: [
[01:53:42] INFO::email_crawler - ----------------------------------------
[01:53:42] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=0
[01:53:42] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=10
[01:53:43] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=20
[01:53:43] INFO::email_crawler - Crawling http://www.google.com/search?q=%5B%5D&start=30

It gave me a corrupted csv when I tried to generate one:

\D0\CF�ࡱ�\E1\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00;\00�\00\FE\FF    \00�\00\00\00\00\00\00\00\00\00\00\00�\00\00\00�\00\00\00\00\00\00\00\00�\00\00\FE\FF\FF\FF\00\00\00\00\FE\FF\FF\FF\00\00\00\00\00\00\00\00\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\FF\F

What's happening here?

Improve and make some sql injection function or and other injection.

I'm see sorce my need to Improve and make some sql injection function or and other injection.My be it will good

def test(self):
c = CrawlerDb()
c.connect()
# c.enqueue(['a12222', '11'])
# c.enqueue(['dddaaaaaa2', '22'])
c.enqueue('111')
c.enqueue('222')
website = c.dequeue()
c.crawled(website)
website = c.dequeue()
c.crawled(website, "a,b")
print '---'
c.dequeue()

CrawlerDb().test()

Not sure why this is messing up...

Traceback (most recent call last):
File "email_crawler.py", line 12, in
logging.config.dictConfig(LOGGING)
File "C:\Python27\lib\logging\config.py", line 794, in dictConfig
dictConfigClass(config).configure()
File "C:\Python27\lib\logging\config.py", line 576, in configure
'%r: %s' % (name, e))
ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

Can anyone help?

SyntaxError: multiple exception types must be parenthesized

I am getting this error right after I execute the script, here's the try except that generates the error:
try:
logger.info("Crawling %s" % url)
request = urllib2.urlopen(req)
except urllib2.URLError, e:
logger.error("Exception at url: %s\n%s" % (url, e))
any idea what might be wrong? Thanks.

except urllib2.URLError, e:

HaoyuedeMacBook:python-email-crawler haoyue$ python email_crawler.py "iphone developers"
  File "email_crawler.py", line 96
    except urllib2.URLError, e:
                           ^
SyntaxError: invalid syntax

I run this code two weeks ago, it works well, but it threw a syntaxError today when I tried to run it again. What does this mean? Sorry, I am using this crawler for a research, thank you for your code.

TypeError: expected string or buffer

Sometimes it runs, sometimes it doesn't.

[14:22:38] INFO::email_crawler - Crawling http://www.google.com.au/search?q=electrician&start=0
[14:22:39] ERROR::email_crawler - Exception at url: http://www.google.com.au/search?q=electrician&start=0
HTTP Error 503: Service Unavailable
[14:22:39] ERROR::email_crawler - EXCEPTION: expected string or buffer 

How to search for multiple keywords

Hey,

Is it possible to search for multiple keywords at the same time?

For instance I would like to enter the keywords that I need in the beginning, so it will work on the keywords one by one.

Thanks for the help

Search results don't change

No matter what keywords I select to search, the program will only search for the keywords from the first time the program was run.

Example:

first test search was "iphone developers"
results were as expected

second search was "android developers"
it still searched as if it were looking for iphone developers.

ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

When i'm running it get an error :

Traceback (most recent call last):
File "index.py", line 12, in
logging.config.dictConfig(LOGGING)
File "C:\Python27\lib\logging\config.py", line 794, in dictConfig
dictConfigClass(config).configure()
File "C:\Python27\lib\logging\config.py", line 576, in configure
'%r: %s' % (name, e))
ValueError: Unable to configure handler 'console': Cannot resolve 'ColorStreamHandler.ColorStreamHandler': No module named _curses

urllib

could you advise please on
a) how to resolve: python email_crawler.py "iphone developers"
File "email_crawler.py", line 96
except urllib2.URLError, e:
^
SyntaxError: invalid syntax
b) how to crawl emails from a specific country, e.g all emails from .it or .com

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.