mariovilas / googlesearch Goto Github PK

View Code? Open in Web Editor NEW

1.1K 72.0 386.0 213 KB

Google search from Python (unofficial).

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

googlesearch's Introduction

Google Search

Google search from Python.

https://python-googlesearch.readthedocs.io/en/latest/

Note: this project is not affiliated with Google in any way.

Usage example

# Get the first 20 hits for: "Breaking Code" WordPress blog
from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
    print(url)

Installing

pip install google

googlesearch's People

Contributors

Stargazers

Watchers

Forkers

alxndr drucko machalekj stuartkmarsh ljw7630 froggle justasabc interskh kiminewt drneox foreigner92 yonnigreen markauro aidan-huang mrniebieski shrikant-aher tylerlong pouriamor mmenchu docworld jakesylvestre creamzy claytonschubiner-okta cschubiner bhagwati ericxsun agentk86 urwithajit9 shenglihu nzachow m7mdcc jjchromik brodul acostanorma21 joelmwas huajay rkabir pulkitkrjha fr34k8 shanedinozzo sojan-official sampath004 ppyxmw bigdido faraujoj jaghaxso syjzwjj hadoocn almashraee khoisancodez abhisheklohani valhallasw chitratech sid12 felixali depierre iambigking 13761517820 solertis hvardhanx ed45626 cainiaocome heman2005 mohammadanasef rashmipattnaik evan176 poriwag dotmark vindhyansh datahack-ru checksalman babooppa6 therealedsheenan megaverse flieger1 priestd09 ceogustworld90 open-awesome folajimia czxxjtu swprasad ciklon-z mjm-buzz mujiansu zsmj513 archanabhat braydencarlson proxymoron tythos tryexceptelse anthonyshuyu fxfitz planewryter chandakvishal darshan-sm bikeshtricon goodluckyz ojlee cake427 kom2727

googlesearch's Issues

HTTP Error 503: Service Unavailable On Heroku

I've been trying to use your google search on heroku app, but whenever I try, the google blocks it.

Unable to see filtered results

Google tacks on "&filter=0" to represent choosing the option to not omit items. I'm wanting to add this functionality but not sure where to start. Fairly new to using git in general, is it possible to make my own branch with this feature?--thanks!

urllib.error.HTTPError: HTTP Error 429: Too Many Requests

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
xx = []
query = ["zenobase" , "exist.io" , "gyrosco.pe" , "quantimodo"]  
print(query)
#requests.get(link, headers = {'User-agent': 'mymerge bot 0.1'}) need python requests package, user_agent="rua"
# to search 
for q in query:
    for j in search(q, tld="com", num=10, stop=2, pause=20): 
        xx.append(j)
        print(xx)

Ran fine before modifications. Maybe caused by me setting num=3000 and pause=2 previously.
Using user_agent="rua" did not help. Browsers still use google fine.

Less Hits than expected

I am doing this :

for url in search("Microsoft", stop = 500):

It gives me 137 links instead of 500. Any hints as to why?

How to get title and snippet from result?

Hi,

First of all thank you for making such wonderful and easy to use library.

I am able to make it work, But now i want to get the title and snippet for better performance and to use it further.

Can i do it using this?

Below is simple code which is giving me the urls only.

from googlesearch import search

for url in search('best usb powered monitor', tld='com', stop=10):
    print(url)

And what things i need to consider if i am going to use this library for enterprise application?

Thanks,
Jayesh (Kod Factory)

HTTP Error 503: Service Unavailable

Sublinks

Hi, when i'm using num=10, i got more than 10 links, because i got some subkinks (like http://wikipedia.org/subject#sub-subject). Is there any parameter to filter out those links from the search ?
Thanks

BeautifulSoup complains about parser

To be completely honest, I have no idea why BeautifulSoup is yelling at my on my normal usage of your library, but I get it the first time I try to iterate through the generator that search() returns.

It looks to me like you're calling it correctly, or at least using it how the error message indictates, so something must be wrong... ?

Any thoughts?

Ideally I'd like to squash this error message.

No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

HTTPError: Too Many Requests

Running the sample code

from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
print(url)

I always obtain this error

File \urllib\request.py line 650 in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Too Many Requests

Not find in Google Play Music

For some reason he doesn’t want to search Google Play Music

Too long search

Thank you author for your working, great library. But.. it so long.
Simple search request executes about 3 seconds, while other libraries (for example, PyGoogler) takes about 1 seconds.

Cannot import Search

I 've read this #39 issue but it does me no good
I'm using python 3.5
When I type

>>>from googlesearch import search
this is what I got
ImportError: cannot import name 'search'

everything else is fine I can import the library without search

`>>> dir(googlesearch)'

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__','__spec__', '__version__']

>>> googlesearch <module 'googlesearch' from 'C:\\Users\\abdelmalek\\AppData\\Local\\Programs\\Python\\Python35\\lib\\site-packages\\googlesearch\\__init__.py'>

What about pagination ?

Hello,

Thanks you for this module i really like it, but does it have pagination support ?

Even with .COM tld getting results for local Google URL

Hi,

I was trying to use the script for parsing the results but I am facing issues when I try to use the TLD as COM. The results in the parser are being shown for the local Google URL (in my case .co.in).

Can you please help.

Regards,

Nitin

To join google

Send me questions of c++

urllib.error.HTTPError: HTTP Error 503: Service Unavailable

Its seems like Google bans after some tries.

Scraping

I want to retrieve the metadata along with the URLs.Is there a way this can be done in this code?

Searching in Persian doesn't produce consistant result

I have tried this module, but I have noticed when I search Persian terms, the result is not consistent.

beside give very different result than what I get from google itself (in private search, not logged in), sometimes the order of the result is different (even when I search after 2 min)

Is this normal behavior?

Add Documentation

Including:

Installation
Link to package installation (eg. PyPi)
Version
Methods and Classes
Pagination
Ect.

Suggestion: More data per result

Is there a way to return more than just the URL? Like the title of the result and the description?

Exception on non-comma thousand separator present in the hints number

While using the library in the Netherlands, I got a ValueError here because Google returns the hits in nl local even when I have tld='com', lang='en' in the list of the arguments passed to hits function.

The exception I got

ValueError: invalid literal for int() with base 10: '672.000'

The related code

googlesearch/googlesearch/__init__.py

Line 735 in 42ee4ac

return int(tag.text.split()[1].replace(',', ''))

ngd function is not working.

Trying to call the ngd function with some basic terms

    term1 = 'macbeth'
    term2 = 'shakespeare'
    relationship = ngd(term1, term2)
    print(relationship)

And I'm getting this error. I've tried a couple other terms, but this ngd should definitely come up with some results...

  File "remake.py", line 132, in <module>
    relationship = ngd(term1, term2)
  File "/lib/python3.7/site-packages/googlesearch/__init__.py", line 766, in ngd
    lhits1 = math.log10(hits(term1))
  File "/lib/python3.7/site-packages/googlesearch/__init__.py", line 745, in hits
    tag = soup.find_all(attrs={"class": "sd", "id": "resultStats"})[0]
IndexError: list index out of range

Missing Catch for Google's 503

I feel it misses a parameters asking the user for a function to be run in case it identifies that Google threw a 503 result.
So in this case, an user could implement a function to receive the html and deal with the ReCaptcha and to continue the fast pace of crawling.
When this situation happens, Google throws this redirect: https://www.google.com/sorry/index
I add the HTML from this page, in order to be used as base for coding, or even can help with adding this improvement, just let me know what you prefer.

Error Page.zip

Cannot import name search

I've installed google via pip for both python2 (system distro) and anaconda for python3. I'm using ubuntu 16.04. When I try to import anything, like search, it says

ImportError: cannot import name 'search'

That's after doing

from google import search

The same thing happens in python2 or 3.

When I import google and check the 'dir', it says:
(python3)

In [84]: dir(google)
Out[84]: ['__doc__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

(python2)

In [3]: dir(google)
Out[3]: ['__doc__', '__name__', '__path__']

Any ideas what's going on? The __init__.py file looks fine where it's installed, which I found from google.__path__. But I thought for most packages, .__file__ was the way to specify the filepath to the module.

Add the option to don't exclude results

Message: In order to show you the most relevant results, we have omitted some entries very similar to the 392 already displayed. If you like, you can repeat the search with the omitted results included.

A ?filter=0 avoids google from filtering these.

License

Hi Mario, do you have a license for your code?!

Get an HTTPError: urllib.error.HTTPError: HTTP Error 429: Too Many Requests

hi， i got an error:

urllib.error.HTTPError: HTTP Error 429: Too Many Requests

I have the task to get the url for many words, thus, it must have losts of requests.
And what should i do to solve this problem.

Thanks.

a typo in line 131

The last word should be "news" instead of "images",I think

Unable to use custom date ranges

The tbs parameter doesn't work with custom date ranges.

tbs has no effect when you input

cdr%3A1%2Ccd_min%3A1%2F1%2F2019%2Ccd_max%3A2%2F1%2F2019
or
cdr:1,cd_min:01/01/2019,cd_max:02/01/2019

It just returns the default results. These strings work when entered directly into the google url via Chrome. I've been mainly trying to get this to work for the search_news function, but the issue applies to everything.

Thanks

Python 3.7.3 Compatibility

Hello, I've just installed google-2.0.2 from pip. Ran the sample code and nothing printed at all. Cloned the repository from Github, tried again, and still nothing printed from the for loop again. All I want is to perform a virtual google search and return a url. The following is what happened:

>>> for url in search('"Breaking Code" WordPress blog', stop=20):
...        print(url)
...
>>>

Would it be possible to enable proxy support?

When using the code on a VPS like digital ocean, I run into issues like this:

    result = func(*args)
  File "/usr/lib64/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

This is even with an explicit user agent.

Is this way will be limited?

Google now has search restrictions. Will this be restricted?

domain search issue

Hi @MarioVilas !
So, I found that you filtered the urls that contain "google", like 'image.google.com', I'm just wondering what's the reason for that? Since I need to search the domain of 'support.google.com' now, and It works pretty well after I removed the "if 'google' not in o.netloc".
Thanks!

Error from BeautifulSoup while running google.py script

While running on Windows with Py2.6 I get this:

$ python google.py test
c:\Python26\lib\site-packages\beautifulsoup4-4.3.2-py2.6.egg\bs4\builder\_htmlparser.py:163: RuntimeWarning: Python's built-in HTMLParser cannot parse
 the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Sou
p with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
Traceback (most recent call last):
  File "google.py", line 274, in <module>
    for url in search(query, **params):
  File "google.py", line 196, in search
    soup = BeautifulSoup(html)
  File "build\bdist.win32\egg\bs4\__init__.py", line 196, in __init__
  File "build\bdist.win32\egg\bs4\__init__.py", line 210, in _feed
  File "build\bdist.win32\egg\bs4\builder\_htmlparser.py", line 164, in feed
HTMLParser.HTMLParseError: junk characters in start tag: u'{t:119}); class=gbzt', at line 1, column 37580

suggestion to have simpler daterange api

Useful package thanks! One suggestion is to have an easier daterange api than relying on the google version.

First issue is that this relies on the tbs parameter which is a little cryptic. Maybe you could have a daterange parameter that formats the tbs appropriately e.g. I use the code below. I am not sure though if tbs is used for other things apart from date.

Secondly the tbs daterange seems to be ignored if one uses old user agents. This includes your default and random user agents. There is a package fake_useragent which grabs the latest ones.

def get_tbs(fromDate, toDate):
    """ return google search tbs parameter for date range

    :param fromDate: python date
    :param toDate:   python date
    :return:  tbs urlencoded value (excluding tbs=)
    """
    # dates to m/d/yyyy format
    fromDate = fromDate.strftime("%m/%d/%Y")
    toDate = toDate.strftime("%m/%d/%Y")
    return urlencode(dict(tbs=f"cdr:1,cd_min:{fromDate},cd_max:{toDate}"))[4:]

Version mismatch between setup.py and PyPI

setup.py version is 2.0.2
PyPI version is 2.0.1

beautifulsoup warning

https://stackoverflow.com/questions/33511544/how-to-get-rid-of-beautifulsoup-user-warning
refering to this

version mismatch between setup.py and PyPI

setup.py version is 2.0.0
PyPI version is 2.0.1

Search returning different umber of results than specified

Hey,

I have been using your code for my undergrad research thesis and it is working fine. However i am having an issue. the search function returns more than the specified urls -

search(query,lang='en',stop=10,pause=3.0) - should return 10 results but sometimes it returns 9,14,6 results.

Automated testing

Any objections to CI tests being added , and run by Travis CI ?

randomized user agent always set to be default value

The randomized user agent is accidentally turned off by hard-coded file path.
line 81: with open('user_agents.txt') as fp:
should be with open(user_agents_file) as fp:

No such file or directory: '/home/wsgi/.google-cookie'

Hi Mario,

Thanks for putting time into building this. It's been incredibly useful for me to use. I'm trying to run your package from a remote AWS EC2 instance and I've actually been unable to because I get the following error:

<type 'exceptions.IOError'>
[Errno 2] No such file or directory: '/home/wsgi/.google-cookie'

I looked through your source code and saw that you actually do reference .google-cookie here:

cookie_jar = LWPCookieJar(os.path.join(home_folder, '.google-cookie'))
try:
      cookie_jar.load()
except Exception:

I haven't been able to figure out how to get this working on my EC2 instance. Do you have any suggestions for what the issue is? Is there a specific cookie I need to add to my environment? Any help you can provide would be greatly appreciated.

Thanks.

-Daniel

domain "support.google.com" doesn't work

Hi @MarioVilas!
I found that when I try to specify the domain to "support.google.com", it doesn't return anything, do you know why is that?

Thanks!

First try was an EPIC Fail!

this happend the first time, the 2nd, 3rd, and so on..

Traceback (most recent call last):
  File "googleParse.py", line 7, in <module>
    for url in google.search('google 1.9.1 python', tld='com.pk', lang='es', stop=5):
AttributeError: 'module' object has no attribute 'search'

--stop not working as expected when --stop < --num

I was expecting the following to return 5 results, but it returns more results. Looking at the code, it was because --num defaults to 10 and it returns everything on the page before applying the --stop.

$ google --stop 5 arsenal
https://www.arsenal.com/
https://twitter.com/arsenal
http://www.skysports.com/arsenal
https://www.youtube.com/channel/UCpryVRk_VDudG8SHXgWcG0w
https://www.facebook.com/Arsenal/
https://en.wikipedia.org/wiki/Arsenal_F.C.
https://www.independent.co.uk/topic/Arsenal
https://www.premierleague.com/clubs/1/Arsenal/overview
https://www.theguardian.com/football/arsenal

Also, in this case I gave a --num option, but looks like there must have been a duplicate in the 5 results returned?

$ google --stop 5 --num 5 arsenal
https://www.arsenal.com/
https://twitter.com/arsenal
http://www.skysports.com/arsenal
https://www.youtube.com/channel/UCpryVRk_VDudG8SHXgWcG0w

Python 3.6+ usage

In documentation, please add that pip3 cannot be used to install the package currently (at least, did not work on my computer).

Moreover, the program uses https certification on the search URLs, which will break usage on OSX Yosemite+ with Python 3.6+. To fix this, Python users on OSX need to install updated certificates that do not come with default on this Mac installation of Python. The code to be run is /Applications/Python\ 3.6/Install\ Certificates.command and can be found in the README.md at /Applications/Python\ 3.6/ReadMe.rtf.

Specifically, if users do not install this and attempt to run (as I did), they will run into issues akin to "URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed"

For more information especially about the latter, see the following post: [https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-error](StackOverflow on urllib SSL)

TypeError: '<' not supported between instances of 'int' and 'str'

Hey Mario,

I know this is a common issue in python, but I'm not sure how to fix it in your source code.

/googlesearch/__init__.py", line 345, in search while not stop or start < stop: TypeError: '<' not supported between instances of 'int' and 'str'

Thanks for putting this together. Can't wait to try it!

Fetch next n urls

Hi,

I request to find a solution to fetch next n URLS.
What i am trying to do is after fetching the URLS, I fetch the contents from websource, but if the result is not appropriate, again fetch new URLs for the same search string.
For example, first i fetched 10 URLs and i need next 5 URLs. Please help me out

Searching Google Based Websites

Issue:
Script is not able to search for or find google-based websites as results

Reproducible Use:
x=google.search("site:google.com apple")

Expected Result:
"feedproxy.google.com/~r/zdnet/Apple/~3/5eQ2k0HRMgc/"

Actual Result:
No site found (in fact a ban from Google for 3-6 hours)

tld `org` doesn't work

for website in search("en.une UNE 66926", tld="org",num=10, stop=3, pause=4):
    print(website)

Output

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-60-393ef278d240> in <module>()
      1 # UNE
----> 2 for website in search("en.une UNE 66926", tld="org",num=10, stop=3, pause=4):
      3     print(website)

~/miniconda3/envs/jay/lib/python3.6/site-packages/googlesearch/__init__.py in search(query, tld, lang, tbs, safe, num, start, stop, domains, pause, only_standard, extra_params, tpe, user_agent)
    286 
    287         # Request the Google Search results page.
--> 288         html = get_page(url, user_agent)
    289 
    290         # Parse the response and process every anchored URL.

~/miniconda3/envs/jay/lib/python3.6/site-packages/googlesearch/__init__.py in get_page(url, user_agent)
    152     request.add_header('User-Agent', user_agent)
    153     cookie_jar.add_cookie_header(request)
--> 154     response = urlopen(request)
    155     cookie_jar.extract_cookies(response, request)
    156     html = response.read()

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found