Giter VIP home page Giter VIP logo

googlesearch's Introduction

Google Search

Google search from Python.

https://python-googlesearch.readthedocs.io/en/latest/

Note: this project is not affiliated with Google in any way.

Usage example

# Get the first 20 hits for: "Breaking Code" WordPress blog
from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
    print(url)

Installing

pip install google

googlesearch's People

Contributors

alphagolf33 avatar alxndr avatar ashrafibrahim03 avatar brodul avatar cclauss avatar depierre avatar fxfitz avatar galenwong avatar geovrodri avatar grhawk avatar hugovk avatar jackson-zhipeng-chang avatar mariovilas avatar nzachow avatar prasadkatti avatar remohammadi avatar thespeedx avatar valhallasw avatar yashagarwal41 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

googlesearch's Issues

Unable to see filtered results

Google tacks on "&filter=0" to represent choosing the option to not omit items. I'm wanting to add this functionality but not sure where to start. Fairly new to using git in general, is it possible to make my own branch with this feature?--thanks!

urllib.error.HTTPError: HTTP Error 429: Too Many Requests

try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
xx = []
query = ["zenobase" , "exist.io" , "gyrosco.pe" , "quantimodo"]  
print(query)
#requests.get(link, headers = {'User-agent': 'mymerge bot 0.1'}) need python requests package, user_agent="rua"
# to search 
for q in query:
    for j in search(q, tld="com", num=10, stop=2, pause=20): 
        xx.append(j)
        print(xx)

Ran fine before modifications. Maybe caused by me setting num=3000 and pause=2 previously.
Using user_agent="rua" did not help. Browsers still use google fine.

Less Hits than expected

I am doing this :

for url in search("Microsoft", stop = 500):

It gives me 137 links instead of 500. Any hints as to why?

How to get title and snippet from result?

Hi,

First of all thank you for making such wonderful and easy to use library.

I am able to make it work, But now i want to get the title and snippet for better performance and to use it further.

Can i do it using this?

Below is simple code which is giving me the urls only.

from googlesearch import search

for url in search('best usb powered monitor', tld='com', stop=10):
    print(url)

And what things i need to consider if i am going to use this library for enterprise application?

Thanks,
Jayesh (Kod Factory)

BeautifulSoup complains about parser

To be completely honest, I have no idea why BeautifulSoup is yelling at my on my normal usage of your library, but I get it the first time I try to iterate through the generator that search() returns.

It looks to me like you're calling it correctly, or at least using it how the error message indictates, so something must be wrong... ?

Any thoughts?

Ideally I'd like to squash this error message.

No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

HTTPError: Too Many Requests

Running the sample code

from googlesearch import search
for url in search('"Breaking Code" WordPress blog', stop=20):
print(url)

I always obtain this error

File \urllib\request.py line 650 in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Too Many Requests

Too long search

Thank you author for your working, great library. But.. it so long.
Simple search request executes about 3 seconds, while other libraries (for example, PyGoogler) takes about 1 seconds.

Cannot import Search

I 've read this #39 issue but it does me no good
I'm using python 3.5
When I type

>>>from googlesearch import search
this is what I got
ImportError: cannot import name 'search'

everything else is fine I can import the library without search

`>>> dir(googlesearch)'

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__','__spec__', '__version__']

>>> googlesearch <module 'googlesearch' from 'C:\\Users\\abdelmalek\\AppData\\Local\\Programs\\Python\\Python35\\lib\\site-packages\\googlesearch\\__init__.py'>

What about pagination ?

Hello,

Thanks you for this module i really like it, but does it have pagination support ?

Y.

Even with .COM tld getting results for local Google URL

Hi,

I was trying to use the script for parsing the results but I am facing issues when I try to use the TLD as COM. The results in the parser are being shown for the local Google URL (in my case .co.in).

Can you please help.

Regards,

Nitin

Scraping

I want to retrieve the metadata along with the URLs.Is there a way this can be done in this code?

Searching in Persian doesn't produce consistant result

I have tried this module, but I have noticed when I search Persian terms, the result is not consistent.

beside give very different result than what I get from google itself (in private search, not logged in), sometimes the order of the result is different (even when I search after 2 min)

Is this normal behavior?

Add Documentation

Including:

  • Installation
  • Link to package installation (eg. PyPi)
  • Version
  • Methods and Classes
  • Pagination
  • Ect.

ngd function is not working.

Trying to call the ngd function with some basic terms

    term1 = 'macbeth'
    term2 = 'shakespeare'
    relationship = ngd(term1, term2)
    print(relationship)

And I'm getting this error. I've tried a couple other terms, but this ngd should definitely come up with some results...

  File "remake.py", line 132, in <module>
    relationship = ngd(term1, term2)
  File "/lib/python3.7/site-packages/googlesearch/__init__.py", line 766, in ngd
    lhits1 = math.log10(hits(term1))
  File "/lib/python3.7/site-packages/googlesearch/__init__.py", line 745, in hits
    tag = soup.find_all(attrs={"class": "sd", "id": "resultStats"})[0]
IndexError: list index out of range

Missing Catch for Google's 503

I feel it misses a parameters asking the user for a function to be run in case it identifies that Google threw a 503 result.
So in this case, an user could implement a function to receive the html and deal with the ReCaptcha and to continue the fast pace of crawling.
When this situation happens, Google throws this redirect: https://www.google.com/sorry/index
I add the HTML from this page, in order to be used as base for coding, or even can help with adding this improvement, just let me know what you prefer.
screen shot 2019-01-08 at 13 22 27
Error Page.zip

Cannot import name search

I've installed google via pip for both python2 (system distro) and anaconda for python3. I'm using ubuntu 16.04. When I try to import anything, like search, it says

ImportError: cannot import name 'search'

That's after doing

from google import search

The same thing happens in python2 or 3.

When I import google and check the 'dir', it says:
(python3)

In [84]: dir(google)
Out[84]: ['__doc__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

(python2)

In [3]: dir(google)
Out[3]: ['__doc__', '__name__', '__path__']

Any ideas what's going on? The __init__.py file looks fine where it's installed, which I found from google.__path__. But I thought for most packages, .__file__ was the way to specify the filepath to the module.

Add the option to don't exclude results

Message: In order to show you the most relevant results, we have omitted some entries very similar to the 392 already displayed. If you like, you can repeat the search with the omitted results included.

A ?filter=0 avoids google from filtering these.

License

Hi Mario, do you have a license for your code?!

Unable to use custom date ranges

The tbs parameter doesn't work with custom date ranges.

tbs has no effect when you input

cdr%3A1%2Ccd_min%3A1%2F1%2F2019%2Ccd_max%3A2%2F1%2F2019
or
cdr:1,cd_min:01/01/2019,cd_max:02/01/2019

It just returns the default results. These strings work when entered directly into the google url via Chrome. I've been mainly trying to get this to work for the search_news function, but the issue applies to everything.

Thanks

Python 3.7.3 Compatibility

Hello, I've just installed google-2.0.2 from pip. Ran the sample code and nothing printed at all. Cloned the repository from Github, tried again, and still nothing printed from the for loop again. All I want is to perform a virtual google search and return a url. The following is what happened:

>>> for url in search('"Breaking Code" WordPress blog', stop=20):
...        print(url)
...
>>>

Would it be possible to enable proxy support?

When using the code on a VPS like digital ocean, I run into issues like this:

    result = func(*args)
  File "/usr/lib64/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

This is even with an explicit user agent.

domain search issue

Hi @MarioVilas !
So, I found that you filtered the urls that contain "google", like 'image.google.com', I'm just wondering what's the reason for that? Since I need to search the domain of 'support.google.com' now, and It works pretty well after I removed the "if 'google' not in o.netloc".
Thanks!

Error from BeautifulSoup while running google.py script

While running on Windows with Py2.6 I get this:

$ python google.py test
c:\Python26\lib\site-packages\beautifulsoup4-4.3.2-py2.6.egg\bs4\builder\_htmlparser.py:163: RuntimeWarning: Python's built-in HTMLParser cannot parse
 the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Sou
p with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
Traceback (most recent call last):
  File "google.py", line 274, in <module>
    for url in search(query, **params):
  File "google.py", line 196, in search
    soup = BeautifulSoup(html)
  File "build\bdist.win32\egg\bs4\__init__.py", line 196, in __init__
  File "build\bdist.win32\egg\bs4\__init__.py", line 210, in _feed
  File "build\bdist.win32\egg\bs4\builder\_htmlparser.py", line 164, in feed
HTMLParser.HTMLParseError: junk characters in start tag: u'{t:119}); class=gbzt', at line 1, column 37580

suggestion to have simpler daterange api

Useful package thanks! One suggestion is to have an easier daterange api than relying on the google version.

First issue is that this relies on the tbs parameter which is a little cryptic. Maybe you could have a daterange parameter that formats the tbs appropriately e.g. I use the code below. I am not sure though if tbs is used for other things apart from date.

Secondly the tbs daterange seems to be ignored if one uses old user agents. This includes your default and random user agents. There is a package fake_useragent which grabs the latest ones.

def get_tbs(fromDate, toDate):
    """ return google search tbs parameter for date range

    :param fromDate: python date
    :param toDate:   python date
    :return:  tbs urlencoded value (excluding tbs=)
    """
    # dates to m/d/yyyy format
    fromDate = fromDate.strftime("%m/%d/%Y")
    toDate = toDate.strftime("%m/%d/%Y")
    return urlencode(dict(tbs=f"cdr:1,cd_min:{fromDate},cd_max:{toDate}"))[4:]

Search returning different umber of results than specified

Hey,

I have been using your code for my undergrad research thesis and it is working fine. However i am having an issue. the search function returns more than the specified urls -

search(query,lang='en',stop=10,pause=3.0) - should return 10 results but sometimes it returns 9,14,6 results.

No such file or directory: '/home/wsgi/.google-cookie'

Hi Mario,

Thanks for putting time into building this. It's been incredibly useful for me to use. I'm trying to run your package from a remote AWS EC2 instance and I've actually been unable to because I get the following error:

<type 'exceptions.IOError'>
[Errno 2] No such file or directory: '/home/wsgi/.google-cookie'

I looked through your source code and saw that you actually do reference .google-cookie here:

cookie_jar = LWPCookieJar(os.path.join(home_folder, '.google-cookie'))
try:
      cookie_jar.load()
except Exception:

I haven't been able to figure out how to get this working on my EC2 instance. Do you have any suggestions for what the issue is? Is there a specific cookie I need to add to my environment? Any help you can provide would be greatly appreciated.

Thanks.

-Daniel

First try was an EPIC Fail!

this happend the first time, the 2nd, 3rd, and so on..

Traceback (most recent call last):
  File "googleParse.py", line 7, in <module>
    for url in google.search('google 1.9.1 python', tld='com.pk', lang='es', stop=5):
AttributeError: 'module' object has no attribute 'search'

--stop not working as expected when --stop < --num

I was expecting the following to return 5 results, but it returns more results. Looking at the code, it was because --num defaults to 10 and it returns everything on the page before applying the --stop.

$ google --stop 5 arsenal
https://www.arsenal.com/
https://twitter.com/arsenal
http://www.skysports.com/arsenal
https://www.youtube.com/channel/UCpryVRk_VDudG8SHXgWcG0w
https://www.facebook.com/Arsenal/
https://en.wikipedia.org/wiki/Arsenal_F.C.
https://www.independent.co.uk/topic/Arsenal
https://www.premierleague.com/clubs/1/Arsenal/overview
https://www.theguardian.com/football/arsenal

Also, in this case I gave a --num option, but looks like there must have been a duplicate in the 5 results returned?

$ google --stop 5 --num 5 arsenal
https://www.arsenal.com/
https://twitter.com/arsenal
http://www.skysports.com/arsenal
https://www.youtube.com/channel/UCpryVRk_VDudG8SHXgWcG0w

Python 3.6+ usage

In documentation, please add that pip3 cannot be used to install the package currently (at least, did not work on my computer).

Moreover, the program uses https certification on the search URLs, which will break usage on OSX Yosemite+ with Python 3.6+. To fix this, Python users on OSX need to install updated certificates that do not come with default on this Mac installation of Python. The code to be run is /Applications/Python\ 3.6/Install\ Certificates.command and can be found in the README.md at /Applications/Python\ 3.6/ReadMe.rtf.

Specifically, if users do not install this and attempt to run (as I did), they will run into issues akin to "URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed"

For more information especially about the latter, see the following post: [https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-error](StackOverflow on urllib SSL)

TypeError: '<' not supported between instances of 'int' and 'str'

Hey Mario,

I know this is a common issue in python, but I'm not sure how to fix it in your source code.

/googlesearch/__init__.py", line 345, in search while not stop or start < stop: TypeError: '<' not supported between instances of 'int' and 'str'

Thanks for putting this together. Can't wait to try it!

Fetch next n urls

Hi,

I request to find a solution to fetch next n URLS.
What i am trying to do is after fetching the URLS, I fetch the contents from websource, but if the result is not appropriate, again fetch new URLs for the same search string.
For example, first i fetched 10 URLs and i need next 5 URLs. Please help me out

Searching Google Based Websites

Issue:
Script is not able to search for or find google-based websites as results

Reproducible Use:
x=google.search("site:google.com apple")

Expected Result:
"feedproxy.google.com/~r/zdnet/Apple/~3/5eQ2k0HRMgc/"

Actual Result:
No site found (in fact a ban from Google for 3-6 hours)

tld `org` doesn't work

for website in search("en.une UNE 66926", tld="org",num=10, stop=3, pause=4):
    print(website)

Output

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-60-393ef278d240> in <module>()
      1 # UNE
----> 2 for website in search("en.une UNE 66926", tld="org",num=10, stop=3, pause=4):
      3     print(website)

~/miniconda3/envs/jay/lib/python3.6/site-packages/googlesearch/__init__.py in search(query, tld, lang, tbs, safe, num, start, stop, domains, pause, only_standard, extra_params, tpe, user_agent)
    286 
    287         # Request the Google Search results page.
--> 288         html = get_page(url, user_agent)
    289 
    290         # Parse the response and process every anchored URL.

~/miniconda3/envs/jay/lib/python3.6/site-packages/googlesearch/__init__.py in get_page(url, user_agent)
    152     request.add_header('User-Agent', user_agent)
    153     cookie_jar.add_cookie_header(request)
--> 154     response = urlopen(request)
    155     cookie_jar.extract_cookies(response, request)
    156     html = response.read()

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

~/miniconda3/envs/jay/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.