Giter VIP home page Giter VIP logo

google-search-results-python's People

Contributors

ajsierra117 avatar dimitryzub avatar elizost avatar gbcfxs avatar gutoarraes avatar hartator avatar heyalexej avatar ilyazub avatar justinrobertohara avatar jvmvik avatar kennethreitz avatar lf2225 avatar manoj-nathwani avatar paplorinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

google-search-results-python's Issues

Cannot increase the offset between returned results using pagination

I am trying to use the pagination feature based on the code at (https://github.com/serpapi/google-search-results-python#pagination-using-iterator). I want to request 20 results per API call but pagination by default iterates by 10 results only instead of 20, meaning I my requests end up overlapping.

I think I have found a solution to this. Looking in the package, the pagination.py file has a Pagination class which takes a page_size variable that changes the size of the offset between returned results.
The Pagination class is imported in the serp_api_client.py file within the pagination method starting on line 170 but here the page_size variable wasn't included. I just added page_size = 10 on lines 170 and 174 and now I can use the page_size variable if I call search.pagination(page_size = 20). Can this change be made in the code?

[Google Jobs API] Support for Pagination

As Google Jobs does not return serpapi_pagination key but expects start param to paginate, this iteration of the library does not support pagination in Google Jobs. Pagination Support to be added for Google Jobs.

# stop if backend miss to return serpapi_pagination
if not 'serpapi_pagination' in result:
  raise StopIteration

# stop if no next page
if not 'next' in result['serpapi_pagination']:
    raise StopIteration

image

SSLCertVerificationError [SSL: CERTIFICATE_VERIFY_FAILED] error

A user reported receiving this error:

SSLCertVerificationError Traceback (most recent call last)
/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
698 # Make the request on the httplib connection object.
--> 699 httplib_response = self._make_request(
700 conn,

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1125)

The solution for them was to turn off the VPN.

macOS installation issue

When installing the package via pip it fails.

Collecting google-search-results
  Using cached https://files.pythonhosted.org/packages/08/eb/38646304d98db83d85f57599d2ccc8caf325961e8792100a1014950197a6/google_search_results-1.5.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/setup.py", line 7, in <module>
        with open(path.join(here, 'SHORT_README.rst'), encoding='utf-8') as f:
      File "/usr/local/Cellar/python@2/2.7.15_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 898, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '/private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/SHORT_README.rst'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/

Running macOS catalina and python 2.7

~ ❯❯❯ pip --version
pip 19.0.2 from /usr/local/lib/python2.7/site-packages/pip (python 2.7)
~ ❯❯❯ python --version
Python 2.7.15

Provide a more convenient way to paginate via the Python package

Currently, the way to paginate searches is to get the serpapi_pagination.current and increase the offset or start parameters in the loop. Like with regular HTTP requests to serpapi.com/search without an API wrapper.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

print(f"Current page: {results['serpapi_pagination']['current']}")

for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

while 'next' in results['serpapi_pagination']:
    search.params_dict[
        "start"] = results['serpapi_pagination']['current'] * 10
    results = search.get_dict()

    print(f"Current page: {results['serpapi_pagination']['current']}")

    for news_result in results["news_results"]:
        print(
            f"Title: {news_result['title']}\nLink: {news_result['link']}\n"
        )

A more convenient way for an official API wrapper would be to provide some function like search.paginate(callback: Callable) which will properly calculate offset for the specific search engine and loop through pages until the end.

import os
from serpapi import GoogleSearch

def print_results(results):
  print(f"Current page: {results['serpapi_pagination']['current']}")

  for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
search.paginate(print_results)

@jvmvik @hartator What do you think?

{'error':'We couldn't find your API Key.'}

`from serpapi.google_search_results import GoogleSearchResults

client = GoogleSearchResults({"q": "coffee", "serp_api_key": "************************"})

result = client.get_dict()`

I tried giving my API key from serpstack. Yet I am left with this error. Any help could be much useful.

Knowledge Graph object not being sent in response.

Some queries which return a knowledge graph in both my own google search and when tested in the SerpApi Playground are not returning the 'knowledge_graph' key in my own application.

Code:

params = {
    'q': 'Aspen Pumps Ltd',
    'engine': 'google',
    'api_key': <api_key>,
    'num': 100
  }

result_set = GoogleSearchResults(params).get_dict()

print(result_set.keys())

Evaluation:

dict_keys(['search_metadata', 'search_parameters', 'search_information', 'ads', 'shopping_results', 'organic_results', 'related_searches', 'pagination', 'serpapi_pagination'])

Manual Results:

https://www.google.com
Screenshot 2019-07-31 at 15 48 55

https://serpapi.com/playground
Screenshot 2019-07-31 at 15 49 10

SerpApiClient.get_search_archive fails with format='html'

SerpApiClient.get_search_archive assumes all results must be loaded as a JSON, so it fails when using format='html'

GoogleSearchResults({}).get_search_archive(search_id='5df0db57ab3f5837994cd5a1', format='html')
---------------------------------------------------------------------------                                                                                                                                   JSONDecodeError                           Traceback (most recent call last)
<ipython-input-8-b6d24cb47bf7> in <module>
----> 1 GoogleSearchResults({}).get_search_archive(search_id='5df0db57ab3f5837994cd5a1', format='html')

C:\ProgramData\Anaconda3\lib\site-packages\serpapi\serp_api_client.py in get_search_archive(self, search_id, format)
78             dict|string: search result from the archive
79         """
---> 80         return json.loads(self.get_results("/searches/{0}.{1}".format(search_id, format)))
81
82     def get_account(self):

C:\ProgramData\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352             parse_int is None and parse_float is None and
353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
355     if cls is None:
356         cls = JSONDecoder

C:\ProgramData\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
337
338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340         end = _w(s, end).end()
341         if end != len(s):

C:\ProgramData\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
355             obj, end = self.scan_once(s, idx)
356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Connexion issue

Hi,
One of the user using my code get the following error when creating a client.
image
I suppose it is machine settings related as it doesn't happen to other users.
Thanks for helping
P.S. I am fairly new to coding.

ImportError: cannot import name 'GoogleSearch' from 'serpapi'

After creating a subscriber account in serpapi, I have been given an API key. I Installed the "pip install google-search-results "

but whenever I tried to run my django app I get this error:

File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 64, in wrapper
fn(*args, **kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/core/management/commands/runserver.py", line 125, in inner_run
autoreload.raise_last_exception()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 87, in raise_last_exception
raise _exception[1]
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/core/management/init.py", line 394, in execute
autoreload.check_errors(django.setup)()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 64, in wrapper
fn(*args, **kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/init.py", line 24, in setup
apps.populate(settings.INSTALLED_APPS)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/apps/registry.py", line 116, in populate
app_config.import_models()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/apps/config.py", line 269, in import_models
self.models_module = import_module(models_module_name)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/MyProjects/topsearch/topsearch/searchapp/models.py", line 3, in
from serpapi import GoogleSearch
ImportError: cannot import name 'GoogleSearch' from 'serpapi' (/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/serpapi/init.py)

how to resolve the Connection aborted error when calling the serpapi

Hi,
A new scrapper here.
in my api call, i have the following error. Would you please let me know if i am doing anything wrong here? Thanks a lot

https://serpapi.com/search
---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    676                 headers=headers,
--> 677                 chunked=chunked,
    678             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    380         try:
--> 381             self._validate_conn(conn)
    382         except (SocketTimeout, BaseSSLError) as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    977         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 978             conn.connect()
    979 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
    370             server_hostname=server_hostname,
--> 371             ssl_context=context,
    372         )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)
    385         if HAS_SNI and server_hostname is not None:
--> 386             return context.wrap_socket(sock, server_hostname=server_hostname)
    387 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    406                          server_hostname=server_hostname,
--> 407                          _context=self, _session=session)
    408 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in __init__(self, sock, keyfile, certfile, server_side, cert_reqs, ssl_version, ca_certs, do_handshake_on_connect, family, type, proto, fileno, suppress_ragged_eofs, npn_protocols, ciphers, server_hostname, _context, _session)
    816                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 817                     self.do_handshake()
    818 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self, block)
   1076                 self.settimeout(None)
-> 1077             self._sslobj.do_handshake()
   1078         finally:

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self)
    688         """Start the SSL/TLS handshake."""
--> 689         self._sslobj.do_handshake()
    690         if self.context.check_hostname:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    726             retries = retries.increment(
--> 727                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    728             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    409             if read is False or not self._is_method_retryable(method):
--> 410                 raise six.reraise(type(error), error, _stacktrace)
    411             elif read is not None:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    733             if value.__traceback__ is not tb:
--> 734                 raise value.with_traceback(tb)
    735             raise value

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    676                 headers=headers,
--> 677                 chunked=chunked,
    678             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    380         try:
--> 381             self._validate_conn(conn)
    382         except (SocketTimeout, BaseSSLError) as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    977         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 978             conn.connect()
    979 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
    370             server_hostname=server_hostname,
--> 371             ssl_context=context,
    372         )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)
    385         if HAS_SNI and server_hostname is not None:
--> 386             return context.wrap_socket(sock, server_hostname=server_hostname)
    387 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    406                          server_hostname=server_hostname,
--> 407                          _context=self, _session=session)
    408 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in __init__(self, sock, keyfile, certfile, server_side, cert_reqs, ssl_version, ca_certs, do_handshake_on_connect, family, type, proto, fileno, suppress_ragged_eofs, npn_protocols, ciphers, server_hostname, _context, _session)
    816                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 817                     self.do_handshake()
    818 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self, block)
   1076                 self.settimeout(None)
-> 1077             self._sslobj.do_handshake()
   1078         finally:

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self)
    688         """Start the SSL/TLS handshake."""
--> 689         self._sslobj.do_handshake()
    690         if self.context.check_hostname:

ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-26-45ac328ca8f8> in <module>
      1 question = 'where to get best coffee'
----> 2 results = performSearch(question)

<ipython-input-25-5bc778bad4e2> in performSearch(question)
     12 
     13     search = GoogleSearch(params)
---> 14     results = search.get_dict()
     15     return results

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_dict(self)
    101             (alias for get_dictionary)
    102         """
--> 103         return self.get_dictionary()
    104 
    105     def get_object(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_dictionary(self)
     94             Dict with the formatted response content
     95         """
---> 96         return dict(self.get_json())
     97 
     98     def get_dict(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_json(self)
     81         """
     82         self.params_dict["output"] = "json"
---> 83         return json.loads(self.get_results())
     84 
     85     def get_raw_json(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_results(self, path)
     68             Response text field
     69         """
---> 70         return self.get_response(path).text
     71 
     72     def get_html(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_response(self, path)
     57             url, parameter = self.construct_url(path)
     58             print(url)
---> 59             response = requests.get(url, parameter, timeout=self.timeout)
     60             return response
     61         except requests.HTTPError as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/api.py in get(url, params, **kwargs)
     73     """
     74 
---> 75     return request('get', url, params=params, **kwargs)
     76 
     77 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/api.py in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)
     62 
     63 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    543 
    544         return resp

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    653 
    654         # Send the request
--> 655         r = adapter.send(request, **kwargs)
    656 
    657         # Total elapsed time of the request (approximately)

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    496 
    497         except (ProtocolError, socket.error) as err:
--> 498             raise ConnectionError(err, request=request)
    499 
    500         except MaxRetryError as e:

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

You need a valid browser to continue exploring our API

This is the error message you get when you don't supply a private key. I think information on this site should be provided regarding:

  1. How to get an API key
  2. Is a key free or how much does it cost
  3. Are there limits to using the key (hits/hour or whatever)

The service provided by the repo is very valuable, but can I use it or not depends on the answers to these questions.

get_html() Returns JSON Instead of HTML

A customer reported the get_html() method for this library returns a JSON response instead of the expected HTML.

I may be misunderstanding something about what the get_html method is intended to do, but I checked this locally and the customer's report appears to be correct:

Screenshot 2024-01-05 at 9 34 23 AM Screenshot 2024-01-05 at 9 42 56 AM

Python 3.8+, Fatal Python error: Segmentation fault when calling requests.get(URL, params) with docker python-3.8.2-slim-buster/openssl 1.1.1d and python-3.9.10-slim-buster/openssl 1.1.1d

Here's the trace:

Python 3.9.10 (main, Mar  1 2022, 21:02:54) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get('https://serpapi.com', {"api_key":VALID_API_KEY, "engine": "google_jobs", "q": "Barista"})
Fatal Python error: Segmentation fault

Current thread 0x0000ffff8e999010 (most recent call first):
  File "/usr/local/lib/python3.9/ssl.py", line 1173 in send
  File "/usr/local/lib/python3.9/ssl.py", line 1204 in sendall
  File "/usr/local/lib/python3.9/http/client.py", line 1001 in send
  File "/usr/local/lib/python3.9/http/client.py", line 1040 in _send_output
  File "/usr/local/lib/python3.9/http/client.py", line 1280 in endheaders
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 395 in request
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496 in _make_request
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790 in urlopen
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 486 in send
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703 in send
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589 in request
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 59 in request
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 73 in get
  File "<stdin>", line 1 in <module>
Segmentation fault

This is not specific to one engine, it also applies to google_images if I swap the engine.

Dockerfile:

FROM python:3.9.10-slim-buster

ENV PYTHONUNBUFFERED 1
ENV PYTHONDONTWRITEBYTECODE 1

# OLD: RUN apt-get update && apt-get upgrade -y && apt-get install gcc -y && apt-get install apt-utils -y

# Install build-essential for celery worker otherwise it says gcc not found
RUN apt-get update \
  # dependencies for building Python packages
  && apt-get install -y build-essential \
  # psycopg2 dependencies
  && apt-get install -y libpq-dev \
  # Additional dependencies
  && apt-get install -y telnet netcat \
  # cleaning up unused files
  && apt-get purge -y --auto-remove -o APT::AutoRemove::RecommendsImportant=false \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY ./compose/local/flask/start /start
RUN sed -i 's/\r$//g' /start
RUN chmod +x /start

# COPY ./compose/local/flask/celery/worker/start /start-celeryworker
# RUN sed -i 's/\r$//g' /start-celeryworker
# RUN chmod +x /start-celeryworker

# COPY ./compose/local/flask/celery/beat/start /start-celerybeat
# RUN sed -i 's/\r$//g' /start-celerybeat
# RUN chmod +x /start-celerybeat

# COPY ./compose/local/flask/celery/flower/start /start-flower
# RUN sed -i 's/\r$//g' /start-flower
# RUN chmod +x /start-flower

COPY . .

# COPY entrypoint.sh /usr/local/bin/
# ENTRYPOINT ["entrypoint.sh"]

docker-compose.yml:

version: "3.9"

services:
  flask_app:
    restart: always
    container_name: flask_app
    image: meder/flask_live_app:1.0.0
    command: /start
    build: .
    ports:
      - "4000:4000"
    volumes:
      - .:/app
    env_file:
      - local.env
    environment:
      - FLASK_ENV=development
      - FLASK_APP=app.py
    depends_on:
      - db
  db:
    container_name: flask_db
    image: postgres:16.1-alpine
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=USER
      - POSTGRES_PASSWORD=PW
      - POSTGRES_DB=DB
    volumes: 
      - postgres_data:/var/lib/postgresql/data
  redis:
    container_name: redis
    image: redis:7.2-alpine
    ports:
      - "6379:6379"
volumes:
  postgres_data: {}

And requirements.txt though I didn't update requirements.txt after trying 3.9.10 from the original 3.8.2:

flask==3.0.0
psycopg2-binary==2.9.9
google-search-results==2.4.2

The above trace came from bashing into my docker instance and running requests.get after importing it like so:

docker exec -it flask_app bash

The host machine runs this fine, but uses LibreSSL 2.8.3 / Python 3.8.16 - based on other tickets/issues here it seems like there's possibly something on the SSL side of the backend that's triggering this - would appreciate some insight

Someone ran into this on SO and the selected answer was updating the timeout: https://stackoverflow.com/questions/74774784/cheerypy-server-is-timing-out but no guarantee this is the same issue, just a reference.

[Pagination] Pagination isn't correct and it skips index by one

image

Since the start value starts from 0, the correct second page should be 10 and not 11.

This behaviour is causing a skip in pages also. The customers are getting confusing results:
image

Intercom Link
First recognized by @marm123.

I think this part needs to be replaced by:
image

self.client.params_dict['start'] += 0

Whether it would cause any error on other engines is something I don't know. But it may also fix it for every other engine.

[Version] Update PyPi to include the most up-to-date version

Currently, the PyPi allows users to install our library using pip easily. However, the library has been updated not to include the print method (screenshot below). The PyPi version still includes it, causing confusion among users. Some of them think that the printed URL should contain their data from search and contacting us about SerpApi not working, while others simply ask for this to be removed for clarity.

Current state:
image

The user confused about the data not being available in the printed link.
Another user confused about the data not being available in the printed link

The user asking to remove the print method for clarity (they installed it through PyPi)
Another user asking to remove the print method

Link for account details for PyPi

[Feature request] Make `async: True` do everything under the hood

From a user perspective, the less setup required the better. I personally find the second example (example.py) more user-friendly especially for non-very technical users.

The user has to just add an async: True and don't bother tinkering/figuring out stuff for another ~hour about how Queue or something else works.

@jvmvik @ilyazub @hartator what do you guys think?

@aliayar @marm123 @schaferyan have you guys noticed similar issues for the users or have any users requested similar things?


What if instead of this:

# async batch requests: https://github.com/serpapi/google-search-results-python#batch-asynchronous-searches

from serpapi import YoutubeSearch
from queue import Queue
import os, re, json

queries = [
    'burly',
    'creator',
    'doubtful'
]

search_queue = Queue()

for query in queries:
    params = {
        'api_key': '...',                 
        'engine': 'youtube',              
        'device': 'desktop',              
        'search_query': query,          
        'async': True,                   # ❗
        'no_cache': 'true'
    }
    search = YoutubeSearch(params)       
    results = search.get_dict()         
    
    if 'error' in results:
        print(results['error'])
        break

    print(f"Add search to the queue with ID: {results['search_metadata']}")
    search_queue.put(results)

data = []

while not search_queue.empty():
    result = search_queue.get()
    search_id = result['search_metadata']['id']

    print(f'Get search from archive: {search_id}')
    search_archived = search.get_search_archive(search_id)
    
    print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

    if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
        for video_result in search_archived.get('video_results', []):
            data.append({
                'title': video_result.get('title'),
                'link': video_result.get('link'),
                'channel': video_result.get('channel').get('name'),
            })
    else:
        print(f'Requeue search: {search_id}')
        search_queue.put(result)

Users can do something like this and we handle everything under the hood:

# example.py
# testable example
# example import: from serpapi import async_search

from async_search import async_search
import json

queries = [
    'burly',
    'creator',
    'doubtful',
    'minecraft' 
]

# or as we typically pass params dict
data = async_search(queries=queries, api_key='...', engine='youtube', device='desktop')

print(json.dumps(data, indent=2))
print('All searches completed')

Under the hood code example:

# async_search.py
# testable example

from serpapi import YoutubeSearch
from queue import Queue
import os, re

search_queue = Queue()

def async_search(queries, api_key, engine, device):
    data = []
    for query in queries:
        params = {
            'api_key': api_key,                 
            'engine': engine,              
            'device': device,              
            'search_query': query,          
            'async': True,                  
            'no_cache': 'true'
        }
        search = YoutubeSearch(params)       
        results = search.get_dict()         
        
        if 'error' in results:
            print(results['error'])
            break

        print(f"Add search to the queue with ID: {results['search_metadata']}")
        search_queue.put(results)

    while not search_queue.empty():
        result = search_queue.get()
        search_id = result['search_metadata']['id']

        print(f'Get search from archive: {search_id}')
        search_archived = search.get_search_archive(search_id)
        
        print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

        if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
            for video_result in search_archived.get('video_results', []):
                data.append({
                    'title': video_result.get('title'),
                    'link': video_result.get('link'),
                    'channel': video_result.get('channel').get('name'),
                })
        else:
            print(f'Requeue search: {search_id}')
            search_queue.put(result)
            
    return data

Is there a specific reason we haven't done it before?

Pagination iterator doesn't work for APIs with token-based pagination

For several APIs, parsing the serpapi_pagination.next is the only way to update params_dict with correct values. An increment of params.start won't work for Google Scholar Profiles, Google Maps, YouTube.

# increment page
self.start += self.page_size

Google Scholar Profiles

Google Scholar Profiles API have pagination.next_page_token instead of serpapi_pagination.next.

pagination.next is a next page URI like https://serpapi.com/search.json?after_author=0QICAGE___8J&engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity where after_author is set to next_page_token.

Google Maps

In Google Maps Local Results API there's only serpapi_pagination.next with a URI like https://serpapi.com/search.json?engine=google_maps&ll=%4040.7455096%2C-74.0083012%2C14z&q=Coffee&start=20&type=search

YouTube

In YouTube Search Engine Results API there's serpapi_pagination.next_page_token similar to Google Scholar Profiles. serpapi_pagination.next is a URI with sp parameter set to next_page_token.

@jvmvik What do you think about parsing serpapi_pagination.next in Pagination#__next__?

- self.start += self.page_size
+ self.client.params_dict.update(dict(parse.parse_qsl(parse.urlsplit(result['serpapi_pagination']['next']).query)))

Here's an example of endless pagination of Google Scholar Authors (scraped 190 pages and manually stopped).

google scholar pagination not returning final results page

I am using the paginate method with google scholar engine to return all results for a search term. When I use a for loop to iterate the pagination and put the results a list, it doesn't return the final page of results, instead stopping at the penultimate page (code snippet and terminal output below).

import serpapi
import os
from loguru import logger
from dotenv import load_dotenv

load_dotenv()

search_string = '"Singer Instruments" PhenoBooth'

# Pagination allows iterating through all pages of results
logger.info("Initialising search through serpapi")
search = serpapi.GoogleSearch(
    {
        "engine": "google_scholar",
        "q": search_string,
        "api_key": os.getenv("SERPAPI_KEY"),
        "as_ylo": 1900,
    }
)
pages = search.pagination(start=0, page_size=20)

# get dict for each page of results and store in list
results_list = []
page_number = 1
for page in pages:
    logger.info(f"Retrieving results page {page_number}")
    results_list.append(page)
    page_number += 1

gscholar_results = results_list[0]["search_information"]["total_results"]
print(f"results reported by google scholar: {gscholar_results}")

paper_count = 0
for page in results_list:
    for paper in page["organic_results"]:
        paper_count += 1

print(f"number of papers in results: {paper_count}")

Screenshot 2021-07-30 at 17 05 11

If I check my searches on serpAPI.com, results are being generated for all pages (see below for example in code). So the problem is not that the result isn't generated, its just not coming out of the pagination iterator for some reason.

Screenshot 2021-07-30 at 17 06 21

Exception not handled on SerpApiClient.get_json

I am experiencing unexpected behaviors when getting thousands of queries. For some reason, sometimes the API returns an empty response. It happens at random (1 time out of 10000 perhaps).

When this situation happens, the method SerpApiClient.get_json does not handle the empty response. In consecuence, the json.loads() raises an exception causing a JSONDecodeError.

I attach an image to clarify the issue.

issue

It seems a problem with the API service. Not sure if the problem should be solved with an Exception handling, handling the code 204 (empty response), or if there is any bug with servers.

to reproduce the exception:

import json json.loads('')
Do you recommend any guidelines to handle the problem in the meanwhile you review the issue on the source code?

Thanks.

How to get "related articles" links from google scholar via serpapi?

I am using SERP API to fetch google scholar papers, although there is always a link called "related articles' under each article but SERP API doesn't have any SERP URL to fetch data of those links?

Screenshot 2022-07-14 at 3 05 07 AM

Serp API result :

Screenshot 2022-07-14 at 3 15 16 AM

Can I directly call this URL https://scholar.google.com/scholar?q=related:gemrYG-1WnEJ:scholar.google.com/&scioq=Multi-label+text+classification+with+latent+word-wise+label+information&hl=en&as_sdt=0,21 using serp API?

Python package should not include tests

When installing via pip, the installation includes the tests directory:

mkdir deps
python3 -m pip install --target deps "google-search-results==2.4.2"
ls -1 deps/tests

Outputs:

__init__.py
__pycache__
test_account_api.py
(etc)

Tests should be excluded.

google scholar pagination skips result 20

When retrieving results from Google Scholar using the pagination() method, the first article on the second page of google scholar is always missing.

I think this is caused by the following snippet in the update() method of google-search-results-python/serpapi/pagination.py:

def update(self):
        self.client.params_dict["start"] = self.start
        self.client.params_dict["num"] = self.num
        if self.start > 0:
            self.client.params_dict["start"] += 1

This seems to mean that for all pages except the first, paginate increases start by 1. So while for the first page it requests results starting at 0 and ending at 19 (if page_size=20). For the second page it requests results starting at 21 and ending at 40, skipping result 20.

If I delete the if statement, the code seems to work as intended and I get result 19 back.

KeyError when Calling Answer Box

I've attempted to get the results from the answer box using the documentation here.

I noticed the Playground does not return these results either.

Is there any way to get this URL also?

Output Returned when Attempting to Run the Sample Provided:

from serpapi import GoogleSearch

params = {
  "q": "What's the definition of transparent?",
  "hl": "en",
  "gl": "us",
  "api_key": ""
}

search = GoogleSearch(params)
results = search.get_dict()
answer_box = results['answer_box']

Different results from serpapi (Google Trends) versus Google Trends site

I'm having an issue (which is causing a serious headache / project issue for us) where the results from serpapi are different versus those returned from Google, when querying the website directly.

Below is a simple code snippet to reproduce:

from serpapi import GoogleSearch
import pandas as pd

PARAMS = {'engine': 'google_trends',
          'data_type' : 'RELATED_QUERIES',
          'q' : "health insurance",
          'geo': "IE",
          'date' : "2022-01-01 2022-12-31",
          'hl' : 'en-GB',
          'csv' : True,
          'api_key' : '[Key]'}

search = GoogleSearch(PARAMS) 
results = search.get_dict() 

rel = results['related_queries']['top']
df = pd.DataFrame(rel)

df[["query", "value"]]

df.to_csv("serpapi_results.csv")

Below is an image with the difference in results (and also attached are the files)
image

This is an image of the URL in Google Trends:
image

serpapi_results.csv
relatedQueries_google_results.csv

Can you let me know if this is a known issue or if I've made some mistake in my API call?
Thanks,
Ronan

[Discuss] Wrapper longer response times caused by some overhead/additional processing

@jvmvik this issue is for discussion.

I'm not 100% sure what the cause is, but there's might be some overhead or additional processing in the wrapper that causes longer response times. Or it is as it should be? Let me know if it's the case.

Table shows results when making 50 requests:

Making direct requests to serpapi.com/search.json Making a request to serpapi.com through API wrapper Making a request with async batch requests with Queue
~7.192448616027832 seconds ~135.2969319820404 seconds ~24.80349826812744 seconds

Making a direct request to serpapi.com/search.json:

import aiohttp
import asyncio
import os
import json
import time

async def fetch_results(session, query):
    params = {
        'api_key': '...',
        'engine': 'youtube',
        'device': 'desktop',
        'search_query': query,
        'no_cache': 'true'
    }
    
    url = 'https://serpapi.com/search.json'
    async with session.get(url, params=params) as response:
        results = await response.json()

    data = []

    if 'error' in results:
        print(results['error'])
    else:
        for result in results.get('video_results', []):
            data.append({
                'title': result.get('title'),
                'link': result.get('link'),
                'channel': result.get('channel').get('name'),
            })

    return data

async def main():
    # 50 queries
    queries = [
        'burly',
        'creator',
        'doubtful',
        'chance',
        'capable',
        'window',
        'dynamic',
        'train',
        'worry',
        'useless',
        'steady',
        'thoughtful',
        'matter',
        'rotten',
        'overflow',
        'object',
        'far-flung',
        'gabby',
        'tiresome',
        'scatter',
        'exclusive',
        'wealth',
        'yummy',
        'play',
        'saw',
        'spiteful',
        'perform',
        'busy',
        'hypnotic',
        'sniff',
        'early',
        'mindless',
        'airplane',
        'distribution',
        'ahead',
        'good',
        'squeeze',
        'ship',
        'excuse',
        'chubby',
        'smiling',
        'wide',
        'structure',
        'wrap',
        'point',
        'file',
        'sack',
        'slope',
        'therapeutic',
        'disturbed'
    ]

    data = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for query in queries:
            task = asyncio.ensure_future(fetch_results(session, query))
            tasks.append(task)

        start_time = time.time()
        results = await asyncio.gather(*tasks)
        end_time = time.time()

        data = [item for sublist in results for item in sublist]

    print(json.dumps(data, indent=2, ensure_ascii=False))
    print(f'Script execution time: {end_time - start_time} seconds') # ~7.192448616027832 seconds

asyncio.run(main())

Same code but using the wrapper YoutubeSearch (not 100% sure if valid comparison):

import aiohttp
import asyncio
from serpapi import YoutubeSearch
import os
import json
import time

async def fetch_results(session, query):
    params = {
        'api_key': '...',
        'engine': 'youtube',
        'device': 'desktop',
        'search_query': query,
        'no_cache': 'true'
    }
    search = YoutubeSearch(params)
    results = search.get_json()

    data = []

    if 'error' in results:
        print(results['error'])
    else:
        for result in results.get('video_results', []):
            data.append({
                'title': result.get('title'),
                'link': result.get('link'),
                'channel': result.get('channel').get('name'),
            })

    return data

async def main():
    queries = [
        'burly',
        'creator',
        'doubtful',
        'chance',
        'capable',
        'window',
        'dynamic',
        'train',
        'worry',
        'useless',
        'steady',
        'thoughtful',
        'matter',
        'rotten',
        'overflow',
        'object',
        'far-flung',
        'gabby',
        'tiresome',
        'scatter',
        'exclusive',
        'wealth',
        'yummy',
        'play',
        'saw',
        'spiteful',
        'perform',
        'busy',
        'hypnotic',
        'sniff',
        'early',
        'mindless',
        'airplane',
        'distribution',
        'ahead',
        'good',
        'squeeze',
        'ship',
        'excuse',
        'chubby',
        'smiling',
        'wide',
        'structure',
        'wrap',
        'point',
        'file',
        'sack',
        'slope',
        'therapeutic',
        'disturbed'
    ]

    data = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for query in queries:
            task = asyncio.ensure_future(fetch_results(session, query))
            tasks.append(task)
        
        start_time = time.time()
        results = await asyncio.gather(*tasks)
        end_time = time.time()

        data = [item for sublist in results for item in sublist]

    print(json.dumps(data, indent=2, ensure_ascii=False))
    print(f'Script execution time: {end_time - start_time} seconds') # ~135.2969319820404 seconds

Using async batch requests with Queue:

from serpapi import YoutubeSearch
from urllib.parse import (parse_qsl, urlsplit)
from queue import Queue
import os, re, json
import time

# 50 queries
queries = [
    'burly',
    'creator',
    'doubtful',
    'chance',
    'capable',
    'window',
    'dynamic',
    'train',
    'worry',
    'useless',
    'steady',
    'thoughtful',
    'matter',
    'rotten',
    'overflow',
    'object',
    'far-flung',
    'gabby',
    'tiresome',
    'scatter',
    'exclusive',
    'wealth',
    'yummy',
    'play',
    'saw',
    'spiteful',
    'perform',
    'busy',
    'hypnotic',
    'sniff',
    'early',
    'mindless',
    'airplane',
    'distribution',
    'ahead',
    'good',
    'squeeze',
    'ship',
    'excuse',
    'chubby',
    'smiling',
    'wide',
    'structure',
    'wrap',
    'point',
    'file',
    'sack',
    'slope',
    'therapeutic',
    'disturbed'
]

search_queue = Queue()

for query in queries:
    params = {
        'api_key': '...',                 
        'engine': 'youtube',              
        'device': 'desktop',              
        'search_query': query,          
        'async': True,                   
        'no_cache': 'true'
    }

    search = YoutubeSearch(params)       # where data extraction happens
    results = search.get_dict()          # JSON -> Python dict
    
    if 'error' in results:
        print(results['error'])
        break

    print(f"Add search to the queue with ID: {results['search_metadata']}")
    search_queue.put(results)

data = []

start_time = time.time()

while not search_queue.empty():
    result = search_queue.get()
    search_id = result['search_metadata']['id']

    print(f'Get search from archive: {search_id}')
    search_archived = search.get_search_archive(search_id)
    
    print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

    if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
        for video_result in search_archived.get('video_results', []):
            data.append({
                'title': video_result.get('title'),
                'link': video_result.get('link'),
                'channel': video_result.get('channel').get('name'),
            })
    else:
        print(f'Requeue search: {search_id}')
        search_queue.put(result)
        
print(json.dumps(data, indent=2))
print('All searches completed')

execution_time = time.time() - start_time
print(f'Script execution time: {execution_time} seconds') # ~24.80349826812744 seconds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.