scrapy-plugins / scrapy-zyte-api Goto Github PK

View Code? Open in Web Editor NEW

33.0 8.0 18.0 780 KB

Zyte API integration for Scrapy

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scrapy crawler proxy plugin scraping

scrapy-zyte-api's Introduction

scrapy-zyte-api

Scrapy plugin for seamless Zyte API integration.

Documentation: https://scrapy-zyte-api.readthedocs.io/en/latest/
License: BSD 3-clause

scrapy-zyte-api's People

Contributors

Stargazers

Watchers

Forkers

mukthy muktheeswaran zanachka eduardormonteiro gallaecio felipeboffnunes manu189 martinhohoff nykakin lucywang000 felipecustodio renatodvc ogiaquino pyexplorer proway2 emarondan georgea92 vmruiz

scrapy-zyte-api's Issues

Scrapy errors due to decompression attempt based on Content-Encoding

For Zyte API requests with httpResponseHeader=True, some websites would return headers like Content-Encoding: gzip.

In the current setup, all of the headers in the httpResponseHeader from Zyte API response is used to create the Response in the Download Handler.

Now, when the response containing the Content-Encoding: gzip header is processed by the Downloader Middlewares, the scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware errors out since it's attempting to decompress the gzipped response.

This is not the case since Zyte API already decompressed the contents.

Review fingerprinting after the addition of new keys

We might not want keys like device or sessionContext to affect fingerprinting.

Maybe we need to add a test that makes sure that every new parameter added to the Zyte API spec is taken into account and mapped for the different purposes it might have in this code base, to make it harder to miss this aspect.

Inconsistencies in stats when performing additional requests inside Page Objects

ARs inside POs currently work when scrapy-zyte-api sets ZYTE_API_TRANSPARENT_MODE=True in its settings.

However, this causes some inconsistencies in Scrapy stats:

{
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 
 'scrapy-zyte-api/429': 0,
 'scrapy-zyte-api/attempts': 1,
 'scrapy-zyte-api/error_ratio': 0.0,
 'scrapy-zyte-api/errors': 0,
 'scrapy-zyte-api/fatal_errors': 0,
 'scrapy-zyte-api/processed': 1,
 'scrapy-zyte-api/status_codes/200': 1,
 'scrapy-zyte-api/success': 1,
 'scrapy-zyte-api/success_ratio': 1.0,
 'scrapy-zyte-api/throttle_ratio': 0.0,
}

The scrapy-zyte-api/* stats should also reflect the ARs performed inside POs.

Look into unnecessary warnings on default headers

e.g.

… [scrapy_zyte_api._params] WARNING: Request <GET …> defines header b'Accept-Encoding', which cannot be mapped into the Zyte API requestHeaders parameter.
… [scrapy_zyte_api._params] WARNING: Request <GET …> defines header b'User-Agent', which cannot be mapped into the Zyte API customHttpRequestHeaders parameter.

Define a minimum Scrapy version

We should indicate a minimum Scrapy version in the README.

I also wonder if Scrapy should be a dependency, setup.py-wise, so that the right version of Scrapy (or, more likely, the other way around) can be enforced during installation. scrapy-zyte-smartproxy does it.

Document that ZYTE_API_KEY can be set as an environment variable

HTTP cache not working in some cases

I have seen 2 people now having trouble with HTTP cache in combination with scrapy-zyte-api.

They set HTTPCACHE_ENABLED to True, and they get NotSupported("Response content isn't text").

I could not reproduce the issue myself with the tutorial spider, so I am not sure why it happens.

If someone affected by this issue can provide a minimal, reproducible example, that would be great!

Map Request.method to httpRequestMethod

Support for old Scrapy is incompatible with scrapy-poet

We added a scrapy-poet>=0.9.0 dep in #89, and scrapy-poet (since 0.3.0) requires Scrapy >= 2.6.0. This is both stricter than the current scrapy>=2.0.1 dep in scrapy-zyte-api and silently breaks the tests for older Scrapy because they now install 2.9.0.

JSON decoding errors

json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 85 (char 84)
Seems these errors are coming directly from zyteapi and not related to the spider code
stack trace is:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/usr/local/lib/python3.10/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/defer.py", line 1030, in adapt
    extracted = result.result()
  File "/app/python/lib/python3.10/site-packages/scrapy_zyte_api/handler.py", line 131, in _download_request
    api_response = await self._client.request_raw(
  File "/app/python/lib/python3.10/site-packages/zyte_api/aio/client.py", line 119, in request_raw
    result = await request()
  File "/app/python/lib/python3.10/site-packages/tenacity/_asyncio.py", line 86, in async_wrapped
    return await fn(*args, **kwargs)
  File "/app/python/lib/python3.10/site-packages/tenacity/_asyncio.py", line 48, in __call__
    do = self.iter(retry_state=retry_state)
  File "/app/python/lib/python3.10/site-packages/tenacity/__init__.py", line 351, in iter
    return fut.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/app/python/lib/python3.10/site-packages/tenacity/_asyncio.py", line 51, in __call__
    result = await fn(*args, **kwargs)
  File "/app/python/lib/python3.10/site-packages/zyte_api/aio/client.py", line 103, in request
    response = await resp.json()
  File "/app/python/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1120, in json
    return loads(stripped.decode(encoding))
  File "/usr/local/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 85 (char 84)

Expose python-zyte-api stats as Scrapy stats

See #30 (comment)

Map request.headers to requestHeaders or customHttpRequestHeaders

I am not sure what to do with incompatible headers, specially with those that are set automatically by Scrapy (e.g. User-Agent). Maybe log a warning if they are user-defined for sure, and silently ignore those set by middlewares?

Setting TWISTED_REACTOR in Linux

We are writing a spider that uses this plugin and need to have the TWISTED_REACTOR setting value be twisted.internet.asyncioreactor.AsyncioSelectorReactor. Once we upgraded to Scrapy 2.6.0, setting this via the spider's custom_settings worked when running in local dev environment macOS Big Sur version 11.6. However, in our production Linux environment, using the same code, we get an error:

Traceback (most recent call last):
  ...
  File "/usr/local/lib/python3.7/dist-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/dist-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/dist-packages/scrapy/crawler.py", line 243, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/usr/local/lib/python3.7/dist-packages/scrapy/crawler.py", line 85, in __init__
    verify_installed_reactor(reactor_class)
  File "/usr/local/lib/python3.7/dist-packages/scrapy/utils/reactor.py", line 90, in verify_installed_reactor
    raise Exception(msg)
Exception: The installed reactor (twisted.internet.epollreactor.EPollReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

The Linux environment is:

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Is this difference in behavior expected? Are there any suggested fixes? We'd like to use this plugin for our use of the API but using the twisted.internet.asyncioreactor.AsyncioSelectorReactor has proven difficult.

For reference, this is how we are running the spiders:

from scrapy import spiderloader
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

def crawl(spider_name):
    # Load project settings
    project_settings = get_project_settings()

    # Load spider
    spider_loader = spiderloader.SpiderLoader.from_settings(project_settings)
    spider = spider_loader.load(spider_name)

    # Create the crawler runner
    runner = CrawlerRunner(project_settings)
    runner.crawl(spider)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    # Run the twisted reactor
    reactor.run()

A way to limit the Zyte API requests

I'm thinking of a feature where the user can indicate the max number of Zyte API requests for a given spider crawl. When this number is reached, the spider shuts down.

The main use case for this is to limit the costs per crawl.

Cover extending retries in the docs

The docs currently cover an example that extends retries to HTTP 521 responses.

Another common need is to increase retries. We should include an example of that as well.

Expose error info to downstream downloader middlewares upon Zyte API errors

..so that further processing may happen, like retries.

As of the current revision of a188fcc, here's the respective code I'm referring to:

scrapy-zyte-api/scrapy_zyte_api/handler.py

Lines 90 to 95 in a188fcc

 except RequestError as er: 

 error_message = self._get_request_error_message(er) 

 logger.error( 

 f"Got Zyte API error ({er.status}) while processing URL ({request.url}): {error_message}" 

 ) 

 raise IgnoreRequest()

Here's a sample piece of log that a job may have, as emitted by the above code's execution:
(in this very example the job used Smart Browser, and the error here indicated a ban that shall be retried from the user's end)

[scrapy_zyte_api.handler] Got Zyte API error (520) while processing URL (): There is a downloading problem which might be temporary. Retry in N seconds from 'Retry-After' header.

In such cases the spider may want to retry such requests, it may not easily do so due to the downloader handler's behavior (raise IgnoreRequest()).

Two possible approaches as I may think of are:

To unpack that zyte_api.aio.errors.RequestError exception into a scrapy.http.Request anyway and return it.
To raise an exception different than scrapy.exceptions.IgnoreRequest and include the error details. (maybe just raise the same zyte_api.aio.errors.RequestError already?)

IMHO approach 2 may be a better one.

Track retries in the crawler's stats

Background

Retries issued by zyte_api.aio.retry.RetryFactory are somewhat hidden. They are logged as DEBUG messages (so they are not seen by default in new projects with LOG_LEVEL: INFO) and, I believe, tracked generally under the scrapy-zyte-api/attempts stat. Only after the retries fail are they logged as errors and tracked as scrapy-zyte-api/fatal_errors.

scrapy-zyte-api/error_types/* also tracks the kind of errors and the amount we've seen, but that stat doesn't tell which were retries.

Suggestion

Would it be possible to explicitly track in the stats the retries issued by the retry policy, segregated by error type? And how many of those result in fatal_errors, also segregated by error type? This way, we could better track what's going on behind the scenes and use those stats to debug unusual behaviors.

This could be especially helpful for custom retry policies, in which case we might not know which codes are retried unless we look at the code or try to infer it from the existing stats.

test_max_requests fails randomly

I believe the root cause is a bug in the max request implementation, which is not reliable and can cause fewer requests to be sent.

scrapy-zyte-api/scrapy_zyte_api/_middlewares.py

Lines 132 to 140 in 2bfb2be

 zapi_req_count = self._crawler.stats.get_value("scrapy-zyte-api/processed", 0) 

 download_req_count = sum( 

 [ 

 len(slot.transferring) 

 for slot_id, slot in downloader.slots.items() 

 if slot_id.startswith(self._slot_prefix) 

 ] 

 ) 

 total_requests = zapi_req_count + download_req_count

Looking at the upstream code, it seems a request remains in slot.transferring since right before handler.download_request is called until right after the response_downloaded signal handlers are handled.

https://github.com/scrapy/scrapy/blob/1d11ea3a54607b436f9a88f07911902a4882f0e8/scrapy/core/downloader/__init__.py#L201-L234

And "scrapy-zyte-api/processed" is incremented at the end of handler.download_request:

scrapy-zyte-api/scrapy_zyte_api/handler.py

Line 239 in 2bfb2be

self._update_stats(api_params)

So a request can have been counted into "scrapy-zyte-api/processed" but still be counted in slot.transferring.

Hence the random test failures, where 4 requests are sent instead of 5, because of those 4, one is counted twice.

Allow to access api_response

There should be a way to get complete API response. Some options:

store it in meta
create a custom Response class, store api response in an attribute

Allow disabling AutoThrottle bypassing

The downloader middleware of scrapy-zyte-api was created to prevent AutoThrottle to affect requests driven through Zyte API, and instead let Zyte API itself control throttling on the server side, sending HTTP 429 responses when a spider is hitting a website too hard.

Relying on Zyte API to handle per-website throttling should most often be the best solution, since Zyte API can have a better picture of the traffic that a website can support and having central throttling control allows running multiple spiders against the same domain in parallel without increasing the overall concurrency to the upstream website.

However, some users might want to let AutoThrottle do its thing anyway. We could implement a setting to let them do just that.

Injection leaks into scrapy-poet additional requests

Not sure if the fix needs to be here or in scrapy-poet.

To reproduce the issue, create a test.py file with the following code:

from logging import getLogger

import attrs
from scrapy import Spider
from scrapy_poet import DummyResponse
from web_poet import (
    BrowserResponse,
    HttpClient,
    ItemPage,
    field,
    handle_urls,
)


logger = getLogger(__name__)

class DebugDownloaderMiddleware:
    def process_request(self, request, spider):
        logger.debug(f"{request} meta: {request.meta}")


class DebugDownloaderMiddleware2(DebugDownloaderMiddleware):
    pass


@attrs.define
@handle_urls("books.toscrape.com")
class TestPage(ItemPage[dict]):
    response: BrowserResponse
    http: HttpClient

    @field
    async def field(self):
        return await self.http.get(url="https://quotes.toscrape.com")


class TestSpider(Spider):
    name = "test"
    start_urls = ["https://books.toscrape.com"]

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            DebugDownloaderMiddleware: 1,
            "scrapy_poet.InjectionMiddleware": 543,
            DebugDownloaderMiddleware2: 544,
            "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 600,
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "REQUEST_FINGERPRINTER_CLASS": "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
            "https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
        },
        "ZYTE_API_KEY": "YOUR_API_KEY",
        "ZYTE_API_TRANSPARENT_MODE": True,

        "SCRAPY_POET_PROVIDERS": {
            "scrapy_zyte_api.providers.ZyteApiProvider": 1100,
        },

        "ZYTE_API_LOG_REQUESTS": True,

    }

    async def parse(
        self,
        response: DummyResponse,
        page: TestPage,
    ):
        yield await page.to_item()

And install scrapy and scrapy-zyte-api[provider].

Then run scrapy runspider test.py.

Unexpectedly, the request to https://quotes.toscrape.com gets browserHtml set to True after going through the scrapy_poet.InjectionMiddleware downloader middleware:

2023-09-28 12:05:23 [test] DEBUG: <GET https://quotes.toscrape.com> meta: {}
2023-09-28 12:05:23 [scrapy_poet.downloadermiddlewares] DEBUG: Using DummyResponse instead of downloading <GET https://quotes.toscrape.com>
2023-09-28 12:05:23 [test] DEBUG: <GET https://quotes.toscrape.com> meta: {'zyte_api': {'browserHtml': True}, 'zyte_api_default_params': False}

It stops happening when you remove the BrowserHtml dependency from the page object.

So, something in the injection middleware or the scrapy-zyte-api provider is leaking the Zyte API params into additional requests somehow.

Use text response objects for text responses without headers

Even if httpResponseHeaders is not True, if the actual response data is plain text, we should interpret it as such.

handler.py: `assert self._cookie_jars is not None` raises when Cookies are disabled

In a spider configured with 'COOKIES_ENABLED': 'False', I receive the assertion error from assert self._cookie_jars is not None.

scrapy-zyte-api/scrapy_zyte_api/handler.py

Line 219 in ae4140f

assert self._cookie_jars is not None # typing

The code during setup implies that no cookies is an expected condition:

scrapy-zyte-api/scrapy_zyte_api/handler.py

Line 99 in ae4140f

if not self._cookies_enabled:

In this codepath, self._cookie_jars is never set away from None, but the response handler insists that there is a cookiejar before returning the response.

This was added last month in 979a240

Bypass AutoThrottle

See if we can make it so that the AutoThrottle extension, when enabled, does not impact Zyte API requests, which handle website-specific throttling internally already.

Stat counter for page types requested

In scrapy-autoextract, it was logging in the stats the types of requested pages (code reference). For example:

autoextract/product/pages/count 4710
autoextract/product/pages/errors    28
autoextract/product/pages/success   4682
autoextract/productList/pages/count 290
autoextract/productList/pages/html  155
autoextract/productList/pages/success   290

It'd be nice if we could also track similar things in the stats for scrapy-zyte-api. We could include other things like httpResponseBody, browserHtml, etc.

The main motivation for this would be that during crawling, it'd be great to know what's the ratio between productNavigation and product pages which is useful for debugging the crawling behavior.

Sending requests with manually-defined parameters is ignoring zyte_api meta parameter

I tried using this feature Sending requests with manually-defined parameters and combine with scrapy-poet integration but it seems that scrapy_zyte_api.providers.ZyteApiProvider is ignoring zyte_api parameter from my spider request object

https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/providers.py#L75-L88

        zyte_api_meta = {}
        if html_requested:
            zyte_api_meta["browserHtml"] = True
        for item_type, kw in item_keywords.items():
            if item_type in to_provide:
                zyte_api_meta[kw] = True
        api_request = Request(
            url=request.url,
            meta={
                "zyte_api": zyte_api_meta,
                "zyte_api_default_params": False,
            },
            callback=NO_CALLBACK,
        )

Is it possible to change this zyte_api_meta = {} to zyte_api_meta = request.meta.get('zyte_api', {})

Some discrepancies in request/response stats

Starting on scrapy-zyte-api >= 0.9.0, we're now able request for item extraction in Zyte API directly in the spider:

import scrapy
from scrapy_poet import DummyResponse
from zyte_common_items import Product


class SampleSpider(scrapy.Spider):
    name = "sample"

    def start_requests(self):
        yield scrapy.Request(
            "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            self.parse_product,
        )

    def parse_product(
        self,
        response: DummyResponse,
        product: Product,
    ):
        yield product

In the Scrapy crawl stats, we have these:

{
 ...
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'response_received_count': 2,
 'scrapy-zyte-api/attempts': 1,
 'scrapy-zyte-api/status_codes/200': 1,
 'scrapy-zyte-api/success': 1,
 'scrapy_poet/dummy_response_count': 1,
 ...
}

These are kind of correct since DummyResponse is technically a response so our spider would have received 2 responses (i.e. DummyResponse and Product). Although I'd probably prefer to have a new downloader/dummy_response_count stat in lieu of the existing scrapy_poet/dummy_response_count just so that they are closer to one another (which helps when visually inspecting them):

{
 ...
 'downloader/dummy_response_count': 1,
 'downloader/response_count': 2,
 ...
}

Another discrepancy is in Scrapy Cloud since it would appear that we have 2 requests:

It's because of scrapinghub-entrypoint-scrapy.sh_scrapy.HubstorageDownloaderMiddleware (ref) records the requests in the Request-tab based on the response received.

def process_response(self, request, response, spider):
    self.pipe_writer.write_request(
        url=response.url,
        status=response.status,
        method=request.method,
        rs=len(response.body),
        duration=request.meta.get('download_latency', 0) * 1000,
        parent=request.meta.setdefault(HS_PARENT_ID_KEY),
        fp=request_fingerprint(request),
    )
    ...

An option would be to update https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy to record the Requests based on processing the actual requests instead of responses. This seems quite straightforward though I'm not sure what's the reasoning of the original implementation.

Error when enabling httpResponseHeaders

I believe this is a known issue already, but I don’t see it in the tracker.

from scrapy import Request, Spider


class ToScrapeSpider(Spider):
    name = 'toscrape_com'

    custom_settings = {
        'DOWNLOAD_HANDLERS': {
            'http': 'scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler',
            'https': 'scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler',
        },
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
    }

    def start_requests(self):
        yield Request(
            'https://toscrape.com',
            meta={
                'zyte_api': {
                    'httpResponseBody': True,
                    'httpResponseHeaders': True,
                },
            },
        )

    def parse(self, response):
        respone.text

causes

2022-05-25 11:00:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://toscrape.com>
Traceback (most recent call last):
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: <200 https://toscrape.com/>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 60, in process_response
    response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 62, in process_response
    decoded_body = self._decode(response.body, encoding.lower())
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 82, in _decode
    body = gunzip(body)
  File "/home/adrian/.local/share/venv/docs.zyte.com/lib/python3.10/site-packages/scrapy/utils/gz.py", line 27, in gunzip
    chunk = f.read1(8196)
  File "/usr/lib/python3.10/gzip.py", line 314, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.10/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.10/gzip.py", line 488, in read
    if not self._read_gzip_header():
  File "/usr/lib/python3.10/gzip.py", line 436, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'<!')

Scrapy's open_in_browser never uses .html extension for Zyte API responses

Scrapy's open_in_browser uses response type to pick the extension of the saved file.

But see

scrapy-zyte-api/scrapy_zyte_api/responses.py

Line 73 in 2777088

class ZyteAPITextResponse(ZyteAPIMixin, TextResponse):

- the response is a TextResponse subclass, and never a HtmlResponse instance. So, open_in_browser never uses a correct .html extension for Zyte API responses.

concurrency handling

If I'm not mistaken, currently the download handler uses AsyncClient from python-zyte-api, which limits concurrency to 15 by default. Probably it would make sense to match it to Scrapy concurrency settings, instead of having this fixed limit.

Alternatively, we may use a separate setting to control Zyte API max concurrency.

Implementation note: don't forget about create_session concurrency options as well.

Allow setting of ZYTE_API_KEY via settings.py instead of os.env

scrapy-zyte-api uses https://github.com/zytedata/python-zyte-api which needs to have the ZYTE_API_KEY to be set in the Env vars.

To follow Scrapy's conventions, we'll need to allow this to be set via the settings instead.

ZYTE_API_RETRY_POLICY doesn't work with Scrapy Cloud deployments

When deploying a Scrapy project to Scrapy Cloud, the process will load everything under settings.py file and pickle all the variables on it . That includes ZYTE_API_RETRY_POLICY that is not pickleable, so its not possible to deploy a project with a custom retry policy defined in the settings.py.

Traceback (most recent call last):
  File "/usr/local/bin/shub-image-info", line 8, in <module>
    sys.exit(shub_image_info())
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 209, in shub_image_info
    _run_usercode(None, ['scrapy', 'shub_image_info'] + sys.argv[1:],
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 138, in _run_usercode
    settings = populate_settings(apisettings_func(), spider)
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/settings.py", line 243, in populate_settings
    return _populate_settings_base(apisettings, _load_default_settings, spider)
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/settings.py", line 172, in _populate_settings_base
    settings = get_project_settings().copy()
  File "/usr/local/lib/python3.10/site-packages/scrapy/settings/__init__.py", line 349, in copy
    return copy.deepcopy(self)
  File "/usr/local/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/local/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/local/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/local/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/local/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/local/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/local/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/lib/python3.10/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle '_thread._local' object
{"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}

{"status": "error", "message": "Internal error"}
Deploy log location: /tmp/shub_deploy_tn3yzm9m.log
Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'

Some workarounds for this problem are:

Define ZYTE_API_RETRY_POLICY inside update_settings method of the spiders. This works for deploying because the class is not instantiated until the spider is running. However, is not a nice solution for the overall project.
Make tenacity.AsyncRetrying pikeable. Not sure if this is even possible, and will be

However, I think the proper solution would be to allow ZYTE_API_RETRY_POLICY to contains a str with the path to the class similar to how other Scrapy settings works:

Template cache policy for failed browser automation

it would be good to either add as a default cache policy, or provide as documentation a cache policy that will fail on any failed automations

class ZyteAPIPolicy(DummyPolicy):

    def should_cache_response(self, response, request):
        if not super().should_cache_response(response, request):
            return False

        if any('error' in action for action in response.raw_api_response['actions']):
            return False

        return True

Two Zyte API extract request with 'from future import annotations'

With from __future__ import annotations spider sends two ZyteAPI requests

[scrapy_zyte_api.handler] DEBUG: Sending Zyte API extract request: {"httpResponseBody": true, "httpResponseHeaders": true, "customHttpRequestHeaders": [{"name": "Accept", "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}, {"name": "Accept-Language", "value": "en"}, {"name": "Accept-Encoding", "value": "gzip, deflate, br"}], "url": "https://gizmodo.com"}
[scrapy_zyte_api.handler] DEBUG: Sending Zyte API extract request: {"article": true, "url": "https://gizmodo.com"}

The code to reproduce

from __future__ import annotations
from typing import Iterable

from scrapy import Request, Spider
from scrapy_poet import DummyResponse
from zyte_common_items import Article


class TestSpider(Spider):
    name: str = "test_spider"

    def start_requests(self) -> Iterable[Request]:
        for url in ["https://gizmodo.com"]:
            yield Request(
                url=url,
                callback=self.parse_test,
            )

    def parse_test(self, response: DummyResponse, article: Article) -> Iterable[Request]:
        proba = article.get_probability()
        if proba is None or proba >= 0.1:
            yield article

Without from __future__ import annotations all works as expected:

[scrapy_zyte_api.handler] DEBUG: Sending Zyte API extract request: {"article": true, "url": "https://gizmodo.com"}

This from __future__ import annotations was needed to annotate for example this model validator for Pydantic:

class SpiderParams(pydantic.BaseModel):
    one_param: Optional[bool] = Field()
    another_param: Optional[int] = Field()

    @model_validator(mode="after"):
    def set_another_param_based_on_one_param(self) -> TestSpider:
        if self.one_param:
            self.another_param = 1
        return self

class TestSpider(SpiderParams, Spider):
    name: str = "test_spider"
    ...

The current Python version is

3.10.5 (main, Jun 27 2022, 15:08:49) [GCC 7.5.0]

ZyteApiProvider could make an unneeded API request

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...

Support agent headers for HTTP requests

It turns our Zyte API can use the User-Agent header, and potentially other headers that we currently ignore during automatic request parameter mapping.

We need to allow specifying such headers in a way that does not cause a warning when using automatic request parameter mapping.

We might also want to expose some setting to allow picking those headers as part of automatic request parameter mapping, instead of requiring users to set customHttpRequestHeaders manually on requests needing them.

Support response.text for httpResponseBody when response content is plain text

scrapy-zyte-api doesn't wait for responses when the request limit is reached

To reproduce: set ZYTE_API_MAX_REQUESTS to some low value (e.g. to the value which is lower than concurrency). The spider may stop even without getting all the responses.

Add JobPosting to the provider

Comparing https://docs.zyte.com/zyte-api/usage/extract.html#zyte-api-extract-fields to https://scrapy-zyte-api.readthedocs.io/en/latest/reference/inputs.html#inputs I see JobPosting is missing, I think it should be trivial to add it or are there any problems?

create facilities or examples for retrying of browser automation failure.

Right now, if the zyte browser automation fails, the response returns with a 200 code #119. It would be good it have some facilities for retrying on browser automation failure.

Review httpResponseHeaders automated mapping

Zyte API now allows to request httpResponseHeaders as the only output. We may need to update our automatic request parameter mapping logic accordingly.

Also find out if it is possible to make such a request a browser-based request without also asking for a browser-specific output (browserHtml, screenshot), as that could also impact how we handle automated mapping for it. See what effect setting other browser-specific parameters, like actions, javascript or requestHeaders, has on this scenario.

Ability to set the default Zyte API Parameters for every Request

Currently, the parameters are passed on a per-request basis.

Users should be able to:

enable Zyte API for all request
set the default parameters for every Zyte API request

Using browser actions raises a TypeError

Hi team.

Following the example to use Browser Actions, we're getting the following error:

File ".../lib/python3.11/site-packages/scrapy_zyte_api/_annotations.py", line 62, in <genexpr> return tuple(frozenset(action.items()) for action in value) TypeError: unhashable type: 'dict'

We're using the same code from the documentation:

from typing import Annotated

from scrapy_zyte_api import Actions, actions


@attrs.define
class MyPageObject(BasePage):
    product: Product
    actions: Annotated[
        Actions,
        actions(
            [
                {
                    "action": "click",
                    "selector": {"type": "css", "value": "button#openDescription"},
                    "delay": 0,
                    "button": "left",
                    "onError": "return",
                },
                {
                    "action": "waitForTimeout",
                    "timeout": 5,
                    "onError": "return"
                },
            ]
        ),
    ]

Add a message to NotConfigured on ScrapyZyteAPIDownloadHandler

When you don't add a ZyteAPI key to the project, it raises NotConfigured without giving any context.
Please add a message so users can get a better idea of what is going on.
Arigatou.

Raise error on unsuccesful automation

When doing a browser automation that is not successful, it would be really helpful to either raise an error, change the status code from 200 to something more indicative of an error, or both.

Unsupported URL scheme 'https': The object should be created from async function

Following the notes for the settings file, we are experiencing an issue where the http and https handlers are not loading as expected. Generically, we are receiving the exception: The object should be created from async function.

There are log lines mentioning asyncio, and aiohttp paths referenced, so it seems like we are successfully loading AsyncIO. Do you have any thoughts on what this could be?

Relevant log lines:

2022-09-21 15:20:17 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: wheel_pricing)
2022-09-21 15:20:17 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.9.4 (default, Sep 20 2022, 14:25:08) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Linux-5.4.0-94-generic-x86_64-with-glibc2.27

2022-09-21 15:20:17 [scrapy.crawler] INFO: Overridden settings: {..., 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-09-21 15:20:17 [asyncio] DEBUG: Using selector: EpollSelector
2022-09-21 15:20:17 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-09-21 15:20:17 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop


2022-09-21 15:20:17 [scrapy.core.downloader.handlers] ERROR: Loading "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler" for scheme "https"
Traceback (most recent call last):
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/__init__.py", line 52, in _load_handler
    dh = create_instance(
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/utils/misc.py", line 166, in create_instance
    instance = objcls.from_crawler(crawler, *args, **kwargs)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/http11.py", line 53, in from_crawler
    return cls(crawler.settings, crawler)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy_zyte_api/handler.py", line 58, in __init__
    self._session = create_session(connection_pool_size=self._client.n_conn)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/zyte_api/aio/client.py", line 32, in create_session
    kwargs["connector"] = TCPConnector(limit=connection_pool_size)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/connector.py", line 708, in __init__
    super().__init__(keepalive_timeout=keepalive_timeout,
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/connector.py", line 207, in __init__
    loop = get_running_loop()
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/aiohttp/helpers.py", line 276, in get_running_loop
    raise RuntimeError("The object should be created from async function")
RuntimeError: The object should be created from async function

2022-09-21 15:20:18 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.walmart.com/browse/auto-tires/wheels-and-rims/91083_4375198>
Traceback (most recent call last):
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/utils/defer.py", line 67, in mustbe_deferred
    result = f(*args, **kw)
  File "/home/me/.local/share/virtualenvs/scrapy_wheel_pricing-xzY05OPg/lib/python3.9/site-packages/scrapy/core/downloader/handlers/__init__.py", line 74, in download_request
    raise NotSupported(f"Unsupported URL scheme '{scheme}': {self._notconfigured[scheme]}")
scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The object should be created from async function

When you trace the code, you find that the http and https keys are dropped from the downloaders dict after the first exception, and the second exception is raised because the dict no longer has those keys.

scrapy 2.6.2
scrapy-zyte-api 0.5.1
Python 3.9.4
Ubuntu 18.04

Enable httpResponseHeaders by default if httpResponseBody is enabled

for Scrapy it's probably an anti-pattern to request httpResponseBody, but not httpResponseHeaders, as the encoding detection may not work properly in this case. — @kmike

We could probably set it to True by default, but still allow users to disable it on purpose if they wish (I don’t see why, but it should be trivial to allow them).

Log the actual request sent to Zyte API

When requests go through Zyte API, it may be useful for debugging purposes to log exactly which parameters are sent to Zyte API (i.e. the JSON API request body).

This should become specially useful once we merge #41.

	except RequestError as er:
	error_message = self._get_request_error_message(er)
	logger.error(
	f"Got Zyte API error ({er.status}) while processing URL ({request.url}): {error_message}"
	)
	raise IgnoreRequest()

	zapi_req_count = self._crawler.stats.get_value("scrapy-zyte-api/processed", 0)
	download_req_count = sum(
	[
	len(slot.transferring)
	for slot_id, slot in downloader.slots.items()
	if slot_id.startswith(self._slot_prefix)
	]
	)
	total_requests = zapi_req_count + download_req_count

scrapy-plugins / scrapy-zyte-api Goto Github PK

scrapy-zyte-api's Introduction

scrapy-zyte-api

scrapy-zyte-api's People

Contributors

Stargazers

Watchers

Forkers

scrapy-zyte-api's Issues

Background

Suggestion

Recommend Projects

Recommend Topics

Recommend Org