scrapy / itemloaders Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 16.0 324 KB

Library to populate items using XPath and CSS with a convenient API

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hacktoberfest

itemloaders's Introduction

Scrapy

Overview

Scrapy is a BSD-licensed fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

itemloaders's People

Contributors

Stargazers

Watchers

Forkers

msgpo gallaecio noviluni hassanqamar07 mindaugasvaitkus2 zanachka jinanlongen scxdhr1998 peonone python-repository-hub senliontec serhii73 mic1on ota-insight ava7 mirror-dump

itemloaders's Issues

Unexpected behaviour while adding scrapy.Item sub class as a value to itemloader.

Sample code

import scrapy
from itemloaders import ItemLoader
from itemloaders.processors import Identity


class TestItem1(scrapy.Item):
    list_of_test_items = scrapy.Field()


class TestItem2(scrapy.Item):
    count = scrapy.Field()
    price = scrapy.Field()


class TestItemLoader(ItemLoader):
    default_item_class = TestItem1

    list_of_test_items_in = Identity()
    list_of_test_items_out = Identity()


il = TestItemLoader()
il.add_value('list_of_test_items', TestItem2(count=1, price=50))
print(il.load_item())

Expected output

{'list_of_test_items': [{'count': 1, 'price': 50}]}

Actual output

{'list_of_test_items': ['count', 'price']}

This is because the _BaseItem is removed from _ITERABLE_SINGLE_VALUES https://github.com/scrapy/itemloaders/blob/master/itemloaders/utils.py#L9

In scrapy _BaseItem is included in _ITERABLE_SINGLE_VALUES https://github.com/scrapy/scrapy/blob/2.2/scrapy/utils/misc.py#L20

Extend ItemLoader processors

Currently there are three methods to add ItemLoader processor:

The default_input/output_processor on the ItemLoader class
The field_name_in/out on the ItemLoader class
The input/output_processor on the scrapy.Field

Personally I use the input/output_processor on the scrapy.Field combined with the default_input/output_processor a lot. But I use those in combination. Often I just want to add one more processor after the default processors. Since input/output_processor on scrapy.Field does a override of the defaults this is quite hard to do.
So I would propose to add another method to add a input/output processors. I would like to have something like add_input/output on the scrapy.Field, which would add the specified processor to the default processor.

I did implement this on my own ItemLoader class but think that it would be usefull for the scrapy core. My implementation is as follows (original source: https://github.com/scrapy/scrapy/blob/master/scrapy/loader/__init__.py#L69). Ofcourse this can be added to get_output_processor in the same way.

def get_input_processor(self, field_name):
        proc = getattr(self, '%s_in' % field_name, None)
        if not proc:
            override_proc = self._get_item_field_attr(field_name, 'input_processor')
            extend_proc = self._get_item_field_attr(field_name, 'add_input')
            if override_proc and extend_proc:
                raise ValueError(f'Not allowed to define input_processor and add_input to {field_name}')
            if override_proc:
                return override_proc
            elif extend_proc:
                return Compose(self.default_input_processor, extend_proc)
            return self.default_input_processor
        return proc

I am not sure if add_input is a good name, probably extend_input_processor is more clear but this quite a long name. I would like to hear if more people are wanting this feature and what you all think about what the naming should be.

ValueError: XPath error: Unknown return type: re.Pattern in //tr[starts-with(td[1]/text(), "Цена:")]/td[2]/text()

Error when add_xpath/add_css in re set re.Pattern

itemloaders==1.0.5

Code:

re_price = re.compile(r'([\d\s]*\d)')
loader = itemloaders.ItemLoader(item=dict(), response=response)
loader.add_xpath(
  'price',
  '//tr[starts-with(td[1]/text(), "Цена:")]/td[2]/text()',
  re=re_price) # < argument re type re.Pattern

Traceback

ERROR: Spider error processing <GET http://www.avangard-voda.ru/catalog/osveschenie/galogenovye/78/1006030/> (referer: http://www.avangard-voda.ru/catalog/osveschenie/galogenovye/78/)
Traceback (most recent call last):
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/parsel/selector.py", line 254, in xpath
    result = xpathev(query, namespaces=nsp,
  File "src/lxml/etree.pyx", line 1599, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 300, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 93, in lxml.etree._XPathContext.registerVariables
  File "src/lxml/extensions.pxi", line 612, in lxml.etree._wrapXPathObject
lxml.etree.XPathResultError: Unknown return type: re.Pattern

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/utils/defer.py", line 132, in iter_errback
    yield next(it)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/utils/python.py", line 354, in __next__
    return next(self.data)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable
    for r in iterable:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/scrapy/spiders/crawl.py", line 116, in _parse_response
    for request_or_item in iterate_spider_output(cb_res):
  File "/home/user/projects/amon/amon_long/amon_spider_muxin.py", line 245, in parse_item
    for i, item in iter_goods(response, url):
  File "/home/user/projects/amon/amon_long/spiders/avangard_voda_ru.py", line 53, in _parse_responce
    loader.add_xpath(
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/itemloaders/__init__.py", line 349, in add_xpath
    values = self._get_xpathvalues(xpath, **kw)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/itemloaders/__init__.py", line 386, in _get_xpathvalues
    return flatten(self.selector.xpath(xpath, **kw).getall() for xpath in xpaths)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/parsel/utils.py", line 21, in flatten
    return list(iflatten(x))
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/parsel/utils.py", line 27, in iflatten
    for el in x:
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/itemloaders/__init__.py", line 386, in <genexpr>
    return flatten(self.selector.xpath(xpath, **kw).getall() for xpath in xpaths)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/parsel/selector.py", line 260, in xpath
    six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "/home/user/projects/amon/venv/lib/python3.10/site-packages/parsel/selector.py", line 254, in xpath
    result = xpathev(query, namespaces=nsp,
  File "src/lxml/etree.pyx", line 1599, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 300, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 93, in lxml.etree._XPathContext.registerVariables
  File "src/lxml/extensions.pxi", line 612, in lxml.etree._wrapXPathObject
ValueError: XPath error: Unknown return type: re.Pattern in //tr[starts-with(td[1]/text(), "Цена:")]/td[2]/text()

Add support for pre-commit

Hello,

I would like to suggest implementing pre-commit in this repo, with checks for Black, isort, and a linter like flake8 or ruff.

Adding checks for Black and isort can ensure that code formatting is consistent across the codebase. Additionally, a validator such as flake8 or ruff can help catch other potential issues such as unused imports or undefined variables.

I believe adding support for these checks would greatly benefit the project and improve code quality.

Response and other context is not passed to nested loader

When passing the response to an item loader:

loader = ItemLoader(item=Product(), response=response)

The response can be used in an input processor via the loader_context param:

def make_absolute_url(url, loader_context):
    return loader_context['response'].urljoin(url)

However, when using a nested loader:

loader = ItemLoader(item=Product(), response=response)
nested_loader = loader.nested_xpath('...')

The input processor fails with the exception:

  File "C:\migr\crawlers\crawlers\items.py", line 16, in make_absolute_url
    return loader_context['response'].urljoin(url)
AttributeError: 'NoneType' object has no attribute 'urljoin'

response is not passed to the input processor and I believe the reason is https://github.com/scrapy/scrapy/blob/acd2b8d43b5ebec7ffd364b6f335427041a0b98d/scrapy/loader/__init__.py#L55
where a new context is created without reusing any of the current context.

To me, this is unexpected behavior. Since the loader is nested, I presumed that all context except the selector would be preserved unless explicitly overwritten in the call to nested_xpath. But maybe there is an explanation for not passing response and/or other context on to the nested loader. I can file a pull request if you agree that the behavior should be changed.

Support kwargs for loader.*_xpath()

Allow passing named variables and namespaces dict arguments on ItemLoader.*_xpath() family of methods (add_xpath, etc.).

Related scrapy/scrapy#2457.

Feature Request: Adding "add_jmes" and "replace_jmes" method to ItemLoader

So, currently the ItemLoader class has 6 methods for loading values:
add_xpath()
replace_xpath()
add_css()
replace_css()
add_value()
replace_value()

Could we add another 2 more methods for loading data through JmesPath selectors. Currently, I have to use the SelectJmes processor to do this. Eventually, it looks really ugly and ends up taking so much line real estate.
I did a hack where I extended the ItemLoader to include those 2 extra methods that are desperately needed.

When there is json data to parse instead of html, JmesPath selectors are the best way to go for parsing, so there should be support for JmesPath selectors in the ItemLoader class as well.

Enable autodocs

Write docstrings to ItemLoader and processors.
Enable AutoDocs.

This should fix this request in scrapy scrapy/scrapy#4516 (comment)

Allow None values in Itemloaders/Items

Summary

I would like to pass None values to the Itemloader() and store them in an Item(). Right now, None values are discarded and therefore working with Item() does not work properly.

Motivation

Sometimes values are not available on every parsed page and when the Selector returns None, the database pipeline (Postgres) results in an KeyError: 'fieldname'.

I solved this problem by filling in a null String which is later changed to None but this seems like a hacky solution.

Add fallback selectors to ItemLoader

In some cases it is common to have fallback selectors for certain fields.
This way, we end up writing a piece of code like

loader = MyLoader(response=response)
loader.add_css('my_field', 'selector1')
loader.add_css('my_field', 'selector2') # fallback 1
loader.add_css('my_field', 'selector3') # fallback 2

However, a, maybe, better way would be

loader = MyLoader(response=response)
loader.add_css('my_field', [
    'selector1',
    'selector2', # fallback 1
    'selector3', # fallback 2
])

The API above would be the equivalent of the first example.
However, @cathalgarvey also shared a nice idea to stop in the first matching selector.

loader = MyLoader(response=response)
loader.add_css('my_field', [
    'selector1',
    'selector2', # fallback 1
    'selector3', # fallback 2
], selectors_as_preferences=True)

Then, if selector1 yields a result, the other ones are not attempted, otherwise we fallback to selector2 and so on.

The same API should be applied to loader.add_xpath.

Optimizing wrap_loader_context()

When working on a loader-heavy project I found that a lot of various inspect calls (and just a lot of various function calls) is done for every field of every loader. wrap_loader_context() calls get_func_args() for each processor which in turn does the most of the aforementioned things.

My tests show that a simple @lru_cache(1024) for get_func_args() is enough and that the cache in this case will contain one or several items for each processor used by the spider (so 1024 should be enough, though if it isn't, the cache will fail to give the performance improvements so this is debatable).

Another option would be changing get_value() and get_output_value() so that wrap_loader_context() isn't called this often, but get_value() takes the processors as input values.

*_{css,xpath} taking multiple selectors

When adding typing I've discovered some strange (to me) code which is covered by tests and is present since CSS/XPath were added in 2013. Quoting tests:

l.replace_css('name', ['p::text', 'div::text'])
l.get_css(['p::text', 'div::text'])

Yet it was always documented that these functions take a "selector" in a form of "str".

Looks like we need to at least document this, as we will need to type these as Union[str, Iterable[str]]. We could remove it instead but I have no idea if it's used (I also have no idea about the actual use cases for this).

Migrate scrapy-loader-upkeep into this repo

I'll be deprecating https://github.com/BurnzZ/scrapy-loader-upkeep and migrate all of its components into this new repo.

Let's keep track of its progress on this issue:

code migration
test migration
docs
examples

Fix empty init limitation for dataclasses

Hm, that's an interesting workaround. It means that user can't create Product(name='Plasma Tv', price=999.98) manually. I'm not sure I understand all consequences. It seems we don't test init=False dataclasses in itemadapter (//cc @elacuesta), I'm not sure what's the behavior e.g. with serialization, if fields are missing.

Overall, this looks like an issue in itemloader which can be fixed, not a problem with dataclasses or itemadapter. Fixing it is out of scope for this PR though.

What do you think about asking to provide default field values instead, and saying "currently" when explaining this limitation? We may want to lift it in future.

Originally posted by @kmike in #13

Mark dataclass and attrs support as experimental

Suggested by @kmike during an internal meeting.

If we do this, we should also open a ticket to do something similar in Scrapy.

test_get_func_args() will fail in Python 3.12.3

Looks like python/cpython#86951 has been fixed and so we will need to update the test expectations depending on the version.

Add n argument to TakeFirst processor

What do you think about adding "n" argument to TakeFirst processor?

By default it would be None, meaning the existing behaviour.
TakeFirst(3) would mean "return a list with at most 3 non-empty values".

Update Getting Started documentation

The first two paragraphs in Getting Started docs aren't that great for an introduction.
We should update them a bit
#4 (comment)

Here they are, for reference

To use an Item Loader, you must first instantiate it. You can either
instantiate it with a dict-like object (item) or without one, in
which case an item is automatically instantiated in the Item Loader __init__ method
using the item class specified in the :attr:ItemLoader.default_item_class
attribute.

Then, you start collecting values into the Item Loader, typically using
CSS or XPath Selectors. You can add more than one value to
the same item field; the Item Loader will know how to "join" those values later
using a proper processing function.

if the value is 0 (int) it will be set to None

https://github.com/scrapy/itemloaders/blob/68e8701432bd8bebb990668a8938145477f60d37/itemloaders/__init__.py#L205C25-L205C25 will not set the value to 0 in case 0 is being passed, instead the value will be set to NoneType.

Assertion error after nested

If xpath in nested_xpath not exists
Code

      info = loader.nested_xpath('.//div[@class="product-card__more-info"]')
      info.add_xpath('artikul', 'p[1][span/text() = "Код товара:"]/text()')

Assertion in _get_xpathvalues:

assert self.selector

KeyError with the initialization of an Item Field defined with None using ItemLoader

Using Scrapy 1.5.0
I took a look at the FAQ section and nothing was relevant about it.
Same for issues with keyword KeyError on github, Reddit, or GoogleGroups.

As you can see below, it seems to me that here is an inconsistency when we load an Item or initialize it with a values as None or an empty string. First we add a value to our field (here title) through a ItemLoader. Then the loader creates an item with the load_item() method. Once it's done we can't access the field if the value was None or an empty string. The inconsistency is that the other method (initializing an Item directly with a field set to None or empty string) doesn't raise a KeyError (field is set).

The class TakeFirst don't return any value when they're set with None or an empty string. Which prevents the method load_item() in ItemLoader class to add an entry to the field.

Here is a minimal source code that represents the inconsistency.

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose

class MyItem(scrapy.Item):
    title = scrapy.Field()

class MyItemLoader(ItemLoader):
    default_output_processor = TakeFirst()
    title_in = MapCompose()

class MySpider(scrapy.Spider):
    name = "My"
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        titles = ['fake title', '', None]

        for title in titles:
            # First case: with ItemLoader
            loader = MyItemLoader(item=MyItem())
            loader.add_value('title', title)
            loaded_item = loader.load_item()

            # Second case: without ItemLoader
            item = MyItem(title=title)

            if title in ('', None):
                # inconsistency!
                assert not 'title' in loaded_item
                assert 'title' in item

We're using Python 3.5 for our project and found the following workaround to prevent this error.
We introduce a new class (DefaultAwareItem) which fulfills unset fields were default metadata has been set previously.

import scrapy

class DefaultAwareItem(scrapy.Item):
    """Item class aware of 'default' metadata of its fields.

    For instance to work, each field, which must have a default value, must
    have a new `default` parameter set in field constructor, e.g.::

        class MyItem(DefaultAwareItem):
            my_defaulted_field = scrapy.Field()
            # Identical to:
            #my_defaulted_field = scrapy.Field(default=None)
            my_other_defaulted_field = scrapy.Field(default='a value')

    """
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        for field_name, field_metadata in self.fields.items():
            self.setdefault(field_name, field_metadata.get('default'))

class MyItem(DefaultAwareItem):
    title = scrapy.Field()
    title_explicitely_set = scrapy.Field(default="empty title")

Fluent Interface call-style for ItemLoader methods

Hi, it should be very helpfull if you add ability to call ItemLoader's methods (like add_xpath, add_css etc) fluently:

loader = (ItemLoader(selector=Selector(html_data))
	.add_xpath('name', '//div[@class="product_name"]/text()')
	.add_xpath('name', '//div[@class="product_title"]/text()')
	.add_css('price', '#price::text')
	.add_value('last_updated', 'today')
)

item = loader.load_item()

This is really a very tiny change - just add return self into the appropriate methods

Calling get_output_value causes loader to assign an "empty" value to field

Description

If a field in an ItemLoader has not been set, calling get_output_value mistakenly assigns an empty object to that item. See the example bellow.

Steps to Reproduce

from scrapy.loader import ItemLoader
from scrapy import Item, Field

class MyItem(Item):
    field = Field()

loader = ItemLoader(MyItem())

Then, simply loading the item correctly produces an empty dict

>>> loader.load_item()
{}

But calling get_output_value before that instantiates the field:

>>> loader.get_output_value("field")
[]

>>> loader.load_item()
{'field': []}

The re-introduction of nested item support caused a significant performance degradation

I have a CPU-bound Scrapy project that becomes 50% slower after #29.

I believe the problem is that is_item is not a cheap call, and it will potentially become more expensive as itemadapter extends support to additional types.

I think we may be able to reimplement this function so that it does not call is_item at all, but still does not treat item-like objects as sequences.

Balance request concurrency vs successful meta extraction

input_processor called only once

Hi guys,

I am using input_processor on custom ItemLoader along one inside Field declaration that belongs to Item which is being used by ItemLoader. The problem is that only ItemLoader fires up the input_processor. Is it a desired behavior?

Related code:

class ProductsItem(scrapy.Item):
    # input_loader is not called
    currency = scrapy.Field(input_processor=MapCompose(utils.get_unified_currency_name))

class CustomProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    # input_loader works fine here
    currency_in = MapCompose(lambda x: x[0])

class MySpider(scrapy.Spider):
    # ....
    def parse(self, response):
        for product in response.css('bla bla'):
            loader = CustomProductLoader(ProductsItem(), product, response=response)
            loader.add_css('currency', 'bla bla')

Remove `**kw` from the `_get_cssvalues` signature

For cleanup purposes. See #32 (comment).

Item default value is appended to the processor_in output in ItemLoader

Description

If I create an Item using a dataclass and define a default value, that default value is appended at the start of the resulting array of the output-processor input. Should not the default value be overriden or at least be at the end of the resulting array when the input processor result is not None? Is this intended behavior?

If so, it means that the user has to create a new Loader function, i.e. TakeSecond(), which is identical in functioning as TakeFirst() but it takes the second value in the array if they want to provide a non-None default value and takes the first value in case the input processor received a None value

Steps to Reproduce

import scrapy
from dataclasses import dataclass
from typing import Optional

from itemloaders.processors import TakeFirst, MapCompose, Join, Compose, Identity
from scrapy.loader import ItemLoader


# Item definition
@dataclass
class ArticleItem:
    user_rating: Optional[float] = -999


# Item Loader definition
class RappiLoader(ItemLoader):

    default_output_processor = TakeFirst()
    
    user_rating_in = MapCompose(lambda x: x.get('score') if isinstance(x, dict) else None)

Expected behavior:
resulting item should be ArticleItem(user_rating=4.5)

Actual behavior:
Result is ArticleItem(user_rating=-999)

If I inspect the value that gets fed to the output processor it is [-999, 4.5]. That's why when using TakeFirst(), it seems as if the input processor did not receive a valid value from the spider, which is not the case.

Reproduces how often:
Every time

Versions

Scrapy : 2.6.2
lxml : 4.8.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.4.0
Python : 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) - [Clang 13.0.1 ]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1s 1 Nov 2022)
cryptography : 37.0.4
Platform : macOS-13.1-x86_64-i386-64bit

Speed regression in ItemLoader since 1.0

TLDR: ItemLoader does not reuse response.selector when you are passing response to it as argument. And looks like it was reusing it up to Scrapy==1.0

Recently we were trying to upgrade our setup to use newer version of Scrapy (we are on 1.0.5).
We noticed huge slowdown in a lot of our spiders.

I dag down a bit and found that the bottleneck was ItemLoader, in particular when you create many ItemLoaders (one for each Item) passing response to every one of them and the response size is relatively big.

The speed drop down goes back to version 1.1. So on Scrapy==1.1 and above the performance degradation is present.

Test spider code (testfile is just a random file of 1MB size):

# -*- coding: utf-8 -*-
import os.path

import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field

HERE = os.path.abspath(os.path.dirname(__file__))


class Product(Item):
    url = Field()
    name = Field()


class TestSpeedSpider(scrapy.Spider):
    name = 'test_speed'

    def start_requests(self):
        file_path = HERE + '/testfile'

        file_url = 'file://' + file_path
        yield scrapy.Request(file_url)

    def parse(self, response):
        for i in xrange(0, 300):
            loader = ItemLoader(item=Product(), response=response)
            loader.add_value('name', 'item {}'.format(i))
            loader.add_value('url', 'http://site.com/item{}'.format(i))

            product = loader.load_item()

            yield product

Feature Request: Uniform MapCompose-like processor

It took me some time to figure out that MapCompose flattens iterables returned by the specified functions. Even though I then I realized that this behavior is well documented, I still wonder: Why is that? To be honest, it feels only appropriate for highly-specific use-cases. It works fine for unary values, but if people start using it on multi-valued fields (e.g., tuples) the results might be counter-intuitive.

The following code is a slightly adapted version of MapCompose, mostly just using append instead of += for value-concatenation:

class ListCompose(object):

def __init__(self, *functions, **default_loader_context):
    self.functions = functions
    self.default_loader_context = default_loader_context

def __call__(self, values, loader_context=None):
    if loader_context:
        context = MergeDict(loader_context, self.default_loader_context)
    else:
        context = self.default_loader_context
    wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
    for func in wrapped_funcs:
        next_values = []
        for v in values:
            next_values.append(func(v))
        values = next_values
    return values

Since nearly my all fields are initially created by add_xpath (an approach where Compose turns out to be pointless), I ended up using ListCompose a lot (just like MapCompose it works for unary fields as well

[NestedItemTest.test_scrapy_item] test failing on python 3.9

Hello, I'm getting a failed test on my machine, specifically the NestedItemTest.test_scrapy_item

============================================================================ test session starts =============================================================================
platform darwin -- Python 3.9.0, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /Users/chrisconte/itemloaders
plugins: Faker-8.6.0, anyio-2.1.0, betamax-0.8.1
collected 81 items                                                                                                                                                           

test_base_loader.py ......................................                                                                                                             [ 46%]
test_loader_initialization.py ..........                                                                                                                               [ 59%]
test_nested_items.py ...F                                                                                                                                              [ 64%]
test_nested_loader.py .....                                                                                                                                            [ 70%]
test_output_processor.py ..                                                                                                                                            [ 72%]
test_processors.py .....                                                                                                                                               [ 79%]
test_select_jmes.py .                                                                                                                                                  [ 80%]
test_selector_loader.py ..............                                                                                                                                 [ 97%]
test_utils_misc.py .                                                                                                                                                   [ 98%]
test_utils_python.py .                                                                                                                                                 [100%]

================================================================================== FAILURES ==================================================================================
______________________________________________________________________ NestedItemTest.test_scrapy_item _______________________________________________________________________

self = <test_nested_items.NestedItemTest testMethod=test_scrapy_item>

    def test_scrapy_item(self):
        try:
            from scrapy import Field, Item
        except ImportError:
            self.skipTest("Cannot import Field or Item from scrapy")
    
        class TestItem(Item):
            foo = Field()
    
>       self._test_item(TestItem(foo='bar'))

test_nested_items.py:50: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_nested_items.py:12: in _test_item
    self.assertEqual(il.load_item(), {'item_list': [item]})
E   AssertionError: {'item_list': ['foo']} != {'item_list': [{'foo': 'bar'}]}
E   - {'item_list': ['foo']}
E   + {'item_list': [{'foo': 'bar'}]}
E   ?                +     ++++++++
========================================================================== short test summary info ===========================================================================
FAILED test_nested_items.py::NestedItemTest::test_scrapy_item - AssertionError: {'item_list': ['foo']} != {'item_list': [{'foo': 'bar'}]}
======================================================================== 1 failed, 80 passed in 1.55s ======================================================================

Import of old scrapylib processor functions?

Old scrapylib had a few ItemLoader processors that were dropped with the codebase.

I preserved them here when scrapylib repo disappeared.
Mostly date/time parser handling and some cleaners. I wonder if they'd be useful to add here - or not. (Are they duplicating features that are elsewhere and I may have overlooked?)

I could see them become a namespace such as itemloaders.processors.extra - which wouldn't be auto-imported with itemloaders.processors and could have external dependencies that don't automatically become required - such as the dateutil parser lib used here?
Or perhaps name it itemloaders.processorlib - as in stdlib - for a place to have a few generically useful processor functions?
But perhaps it's also okay if they just disappear. I'm not sure, really.

get_output_value behavior differs from load_item with empty data

If an output processor requires a value and a field has no data, doing get_output_value will raise an exception whereas load_item doesn't. I expected get_output_value to return None in this case.

scrapy / itemloaders Goto Github PK

itemloaders's Introduction

Scrapy

Overview

Requirements

Install

Documentation

Releases

Community (blog, twitter, mail list, IRC)

Contributing

Code of Conduct

Companies using Scrapy

Commercial Support

itemloaders's People

Contributors

Stargazers

Watchers

Forkers

itemloaders's Issues

Summary

Motivation

Description

Steps to Reproduce

Description

Steps to Reproduce

Versions

Recommend Projects

Recommend Topics

Recommend Org