Comments (6)
I had done something similar in parsel
. scrapy/parsel#27
from itemloaders.
@voith that is really nice! You went the step further and got json out of html using xpath then using jmespath to get the stuff you want. However, my request is more for pure Json data that contains no html what so ever. I am scraping a lot of sites using the React framework and all of the data I need from these sites is in the Json. Plus, apis that consume Json is also another need to use jmespath. Just like xpath/css selectors are favor over regular expressions, jmespath are also favored over turning the json data into a python dict and access directly in loops and using key indexes. Jmespath provides a query string that can be used the same way as xpath/css selectors.
from itemloaders.
@IAlwaysBeCoding I understand your requirement. The above PR can also work only with Jmespath too.
My requirement was such that I needed chain selectors. I've scraped sites which give json with some embedded html in the response. So it would be nice to have nested selectors that would solve everyones need.
But the problem is that building such nested selectors is not that easy. My PR too is not a full proof solution. Its just a POC to take the idea ahead(I suspect that there will be several bugs in my implementation). There was quite a discussion held in scrapy/parsel#25 because of the complexity.
from itemloaders.
I think this feature should go to the ItemLoader repository: https://github.com/scrapy/itemloaders/
But I want it very much! The current workaround is probably defining a separate Item type for that (json) data item and then setting SelectJmes
as input_processor
.
That way, the item unfortunately is specific to the page.
from itemloaders.
I have this working in a project, and the path forward requires 4 changes in 3 repositories, 2 of them in Scrapy:
- scrapy/scrapy#4961
- scrapy/parsel#181
- Add JMESPath support to itemloaders
- Add JMESPath support to Scrapy (
response.add_jmespath
, otherwise you would need to useresponse.selector.add_jmespath
)
from itemloaders.
I’m moving this to itemloaders since we have scrapy/scrapy#5894 for the remaining Scrapy work.
from itemloaders.
Related Issues (20)
- Migrate scrapy-loader-upkeep into this repo HOT 2
- Fix empty __init__ limitation for dataclasses HOT 6
- Mark dataclass and attrs support as experimental HOT 5
- Calling get_output_value causes loader to assign an "empty" value to field
- Unexpected behaviour while adding scrapy.Item sub class as a value to itemloader.
- Allow None values in Itemloaders/Items HOT 5
- Import of old scrapylib processor functions? HOT 2
- [NestedItemTest.test_scrapy_item] test failing on python 3.9 HOT 1
- Optimizing wrap_loader_context() HOT 2
- The re-introduction of nested item support caused a significant performance degradation
- Fluent Interface call-style for ItemLoader methods
- ValueError: XPath error: Unknown return type: re.Pattern in //tr[starts-with(td[1]/text(), "Цена:")]/td[2]/text() HOT 1
- Item default value is appended to the processor_in output in ItemLoader HOT 2
- *_{css,xpath} taking multiple selectors HOT 1
- Remove `**kw` from the `_get_cssvalues` signature
- Add support for pre-commit
- if the value is 0 (int) it will be set to None HOT 7
- Balance request concurrency vs successful meta extraction
- test_get_func_args() will fail in Python 3.12.3 HOT 1
- Assertion error after nested HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from itemloaders.