Comments (5)
@xciser77,
I didn't have that. But perhaps the site uses post JS render.
Try using ExecuteJavascriptMiddleware in your code:
https://roach-php.dev/docs/downloader-middleware/#executing-javascript
from core.
i am trying, do I only have to include use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware
in my spider or do I have to declare something in the downloader middleware also ?
from core.
Do you have an example repository illustrating this issue? Because the interactive shell uses the same mechanism to download a site's HTML as a spider does.
from core.
if got this link (https://www.douglas.nl/nl/p/5009960042) and I am trying to get the prizes.
`<?php
namespace App\Spiders;
use Generator;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\ExecuteJavascriptMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;
class douglasnl extends BasicSpider
{
public array $startUrls = [
'https://www.douglas.nl/nl/p/5009960042'
];
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
];
public array $spiderMiddleware = [
//
];
public array $itemProcessors = [
//
];
public array $extensions = [
LoggerExtension::class,
StatsCollectorExtension::class,
];
public int $concurrency = 1;
public int $requestDelay = 2;
/**
* @return Generator<ParseResult>
*/
public function parse(Response $response): Generator
{
$product_id = $response->filterXpath('//link[@rel="canonical"]')->link();
$prizes = $response->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);
yield $this->item([
'product_id' => $product_id,
'prizes' => $prizes
]);
}
}
`
If I try the response filterXpath ($response->filterXpath('//div[@Class="product-detail__variant-row product-detail__variant-row--spread-content"]')->eq(0);) in the interactive shell, I get two results, in my laravel app I get an error ( 0 => "Line 175, Col 44974: No match in entity table for 'Gabbana'")
Roach::startSpider(douglasnl::class); $items = Roach::collectSpider(douglasnl::class);
from core.
So I've tested this locally and you actually get back the same result both times. The difference is that the REPL processes the raw HTML a little bit before showing the results.
The issue is that you're yielding the entire Crawler
object in your spider instead of just the string contents of the node. So your parse method should look something like this instead:
/**
* @return Generator<ParseResult>
*/
public function parse(Response $response): Generator
{
$product_id = $response->filterXpath('//link[@rel="canonical"]')
->link()
// Return the actual URI string instead of the `Link` object.
->getUri();
$prizes = $response
->filterXpath('//div[@class="product-detail__variant-row product-detail__variant-row--spread-content"]')
->eq(0)
// Return the actual text contents of the node instead of the entire
// `Crawler` object.
->text();
yield $this->item([
'product_id' => $product_id,
'prizes' => $prizes
]);
}
Another thing is that you should call either Roach::startSpider
or Roach::collectSpider
but not both since that would actually cause the spider to run twice. Roach::collectSpider
already starts the spider.
from core.
Related Issues (20)
- Middleware for Downloader HOT 3
- Run namespace and Request serialization HOT 2
- Is there a documented way to scrape Single Page Applications? HOT 1
- Is there any way of accessing the first item inside a processor or an extension? HOT 2
- No publishable resources for tag [] for publishing config file HOT 2
- Pass Context into Request Middleware HOT 3
- This library is no longer maintained? HOT 2
- Possibly to scrape behind secured sessions
- Self signed certificate HOT 2
- How to process items in batch?
- Laravel 10 support? HOT 1
- No results with specific url HOT 2
- Argument #1 ($timestamp) must be greater than or equal to the current time HOT 1
- How to login and then scrap data from a page that requires auth? HOT 1
- time_sleep_until(): Argument #1 ($timestamp) must be greater than or equal to the current time HOT 4
- Responses dropped by middlewares are not registered HOT 1
- Overriding Not Working HOT 3
- Argument #1 ($request) must be of type RoachPHP\Http\Request, string given HOT 2
- Conflict with Laravel 9 HOT 2
- Symfony 5.4.1 issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from core.