roach-php / core Goto Github PK

View Code? Open in Web Editor NEW

1.3K 17.0 68.0 1.05 MB

The complete web scraping toolkit for PHP.

Home Page: https://roach-php.dev

PHP 100.00%

php web-scraping crawling

core's People

Contributors

Stargazers

Watchers

core's Issues

Testing how a spider scrapes a given HTML file

Hello there,

Just a question. Is there a simple way to feature test a spider by giving it some HTML and inspecting what it returns, e.g. making assertions against what would be returned by collectSpider.

Many thanks

Seb

Question: Best way to process items that depend on another...?

Hello,

I am taking a look at this package, using the Laravel integration, actually.

I have a specific need. I would like to scrape a page, but the page has three components:

a parent component (usually)...let's say a country
the component itself...let's say a city
child components (usually)...let's say citizens

The country and citizen components are provided via links on the main page, with the latter possibly having a few bits of data that the main citizen page might not provide.

I'd like to parse the city page, but I'm not quite sure how to handle processing everything.

I would like to be able to grab the country link, creating it or updating it in my database, and then passing its ID along to the city processing, so that the city can be inserted/updated to become a part of the country.

Finally, I'd like to be able to process any of the citizens, again inserting/updating them in my database, as needed, all with reference to the city ID to which they belong. (And, ideally, handling some of the "extra" bits of data that might exist on the main city page.)

I can't quite figure out if I should have three different spiders, with the city spider calling the country and citizen spiders...or one city spider with different parser methods...? 🤔 In any case, I can't figure out how to pass the country/city database IDs along...and in the case of a single spider, I can't figure out how to make the item processors process one component versus another.

Any help or suggestions? I took a look at https://github.com/ksassnowski/roach-example-project, which was quite helpful, in general, but I didn't see how it could help with these particular problems.

Thanks in advance for your help. 🤓

Request Metrics

Is there currently a way to get metrics like name_lookup, time_connect, tls_handshake, etc?

[Laravel Sail] ExcecuteJavascriptMiddleware not firing

I am wanting to use this Middleware but it is not firing. I had an issue as I am using Laravel Sail on an M1 Mackbook and installing puppeteer had issues due to chromium not being arm64 ready:

The chromium binary is not available for arm64.

so I did the following.

I installed spatie/browsershot and ran

sail PUPPETEER_EXPERIMENTAL_CHROMIUM_MAC_ARM=1 npm i puppeteer

and everything seemed to install correctly but the ExcecuteJavascriptMiddleware doesn't appear to being called so I still get the:

<noscript>You need to enable JavaScript to run this app.</noscript>\n

version of the page returned.

I put breakpoints in ExcecuteJavascriptMiddleware but they never fire.

am I doing something wrong or missed a step?

I am using the Laravel Adaptor so thought the Middleware was already injected in to the Container, am I wrong?

Trying to parse the first page of a paginated result (Call to undefined method Generator::value())

I am trying to scrape a page that has paginated links at the bottom. In the roach docs I have found that you could override the initialRequest to find other URL's to scrape.

This is working as expected:

class ExampleSpider extends BasicSpider
{
    public function parseOverview(Response $response): \Generator
    {
        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

    public function parse(Response $response): \Generator
    {
        $items = $response->filter('.product-item')->each(function (Crawler $product, $i) {

            $productName = $product->filter('.product-item-link');
            $array['product_name'] = $productName->count() ? $productName->text() : null;

            $link = $product->filter('.product-item-link');
            $array['link'] = $link->count() ? $link->link()->getUri() : null;

            $imageUrl = $product->filter('.product-image-photo');
            $array['image_url'] = $imageUrl->count() ? $imageUrl->image()->getUri() : null;

            $salePrice = $product->filter('.price-final_price .price');
            $array['sale_price'] = $salePrice->count() ? $salePrice->text() : null;

            $regularPrice = $product->filter('.old-price span.price');
            $array['regular_price'] = $regularPrice->count() ? $regularPrice->text() : null;

            $attributeSize = $product->filter('.attribute.size');
            $array['attribute_size'] = $attributeSize->count() ? $attributeSize->text() : null;

            $savings = $product->filter('.sticker-wrapper');
            $array['savings'] = $savings->count() ? $savings->text() : null;

            return $array;
        });

        foreach ($items as $item) {
            if (!$item) {
                continue;
            }
            yield $this->item($item);
        }
    }

    /** @return Request[] */
    protected function initialRequests(): array
    {
        return [
            new Request(
                'GET',
                'https://www.example.com/5-pages', // Has 5 pages
                [$this, 'parseOverview']
            ),
            new Request(
                'GET',
                'https://www.example.com/1-page', // Has 1 page (no pagination)
                [$this, 'parseOverview']
            ),
        ];
    }
}

However, this only scrapes the pages that are gathered parseOverview() method. I would also like to use the $response object from the first page (https://www.example.com/5-pages) and not only:

So I figured, as we have the first page already in the Response, I'll try running the $this->parse() method on the $response object in the parseOverview() method:

public function parseOverview(Response $response): \Generator
    {
        yield $this->parse($response); // Here I try yielding the parse() method using the response object from the first page

        $pageUrls = array_map(
            function (Link $link) {
                return $link->getUri();
            },
            $response
                ->filter('.pages-items li a')
                ->links(),
            );

        foreach ($pageUrls as $pageUrl) {
            // Since we’re not specifying the second parameter,
            // all article pages will get handled by the
            // spider’s `parse` method.
            yield $this->request('GET', $pageUrl);
        }
    }

However, when running the Spider I get the following error: Call to undefined method Generator::value()

I tried adding the first page url to the array $pageUrls, but then I get a DuplicatedRequest. This is good because I do not want to fire the request twice when we already have a working Response object.

What do you recommend to change to make sure I get the data of the first page also?

Conflict with Laravel 9

Hello.

Your requirements could not be resolved to an installable set of packages.

Problem 1
- Root composer.json requires roach-php/core ^2.0 -> satisfiable by roach-php/core[2.0.0].
- roach-php/core 2.0.0 requires monolog/monolog ^3.1 -> found monolog/monolog[3.1.0, 3.2.0, 3.3.0, 3.3.1] but the package is fixed to 2.9.1 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.

Provide more built-in middleware

Right now, the package only ships with a bare minimum of built-in middleware.

Downloader Middleware

RobotsTxt
Set user agent
Request deduplication
Cookies

Spider Middleware

Maximum crawl depth

Extensions

Maximum requests

Add Retry Middleware

Is it possible to add retry middleware, scrapy also comes with retry middleware.

League Container and Symfony container conflicts

I am using in roach-php in a Symfony 6 project. I am trying to inject the EntityManagerInterface in my ItemProcessorInterface class to save the object in the DB. But doing that looks like it creates some kind of conflict between the containers:

Alias (Doctrine\ORM\EntityManagerInterface) is not being managed by the container or delegates
in (League) Container.php, 188

This is happening also if I inject dependencies in the Spider class. Any workaround for this? Maybe telling symfony to ignore these classes and using League Container instead? No idea how to do that since league container is instantiated in vendor/roach-php

interactive shell vs real code

I was trying roach-php in a laravel project. When I try a filter in the interactive shell I get the data I want.
But if I you the same filter in my spider file, I don;t get the data.

Looks like the interactive shell gets the remote date different, because if I dd the return array in laravel and look at the remote html data, there is less information available then in the interactive shell.

Can I use config values to get the same results in the real code vs interactive shell ?

How to pass extra header for blocked sites?

Hey first of all, thanks for this excellent package, im newbie of php and laravel. Some sites blocked crawler so this package returning
The current node list is empty. code. So probably i should pass the headers information like this get request from Mozilla information chrome etc.

I use this:
public array $downloaderMiddleware = [ [ RoachPHP\Downloader\Middleware\UserAgentMiddleware::class, ['userAgent' => 'Mozilla/5.0 (compatible; RoachPHP/0.1.0)'], ] ];

But its not work. Output: Illuminate
App\Spiders\RoachPHP\Downloader\Middleware\UserAgentMiddleware

How can i do that with this?

How do I access items once all item pipelines are finished?

I am running unit tests and i want to get all items scraped into an array. I plan to show the results in a Vue componenent so I need to return the results from a Laravel controller.

I am getting phpUnit logs that the terms have been successfully crawled. However the below results produces a null result.

$customs = Roach::startSpider(CustomSpider::class);

//roach.INFO: Run starting [] []
//roach.INFO: Item scraped {"name":"xxx"} []
//roach.INFO: Item scraped {"name":"xxx2"} []


   foreach ($customs as $cus) {
            dd($cus);
        }

//foreach() argument must be of type array|object, null given

Please help I have read your docs but it talks about handling data within the generator (itemsPipeline), nothing about exporting results.

Thanks

Missing ext-exif dependency

In a my project, when I run composer require roach-php/core, I get this error

Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - spatie/image[1.5.3, ..., 1.7.6] require php ^7.0 -> your php version (8.0.15) does not satisfy that requirement.
    - spatie/image[1.7.7, ..., 1.10.6, 2.0.0, ..., 2.2.1] require ext-exif -> it is missing from your system. Install or enable PHP's exif extension.
    - spatie/browsershot[3.52.0, ..., 3.52.2] require symfony/process ^4.2|^5.0 -> found symfony/process[v4.2.0, ..., v4.4.37, v5.0.0, ..., v5.4.3] but these were not loaded, likely because it conflicts with another require.
    - roach-php/core 0.3.0 requires spatie/browsershot ^3.52 -> satisfiable by spatie/browsershot[3.52.0, 3.52.1, 3.52.2, 3.52.3].
    - spatie/browsershot 3.52.3 requires spatie/image ^1.5.3|^2.0 -> satisfiable by spatie/image[1.5.3, ..., 1.10.6, 2.0.0, 2.1.0, 2.2.0, 2.2.1].
    - Root composer.json requires roach-php/core ^0.3.0 -> satisfiable by roach-php/core[0.3.0].

To enable extensions, verify that they are enabled in your .ini files:
    - /usr/local/etc/php/php.ini
    - /usr/local/etc/php/conf.d/docker-php-ext-opcache.ini
    - /usr/local/etc/php/conf.d/docker-php-ext-sodium.ini
    - /usr/local/etc/php/conf.d/docker-php-ext-xdebug.ini
You can also run `php --ini` in a terminal to see which files are used by PHP in CLI mode.
Alternatively, you can run Composer with `--ignore-platform-req=ext-exif` to temporarily ignore these required extensions.
You can also try re-running composer require with an explicit version constraint, e.g. "composer require roach-php/core:*" to figure out if any version is installable, or "composer require roach-php/core:^2.1" if you know which you need.

I am using php:8.0-cli-alpine docker image. Same result with php:8.0-cli image (ubuntu-based I think)

When I use

RUN docker-php-ext-install exif

all seems to work. Should be this extension added to composer.json and to the docs?

Unable to mark scans as failed

Maybe I am missing the documentation but when you send the spider out to say https://1013polebeauty.com it would be nice for me to be able to mark this domain as invalid so we do not crawl it again.

I looked into the downloader middleware and spider middleware however the response already seems to be processed at this point, is there any way that you can look at the response code and if there is an issue $yield or be able to update a Laravel model?

Event listeners get registered multiple times

It seems the event system is not able to handle the execution of multiple spiders.
I have written a console command which triggers multiple runs of various spiders.

For example, there are spiders A and B which are called by Roach::startSpider in the same command.

They get executed and the requests are performed once which is correct, but it seems the event listeners are registered for each spider, so spider B gets logged twice when the LoggerExtension is active:

local.INFO: Run starting  
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx1"} 
local.INFO: Run finished  
local.INFO: Run starting  
local.INFO: Run starting  
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx2"} 
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx2"} 
local.INFO: Run finished  
local.INFO: Run finished

If there are three spiders running in one script call, event listeners are getting registered thrice and the output is as follows:

local.INFO: Run starting  
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx1"} 
local.INFO: Run finished  
local.INFO: Run starting  
local.INFO: Run starting  
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx2"} 
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx2"} 
local.INFO: Run finished  
local.INFO: Run finished  
local.INFO: Run starting  
local.INFO: Run starting  
local.INFO: Run starting  
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx3"} 
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx3"} 
local.INFO: Dispatching request {"uri":"xxxxxxxxxxxxxxxxxx3"} 
local.INFO: Run finished  
local.INFO: Run finished 
local.INFO: Run finished

For 5 running spiders I had the problem, that each event was listened 5 times (and so on)...
The problem also occurs when the same spider is called multiple times in the same script.

Crawling entire site?

Hello,

Does roach-php support following links and crawling an entire site? I read the documentation on Scraping versus Crawling and understand the difference between two... but throughout the documentation, both terms "scraper" and "crawler" are used to describe roachphp. So my question is does roach-php support crawling? If so, I can't seem to find anywhere on the documentation.

Thank you.

No publishable resources for tag [] for publishing config file

This happen for Laravel 9.19 when run command php artisan vendor:publish --provider='RoachPHP\Laravel\RoachServiceProvider' to publish config file.

Handle different HTTP verbs

The Request class already accepts a $method parameter, but none of the helper methods to create a request do. Creating more complicated requests should be possible by using the Request::withGuzzleRequest method to directly configure the underlying Guzzle request. This needs to be documented, however.

RequestDeduplicationMiddleware slows down crawl because requestDelay will affect even dropped Urls

Thank you for this great project!

I've got a little suggestion:

The requestDelay should only affect urls which aren't dropped.
Otherwise these leads to a very long waiting time because there are a lot of duplicate URL scheduled when crawling a whole page.

Thank you!

RequestSchedulerInterface is not instantiable

Giving this project a go and I get the following error when attempting to run my spider.

tried PHP 8 and 8.1

Illuminate\Contracts\Container\BindingResolutionException
Target [RoachPHP\Scheduling\RequestSchedulerInterface] is not instantiable while building [RoachPHP\Core\Engine].

Run namespace and Request serialization

Would you accept a PR that adds a namespace to a run?

My use case is that i want to write a RequestSchedulerInterface that runs in Redis so i can resume runs, but there's no way to make the run consistent across runs without passing through something from the spider. So my solution is to pass the Spider class name into the Run and then i can pass that to the scheduler.

Another thing that i need to support is serialization of requests which isn't possible currently since the callable is stored as a closure. So i propose we store the callable and convert it to a closure later, that way i can serialize the request for later use.

As i said in my other #102, i assume there's reasoning behind the way things are done currently so i won't make a PR until i get your go ahead to make these outlandish changes!

Will the project support proxy middleware?

The collection definitely needs to use the proxy, scrapy also supports, so this project will support it?

This library is no longer maintained?

Is this project no longer maintained and no one has updated it for over a year?

Defining Different Parse Methods does not work

I have a spider which get some links from the response.

I call $this->request('GET', $fullUrl, 'parseSportDetailPage'); on the parse method and also define
the parseSportDetailPage method.

public function parseSportDetailPage(Response $response)
{
    dd(123);
}

But it's not been reached, I tried to dd(123) inside parseSportDetailPage and it didn't stop, nothing thrown, no logs also.

Executing Javascript

Probably use spatie/browsershot for executing Javascript. Got to think about if this should go in a middleware/extension or should always be activated.

Middleware for Downloader

While having RequestMiddleware and ResponseMiddleware is awesome, what are your thoughts on a Downloader middleware that takes a request and returns a response?

An example use case is the JavascriptMiddleware, which could take the request and render the page, which then returns the response itself, instead of running it through the downloader and then overwriting the response. The other use case would be caching requests between runs.

I'm sure there's a reason why this isn't already a thing, so i was wondering what would need to be done to get this implemented? I'm happy to contribute to get it done

How to pass data to the spider?

I need to pass data to the spider about the previous request.

To be specific, I want to store which sites, have the link of the current site in their content. But I can not send any extra information to my crawler like the information about the previous request.

I wanted something like this:

    \RoachPHP\Roach::startSpider(MySpider::class,
        new \RoachPHP\Spider\Configuration\Overrides([
            'startUrls' => 'https://masaf.ir',
            'reference' => 'https://masaf.ir'
        ]),
    );

But Overrides only overrides the conventional spider's configuration ):

In other and simple words, I want to make the spider configurable somehow, and send data to it.

Argument #1 ($request) must be of type RoachPHP\Http\Request, string given

Hey,
I have no idea why I get this error but it started when starting the Spiders from a Laravel Job instead of directly from the kernel.php file.

TypeError: RoachPHP\Core\Engine::scheduleRequest(): Argument #1 ($request) must be of type RoachPHP\Http\Request, string given, called in /home/forge/clarken.nomess.se/vendor/roach-php/core/src/Core/Engine.php on line 65 and defined in /home/forge/clarken.nomess.se/vendor/roach-php/core/src/Core/Engine.php:125

Fix broken HTML before it is parsed

Heya,

the site I'm crawling has some really bad HTML issues that mess up parsing and I need to manipulate the HTML before handing it over to the parser. Is that possible and if so, how?

Thanks for any suggestions!

Question: How to set cookies manually?

Document: https://roach-php.dev/docs/downloader-middleware/#managing-cookies
I try to write code like this, but it don't work:

    public array $downloaderMiddleware = [
        [CookieMiddleware::class,
            'Set-Cookie'=>'something cookies'
        ]
    ];

Is there any way of accessing the first item inside a processor or an extension?

Description
I'm looking for a way of accessing the first item inside a Processor or inside an Extension.

It will bring me the possibility of initializing a CSV file with the item props as header, and this would only be needed for the first item.

Proposed solution
If it's inside an Extension, I'd be looking for a FirstItem Event, to which I can subscribe and access the data before it's processed in the pipeline.

Considered alternatives
If it's inside a Processor, I'd be looking for an $item->isFirst() that would be added directly into the ItemInterface, or inside the ItemProcessorInterface a firstItem() method.

Additional context
Solution 1:

Solution 2:

Solution 3:

Scraping and crawling with Laravel Dusk

Hello!

Awesome idea crafting this project, I'm really looking forward to using it when scraping data.

Some websites rely on Javascript heavily and require interactivity to reach certain pieces of information. Is there any way of using something like Laravel Dusk's interactivity features with Roach?

Requeing requests

Hi!

How can I implement such as RetryMiddleware, that should retry requests after some errors?

I see, that Guzzle Promise have not assign onRejected callback now, and exceptions that thrown in Client are not handled.

Overriding Spider conf or passing context doesn't work

I'm calling another spider from my main spider with overrides and context but the second spider doesn't catch any of those.

class MainSpider extends BasicSpider
{
    ...

    public function parse(Response $response): Generator
    {
        ...

        Roach::startSpider(
            AnotherSpider::class,
            new Overrides(startUrls: ['https://github.com']),
            context: ['name' => 'test']
    )

class AnotherSpider extends BasicSpider
{
    // here I have to redefine $startUrls because the spider doesn't apply the overrides...

    public function parse(Response $response): Generator
    {
        dd($this-context); // returns empty array
    }
}

Using Proxies, possible?

Hi,

I cannot find it in your documentation but I would like to know if there is a way to change IP before running:

Roach::collectSpider();

Scraping without changing IP is not possible in mose case.

Thank you,

what's roadmap in this project ?

how communiti can to help in your project ?

Pass Context into Request Middleware

Is there any way to Pass Context into Request Middleware or an ItemProcessor?

Within an Item Processor, if you need a bit of context from the Spider before you can save the Item to the Database, there seems to be no way to access any Meta data, or the Request/Response objects

Overriding Not Working

Hi, hope all is well,
Im trying to pass the URL dynamically using overriding but its not working and the response is null.
Thank you

Roach::startSpider(
            LoremIpsumSpider::class,
            new Overrides(startUrls: ['https://sinarahmannejad.com']),
        );

        $result = Roach::collectSpider(LoremIpsumSpider::class);

Symfony 5.4.1 issue

Hello, having a problem installing the core package on the Symfony project

Your requirements could not be resolved to an installable set of packages.

**Problem 1
- spatie/browsershot[3.52.0, ..., 3.57.6] require spatie/image ^1.5.3|^2.0 -> satisfiable by spatie/image[1.5.3, ..., v1.x-dev, 2.0.0, ..., 2.2.5].
- roach-php/core[dev-configuration-overrides, dev-execute-javascript, 0.1.0, ..., 0.2.0] require psy/psysh ^0.10.8 -> satisfiable by psy/psysh[v0.10.8, ..., 0.10.x-dev].
- spatie/image[1.5.3, ..., 1.7.6] require php ^7.0 -> your php version (8.1.16) does not satisfy that requirement.
- spatie/image[1.7.7, ..., v1.x-dev, 2.0.0, ..., 2.2.5] require ext-exif * -> it is missing from your system. Install or enable PHP's exif extension.
- psy/psysh[v0.10.8, ..., 0.10.x-dev] require symfony/console ~5.0|~4.0|~3.0|^2.4.2|~2.3.10 -> found symfony/console[v2.3.10, ..., 2.8.x-dev, v3.0.0-BETA1, ..., 3.4.x-dev, v4.0.0-BETA1, ..., 4.4.x-dev, v5.0.0-BETA1, ..., 5.4.x-dev] but it conflicts with your root composer.json require (^6.0).
- roach-php/core[dev-ci-stuff, 1.0.0] require psr/container ^2.0 -> found psr/container[dev-master, 2.0.0, 2.0.1, 2.0.2, 2.0.x-dev (alias of dev-master)] but the package is fixed to 1.1.2 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
- roach-php/core[dev-spider-testing-helpers, dev-custom_item_classes, dev-main, 1.1.0, ..., 1.x-dev, 2.0.0, ..., 2.0.1] require guzzlehttp/guzzle ^7.4.5 -> found guzzlehttp/guzzle[dev-master, 7.4.5, 7.5.0, 7.5.x-dev (alias of dev-master)] but the package is fixed to 7.4.1 (lock file version) by a partial update and that version does not match. Make sure you list it as an argument for the update command.
- roach-php/core 0.3.0 requires spatie/browsershot ^3.52 -> satisfiable by spatie/browsershot[3.52.0, ..., 3.57.6].
- Root composer.json requires roach-php/core * -> satisfiable by roach-php/core[dev-spider-testing-helpers, dev-custom_item_classes, dev-ci-stuff, dev-configuration-overrides, dev-execute-javascript, dev-main, 0.1.0, 0.2.0, 0.3.0, 1.0.0, ..., 1.x-dev, 2.0.0, 2.0.1, 9999999-dev].

To enable extensions, verify that they are enabled in your .ini files:
- /usr/local/etc/php/conf.d/docker-fpm.ini
- /usr/local/etc/php/conf.d/docker-php-ext-gd.ini
- /usr/local/etc/php/conf.d/docker-php-ext-intl.ini
- /usr/local/etc/php/conf.d/docker-php-ext-opcache.ini
- /usr/local/etc/php/conf.d/docker-php-ext-pcntl.ini
- /usr/local/etc/php/conf.d/docker-php-ext-pdo_mysql.ini
- /usr/local/etc/php/conf.d/docker-php-ext-redis.ini
- /usr/local/etc/php/conf.d/docker-php-ext-sodium.ini
- /usr/local/etc/php/conf.d/docker-php-ext-xdebug.ini
- /usr/local/etc/php/conf.d/docker-php-ext-zip.ini
- /usr/local/etc/php/conf.d/xdebug.ini
You can also run php --ini in a terminal to see which files are used by PHP in CLI mode.
Alternatively, you can run Composer with --ignore-platform-req=ext-exif to temporarily ignore these required extensions.

Use the option --with-all-dependencies (-W) to allow upgrades, downgrades and removals for packages currently locked to specific versions.**

[Feature Request] Composing Spiders

Hey,

First of all, thanks for this great package!

In the docs, there is an example of how to parse a set of articles from an overview page:

public function parse(Response $response): Generator
{
    $links = $response->filter('header + div a')->links();

    foreach ($links as $link) {
        yield $this->request('GET', $link->getUri(), 'parseBlogPage');
    }
}

public function parseBlogPage(Response $response): Generator
{
    $title = $response->filter('h1')->text();
    $publishDate = $response
        ->filter('time')
        ->attr('datetime');
    $excerpt = $response
        ->filter('.blog-content div > p:first-of-type')
        ->text();

    yield $this->item(compact('title', 'publishDate', 'excerpt'));
}

In a use case of mine, I would like to do something similar but split the parsing of the overview page and a specific blog page up into two separate Spiders. In the Spider that finds different articles, I would then like to delegate the parsing of a specific blog page to another Spider. For example, I'd like to do something like this:

class BlogOverviewSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        $pages = $response
            ->filter('main > div:first-child a')
            ->links();

        foreach ($pages as $page) {
            // Here the spider() method would use the parse result of a specific Spider class
            yield $this->spider(BlogPageSpider::class, overrides: new Overrides([startUrls: $page->getUri()]));
        }
    }
}

class BlogPageSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        yield $this->item([/* */])
    }
}

Here's a simplified example that's a bit more realistic and that demonstrates its usefulness.

Scraping metadata from different Git repositories

class RepositoryOverviewSpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        $repositories = $response
            ->filter('main > div:first-child a')
            ->links();

        foreach ($repositories as $repository) {
            if ($this->isGithubRepository($repository->getUri())) {
                 yield $this->spider(GithubRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            } else if ($this->isGitlabRepository($repository->getUri())) {
                 yield $this->spider(GitlabRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            } else {
                 yield $this->spider(GenericRepositorySpider::class, overrides: new Overrides([startUrls: $repository->getUri()]))
            }
        }
    }
}

Here, each repository Spider could define its own authentication scheme and its own specific parsing method.

I could not find any way of using the result of another Spider in the docs. Most of the logic of starting a Spider seems to be locked behind a private API in the RoachPHP\Roach class .

Maybe I've missed something and you can already compose Spiders in some way. If not, I think it could be a great feature.

If you also see the merit in this, I could try taking a stab at implementing this myself.

Exceptions

How Handle request exceptions?

Custom Object in pipeline

How to pass custom object from one method to another.

For example in first method (parse) I grab collection of items and in for each loop I added new entry to DB and create eloquent object. Next I would like to for each element call another request and parse in another method but I need to pass Object to that parse method as well.
How can I do it?

public function parse(Response $response): Generator
    {
        $linksObject = $response
            ->filter('a.make');

        foreach ($linksObject as $obj) {
            $carBrand = Brand::firstOrCreate(['name' => trim($obj->textContent)]);
            $url = $obj->getAttribute('href');
           
            $this->request(
                'GET',
                 self::BASE_URI.$url,
                'getModels',
            
            );
        }
    }

public function getModels(Response $response, Brand $brand): \Generator
    {
        $models = $response->filter('.model-selector-box');
        foreach ($models as $model) {
            $url = $model->getAttribute('href');
            $model = trim($model->getElementsByTagName('h2')[0]->nodeValue);
            $realUrl = self::BASE_URI.$url;
            yield $this->request(
                'GET',
                $realUrl,
                'getGeneration'
            );
        }
    }

Is there a documented way to scrape Single Page Applications?

SPAs usually pass on a CSRF token for use in subsequent requests, is there a roach way of scraping such sites?

Resolve middleware via the container

I can't find where Middlewares are resolved. I would like to add a few parameters to the underlying Browsershot instance since the URL I am trying to parse does not render a few items until they get into the viewport.

        $this->app->bind(ExecuteJavascriptMiddleware::class, function () {
            return new ExecuteJavascriptMiddleware(
                app(LoggerInterface::class),
                function ($url) {
                    return Browsershot::url($url)->windowSize(1900, 50000)
                        ->setIncludePath('/Users/israel/.nvm/versions/node/v14.17.5/bin');
                }
            );
        });

Seems like the ExecuteJavascriptMiddleware class is not resolved via the Laravel Container so I can't overwrite it. Since this is a framework agnostic package, how can I do this? Class is also final so can't extend from the main ExecuteJavascriptMiddleware class and just update its behaviour which could be great.

Xpath not working

Hello,

I was trying to test your library but cannot get a basic xpath working on Google(as an example).

    public function parse(Response $response): Generator
    {
        $html = $response->filterXpath('//div[contains(@id, "center_col")]')->each(function (Crawler $node) {
            return $node->text();
        });
        yield $this->item([
            'html' => $html,
        ]);
    }

It returns and empty array and this is strange because there is a "

" in the page.

Any idea why it is not working please?

Thank yyou.

Scraping multiple elements within a page and relative queries is not available

Hi! Now package is allow scraping list of single pages.

But what if we need to repeatedly select a list of items within a page and do an additional filter on each of them? For example:

public function parse(Response $response): Generator
{
  $data = $response->filter('.preview');
  
  foreach ($data as $item) {  
      $cover = $item->filter('img')->attr('src');
      $number = $tem->filter('span:nth-of-type(1)')->text();
      $published = $item->filter('span.published')->text();
      
      yield $this->item(compact('cover', 'number', 'published'));
  } 
}

Here I want to select a list of items on every page by the general .preview filter, and then make an additional filter for each of $item to get the data I want.

At the moment, as I understand it, the package does not support such functionality, although in scrapy it is possible and is a powerful tool for working with page data.

spatie/robots-txt overwrites default Laravel robots.txt

I'm not sure if this issue supposed to be written here or in the https://github.com/spatie/robots-txt package !, but since I'm using this package and it depends on spatie/robots-txt. I will write it here.
I just discovered that all the URL's in my website is not indexed and blocked by robots.txt, after digging it out the only thing that I found overwriting the default robots.txt file in Laravel is spatie/robots-txt, I'm not using spatie/robots-txt directly I'm just using https://github.com/roach-php.
Any help or confirming this issue would be helpful.

item->get('key'); does not work as expected

After runnung the spider like this

        $items = Roach::collectSpider(SpainOnAForkSpider::class);

I do

        foreach( $items as $item)
        {
          debug($item->get('json'));
        }

result is

null

a dump of $items like this:

^ array:1 [▼
  0 => RoachPHP\ItemPipeline\Item {#524 ▼
    -data: array:2 [▼
      0 => "json"
      1 => "{"@context":"https://schema.org","@graph":[{"@type":"Article","@id":"https://spainonafork.com/authentic-spanish-seafood-paella-recipe/#article","isPartOf":{"@id ▶"
    ]
    -dropReason: ""
    -dropped: false
  }
]

ExecuteJavascriptMiddleware not waiting long enough - option request for wait until network idle

I'm finding the execute javascript middleware to be extremely useful but sometimes it doesn't wait long enough and activity is still happening in the DOM.

I ended up having to copy over the middleware and chain on ->waitUntilNetworkIdle() in the constructor

In order for me to be able get the markup from the DOM otherwise lots of critical data doesn't get rendered.

I might have missed this as an option somewhere - there is likely a better way to indicate this as an option rather than an entirely separate middleware file.

<?php

declare(strict_types=1);

/**
 * Copyright (c) 2022 Kai Sassnowski
 *
 * For the full copyright and license information, please view
 * the LICENSE file that was distributed with this source code.
 *
 * @see https://github.com/roach-php/roach
 */

namespace App\Roach\Middleware;

use Psr\Log\LoggerInterface;
use RoachPHP\Http\Response;
use RoachPHP\Downloader\Middleware\ResponseMiddlewareInterface;
use RoachPHP\Support\Configurable;
use Spatie\Browsershot\Browsershot;
use Throwable;

final class ExecuteJavascriptNetworkIdleMiddleware implements ResponseMiddlewareInterface
{
    use Configurable;

    /**
     * @var callable(string): Browsershot
     */
    private $getBrowsershot;

    /**
     * @param null|callable(string): Browsershot $getBrowsershot
     */
    public function __construct(
        private LoggerInterface $logger,
        ?callable $getBrowsershot = null,
    ) {
        $this->getBrowsershot = $getBrowsershot ?: static fn (string $uri): Browsershot => Browsershot::url($uri)->waitUntilNetworkIdle();
    }

    public function handleResponse(Response $response): Response
    {
        $browsershot = $this->configureBrowsershot(
            $response->getRequest()->getUri(),
        );

        try {
            $body = $browsershot->bodyHtml();
        } catch (Throwable $e) {
            $this->logger->info('[ExecuteJavascriptMiddleware] Error while executing javascript', [
                'message' => $e->getMessage(),
                'trace' => $e->getTraceAsString(),
            ]);

            return $response->drop('Error while executing javascript');
        }

        return $response->withBody($body);
    }

    /**
     * @psalm-suppress MixedArgument, MixedAssignment
     */
    private function configureBrowsershot(string $uri): Browsershot
    {
        $browsershot = ($this->getBrowsershot)($uri);

        if (!empty($this->option('chromiumArguments'))) {
            $browsershot->addChromiumArguments($this->option('chromiumArguments'));
        }

        if (null !== ($chromePath = $this->option('chromePath'))) {
            $browsershot->setChromePath($chromePath);
        }

        if (null !== ($binPath = $this->option('binPath'))) {
            $browsershot->setBinPath($binPath);
        }

        if (null !== ($nodeModulePath = $this->option('nodeModulePath'))) {
            $browsershot->setNodeModulePath($nodeModulePath);
        }

        if (null !== ($includePath = $this->option('includePath'))) {
            $browsershot->setIncludePath($includePath);
        }

        if (null !== ($nodeBinary = $this->option('nodeBinary'))) {
            $browsershot->setNodeBinary($nodeBinary);
        }

        if (null !== ($npmBinary = $this->option('npmBinary'))) {
            $browsershot->setNpmBinary($npmBinary);
        }

        return $browsershot;
    }

    private function defaultOptions(): array
    {
        return [
            'chromiumArguments' => [],
            'chromePath' => null,
            'binPath' => null,
            'nodeModulePath' => null,
            'includePath' => null,
            'nodeBinary' => null,
            'npmBinary' => null,
        ];
    }
}

Thank you!

Duplicate requests being dispatched even with RequestDeduplicationMiddleware in place

I have a list of URLs in the database, scrapping specific information these URLs.
I have split these URLs in portions of 50 and dispatch a job by giving the offset from database to start from.

Each job gets the 50 URLs from database and spider starts sending requests. 2 concurrent requests with 1 second delay.
At some point it starts sending duplicate requests as can be seen below and Deduplication middleware doesn't report/drop these requests. Not sure what's going on here. Any thoughts?

[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}

roach-php / core Goto Github PK

core's People

Contributors

Stargazers

Watchers

Forkers

core's Issues

Downloader Middleware

Spider Middleware

Extensions

Recommend Projects

Recommend Topics

Recommend Org