Giter VIP home page Giter VIP logo

phpcrawl's Introduction

Now archived due to fundamental issues. Replaced by SuperSimpleCrawler

phpcrawl

Latest Stable Version Total Downloads License

composer require brittainmedia/phpcrawl
use PHPCrawl\Enums\PHPCrawlerAbortReasons;
use PHPCrawl\Enums\PHPCrawlerMultiProcessModes;
use PHPCrawl\Enums\PHPCrawlerUrlCacheTypes;
use PHPCrawl\PHPCrawler;
use PHPCrawl\PHPCrawlerDocumentInfo;

// New custom crawler
$crawler = new class() extends PHPCrawler {

    /**
     * @param $PageInfo
     * @return int
     */
    function handleDocumentInfo($PageInfo): int
    {
        // Print the URL of the document
        echo "URL: " . $PageInfo->url . PHP_EOL;

        // Print the http-status-code
        echo "HTTP-statuscode: " . $PageInfo->http_status_code . PHP_EOL;

        // Print the number of found links in this document
        echo "Links found: " . count($PageInfo->links_found_url_descriptors) . PHP_EOL;

        // ..

        // continue crawling
        return 1;
    }
};

$crawler->setURL($url = 'https://bbc.co.uk/news');

// Optional
//$crawler->setProxy($proxy_host, $proxy_port, $proxy_username, $proxy_password);

// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule('#text/html#');

// Ignore links to ads...
$advertFilterRule = "/\bads\b|2o7|a1\.yimg|ad(brite|click|farm|revolver|server|tech|vert)|at(dmt|wola)|banner|bizrate|blogads|bluestreak|burstnet|casalemedia|coremetrics|(double|fast)click|falkag|(feedster|right)media|googlesyndication|hitbox|httpads|imiclk|intellitxt|js\.overture|kanoodle|kontera|mediaplex|nextag|pointroll|qksrv|speedera|statcounter|tribalfusion|webtrends/";
$crawler->addURLFilterRule($advertFilterRule);

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Limits set, successfully retrieved only
$crawler->setRequestLimit(1);

/**
 * 3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url.</b>
 * E.g. if the root-url is
 * "http://www.foo.com/bar/index.html",
 * the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html",
 * but not links to "http://www.foo.com/page.html".
 *
 */
$crawler->setFollowMode(3);

// Keep going until resolved
$crawler->setFollowRedirectsTillContent(TRUE);

// tmp directory
$crawler->setWorkingDirectory(sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'phpcrawl' .DIRECTORY_SEPARATOR);

// Cache
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY);

// File crawling - Store to file or set limit for large files
#$crawler->addStreamToFileContentType('##');
#$crawler->setContentSizeLimit(500000); // Google only crawls pages 500kb and below?

//Decides whether the crawler should obey "nofollow"-tags, we will obey
$crawler->obeyNoFollowTags(true);

//Decides whether the crawler should obey robot.txt, we will not obey!
$crawler->obeyRobotsTxt(false);

// Delay to stop blocking
$crawler->setRequestDelay(0.5);

// fake browser or use fake robot one
$crawler->setUserAgentString('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0');

// Multiprocess (optional) - Forces PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE use, must have link priorities!
$crawler->addLinkPriority("/news/", 10);
$crawler->addLinkPriority("/\.jpeg/", 5);
$crawler->goMultiProcessed(PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

echo 'Finished crawling site: ' . $url . PHP_EOL;
echo 'Summary:' . PHP_EOL;
echo 'Links followed: ' . $report->links_followed . PHP_EOL;
echo 'Documents received: ' . $report->files_received . PHP_EOL;
echo 'Bytes received: ' . $report->bytes_received . ' bytes' . PHP_EOL;
echo 'Process runtime: ' . $report->process_runtime . ' sec' . PHP_EOL;
echo 'Process memory: ' . $report->memory_peak_usage . ' sec' . PHP_EOL;
echo 'Server connect time: ' . $report->avg_server_connect_time . ' sec' . PHP_EOL;
echo 'Server response time: ' . $report->avg_server_response_time . ' sec' . PHP_EOL;
echo 'Server transfer rate: ' . $report->avg_proc_data_transfer_rate . ' bytes' . PHP_EOL;

$abortReason = $report->abort_reason;
switch ($abortReason) {
    case PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH:
        echo 'Crawling-process aborted because everything is done/passed through.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_TRAFFICLIMIT_REACHED:
        echo 'Crawling-process aborted because the traffic limit set by user was reached.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_FILELIMIT_REACHED:
        echo 'Crawling-process aborted because the file limit set by user was reached.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_USERABORT:
        echo 'Crawling-process aborted because the handleDocumentInfo-method returned a negative value.' . PHP_EOL;
        break;
    default:
        echo 'Unknown abort reason.' . PHP_EOL;
        break;

}

Initially just a copy of http://phpcrawl.cuab.de/ forked from mmerian for using with composer.

Due to the main project now seemingly being abandoned (having no updates for 4 years) I am going to proceed to make any changes/fixes in this repository.

Latest updates

  • 0.9 compatible PHP 7 Only.
  • 0.10 compatible PHP 8. (Submit issues)
  • Introduced namespaces
  • Lots of bug fixes
  • Refactored various class sections

Now archived...

phpcrawl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phpcrawl's Issues

Method createPEMCertificate is in example.php but doesn't exist

Hi,

In your example I see a call to createPEMCertificate:

$crawler->createPEMCertificate($passPhrase, $certificateData);

But that method doesn't exist in the code.
How can I connect properly to a https site?
Because if I remove the call to createPEMCertificate I get this error with the example.php:
Warning: stream_socket_client(): Peer certificate CN=`www.example.org' did not match expected CN=`12.123.123.123' in /vendor/brittainmedia/phpcrawl/libs/PHPCrawlerHTTPRequest.php on line 569

Typed property can be null

  • Lib version: 0.9.13
  • PHP: 7.4.22 (Ubuntu 20.04)

The property PHPCrawlerStatusHandler::$crawlerStatus is typed as PHPCrawlerStatus but because no actual checking of the crawler ID's existence occurs on the F/S in getCrawlerStatus(), objects naively assume that the file is always present. When it's not present, getCrawlerStatus() will set the null return of PHPCrawlerUtils::deserializeFromFile($this->working_directory . 'crawlerstatus.tmp') to the property resulting in the following PHP error:

[Emergency] Uncaught TypeError: Typed property PHPCrawl\ProcessCommunication\PHPCrawlerStatusHandler::$crawlerStatus must be an instance of PHPCrawl\PHPCrawlerStatus, null used

Hacking the file and removing the scoping "works" but doesn't explain why the file is missing in the first place. At a minimum though I'd expect the library to check that a file existed first, before going ahead and using its presence or content as a signal to perform some other task.

Interestingly and despite the casting, the logic in getCrawlerStatus() is still expecting $this->crawlerStatus to be null in some circumstances.

The logic would then look something like the following:

    /**
     * Returns/reads the current crawler-status
     *
     * @return PHPCrawlerStatus The current crawlerstatus as a PHPCrawlerStatus-object
     * @throws \LogicException
     */
    public function getCrawlerStatus(): PHPCrawlerStatus
    {
        $crawlFile = sprintf('%s/crawlerstatus.tmp', $this->working_directory);

        if (!file_Exists($crawlFile)) {
            throw \LogicException('Crawler status file not found!');
        }

        // Get crawler-status from file
        if ($this->write_status_to_file) {
            $this->crawlerStatus = PHPCrawlerUtils::deserializeFromFile($crawlFile);
            if ($this->crawlerStatus == null) {
                $this->crawlerStatus = new PHPCrawlerStatus();
            }
        }

        return $this->crawlerStatus;
    }

The easiest fix however is simply to permit the property to be null:

class PHPCrawlerStatusHandler
{
    /**
     * @var mixed null|PHPCrawlerStatus
     */
    protected ?PHPCrawlerStatus $crawlerStatus;
   ...
}

Unexpected 'PHPCrawlerStatus'

I initially tried version 0.9.9, but the same error occurred there.
I put it through the composer.
I think PHP hates me.

Run via OSPanel.
PHP Version 7.3.

My folder structure looks like this:

  • public (index.php)
  • tmp
  • vendor

Code is taken from an example.php

index.php:

require_once '../vendor/autoload.php';

use PHPCrawl\PHPCrawler;
use PHPCrawl\PHPCrawlerDocumentInfo;

/**
 * Class MyCrawler
 */
class MyCrawler extends PHPCrawler
{
    /**
     * @param PHPCrawlerDocumentInfo $DocInfo
     * @return int|void
     */
    public function handleDocumentInfo($DocInfo)
    {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>")..
        if (PHP_SAPI === 'cli') {
            $lb = "\n";
        } else {
            $lb = "<br />";
        }


        // Print the URL and the HTTP-status-Code
        echo 'Page requested: ' . $DocInfo->url . ' (' . $DocInfo->http_status_code . ')' . $lb;
        // Print the refering URL
        echo 'Referer-page: ' . $DocInfo->referer_url . $lb;
        // Print if the content of the document was be recieved or not
        if ($DocInfo->received == true) {
            echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb;
        } else {
            echo "Content not received" . $lb;
        }

        echo 'Error: ' . var_export($DocInfo->error_string, TRUE);

        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example
        echo $lb;
        flush();
    }
}

$crawler = new MyCrawler();

$crawler->setURL('https://google.com/');
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setWorkingDirectory("../tmp/");
$crawler->go();

And I catch the error: Parse error: syntax error, unexpected 'PHPCrawlerStatus' (T_STRING), expecting function (T_FUNCTION) or const (T_CONST) in D:\Software\OSPanel\domains\web_scrapping\vendor\brittainmedia\phpcrawl\libs\ProcessCommunication\PHPCrawlerStatusHandler.php on line 18

What's the problem?

Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url

I'm crawling a site and there's an issue with 1 incorrect a-href-tag:
<a href="http://example โ€œOโ€ and 'I' Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">example.com</a>

This leads to this error:
Fatal error: Uncaught Exception: PHPCrawlerUtils::splitURL Failed to parse url: ... in ./vendor/brittainmedia/phpcrawl/libs/Utils/PHPCrawlerUtils.php on line 51

Is there a way to make a non-fatal error of this so I can just log is and continue processing?

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned

I get this error when I call $report = $crawler->getProcessReport();

Fatal error: Uncaught TypeError: Return value of PHPCrawl\PHPCrawlerBenchmark::getElapsedTime() must be of the type float or null, none returned in vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php:91
Stack trace:
#0 vendor/brittainmedia/phpcrawl/libs/PHPCrawler.php(958): PHPCrawl\PHPCrawlerBenchmark::getElapsedTime('crawling_proces...')
#1 /myScript.php(197): PHPCrawl\PHPCrawler->getProcessReport()
#2 {main}
  thrown in /var/hpwsites/u_ruiten/website/html/webroot/pub/crawl/vendor/brittainmedia/phpcrawl/libs/PHPCrawlerBenchmark.php on line 91

I'm using version 0.9.5.
What could be the cause of this?

$DocInfo->received returns false if the requested page is returned a 301

I'm using this part of code:
public function handleDocumentInfo($DocInfo) { // Print if the content of the document was be recieved or not if ($DocInfo->received == true) { echo "Content received: " . $DocInfo->bytes_received . " bytes" . $lb; } else { echo "Content not received for url:".$lb.$DocInfo->url . " (" . $DocInfo->http_status_code . ")" . " Referer-page: " . $DocInfo->referer_url; print_r($DocInfo); }

Then I see this in the output:
"Content not received for url: http://mysite.com/abcd.html (301) Referer-page: https://mysite.com/1234.html"
So I'm crawling a https site, but some internal url's are still pointing to the http-site.
Due to the redirect the $DocInfo->received doesn't return true.

How can I fix this? So that also a redirect is just handled like any other 'normal' page?
Is there a way to replace "http://" by "https://" or can I somewhere update the code so that the code follows the 301 redirect and processes the page where the redirect is leading to?

PHP 8 readResponseContentChunk Multi-process

https://github.com/crispy-computing-machine/phpcrawl/blob/master/libs/PHPCrawlerHTTPRequest.php#L852

Need to Replace with working code:

if (strpos($headers, 'Transfer-Encoding: chunked') !== false) {
    $response = '';
    while (!feof($socket)) {
        // Read the chunk size (in hexadecimal)
        $chunkSizeHex = rtrim(fgets($socket));
        // Convert the chunk size to an integer
        $chunkSize = hexdec($chunkSizeHex);

        // If the chunk size is 0, it means we've reached the last chunk
        if ($chunkSize === 0) {
            break;
        }

        // Read the chunk data
        $chunkData = '';
        while ($chunkSize > 0) {
            $buffer = fread($socket, $chunkSize);
            $chunkData .= $buffer;
            $chunkSize -= strlen($buffer);
        }

        // Add the chunk data to the response
        $response .= $chunkData;

        // Read the trailing CRLF after the chunk data
        fgets($socket);
    }
} else {
    // If the response is not chunked, read the response normally
    while (!feof($socket)) {
        $response .= fread($socket, 128);
    }
}  

Notice: Uninitialized string offset: 0 in libs/Utils/PHPCrawlerUtils.php

Starting at line 291 of PHPCrawlerUtils.php, and then repeated on lines 324 and 327, references to $link[0] occur; e.g.:
elseif ($link[0] === '/')

These are triggering a uninitialized string offset notice. Removing the [0] index removes the notice and things seem to work fine, but I don't want to modify your library locally if I don't have to. Am I missing something?

BTW, this is occuring just running your example.php file.

problem

Got error when running the example.php

Fatal error: Uncaught Error: Class 'PHPCrawl\PHPCrawlerHTTPRequest' not found in C:\xampp\htdocs\bzdemo\libs\PHPCrawler.php on line 245

Subprocesses not working in Docker containers...

It doesn't seem like the application ever completed when running more then one process.
I'm not sure if the processes are running and the execution of MPMODE_PARENT_EXECUTES_USERCODE doesn't seem to be working, as I don't see console log output for the main process.

When i override the initChildProcess() function and print_r the $this i can the crawler object for the child process.

But I am not sure the child process is doing anything or talking back the main process.

Basically if i run with ->go() the whole scan is done fairly quickly. If I run with goMultiProcessed() it doesn't appear anything is happening and the process never finishes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.