mmerian / phpcrawl Goto Github PK
View Code? Open in Web Editor NEWCopy of http://phpcrawl.cuab.de/ for using with composer
License: GNU General Public License v2.0
Copy of http://phpcrawl.cuab.de/ for using with composer
License: GNU General Public License v2.0
...and when autoloading the classes the wrong one is loaded (the one missing some of the methods used by the script). PHPCrawlerUtils appears in libs/Utils (the right one) but also in libs/.
Framework: Laravel
PHPCrawl version: 0.83
Issue:
I'm trying to set the obeyRobotsTxt but it uses the wrong PHPCrawlerUtils. obeyRobotsTxt calls PHPCrawlerRobotsTxtParser::parseRobotsTxt which in turn calls PHPCrawlerUtils::getURIContent but it doesn't find it, reason why is because it uses this Class:
vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php //Doesn't contain getURIContent
Instead of this one, which it should use.
vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php ////Does contain getURIContent
error:
Call to undefined method PHPCrawlerUtils::getURIContent()
autoload warning:
Warning: Ambiguous class resolution, "PHPCrawlerUtils" was found in both "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/PHPCrawlerUtils.class.php" and "/Users/macmini2/securityscan/vendor/mmerian/phpcrawl/libs/Utils/PHPCrawlerUtils.class.php", the first will be used.
The following regexes in the prepareHTMLChunk function leads to a complete empty html source for many pages:
$html_source = preg_replace("#^(?:(?!<script).)*<\/script># Uis", "", $html_source);
$html_source = preg_replace("#<\!--.*(?:-->|$)# Uis", "", $html_source);
$html_source = preg_replace("#^(?:(?!<\!--).)*-->#Uis", "", $html_source);
My regex skills are not good enough to debug it.
PHP Warning: Declaration of MyCrawler::handleDocumentInfo($DocInfo) should be compatible with PHPCrawler::handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) in /var/www/srclast/PHPCrawl/rsclast.class.php on line 10
Page requested: https://security.alibaba.com/top.htm?spm=0.0.0.0.gqgp1o&time= ()
Referer-page:
Content not received
Summary:
Links followed: 1
Documents received: 0
Bytes received: 0 bytes
Process runtime: 1.7769010066986 sec
root@ydxred:/var/www/srclast/PHPCrawl#
Hello
Is phpcrawl being maintained by the author?
I have tried to reach him by e-mail regarding the class but no success...
Hello
I'm using the following code
class MyCrawler extends PHPCrawler {
function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) {
// Your code comes here!
// Do something with the $PageInfo-object that
// contains all information about the currently
// received document.
// As example we just print out the URL of the document
//weekly pode ser fornecido como paramentro
//return -1;
}
}
$crawler = new MyCrawler();
$crawler->setURL('http://example.com');
$crawler->setWorkingDirectory("/dev/shm/");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->excludeLinkSearchDocumentSections(PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->go();
How can I access the PHPCrawlerDocumentInfo->links_found after the crawl is complete?
Thanks in advance.
Does it work with html5?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.