xuxueli / xxl-crawler Goto Github PK
View Code? Open in Web Editor NEWA distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Home Page: http://www.xuxueli.com/xxl-crawler
License: Apache License 2.0
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Home Page: http://www.xuxueli.com/xxl-crawler
License: Apache License 2.0
无法读取到下拉选择框的当前选择项,在浏览器Console可以获取
直接用反射还是不太好吧!
使用SeleniumPhantomjsPageLoader后,jsoup解析后document对象中的baseUri为空
类似下面那样,或者是js生成的节点。
`
`
Document 和 Element对象可以在其他地方取出,使用匿名函数的调用方式并不是很方便
PageFieldSelect能否使用复杂类型?希望爬取1-N的数据结构
实现简单的只有用户名与密码的登陆授权,获取token或session来爬其他的页面。
Jsoup默认页面内容大小限制为1M,设置为0则不限制大小
参考文档:
http://chenrongrong.info/2014/11/16/Jsoup-skill.html
http://leobluewing.iteye.com/blog/1997906
打开了扩散全站的功能, 但是在 JsoupUtil.findLinks()方法中筛选到的url不全, 标签获得的href是相对路径, 不是决定路径. 使用下面三种方法获得的值全部是相对路径, 校验url不通过导致, 扩散爬取失败, 大佬有遇到过这种情况吗 ?
tips: 使用 JS渲染方式采集数据,"selenisum + phantomjs" 方案
爬取的url是 http://www.bootcss.com/
如何针对对某个url的connect timeout超时做出判断处理,或者重新加入待爬取内容
// ------- pagevo ----------
if (!crawler.getRunConf().validWhiteUrl(link)) { // limit unvalid-page parse, only allow spread child
return false;
}
这一段代码返回false,如果用户设置了重试次数,会导致无意义的重试。这里应该返回true
针对post请求,相同的url,根据参数不同返回不同结果的页面抓取实现
是否可考虑在解析页面结果的类中返回当前爬虫对象,这样可以在处理完上一个页面抓取后,向爬虫对象中的url队列添加新的url。增强现在的只能在爬虫初始化的时候添加url(或者只能粗犷的扩散爬取)功能。
建议首先考虑正则选择器
如题
LocalRunData 中 使用 LinkedBlockingQueue 来记录需要爬取的url, 这是一个线程安全的队列, 还需要加 volatile 关键字吗 ?
你好,我在测试用例中没有找到post请求的模板调用
这是我的调用代码
` Map<String,String> dataMap = new HashMap<>();
dataMap.put("category","**");
dataMap.put("currentPage","1");
dataMap.put("pageSize","30");
Map<String,String> headerMap = new HashMap<>();
headerMap.put("Accept-Encoding","gzip");
headerMap.put("Content-Type","application/json;charset=UTF-8");
headerMap.put("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");
XxlCrawler xxlCrawler = new XxlCrawler.Builder()
.setUrls(url)
.setAllowSpread(false)
.setIfPost(true)
.setHeaderMap(headerMap)
.setParamMap(dataMap)
.setPageParser(new PageParser() {
@Override
public void parse(Document html, Element pageVoElement, Object pageVo) {
XxlJobLogger.log("html:{}",html);
}
})
.build();
xxlCrawler.start(true);
return SUCCESS;`
这是报错:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400
JsoupUtil工具类loadPageSource()方法里Connection没有调用requestBody,有的接口要求只能通过Connection.requestBody()传递参数,这种情况下,抓取不到数据。
多线程情况下,tryFinish()会误判CrawlerThread
的运行状态,导致提前stop,以下是运行XxlCrawlerTest,开启3个thread,并打印日志:
概率比较小,大概试10次能出现一次,原因可能如下:
thread-3调用tryFinish()
并提前获取了3个CrawlerThread的isRunning
状态均为false,刚好此时thread-1调用了crawler.getRunData().getUrl()
并将running设为true(但thread-3已经无法知晓),最后thread-3判断runData.getUrlNum()==0
为true,由此isEnd
为true,导致了误判:
tryFinish()
,先判断runData.getUrlNum()==0
,再逐一获取CrawlerThread的状态,防止调用crawler.getRunData().getUrl()
无法获取running的最新状态:public void tryFinish(){
boolean isEnd = runData.getUrlNum()==0;
boolean isRunning = false;
for (CrawlerThread crawlerThread: crawlerThreads) {
if (crawlerThread.isRunning()) {
isRunning = true;
break;
}
}
isEnd = isEnd && !isRunning;
if (isEnd) {
logger.info(">>>>>>>>>>> xxl crawler is finished.");
stop();
}
}
private volatile boolean running;
com.xuxueli.crawler.thread.CrawlerThread#processPage中以下代码应该return false比较合适吧?
if (!crawler.getRunConf().validWhiteUrl(pageRequest.getUrl())) { // limit unvalid-page parse, only allow spread child, finish here
return true;
}
请问一下,有登录后再爬取内容的功能吗?
setWhiteUrlRegexs("https://www.kuaidaili.com/free/inha/\\b[1-2]/")
例如这种方式它不会匹配两个url,whiteUrlRegexs.length为1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.