Giter VIP home page Giter VIP logo

xxl-crawler's Issues

扩散全站功能异常问题.

打开了扩散全站的功能, 但是在 JsoupUtil.findLinks()方法中筛选到的url不全, 标签获得的href是相对路径, 不是决定路径. 使用下面三种方法获得的值全部是相对路径, 校验url不通过导致, 扩散爬取失败, 大佬有遇到过这种情况吗 ?
tips: 使用 JS渲染方式采集数据,"selenisum + phantomjs" 方案

  1. item.absUrl("abs:href");
  2. item.attr("abs:href");
  3. item.attr("href");

爬取的url是 http://www.bootcss.com/

线程安全问题

LocalRunData 中 使用 LinkedBlockingQueue 来记录需要爬取的url, 这是一个线程安全的队列, 还需要加 volatile 关键字吗 ?

发送post请求时返回400

你好,我在测试用例中没有找到post请求的模板调用

这是我的调用代码
` Map<String,String> dataMap = new HashMap<>();
dataMap.put("category","**");
dataMap.put("currentPage","1");
dataMap.put("pageSize","30");

    Map<String,String> headerMap = new HashMap<>();
    headerMap.put("Accept-Encoding","gzip");
    headerMap.put("Content-Type","application/json;charset=UTF-8");
    headerMap.put("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");

    XxlCrawler xxlCrawler = new XxlCrawler.Builder()
            .setUrls(url)
            .setAllowSpread(false)
            .setIfPost(true)
            .setHeaderMap(headerMap)
            .setParamMap(dataMap)
            .setPageParser(new PageParser() {
                @Override
                public void parse(Document html, Element pageVoElement, Object pageVo) {
                    XxlJobLogger.log("html:{}",html);
                }
            })
            .build();
    xxlCrawler.start(true);
    return SUCCESS;`

这是报错:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=400

[issue] 多线程情况下,tryFinish()很小的概率会误判当前运行状态

  • issue description

多线程情况下,tryFinish()会误判CrawlerThread的运行状态,导致提前stop,以下是运行XxlCrawlerTest,开启3个thread,并打印日志:
image

概率比较小,大概试10次能出现一次,原因可能如下:
thread-3调用tryFinish()并提前获取了3个CrawlerThread的isRunning状态均为false,刚好此时thread-1调用了crawler.getRunData().getUrl()并将running设为true(但thread-3已经无法知晓),最后thread-3判断runData.getUrlNum()==0为true,由此isEnd为true,导致了误判:
image

  • solution
  1. 改写tryFinish(),先判断runData.getUrlNum()==0,再逐一获取CrawlerThread的状态,防止调用crawler.getRunData().getUrl()无法获取running的最新状态:
public void tryFinish(){
    boolean isEnd = runData.getUrlNum()==0;
    boolean isRunning = false;
    for (CrawlerThread crawlerThread: crawlerThreads) {
        if (crawlerThread.isRunning()) {
            isRunning = true;
            break;
        }
    }
    isEnd = isEnd && !isRunning;
    if (isEnd) {
        logger.info(">>>>>>>>>>> xxl crawler is finished.");
        stop();
    }
}
  1. CrawlerThread的running参数加上volatile关键字,保证可见性:
private volatile boolean running;

com.xuxueli.crawler.thread.CrawlerThread#processPage问题

com.xuxueli.crawler.thread.CrawlerThread#processPage中以下代码应该return false比较合适吧?

if (!crawler.getRunConf().validWhiteUrl(pageRequest.getUrl())) {     // limit unvalid-page parse, only allow spread child, finish here
            return true;
        }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.