Giter VIP home page Giter VIP logo

cssci's Introduction

cssci

crawling data from CSSCI

  1. 修正参考文献分隔符问题,并将分隔符由分号变成Tab

=========================

CSSCI Crawler

  1. 采用Python3 + BeautifulSoup4 + urllib/Selenium3 + PhantomJS + SQLite技术

  2. 支持分布式的静默爬虫

爬虫角色分配:

情报员:情报收集;

调度员:任务调度;

工程兵:任务执行、更新情报;

  1. 情报收集

按照检索关键词定期更新文章情况,以Document ID作为全局唯一标识。

入库: 字段包括,RID(全局Record ID,用于合并不同的数据表)、DocID、url 、Status(是否已经入库)和Last Update (最后一次更新时间)

状态由工程兵负责维护

更新:以ID为标识,更新url,但不更新入库的内容,(需更新被引用次数)。

  1. 任务调度

根据目前文章列表及状态,生成任务列表,每个任务列表为50-100个文章左右。

查询当前可用的工程兵数量,并分配任务。

  1. 工程作业

接收任务列表,打开相应的URL爬取固定的内容,如有可能可以感知网页元素的变化并提醒。 将具体内容添加到数据库 等到任务完成,一次性更新情报状态(入库还是未入库)。

cssci's People

Contributors

jiudong90 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.