Giter VIP home page Giter VIP logo

wanfangdata's Introduction

WanFangData

Python3兼容分支

在语法上做了修改以支持Python3运行,但主要代码结构和内容没有变更,将原先的代理部分换成了我刚写的FP-Server

要顺利爬取内容,需要自己手动修改爬虫代码来适应最新的网站结构。

兼容修改完成的部分:

  • WFSpider

安装环境:

  • 安装MongoDB数据库
  • 安装Python依赖(Windows系统可能需要安装其他package)
    pip install requirements.txt
    
  • 安装运行FP-Server

爬虫部分

☝️ 此爬虫仅针对万方数据知识服务平台网站的“期刊”模块,如果要爬其他模块,需要对WFbase和WFindex做一些修改

爬虫基于Scrapy和MongoDB

spiders目录下有5个爬虫,按顺序执行:

  • WFbase 爬取期刊主分类

  • WFindex 爬取期刊索引,即二级分类(三千个左右)

  • WFcore 爬取文章页面(因个人需要只爬了17年的,截至6月份共计爬取23万条,保守估计2016年条数会超过百万)

另外两个patch是修正用的补丁,已经加入到WFcore中,可以忽略

关于settings设置:

  • TARGET_YEAR = ['2017'] 目标时间(依照期刊发行时间筛选),从最新发行刊目开始逆序运行

  • USE_PROXY = 0 是否使用代理

  • DOWNLOAD_DELAY = 1 如果不使用代理,此处不能小于1,否则IP会被服务器拒绝,一小时内不能访问

Web部分

用爬下来的数据做了web展示,基于Flask

主要功能有:

  1. 根据标题、作者等进行搜索

    • 主页 home
    • 搜索结果 searchresult.png
    • 条目详情 item2
  2. 输入任意时间范围和领域,查看Top100的热门作者、关键词以及对应的文章数量

    • 热门查询 popular.png
    • 查询结果 popular_res2.png

wanfangdata's People

Contributors

karmenzind avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.