Giter VIP home page Giter VIP logo

thesaurusspider's Introduction

搜狗、百度、QQ输入法词库爬虫

用python实现的爬取搜狗、百度、QQ输入法词库的爬虫。各文件夹对应的内容如下

每个输入法均采用了单线程和多线程实现了爬取功能。多线程的速度要远快于单线程,线程数目建议设为5~10,或者保留默认的设定数5。

通过urllib2、Queue、re、threading等python自带模块实现,无依赖的第三方模块。使用时将singleThreadDownload.py(单线程下载)或 multiThreadDownload.py(多线程下载)中的主函数中的baseDir改为自己的下载路径即可运行单线程下载或多线程下载,注意baseDir末尾没有/。

如果有下载不成功的文件或解析不成功的页面,在下载根目录会生成下载日志,记录这些文件和页面的URL信息,方便debug。

关于实现的具体细节可参考这篇文章

下载的词库文件并非文本格式,而是各个输入法自己定制的二进制格式,关于词库文件的解码并转为文本格式可参考这个repository

2017.01.13更新

百度输入法词库的网页布局已改版,词库的下载链接通过js代码获取,并且采取了一定的反爬虫措施(返回500,502错误)。500, 502表示内部服务器错误,但有的网站在针对爬虫访问的时候也会利用错误码500或502来反爬,百度词库正是这样。

解决方法:

1.虽然下载时通过js代码获取下载链接,但是分析点击下载链接时的http request头中的Request URL,可以发现实际的下载链接还是一个静态链接https://shurufa.baidu.com/dict_innerid_download?innerid=,其中innerid=后跟着的是词库文件的标示ID,可在网页中获取。

2.对于返回500,502错误码的反爬虫措施,通过重新进行请求解决,因为百度词库在返回500或502后会返回一个200,所以实际上并不是服务器出问题,更像是为了反爬而以一定概率出现这类状态码

注意:因为百度输入法采取了一定的反爬虫措施,为了降低返回502,500错误的几率,请求的 user-agent 不再固定,而是采用第三方库user-agent 生成,使用前需要先通过easy_install user-agentpip install user-agent安装。

thesaurusspider's People

Contributors

efeiefei avatar wulc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

thesaurusspider's Issues

多线程下载总是卡住

每次都是下载完 全宋词六,就卡主不动
是因为死锁,还是某个页面解析不动了呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.