Giter VIP home page Giter VIP logo

illness_crawel's Introduction

illness_crawel

2019.08.19

爬取网站https://www.msdmanuals.com 的中文版,英文版和法语版内容
使用了scrapy框架和selenium框架

2020.02.2

添加网页r'https://medlineplus.gov/ency/ 的爬取

2020.04.23

添加了一个网页的https://reference.medscape.com/drug/ 的爬取
本来打算使用dict将所有的网页的路径(保存在本地)和他的url形成一个字典,然后最后遍历这整个大dict下载文本
后面发现整个大的dict全部添加完要比较长的时间,需要进行4个大的for循环,
最后决定不保留dict,使用tuple记录每个路径和url,然后下载文本。

2024.02.18

添加了一个网页的https://www.orpha.net/consor/cgi-bin/Disease_Search_List.php?lng=EN 的爬取
将所有的子url爬取后,使用dict保存 name 和 url,方便下载的时候以 name 为文件名保存爬取内容。使用线程池,边爬取边下载。

小tips

爬取网页按照顺序爬取的一个特别好的写法是

patt1 = re.compile(r'<p>(.*?)</p>|<li>(.*?)</li>|<h3>(.*?)</h3>|<h4>(.*?)</h4>', re.S)
subcontent = patt1.findall(str(subDiv))

原理是先用正则表达式将所要爬取的内容的tags全部先complile设置好,
然后使用findall查找已经解析出来的网页部分,这样获得的就是按照顺序的文本。

illness_crawel's People

Contributors

zhaodi-wen avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.