Giter VIP home page Giter VIP logo

crawl_taobao's Introduction

crawl_taobao

用selenium+pyquery爬取淘宝自定义商品的信息,同时保存到MongoDB,后续进行数据分析

考虑分为单进程爬取(CRAWL.py)多进程爬取(如何构造淘宝的请求url是关键点)


补充:

1、强推多进程爬取,20秒100页(在没有渲染的情况下);
2、在代码里能看到有很多cookies,这些都是为了对付淘宝的登录操作,而淘宝的cookies有很多个,于是通过一个很好的插件 <font color=red>EditThisCookie</font> 把所有的cookies都按照序号导出来,这里有个问题就是导出的cookies字典有些地方不是字符串形式,所以我再通过两行代码把不是字符串的地方(e.g. false和true)变成字符串;

3、由于一些大型网站都具有较强反扒的能力,淘宝网站就能通过某些机制判断是否是selenium控制浏览器,比如<font color=red>参数 window.navigator.webdriver,若为true则证明有selenium,undefined就没有
`解决方案:` 
A(推荐)、给Chrome()加上option参数(实验性参数),执行js代码屏蔽window.navigator.webdriver;
B、往网站里注入代码屏蔽它

##PS: 淘宝有时候需要登录验证,有时候又不需要(貌似记得是先手动开浏览器登录一次,再通过selenium控制登录,而这个肯定是不成功的(即使此时手动在被控制的browser上登录也会失败),此时不关闭这个browser,过一段时间后,再次在该browser上登录,成功)


更新:

    通过分析源码,network请求等等,发现了多进程请求的网页url的构造方法(打开后很惊讶的发现是jsonp类型的,很好爬取,直接用json的方法,最多结合一下正则表达式就获得了商品信息,不过让人怀疑是不是淘宝是不是用来误导爬虫的),注意构造时有个jsonp参数,把他删去即可得到json类型,因此我构造的url是没有jsonp参数的

url在这里
可以看到左边的search?data-key=xxxx&data-value=xxxx,(为了可观性,方便找到url,可选择右键Name,domain),能看到44、88、 132的数字,证明一次页面就改变44, 另外还有参数bcoffset、 ntoffset,通过等差数列构造。
特别是有个_ksTS参数,那个明显就是时间戳,只不过记忆里时间戳要不是13位的,要不就是16位的,这个参数却是 13位_4位类型的。好奇心驱使试了一下打印time.time, 发现的确是17位浮点数,那就好办咯~
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

结果:

单进程花了16分钟,原因分析应该是因为用的是selenium,页面要渲染,同时也加了几个time.sleep;
如图


******************* 多进程花了20秒,因为直接用requests请求url(通过规律找出),而这个url页面打开后发现是 jsonp 类型,毫无渲染
![如图](https://github.com/HELL-TO-HEAVEN/crawl_taobao/blob/master/url%E9%A1%B5%E9%9D%A2.png)

如图

接下来进行数据分析

crawl_taobao's People

Contributors

hell-to-heaven avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

moreinterest

crawl_taobao's Issues

链接问题1

当我使用你给你链接

https://s.taobao.com/search?data-key=s&data-value=88&ajax=true&_ksTS=1553678998014_1170&q=%E5%A5%B3%E7%94%9F%E5%8C%85%E5%8C%85&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20190327&ie=utf8&bcoffset=0&ntoffset=6&p4ppushleft=1,48&s=44

第一次访问时候出现正确的结果json

多次访问之后出现这样,请问你能给我解释这是什么原因吗?

jsonp1084({ "rgv587_flag": "sm", "url": "https://s.taobao.com:443//search/_____tmd_____/punish?x5secdata=5e0c8e1365474455070961b803bd560607b52cabf5960afff39b64ce58073f78f68ede033dd239842063c29628191423773f1e4d712042da0b04859e7922f0cd8026dade1c87a609bb9a0d1f6d96ccc82f5139971b530ff3588b8c9e848299a945cb0ce36cef9f62cbd52852a03cf8ba461ee819ca12264cfd380e1ff9a318179636e1a04e92032cedffa50d0c9c3a56ca522008f6bcd1f4835bb9c2bad45d584508cfef73bf1804c75ad242ec660d4179a52a883b8d79c21b1904b01749f64ff68ede033dd239842063c29628191423c2512d1f2d6af203e486d1ae27585a506b8357c40b8852e10bee2dd322fdfa01b85d13ca384528f05b373d3a77a70575ad921bb1d36afdc5973c0455682491a957f7918a4f2572499cc398910575bb4ae5b2a48d9c0185c8d8521d59b4860b9243a2952e026506275152d2dce642e18a4440bf0b3e57db00024c36b841c1cc35ed81c65bebf3b9df46dd6afed6f199892c38573d94a1e033206e485398b2371f33c4d8e2e6b00ba097d8786478a58309bf29683bdadfe452f4353351418c615a273b6e8b188bb7af7caa7e645102b4841a499e722a7130246d94117211e574f22642edfb1867297bc9ec176ba721c72a441511e4fcdd0d33bd6a9d458dec482e0775e4606af9cfdadc50136f2b781f756cfd2b8845113d81d26788388259064f6a0620897ec8ecffd194f0fb13d29e615e815cecf23176496dbc510004ab5d070fdb0e255d6e9e16386749f104e9ab5118104b06bd98670e84056d61db95dbf743ff9761853023bd90f0ce9e8444d2aaa1b605f68fb734ef2550c1ba27f72d11cf8009a28d14bdf5c9af931ea6bc8275bc5f704a82cfb7fd46579fa85baaa2c888b0229e42374601dd62e2b4dbec1f913bb6c38b3ffae92bb4849e1e5228f1f7537b163a6f81af25ce770acee6cb26249891a48318440dcccb46da6a5d37f1c5d442933a57e108109ecec26e236a1f74684a9b38ad08537023e2489dac27a32b7ebd17f1677d315655dfa8fe93649d17ad99ed1db57996a9511714bfc34a0304&x5step=2" })

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.