Giter VIP home page Giter VIP logo

-python-crawler-turorial-1-'s Introduction

-Python-Crawler-Turorial-1 

爬虫思路

我们打开一个网页,即是通过了HTTP协议,对一个资源进行了请求,如何他返还你一份 HTML文档,然后浏览器进行文档的渲染。所以,我们只需要模拟浏览器,发送一份请求,获得这份文档,再抽取出我们需要的内容就好

简单爬虫

import urllib
response=urllib.urlopen("http://www.baidu.com")
print response.read()

  • 首先,我们引入python的urllib库
  • 然后,我们对一个url地址进行反问
  • 我们就可以看到,一版面的 HTML代码了,就是这么简单,使用python

进阶1

伪装 User-Agent

import urllib2

header={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Host":"aljun.me"
}
request=urllib2.request("http://xxx.com",header=header)
response=urllib2.urlopen(request)

cookie发送

import urllib2
import cookielib

cookie={"bdshare_firstime":1455378744638}
cookie = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())

urllib2.install_opener(opener)

response=urllib2.urlopen("http://xxx.com")

数据交互

import urllib2

data={
"username":"xxx"
"password":"xxx"
}
request=urllib2.request("http://xxx.com",data)
response=urllib2.urlopen(request)

爬取图片

import urllib

path="xxx.png"
url="http://zhaduixueshe.com/static/pic/discovery.png"

urllib.urlretrieve(url,path)

  • 官方推荐做法,非常快,而且好用
  • Request库链接

正则表达式re

import urllib2
import re

reg=r'http.(d+).jpg'
reg=re.compile(reg)
response=urllib2.urlopen("http://xxx.com")
result=re.findall(response.read(),reg)

  • 用来匹配文档

beautifulsoap

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

###
<html>
<head>
<title>首页</title>
</head>
<body>
<h1>我是标题</h1>
<img src="xxx">
</body>
</html>

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')</br> # [Elsie,</br> # Lacie,</br> # Tillie]`

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  • 用来编译HTML代码的专业库

pyquery

pyquery库链接

  • pyquery是以jquery的选择器语法为基础,非常适合前端转来的

调用json格式

In [1]: import urllib2

In [2]: response=urllib2.urlopen("http://aljun.me/like")

In [3]: print response.read()
{
"liked": 1647
}

-python-crawler-turorial-1-'s People

Contributors

xinrui-fang avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.