-Python-Crawler-Turorial-1

爬虫思路

我们打开一个网页，即是通过了HTTP协议，对一个资源进行了请求，如何他返还你一份 HTML文档，然后浏览器进行文档的渲染。所以，我们只需要模拟浏览器，发送一份请求，获得这份文档，再抽取出我们需要的内容就好

简单爬虫

import urllib
response=urllib.urlopen("http://www.baidu.com")
print response.read()

首先，我们引入python的urllib库
然后，我们对一个url地址进行反问
我们就可以看到，一版面的 HTML代码了，就是这么简单，使用python

进阶1

伪装 User-Agent

import urllib2

header={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Host":"aljun.me"
}
request=urllib2.request("http://xxx.com",header=header)
response=urllib2.urlopen(request)

cookie发送

import urllib2
import cookielib

cookie={"bdshare_firstime":1455378744638}
cookie = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())

urllib2.install_opener(opener)

response=urllib2.urlopen("http://xxx.com")

数据交互

import urllib2

data={
"username":"xxx"
"password":"xxx"
}
request=urllib2.request("http://xxx.com",data)
response=urllib2.urlopen(request)

爬取图片

import urllib

path="xxx.png"
url="http://zhaduixueshe.com/static/pic/discovery.png"

urllib.urlretrieve(url,path)

官方推荐做法，非常快，而且好用
Request库链接

正则表达式re

import urllib2
import re

reg=r'http.(d+).jpg'
reg=re.compile(reg)
response=urllib2.urlopen("http://xxx.com")
result=re.findall(response.read(),reg)

用来匹配文档

beautifulsoap

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

###
<html>
<head>
<title>首页</title>
</head>
<body>
<h1>我是标题</h1>
<img src="xxx">
</body>
</html>

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')</br> # [Elsie,</br> # Lacie,</br> # Tillie]`

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

用来编译HTML代码的专业库

pyquery

pyquery库链接

pyquery是以jquery的选择器语法为基础，非常适合前端转来的

调用json格式

In [1]: import urllib2

In [2]: response=urllib2.urlopen("http://aljun.me/like")

In [3]: print response.read()
{
"liked": 1647
}

xinrui-fang / -python-crawler-turorial-1- Goto Github PK