我们打开一个网页,即是通过了HTTP协议,对一个资源进行了请求,如何他返还你一份 HTML文档,然后浏览器进行文档的渲染。所以,我们只需要模拟浏览器,发送一份请求,获得这份文档,再抽取出我们需要的内容就好
import urllib
response=urllib.urlopen("http://www.baidu.com")
print response.read()
- 首先,我们引入python的urllib库
- 然后,我们对一个url地址进行反问
- 我们就可以看到,一版面的 HTML代码了,就是这么简单,使用python
import urllib2
header={
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Host":"aljun.me"
}
request=urllib2.request("http://xxx.com",header=header)
response=urllib2.urlopen(request)
import urllib2
import cookielib
cookie={"bdshare_firstime":1455378744638}
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
response=urllib2.urlopen("http://xxx.com")
import urllib2
data={
"username":"xxx"
"password":"xxx"
}
request=urllib2.request("http://xxx.com",data)
response=urllib2.urlopen(request)
import urllib
path="xxx.png"
url="http://zhaduixueshe.com/static/pic/discovery.png"
urllib.urlretrieve(url,path)
- 官方推荐做法,非常快,而且好用
- Request库链接
import urllib2
import re
reg=r'http.(d+).jpg'
reg=re.compile(reg)
response=urllib2.urlopen("http://xxx.com")
result=re.findall(response.read(),reg)
- 用来匹配文档
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
###
<html>
<head>
<title>首页</title>
</head>
<body>
<h1>我是标题</h1>
<img src="xxx">
</body>
</html>
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')</br>
# [Elsie,</br>
# Lacie,</br>
# Tillie]`
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- 用来编译HTML代码的专业库
pyquery库链接
- pyquery是以jquery的选择器语法为基础,非常适合前端转来的
In [1]: import urllib2
In [2]: response=urllib2.urlopen("http://aljun.me/like")
In [3]: print response.read()
{
"liked": 1647
}