Giter VIP home page Giter VIP logo

maoyan's Introduction

Python3 网络爬虫开发实战

本书介绍了如何利用 Python 3 开发网络爬虫。书中首先详细介绍了环境配置过程和爬虫基础知识;然后讨论了 urllib、requests 等请求库,Beautiful Soup、XPath、pyquery 等解析库以及文本和各类数据库的存储方法;接着通过多个案例介绍了如何进行 Ajax 数据爬取,如何使用 Selenium 和 Splash 进行动态网站爬取;接着介绍了爬虫的一些技巧,比如使用代理爬取和维护动态代理池的方法,ADSL 拨号代理的使用,图形、 极验、点触、宫格等各类验证码的破解方法,模拟登录网站爬取的方法及 Cookies 池的维护。 此外,本书还结合移动互联网的特点探讨了使用 Charles、mitmdump、Appium 等工具实现 App 爬取 的方法,紧接着介绍了 pyspider 框架和 Scrapy 框架的使用,以及分布式爬虫的知识,最后介绍了 Bloom Filter 效率优化、Docker 和 Scrapyd 爬虫部署、Gerapy 爬虫管理等方面的知识。

本书由图灵教育 - 人民邮电出版社出版发行,版权所有,禁止转载。

作者:崔庆才

购买地址:

加读者群:

视频资源:

Python3 爬虫三大案例实战分享

自己动手,丰衣足食!Python3 网络爬虫实战案例

maoyan's People

Contributors

germey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maoyan's Issues

需要在get_one_page函数里加个response.close()

def get_one_page(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
response.close()
return response.text
return None
except RequestException:
return None

运行成功,但没有结果

程序虽然运行成功,但是没有结果。
debug后,发现在正则匹配的时候没有返回结果。
items = re.findall(pattern, html) items 为空

为何只有排名25的泰坦尼克号爬不出来呢

我是先把源代码放在文件夹里,然后从本地爬取,唯独第25的泰坦尼克号没有爬出来,是为什么

import re
import time
import json
maoyan_url_base = 'https://maoyan.com/board/4?offset='
pattern = re.compile('<dd>.*?<i class="board-index.*?>(.*?)</i>.*?title="(.*?)".*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>',re.S)
proxies = {
	'http':'http://127.0.0.1:10809',
	'https':'http://127.0.0.1:10809'
}
def get_one_page_url(url):
	"""获得一个网页的源码,使用代理"""
	headers={
		'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36' 
	}
	result = requests.get(url,headers=headers,proxies=proxies)
	if result.status_code == 200:
		return result.text
	else:
		return None
def store_html(html_txt,filename):
	"""存储一个网页的源码"""
	with open(filename,'w',encoding='utf-8') as f:
		f.write(html_txt)
def get_store_html(filename):
	"""获取文件的内容"""
	with open(filename,'r',encoding='utf-8') as f:
		html = f.read()
	return html
def store_10_html():
	"""存储十个网页的源码"""
	for i in range(1,2):
		url = maoyan_url_base + str(i*10)
		filename = f"maoyan/maoyan_page{i}.txt"
		html = get_one_page_url(url)
		store_html(html,filename)
		time.sleep(1)


def scrap_web(filename):
	"""处理源代码的排名,电影名等,返回字典格式"""
	html = get_store_html(filename)
	results = re.findall(pattern,html)
	# 1为排名,2为电影名,3为主演名
	for result in results:
		# print(result.group(),result.group(2),result.group(3).strip(),result.group(4))
		score =result[4]+result[5]
		score = score.strip()
		print(result[0],result[1],result[2].strip(),result[3],score)
		# yield{
		# 	'index':result[0],
		# 	'title':result[1],
		# 	'actor':result[2].strip()[3:],
		# 	'time':result[3][5:],
		# 	'score':score
		# }

# def scrap_10_webs():
# 	"""获取十个网站的源码"""
# 	for i in range(10):
# 		filename = f"maoyan/maoyan_page{i}.txt"
# 		content = scrap_web(filename)
# 		scrap_web(filename)
def write_to_json(content):
	with open('result.txt','a',encoding='utf-8') as f:
		f.write(json.dumps(content,ensure_ascii=False)+'\n')
def read_10_txts():
	for i in range(10):
		filename = f"maoyan/maoyan_page{i}.txt"
		for item in scrap_web(filename):
			write_to_json(item)

if __name__ == "__main__":
	scrap_web('maoyan/maoyan_page2.txt')

``
原代码里的第25条是和别的代码一样的格式,就是爬不出来 有没有遇到同样问题的大佬呢?
网页源代码
` <dd>
                        <i class="board-index board-index-25">25</i>
    <a href="/films/267" title="泰坦尼克号" class="image-link" data-act="boarditem-click" data-val="{movieId:267}">
      <img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
      <img data-src="https://p0.meituan.net/moviemachine/e7dd6b1f77fba08c1f20a3b20b156621642576.jpg@160w_220h_1e_1c" alt="泰坦尼克号" class="board-img" />
    </a>
    <div class="board-item-main">
      <div class="board-item-content">
              <div class="movie-item-info">
        <p class="name"><a href="/films/267" title="泰坦尼克号" data-act="boarditem-click" data-val="{movieId:267}">泰坦尼克号</a></p>
        <p class="star">
                主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩
        </p>
<p class="releasetime">上映时间:1998-04-03</p>    </div>
    <div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">4</i></p>        
    </div>

      </div>
    </div>

                </dd>
                <dd>
                        <i class="board-index board-index-26">26</i>
    <a href="/films/899" title="当幸福来敲门" class="image-link" data-act="boarditem-click" data-val="{movieId:899}">
      <img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" />
      <img data-src="https://p0.meituan.net/moviemachine/e5daa8748733820faab91102bd0bc4507730353.jpg@160w_220h_1e_1c" alt="当幸福来敲门" class="board-img" />
    </a>
    <div class="board-item-main">
      <div class="board-item-content">
              <div class="movie-item-info">
        <p class="name"><a href="/films/899" title="当幸福来敲门" data-act="boarditem-click" data-val="{movieId:899}">当幸福来敲门</a></p>
        <p class="star">
                主演:威尔·史密斯,贾登·史密斯,坦迪·牛顿
        </p>
<p class="releasetime">上映时间:2008-01-17</p>    </div>
    <div class="movie-item-number score-num">
<p class="score"><i class="integer">9.</i><i class="fraction">3</i></p>        
    </div>

      </div>
    </div>

                </dd>`

无法运行

不知道为什么我这边复制代码运行不了。debugger显示程序根本没有进入main函数只把循环做完了


知道原因了,现在登录这个网站需要验证(滑动滑块)

headers中User-Agent参数需要调整

headers中User-Agent参数需要调整,直接运行会被拦截。
经尝试,调整如下可以运行:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.