zhengxiaotian / geek_crawler Goto Github PK

View Code? Open in Web Editor NEW

500.0 500.0 196.0 30 KB

极客时间课程抓取脚本，支持输入账号密码后自动将极客时间的专栏课程保存到本地

License: MIT License

Python 88.55% CSS 11.45%

geek_crawler's Introduction

geek_crawler

最近极客时间有个活动，企业可以为每位员工免费领取3门课程。刚好我们公司领导也给我们申请了这个权益（没有领取的可以找领导说说帮忙弄一下，活动地址）。

免费领取的课程只有30天有效期，因为工作日白天要正常上班，30天之内没法学完3门课程。所以就写了个脚本，将账号下所有可以看到的专栏课程自动保存到本地。

💥 该项目仅限学习交流使用，请勿用于任何商业行为和损害其它人利益的行为。 💥

如何使用

将代码 clone 到本地

git clone [email protected]:zhengxiaotian/geek_crawler.git

直接在终端或者 Pycharm 中运行脚本(ps: 代码是在 Python3 下编写的，需要使用 Python3 运行)
```
# 运行前需安装一个第三方库 requests
python geek_crawler.py
```

输入账号密码

E:\geek_crawler (master -> origin)
λ python geek_crawler.py
请输入你的极客时间账号（手机号）: *************
请输入你的极客时间密码: ************

抓取完成

2020-04-28 19:32:41,624 - geek_crawler.py[line:307] - INFO: 请求获取文章信息接口：
2020-04-28 19:32:41,633 - geek_crawler.py[line:320] - INFO: 接口请求参数：{'id': 225554, 'include_neighbors': 'tru
e', 'is_freelyread': 'true'}
2020-04-28 19:32:42,047 - geek_crawler.py[line:349] - INFO: ----------------------------------------
2020-04-28 19:32:47,131 - geek_crawler.py[line:478] - INFO: 正常抓取完成。

PS：如果抓取过程中有接口报错导致抓取中断，可以查看日志中对应的报错信息，然后直接重新跑脚本继续抓取（之前抓取成功的文章会在本地有文档记录，后续不会重复抓取的）

成果展示

功能清单

输入账号密码后自动将该账号下所有可以看到的专栏（图文+音频），保存到本地；
可以支持选择保存成 Markdown 文档或者 HTML 文档；
支持配置排除某些课程的拉取（比如已经有的课程不再下载）；
抓取指定名称的课程；
将每篇文章的评论与正文一起保存到本地；
将视频拉取下来保存成 MP4 文件；

geek_crawler's People

Contributors

Stargazers

Watchers

Forkers

randymar iez1784 zxhycxq hf-hf rchlz shangzongyu monsterone javajianghu kenychen ws1993 yyl2020 wangwox dystudio ningxiaofa kyrieou0909 parisyang leoxionglei benxiaolang-hacker gityfx2018 hhs66317 mahaibo168 skystar2 mengbuxiu wuxinshui lxngoddess5321 yihr zhb127 ericxu900210 kanghaov liumangtu fulin0532 costa92 tongsiying testerclub colinshin robbe-bu fuquanming yanchuanbing hui-coder redzealot2008 sishen007 webpq mingminger2333 lhongjum chaichux tanlay 10440755 abirdman raoxisme surprisexiong cymx66688 changshengdeluoye rocksnake explorerman yxly2008 quanjiedeng limboliboy lemonyliu malikcheng wzp-hd atonglvv xmucyp foragile shushuitie2017 luojun0115 orgization-with-help-of-xingge tonyfan666 yunyexiansheng aohanhongzhi wjjleopard haoliuhust yuhenggh wangzhengran mrking-007 hlhellen wwb-89 lyx5254 testerwm kakaloqi yudongya 521hellogithub ismart-yuxi ruizer brady-wang lsb-coder dbkernel mirindaliu nzwg1412 shitongtcu chenwanyu1207 qianjinguo codeinmyself nysyxxg chenjiongh leihuan1 hscarb john1688 triplesim jolleykong srfs

geek_crawler's Issues

个别专栏爬取报错

报错信息：
File "geek_crawler.py", line 483, in save_to_file
with open(file_path, 'w', encoding='utf-8') as f:
OSError: [Errno 22] Invalid argument: 'D:\0-git-time\geek_crawler-master\JavaScript核心原理解析\20 _ (0, eval)("x = 100") ：一行让严格模式形同虚设的破坏性设计（上）.md'

大神要不要看看

下载视频部分

请问有下载视频部分的处理吗？可以分享下不？你功能部分介绍的最后一个有具体实现吗？

文件后缀名始终是.md的问题

在主函数中：
原： run(cellphone, pwd, exclude=exclude, get_comments=get_comments)
应改为：
run(cellphone, pwd, exclude=exclude, get_comments=get_comments, file_type=file_type)

返回的文章列表不能大于100

在一个专栏里有大于100个的文章时，该脚本最大只能保存100个文章。
查看代码后发现
_articles 方法中的 'data = res.json().get('data', {})' 返回值中的list最大只有100。如图：

非法图形验证码

抓取报错

大神来看下呀：

/Users/bo/PycharmProjects/pythonProject/main.py[line:550] - ERROR: 请求过程中出错了，出错信息为：Traceback (most recent call last):
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 547, in
run(cellphone, pwd, exclude=exclude, get_comments=get_comments)
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 513, in run
geek._article(aid, pro, file_type=file_type, get_comments=get_comments) # 获取单个文章的信息
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 341, in _article
self.save_to_file(
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 449, in save_to_file
os.mkdir(dir_path)
FileNotFoundError: [Errno 2] No such file or directory: 'A/B测试从0到1'

能否直接保存为html的文件？

你好，我问下下面这个报错怎么解决

请求登录接口：
接口请求参数：{'country': 86, 'cellphone': '*******', 'password': '********', 'captcha': '', 'remember': 1, 'platform': 3, 'appid': 1, 'source': ''}
请求过程中出错了，出错信息为：Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 686, in urlopen
self._prepare_proxy(conn)
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 952, in prepare_proxy
conn.connect()
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connection.py", line 389, in connect
self.sock = ssl_wrap_socket(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/util/ssl.py", line 397, in ssl_wrap_socket
ssl_sock = context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1040, in _create
self.do_handshake()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 745, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/util/retry.py", line 474, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='account.geekbang.org', port=443): Max retries exceeded with url: /account/ticket/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)')))

可以下载指定课程

在原来的代码基础上简单的修改了一下，实现下载指定的课程
修改点1.使用原来的exclude变量，存储想要下载的课程，大概在539行左右

 # 将exclude设置为指定要爬取的文章
    exclude = ['快速上手C++数据结构与算法']

修改点2.将297行左右的

if product.get('title', '')  in self.exclude:
修改为
if product.get('title', '')  not in self.exclude: