Giter VIP home page Giter VIP logo

geek_crawler's Introduction

geek_crawler

最近极客时间有个活动,企业可以为每位员工免费领取3门课程。刚好我们公司领导也给我们申请了这个权益(没有领取的可以找领导说说帮忙弄一下,活动地址)。

免费领取的课程只有30天有效期,因为工作日白天要正常上班,30天之内没法学完3门课程。所以就写了个脚本,将账号下所有可以看到的专栏课程自动保存到本地。

💥 该项目仅限学习交流使用,请勿用于任何商业行为和损害其它人利益的行为。 💥

如何使用

  1. 将代码 clone 到本地

    git clone [email protected]:zhengxiaotian/geek_crawler.git
  2. 直接在终端或者 Pycharm 中运行脚本(ps: 代码是在 Python3 下编写的,需要使用 Python3 运行)

    # 运行前需安装一个第三方库 requests
    python geek_crawler.py
  3. 输入账号密码

    E:\geek_crawler (master -> origin)
    λ python geek_crawler.py
    请输入你的极客时间账号(手机号): *************
    请输入你的极客时间密码: ************
  4. 抓取完成

    2020-04-28 19:32:41,624 - geek_crawler.py[line:307] - INFO: 请求获取文章信息接口:
    2020-04-28 19:32:41,633 - geek_crawler.py[line:320] - INFO: 接口请求参数:{'id': 225554, 'include_neighbors': 'tru
    e', 'is_freelyread': 'true'}
    2020-04-28 19:32:42,047 - geek_crawler.py[line:349] - INFO: ----------------------------------------
    2020-04-28 19:32:47,131 - geek_crawler.py[line:478] - INFO: 正常抓取完成。

    Snipaste_2020-04-29_08-55-08.png

    PS:如果抓取过程中有接口报错导致抓取中断,可以查看日志中对应的报错信息,然后直接重新跑脚本继续抓取(之前抓取成功的文章会在本地有文档记录,后续不会重复抓取的)

成果展示

Snipaste_2020-04-29_08-44-44.png

Snipaste_2020-04-28_19-31-52.png

功能清单

  • 输入账号密码后自动将该账号下所有可以看到的专栏(图文+音频),保存到本地;

  • 可以支持选择保存成 Markdown 文档或者 HTML 文档;

  • 支持配置排除某些课程的拉取(比如已经有的课程不再下载);

  • 抓取指定名称的课程;

  • 将每篇文章的评论与正文一起保存到本地;

  • 将视频拉取下来保存成 MP4 文件;

geek_crawler's People

Contributors

zhengxiaotian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geek_crawler's Issues

个别专栏爬取报错

报错信息:
File "geek_crawler.py", line 483, in save_to_file
with open(file_path, 'w', encoding='utf-8') as f:
OSError: [Errno 22] Invalid argument: 'D:\0-git-time\geek_crawler-master\JavaScript核心原理解析\20 _ (0, eval)("x = 100") :一行让严格模式形同虚设的破坏性设计(上).md'

大神要不要看看

下载视频部分

请问 有下载视频部分的处理吗? 可以分享下不? 你功能部分介绍的最后一个有具体实现吗?

文件后缀名始终是.md的问题

在主函数中:
原: run(cellphone, pwd, exclude=exclude, get_comments=get_comments)
应改为:
run(cellphone, pwd, exclude=exclude, get_comments=get_comments, file_type=file_type)

返回的文章列表不能大于100

在一个专栏里有大于100个的文章时,该脚本最大只能保存100个文章。
查看代码后发现
_articles 方法中的 'data = res.json().get('data', {})' 返回值中的list最大只有100。如图:
image

抓取报错

大神来看下呀:

/Users/bo/PycharmProjects/pythonProject/main.py[line:550] - ERROR: 请求过程中出错了,出错信息为:Traceback (most recent call last):
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 547, in
run(cellphone, pwd, exclude=exclude, get_comments=get_comments)
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 513, in run
geek._article(aid, pro, file_type=file_type, get_comments=get_comments) # 获取单个文章的信息
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 341, in _article
self.save_to_file(
File "/Users/bo/PycharmProjects/pythonProject/main.py", line 449, in save_to_file
os.mkdir(dir_path)
FileNotFoundError: [Errno 2] No such file or directory: 'A/B测试从0到1'

你好,我问下下面这个报错怎么解决

请求登录接口:
接口请求参数:{'country': 86, 'cellphone': '*******', 'password': '********', 'captcha': '', 'remember': 1, 'platform': 3, 'appid': 1, 'source': ''}
请求过程中出错了,出错信息为:Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 686, in urlopen
self._prepare_proxy(conn)
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 952, in prepare_proxy
conn.connect()
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connection.py", line 389, in connect
self.sock = ssl_wrap_socket(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/util/ssl
.py", line 397, in ssl_wrap_socket
ssl_sock = context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1040, in _create
self.do_handshake()
File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/requests-2.24.0-py3.8.egg/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/connectionpool.py", line 745, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.8/site-packages/urllib3-1.26.0.dev0-py3.8.egg/urllib3/util/retry.py", line 474, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='account.geekbang.org', port=443): Max retries exceeded with url: /account/ticket/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123)')))

可以下载指定课程

在原来的代码基础上简单的修改了一下,实现下载指定的课程
修改点1.使用原来的exclude变量,存储想要下载的课程,大概在539行左右

 # 将exclude设置为指定要爬取的文章
    exclude = ['快速上手C++数据结构与算法']

修改点2.将297行左右的

if product.get('title', '')  in self.exclude:
修改为
if product.get('title', '')  not in self.exclude:

python3.8 失败

$ python geek_crawler.py
Traceback (most recent call last):
File "geek_crawler.py", line 12, in
import requests
ModuleNotFoundError: No module named 'requests'

不能下载部分课程

我有50多门课(仅3个是视频课,其他都是文字版),只有20多门课能下载。请问是什么原因导致不能下载所有课程

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.