nanmicoder / mediacrawler Goto Github PK

小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频｜评论爬虫、微博帖子｜评论爬虫

License: Other

Python 100.00%

mediacrawler's Introduction

免责声明：

大家请以学习为目的使用本仓库，爬虫违法违规的案件：https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China

本仓库的所有内容仅供学习和参考之用，禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究，不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任，本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。

点击查看更为详细的免责声明。点击跳转

仓库描述

小红书爬虫，抖音爬虫， 快手爬虫， B站爬虫， 微博爬虫...。
目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。

原理：利用playwright搭桥，保留登录成功后的上下文浏览器环境，通过执行JS表达式获取一些加密参数通过使用此方式，免去了复现核心加密JS代码，逆向难度大大降低

功能列表

平台	关键词搜索	指定帖子ID爬取	二级评论	指定创作者主页	登录态缓存	IP代理池	生成评论词云图
小红书	✅	✅	✅	✅	✅	✅	✅
抖音	✅	✅	✅	✅	✅	✅	✅
快手	✅	✅	✅	✅	✅	✅	✅
B 站	✅	✅	✅	✅	✅	✅	✅
微博	✅	✅	❌	❌	✅	✅	✅

使用方法

创建并激活 python 虚拟环境

# 进入项目根目录
cd MediaCrawler

# 创建虚拟环境
# 注意python 版本需要3.7 - 3.9 高于该版本可能会出现一些依赖包兼容问题
python -m venv venv

# macos & linux 激活虚拟环境
source venv/bin/activate

# windows 激活虚拟环境
venv\Scripts\activate

安装依赖库

pip install -r requirements.txt

安装 playwright浏览器驱动

playwright install

运行爬虫程序

### 项目默认是没有开启评论爬取模式，如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
### 一些其他支持项，也可以在config/base_config.py查看功能，写的有中文注释

# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
python main.py --platform xhs --lt qrcode --type search

# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
python main.py --platform xhs --lt qrcode --type detail

# 打开对应APP扫二维码登录
  
# 其他平台爬虫使用示例，执行下面的命令查看
python main.py --help

数据保存

支持保存到关系型数据库（Mysql、PgSQL等）
- 执行 python db.py 初始化数据库数据库表结构（只在首次执行）
支持保存到csv中（data/目录下）
支持保存到json中（data/目录下）

开发者服务

知识星球：沉淀高质量常见问题、最佳实践文档、多年编程+爬虫经验分享，提供付费知识星球服务，主动提问，作者会定期回答问题 (每天 1 快钱订阅我的知识服务)

星球精选文章：
MediaCrawler视频课程：

如果你想很快入门这个项目，或者想了具体实现原理，我推荐你看看这个视频课程，从设计出发一步步带你如何使用，门槛大大降低，同时也是对我开源的支持，如果你能支持我的课程，我将会非常开心～
课程售价非常非常的便宜，几杯咖啡的事儿.
课程介绍飞书文档链接：https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh

感谢下列Sponsors对本仓库赞助

感谢 JetBrains 对本项目的支持！

- 通过注册这个款免费的GPT助手，帮我获取GPT4额度作为支持。也是我每天在用的一款chrome AI助手插件

成为赞助者，展示你的产品在这里，联系作者：[email protected]

MediaCrawler爬虫项目交流群：

扫描下方我的个人微信，备注：github，拉你进MediaCrawler项目交流群(请一定备注：github，会有wx小助手自动拉群)

如果图片展示不出来，可以直接添加我的微信号：yzglan

运行报错常见问题Q&A

遇到问题先自行搜索解决下，现在AI很火，用ChatGPT大多情况下能解决你的问题免费的ChatGPT

➡️➡️➡️ 常见问题

dy和xhs使用Playwright登录现在会出现滑块验证 + 短信验证，手动过一下

项目代码结构

➡️➡️➡️ 项目代码结构说明

代理IP使用说明

➡️➡️➡️ 代理IP使用说明

词云图相关操作说明

➡️➡️➡️ 词云图相关说明

手机号登录说明

➡️➡️➡️ 手机号登录说明

打赏

免费开源不易，如果项目帮到你了，可以给我打赏哦，您的支持就是我最大的动力！

爬虫入门课程

我新开的爬虫教程Github仓库 CrawlerTutorial ，感兴趣的朋友可以关注一下，持续更新，主打一个免费.

项目贡献者

感谢你们的贡献，让项目变得更好！（贡献比较多的可以加我wx，免费拉你进我的知识星球，后期还有一些其他福利。）

_{程序员阿江-Relakkes}	_leantli	_Rosyrain	_{Bao Zhuhan}	_zhounan	_HIRO
_PeanutSplash	_Ermeng	_{Henry He}	_{leonardoqiuyu}	_jayeeliu	_ZuWard
_Zendrix	_{zhangzhenpeng}	_{Sam Tan}	_xbsheng	_Martin	_zhihuiio
_Ren	_{Wang Tianci}	_Styunlen	_Schofi	_Klu5ure	_Kermit
_KEXNA	_{Jian Chang}	_tianqing

star 趋势图

如果该项目对你有帮助，star一下 ❤️❤️❤️

参考

xhs客户端 ReaJason的xhs仓库
短信转发参考仓库
内网穿透工具 ngrok

免责声明

1. 项目目的与性质

本项目（以下简称“本项目”）是作为一个技术研究与学习工具而创建的，旨在探索和学习网络数据采集技术。本项目专注于自媒体平台的数据爬取技术研究，旨在提供给学习者和研究者作为技术交流之用。

2. 法律合规性声明

本项目开发者（以下简称“开发者”）郑重提醒用户在下载、安装和使用本项目时，严格遵守中华人民共和国相关法律法规，包括但不限于《中华人民共和国网络安全法》、《中华人民共和国反间谍法》等所有适用的国家法律和政策。用户应自行承担一切因使用本项目而可能引起的法律责任。

3. 使用目的限制

本项目严禁用于任何非法目的或非学习、非研究的商业行为。本项目不得用于任何形式的非法侵入他人计算机系统，不得用于任何侵犯他人知识产权或其他合法权益的行为。用户应保证其使用本项目的目的纯属个人学习和技术研究，不得用于任何形式的非法活动。

4. 免责声明

开发者已尽最大努力确保本项目的正当性及安全性，但不对用户使用本项目可能引起的任何形式的直接或间接损失承担责任。包括但不限于由于使用本项目而导致的任何数据丢失、设备损坏、法律诉讼等。

5. 知识产权声明

本项目的知识产权归开发者所有。本项目受到著作权法和国际著作权条约以及其他知识产权法律和条约的保护。用户在遵守本声明及相关法律法规的前提下，可以下载和使用本项目。

6. 最终解释权

关于本项目的最终解释权归开发者所有。开发者保留随时更改或更新本免责声明的权利，恕不另行通知。

mediacrawler's People

Contributors

Stargazers

Watchers

Forkers

fairyworld supervipcard helioscanlin zenghj chenpython wangzhiyuanawe wjw136 bigfacecat2017 garyrandom annoymousbear lixiaolevae smart-xing666 dogwars chiwenheng kxianghui minry gptpage alawnchen xuelainiao linfangzhi laokpa gnose gaochaozhu clickear xtuyaowu allison199005 wobangnidashui kekewind leizeng redraiment leonhan01 dongshixiaohehe crewcutbro tanpenggood-fork msliu98 huangxianliang1985 zuoyou6 caffeecoffee imwangxm ccbond boyyongxin sunqi6734 github-yxb moyun712 1614229511 mmxff yangrq1018 yacey gongsong 29745560 lyong9102 jayeeliu jacktou maserliumz abu-outis etongle-open bpm108 foursking renaissancezyc xseven0908 imperialpalace lin657542270 wanghaisheng yarntime pbuff07 elvisvern bumynm701 suiiiyiiing coder1943 being1943 seedx194 18171404632 stanleymr qurel conquener skipgip himoko karakong ganjunhong aturret zhanfish colornote zhouayi lu-huibin painterv arreow cxy-csx licesun roronoaye glhouse hongyukeji viviki00 jadegeek eadwin solmyr118 wangdong2023 iuvluyonghao liduos antionyliu kugarliyifan

mediacrawler's Issues

获取着获取着会卡住，不知道为什么，另外爬取数据会写到指定文件吗？

很好用，谢谢大佬～期待贡献

报错信息

爬取抖音评论报错： MediaCrawler ERROR aweme_id: xxx get comments failed, error: Expecting value: line 1 column 1 (char 0)
应该如何解决

获取到的笔记图片是有水印的，如何解析拉取到的数据呢

你好，
我获取到的数据地址是这样的：https://sns-img-bd.xhscdn.com/cebf4761-7ec0-9911-fbdf-77fd3cb01c31
这是一张图，如何把解析成图片呢？

我尝试下载下来，添加.jpg后缀，可以查看到图片。

另外，
获取到的笔记图片是有水印的，还是我中间有些操作不正确，导致水印没有去除？

望回复。。

报错Full list of missing libraries:

File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 44, in start
self.browser_context = await self.launch_browser(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 184, in launch_browser
browser_context = await chromium.launch_persistent_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Host system is missing dependencies!

Full list of missing libraries:
chrome_elf.dll

有接口能通过抖音评论id，查询评论详情吗

咨询，试了一下 dy 的抓取，看了下数据库内容，好像没有抓到那种评论里的回复。为了表达清楚，我截个图示意一下，谢谢。

如下图，我指这样的评论里的回复，爬取数据里没有爬这里面的内容，对嘛？

无法抓取视频评论

看到3周前其他issue里也有同样问题，作者回答解决了但现在还有相同报错，换了多个关键词测试结果相同，报错如下：

2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7241024491999022392 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7257898852668296485 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7246635327694179623 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7042503193409880590 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7256294498760674612 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7229990392005922081 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler INFO Douyin Crawler finished ...

大佬,有讨论群吗?我太小白了,readme对我而言太简洁了

请问有方法可以爬取标签以及对应的热度吗？

大佬你好，

想请问一下有没有可能把 #标签这类的话题浏览次数爬下来啊？类似下图这种

码农高天的一期视频用的你们的项目，如何评价他说的问题？

视频链接：【【Code Review】传参的时候有这么多细节要考虑？冗余循环变量你也写过么？-哔哩哔哩】 https://b23.tv/LdSoAJP

playwright._impl._api_types.Error: Browser closed.

File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Browser closed.
==================== Browser output: ====================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup
=========================== logs ===========================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup

关于redis

大佬，请问一下，抖音评论爬取是不是还不可以通过redis存储到数据库中？
能麻烦加一个windows 的 redis配置过程吗。
please😜

部分笔记在web端会出现-510001，"当前内容无法展示"，但是实际在app端可以查看，这种情况可以获取吗？

请问这个爬取的数据可以直接导出为csv吗？因为我在导到数据库中时遇到了挺多问题

以下是遇到的一些问题：

asyncmy.errors.OperationalError: (1054, "Unknown column 'nickname' in 'field list'")

tortoise.exceptions.OperationalError: (1054, "Unknown column 'add_ts' in 'field list'")
………………一些字段缺失（我在sql内补了一些）

tortoise.exceptions.OperationalError: (1054, "Unknown column 'image_list' in 'field list'")
（有一些python的list、dict类型的我不知道在sqln内需要设置成什么）

tortoise.exceptions.OperationalError: (1366, "Incorrect string value: '\xF0\x9F\x8C\xB0' for column 'nickname' at row 1")（我去调整了sql的collation为“utf-8_general_ci”）

asyncmy.errors.DataError: (1406, "Data too long for column 'avatar' at row 1")

tortoise.exceptions.OperationalError: (1406, "Data too long for column 'avatar' at row 1")

Ubuntu 环境下，QR code扫描不出来

在linux系统下，QR code显示之后，扫描完全没有反映。

dy登录失败，没有弹出二维码

@NanmiCoder @tanpenggood

C:\Users\caps\.vitualenvs\crawler\Scripts\python.exe main.py --platform dy --lt qrcode 
2023-07-26  22:58:50 MediaCrawler ERROR login dialog box does not pop up automatically, error: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================ 
2023-07-26  22:58:50 MediaCrawler INFO login dialog box does not pop up automatically, we will manually click the login button 
Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 90, in popup_login_dialog
    await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 10)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 8266, in wait_for_selector
    await self._impl_obj.wait_for_selector(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_page.py", line 368, in wait_for_selector
    return await self._main_frame.wait_for_selector(**locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 322, in wait_for_selector
    await self._channel.send("waitForSelector", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 47, in <module>
    asyncio.run(main())
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 42, in main
    await crawler.start()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\core.py", line 62, in start
    await login_obj.begin()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 45, in begin
    await self.popup_login_dialog()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 95, in popup_login_dialog
    await login_button_ele.click()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 15419, in click
    await self._impl_obj.click(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_locator.py", line 160, in click
    return await self._frame.click(self._selector, strict=True, **params)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 489, in click
    await self._channel.send("click", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//p[text() = '登录']")
  locator resolved to <p class="lqiPv8cB">登录</p>
attempting click action
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #1
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #2
  waiting 20ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #3
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #4
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #5
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #6
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #7
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #8
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #9
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #10
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #11
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #12
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #13
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #14
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #15
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #16
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #17
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #18
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #19
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #20
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #21
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #22
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #23
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #24
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #25
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #26
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #27
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #28
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #29
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #30
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #31
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #32
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #33
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #34
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #35
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #36
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #37
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #38
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #39
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #40
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #41
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #42
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #43
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #44
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #45
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #46
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #47
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #48
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #49
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #50
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #51
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #52
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #53
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #54
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #55
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #56
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #57
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #58
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #59
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #60
  waiting 500ms
============================================================

进程已结束,退出代码1

抖音无法登陆

dy不管使用哪种方法登录都同样报错,xhs正常

好像是playwright的问题,我这边用playwright无法打开抖音首页,把index_url换成www.douyin.com/discover
就正常了

二维码登录

二维码登录不可用，需要滑块验证了
<Page url='https://www.xiaohongshu.com/website-login/captcha?redirectPath=>

请问短时间内同一个账号能获取多少数据

感谢作者的repo，很好用！

我在获取的时候同一个账号在短时间内没法获取太多的数据，换IP也不行，请问作者存在这种情况吗，需要怎么样解决。

抖音二维码登录问题！

你好：
试了一下抖音二维码登录，也是需要短信验证？

为什么运行后没有反应，也没有报错

Proxies Issues

这个是必须要给proxy吗？

dy数据爬取失败

qrcode与cookie模式皆报错，报错信息如图

小红书扫码成功后报错

小红书扫码后报错

如下：
2023-07-08 18:04:13 root INFO Begin login xiaohongshu by qrcode ...
2023-07-08 18:04:23 root INFO waiting for scan code login, remaining time is 20s
Traceback (most recent call last):
File "/Users/username/work/github_test/MediaCrawler/main.py", line 58, in
asyncio.run(main())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/Users/username/work/github_test/MediaCrawler/main.py", line 39, in main
await crawler.start()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/core.py", line 82, in start
await login_obj.begin()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 48, in begin
await self.login_by_qrcode()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 155, in login_by_qrcode
login_flag: bool = await self.check_login_state(no_logged_in_session)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 88, in async_wrapped
return await fn(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 47, in call
do = self.iter(retry_state=retry_state)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/init.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1271e2ef0 state=finished returned bool>]

python main.py --platform dy --lt qrcode 失败

执行命令 python main.py --platform dy --lt qrcode
其中 --lt 后面能跟的参数都尝试了，结果总是错误：
Traceback (most recent call last):
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 45, in main
await crawler.start()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 66, in start
await self.search()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 78, in get
await self.__process_req_params(params, headers)
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 56, in __process_req_params
"webid": douyin_js_obj.call("get_web_id"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 92, in _call
return self.eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 78, in eval
return self.exec(code)
^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 18, in exec
return self.exec(source)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 88, in exec
return self._extract_result(output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: SyntaxError: 缺少 ';'

sql插入失败

xhs.model里
"title": note_item.get("title") or note_item.get("desc", "")
有的只有desc没有title,就会导致title字符串过长溢出导致插入失败

请问能获取最新的数据吗？

就是模拟输入关键词后搜索最新的相关关键词数据

我在使用cookie登录抖音后遇到了execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.的错误

详细的报错信息如下：(base) yyyy:~/Union/MediaCrawler$ python main.py --platform dy --lt cookie
/yyyy/MediaCrawler/main.py:51: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
2023-08-11 15:08:42 MediaCrawler INFO Begin login douyin by cookie ...
2023-08-11 15:08:48 MediaCrawler INFO login finished then check login state ...
2023-08-11 15:08:48 MediaCrawler INFO Login successful then wait for 5 seconds redirect ...
2023-08-11 15:08:53 MediaCrawler INFO Begin search douyin keywords
2023-08-11 15:08:53 MediaCrawler INFO Current keyword: 健身
Traceback (most recent call last):
File "/yyyy/MediaCrawler/main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "/yyyy/anaconda3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/yyyy/MediaCrawler/main.py", line 45, in main
await crawler.start()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 66, in start
await self.search()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 78, in get
await self.__process_req_params(params, headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 38, in __process_req_params
douyin_js_obj = execjs.compile(open('libs/douyin.js').read())
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/init.py", line 61, in compile
return get().compile(source, cwd)
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 21, in get
return get_from_environment() or _find_available_runtime()
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime
raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.")
execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.

登录已过期

請問大佬, 我使用掃碼登入或者cookie登入
可以順利登入
不過當開始抓取不到一組資料的時候
就會直接被登出顯示以下資訊
media_platform.xhs.exception.DataFetchError: 登录已

是為什麼呢？

使用cookies是否可以不重复登录

您好，在爬取小红书时，我尝试在第一次中使用QRcode登录并且获得到cookies，后续尝试使用获得到的cookies免登陆但是失败了，请问是我的操作有问题，还是免登录的上下文环境不止cookies呢，或者是其他原因？

抖音的爬虫只有爬取评论这些吗？

介绍上写的是可以爬取抖音视频，我看一下源码，发现没有下载视频的代码

二维码扫描登录show方法不关闭图片程序会卡住，不再往下执行

tools.utils.show_qrcode()

def show_qrcode(qr_code: str):
    """parse base64 encode qrcode image and show it"""
    qr_code = qr_code.split(",")[1]
    qr_code = base64.b64decode(qr_code)
    image = Image.open(BytesIO(qr_code))

    # Add a square border around the QR code and display it within the border to improve scanning accuracy.
    width, height = image.size
    new_image = Image.new('RGB', (width + 20, height + 20), color=(255, 255, 255))
    new_image.paste(image, (10, 10))
    draw = ImageDraw.Draw(new_image)
    draw.rectangle((0, 0, width + 19, height + 19), outline=(0, 0, 0), width=1)
    new_image.show()

login_flag: bool = await self.check_login_state(no_logged_in_session)
        if not login_flag:
            # wait 2s
            # login_flag: bool = await self.check_login_state(no_logged_in_session)

大佬可以指定关键词或者指定达人吗

下载的内容都是首页的不受指定的内容，可以指定吗

有偿寻求技术支持

关于js逆向，混淆函数等微信 zl47895462

数据输出卡住

运行之后发现只能输出大概200+的数据,然后就不在输出了,主题和评论加起来200+,这是小红书限制了还是其他什么原因呢

爬取能正常运行，但是在爬取评论时一条评论信息也爬取不到

报错信息如下 MediaCrawler ERROR aweme_id: 7266050530072481076 get comments failed, error: Expecting value: line 1 column 1 (char 0),
按理说即使又抖音的反爬取机制，但是也应该有一两条数据，但是一条也没有，以下是我修改过后的保存到本地csv的代码：
import json
from typing import Dict, List

from tortoise import fields
from tortoise.models import Model
import os
import config
from tools import utils
import pandas as pd

class DouyinBaseModel(Model):
id = fields.IntField(pk=True, autoincrement=True, description="自增ID")
user_id = fields.CharField(null=True, max_length=64, description="用户ID")
sec_uid = fields.CharField(null=True, max_length=128, description="用户sec_uid")
short_user_id = fields.CharField(null=True, max_length=64, description="用户短ID")
user_unique_id = fields.CharField(null=True, max_length=64, description="用户唯一ID")
nickname = fields.CharField(null=True, max_length=64, description="用户昵称")
avatar = fields.CharField(null=True, max_length=255, description="用户头像地址")
user_signature = fields.CharField(null=True, max_length=500, description="用户签名")
ip_location = fields.CharField(null=True, max_length=255, description="评论时的IP地址")
add_ts = fields.BigIntField(description="记录添加时间戳")
last_modify_ts = fields.BigIntField(description="记录最后修改时间戳")

class Meta:
    abstract = True

class DouyinAweme(DouyinBaseModel):
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
aweme_type = fields.CharField(max_length=16, description="视频类型")
title = fields.CharField(null=True, max_length=500, description="视频标题")
desc = fields.TextField(null=True, description="视频描述")
create_time = fields.BigIntField(description="视频发布时间戳", index=True)
liked_count = fields.CharField(null=True, max_length=16, description="视频点赞数")
comment_count = fields.CharField(null=True, max_length=16, description="视频评论数")
share_count = fields.CharField(null=True, max_length=16, description="视频分享数")
collected_count = fields.CharField(null=True, max_length=16, description="视频收藏数")

class Meta:
    table = "douyin_aweme"
    table_description = "抖音视频"

def __str__(self):
    return f"{self.aweme_id} - {self.title}"

def save_data_to_excel(data: Dict, sheet_name: str):
file_path = 'D:\douyin.xlsx'
if not os.path.exists(file_path):
df = pd.DataFrame(columns=list(data.keys()))
df.to_excel(file_path, sheet_name=sheet_name,index=False, engine='openpyxl')
else:
with pd.ExcelFile(file_path) as xls:

        df_old = pd.read_excel(xls, sheet_name=sheet_name, engine='openpyxl')

        # 使用 pd.concat 替代 append 方法
        df_new = pd.DataFrame([data])
        df_combined = pd.concat([df_old, df_new], ignore_index=True)

        df_combined.to_excel(file_path, sheet_name=sheet_name, index=False, engine='openpyxl')

async def save_aweme_to_excel(aweme_data: Dict):
save_data_to_excel(aweme_data, "aweme")

async def save_comment_to_excel(comment_data: Dict):
save_data_to_excel(comment_data, "comments")

async def save_aweme_to_excel(aweme_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(aweme_data.keys()))

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='aweme', engine='openpyxl')

df = df.append(aweme_data, ignore_index=True)

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

async def save_comment_to_excel(comment_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(comment_data.keys()))

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='comments', engine='openpyxl')

df = df.append(comment_data, ignore_index=True)

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

class DouyinAwemeComment(DouyinBaseModel):
comment_id = fields.CharField(max_length=64, index=True, description="评论ID")
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
content = fields.TextField(null=True, description="评论内容")
create_time = fields.BigIntField(description="评论时间戳")
sub_comment_count = fields.CharField(max_length=16, description="评论回复数")

class Meta:
    table = "douyin_aweme_comment"
    table_description = "抖音视频评论"

def __str__(self):
    return f"{self.comment_id} - {self.content}"

async def update_douyin_aweme(aweme_item: Dict):
aweme_id = aweme_item.get("aweme_id")
user_info = aweme_item.get("author", {})
interact_info = aweme_item.get("statistics", {})
local_db_item = {
"aweme_id": aweme_id,
"aweme_type": aweme_item.get("aweme_type"),
"title": aweme_item.get("desc", ""),
"desc": aweme_item.get("desc", ""),
"create_time": aweme_item.get("create_time"),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": user_info.get("avatar_thumb", {}).get("url_list", [""])[0],
"liked_count": interact_info.get("digg_count"),
"collected_count": interact_info.get("collect_count"),
"comment_count": interact_info.get("comment_count"),
"share_count": interact_info.get("share_count"),
"ip_location": aweme_item.get("ip_label", ""),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme id:{aweme_id}, title:{local_db_item.get('title')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAweme.filter(aweme_id=aweme_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAweme.create(**local_db_item)
else:
await DouyinAweme.filter(aweme_id=aweme_id).update(**local_db_item)
else:
await save_aweme_to_excel(local_db_item)

async def batch_update_dy_aweme_comments(aweme_id: str, comments: List[Dict]):
if not comments:
return
for comment_item in comments:
await update_dy_aweme_comment(aweme_id, comment_item)

async def update_dy_aweme_comment(aweme_id: str, comment_item: Dict):
comment_aweme_id = comment_item.get("aweme_id")
if aweme_id != comment_aweme_id:
print(f"comment_aweme_id: {comment_aweme_id} != aweme_id: {aweme_id}")
return
user_info = comment_item.get("user", {})
comment_id = comment_item.get("cid")
avatar_info = user_info.get("avatar_medium", {}) or user_info.get("avatar_300x300", {}) or user_info.get(
"avatar_168x168", {}) or user_info.get("avatar_thumb", {}) or {}
local_db_item = {
"comment_id": comment_id,
"create_time": comment_item.get("create_time"),
"ip_location": comment_item.get("ip_label", ""),
"aweme_id": aweme_id,
"content": comment_item.get("text"),
"content_extra": json.dumps(comment_item.get("text_extra", [])),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": avatar_info.get("url_list", [""])[0],
"sub_comment_count": comment_item.get("reply_comment_total", 0),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme comment: {comment_id}, content: {local_db_item.get('content')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAwemeComment.filter(comment_id=comment_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAwemeComment.create(**local_db_item)
else:
await DouyinAwemeComment.filter(comment_id=comment_id).update(**local_db_item)
else:
await save_comment_to_excel(local_db_item)

Invalid port Bug

报以下错误

Begin search xiaohongshu keywords:  健身
Traceback (most recent call last):
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 339, in normalize_port
    port_as_int = int(port)
ValueError: invalid literal for int() with base 10: ':1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 35, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "main.py", line 30, in main
    await crawler.start()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 70, in start
    note_res = await self.search_posts()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 134, in search_posts
    posts_res = await self.xhs_client.get_note_by_keyword(keyword=self.keywords)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 110, in get_note_by_keyword
    return await self.post(uri, data)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 77, in post
    return await self.request(method="POST", url=f"{self._host}{uri}",
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 53, in request
    async with httpx.AsyncClient(proxies=self.proxies) as client:
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1408, in __init__
    self._mounts: typing.Dict[URLPattern, typing.Optional[AsyncBaseTransport]] = {
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1409, in <dictcomp>
    URLPattern(key): None
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_utils.py", line 397, in __init__
    url = URL(pattern)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urls.py", line 113, in __init__
    self._uri_reference = urlparse(url, **kwargs)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 246, in urlparse
    parsed_port: typing.Optional[int] = normalize_port(port, scheme)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 341, in normalize_port
    raise InvalidURL("Invalid port")
httpx.InvalidURL: Invalid port

针对某账号爬虫/ 无需爬取评论

请问有方法可以针对某账号爬虫或关闭爬取评论吗？

如何手动更换登陆账号而不是等它自然失效

要更换账号扫码时删掉整个user_data_dir吗

Help me: 项目启动报错

software	version
OS	mac
python	3.11

$ python main.py --platform xhs --lt qrcode
Traceback (most recent call last):
  File "/Users/xxx/tp-code/MediaCrawler/main.py", line 8, in <module>
    from media_platform.douyin import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/__init__.py", line 1, in <module>
    from .core import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/core.py", line 17, in <module>
    from .login import DouYinLogin
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/login.py", line 6, in <module>
    import aioredis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/__init__.py", line 1, in <module>
    from aioredis.client import Redis, StrictRedis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/client.py", line 32, in <module>
    from aioredis.connection import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/connection.py", line 33, in <module>
    from .exceptions import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/exceptions.py", line 14, in <module>
    class TimeoutError(asyncio.TimeoutError, builtins.TimeoutError, RedisError):
TypeError: duplicate base class TimeoutError

这有哪些应用场景？

如题

小红书功能反馈

请问能不能更新一个，只从固定用户主页搜索全部笔记的功能。
另外，好像笔记下的评论不能完全爬取保存完整。
不过确实挺好用的，是我找了这么多，唯一可以用的项目

小红书扫码失败了

今天开始，小红书扫码无法登陆，弹出图片，扫码，直接提示失败，重新登录，然后就是二维码过期了。

视频爬取

作者你好，感谢开源代码，看到代码里面现在是爬取评论和笔记，如果想要爬取小红书或者抖音平台中的视频在哪个部分呢，现在代码中有吗~

请问XHS可以进行关键字搜索嘛

get comments failed, error: Expecting value: line 1 column 1 (char 0)

get comments failed, error: Expecting value: line 1 column 1 (char 0)
其它博主下面的评论就没事，某一个就不行，不知道杂回事

在服务器山运行出错

在服务器上运行没有权限访问，这个在哪配置呢？

2023-09-23  22:30:23 MediaCrawler INFO Begin create browser context ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin create xiaohongshu API client ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin to ping xhs... 
2023-09-23  22:30:25 httpx INFO HTTP Request: POST https://edith.xiaohongshu.com/api/sns/web/v1/search/notes "HTTP/1.1 200 OK" 
2023-09-23  22:30:25 MediaCrawler ERROR Ping xhs failed: 您当前登录的账号没有权限访问, and try to login again... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu by qrcode ... 
2023-09-23  22:30:25 MediaCrawler INFO waiting for scan code login, remaining time is 20s 
<PIL.Image.Image image mode=RGB size=175x175 at 0x7F693C062E00>

douyin/client.py的第142行导致get_video_by_id错误

async def get_video_by_id(self, aweme_id: str):
"""
DouYin Video Detail API
:param aweme_id:
:return:
"""
params = {
"aweme_id": aweme_id
}
headers = copy.copy(self.headers)
headers["Cookie"] = "s_v_web_id=verify_leytkxgn_kvO5kOmO_SdMs_4t1o_B5ml_BUqtWM1mP6BF;"
del headers["Origin"]
return await self.get("/aweme/v1/web/aweme/detail/", params, headers)

nanmicoder / mediacrawler Goto Github PK

mediacrawler's Introduction

仓库描述

功能列表

使用方法

创建并激活 python 虚拟环境

安装依赖库

安装 playwright浏览器驱动

运行爬虫程序

数据保存

开发者服务

感谢下列Sponsors对本仓库赞助

MediaCrawler爬虫项目交流群：

运行报错常见问题Q&A

项目代码结构

代理IP使用说明

词云图相关操作说明

手机号登录说明

打赏

爬虫入门课程

项目贡献者

star 趋势图

参考

免责声明

1. 项目目的与性质

2. 法律合规性声明

3. 使用目的限制

4. 免责声明

5. 知识产权声明

6. 最终解释权

mediacrawler's People

Contributors

Stargazers

Watchers

Forkers

mediacrawler's Issues

dy不管使用哪种方法登录都同样报错,xhs正常

async def save_aweme_to_excel(aweme_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(aweme_data.keys()))

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='aweme', engine='openpyxl')

df = df.append(aweme_data, ignore_index=True)

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

async def save_comment_to_excel(comment_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(comment_data.keys()))

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='comments', engine='openpyxl')

df = df.append(comment_data, ignore_index=True)

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

Recommend Projects

Recommend Topics

Recommend Org