Giter VIP home page Giter VIP logo

mediacrawler's Introduction

免责声明:

大家请以学习为目的使用本仓库,爬虫违法违规的案件:https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China

本仓库的所有内容仅供学习和参考之用,禁止用于商业用途。任何人或组织不得将本仓库的内容用于非法用途或侵犯他人合法权益。本仓库所涉及的爬虫技术仅用于学习和研究,不得用于对其他平台进行大规模爬虫或其他非法行为。对于因使用本仓库内容而引起的任何法律责任,本仓库不承担任何责任。使用本仓库的内容即表示您同意本免责声明的所有条款和条件。

点击查看更为详细的免责声明。点击跳转

仓库描述

小红书爬虫抖音爬虫快手爬虫B站爬虫微博爬虫...。
目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。

原理:利用playwright搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数 通过使用此方式,免去了复现核心加密JS代码,逆向难度大大降低

功能列表

平台 关键词搜索 指定帖子ID爬取 二级评论 指定创作者主页 登录态缓存 IP代理池 生成评论词云图
小红书
抖音
快手
B 站
微博

使用方法

创建并激活 python 虚拟环境

# 进入项目根目录
cd MediaCrawler

# 创建虚拟环境
# 注意python 版本需要3.7 - 3.9 高于该版本可能会出现一些依赖包兼容问题
python -m venv venv

# macos & linux 激活虚拟环境
source venv/bin/activate

# windows 激活虚拟环境
venv\Scripts\activate

安装依赖库

pip install -r requirements.txt

安装 playwright浏览器驱动

playwright install

运行爬虫程序

### 项目默认是没有开启评论爬取模式,如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释

# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
python main.py --platform xhs --lt qrcode --type search

# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
python main.py --platform xhs --lt qrcode --type detail

# 打开对应APP扫二维码登录
  
# 其他平台爬虫使用示例,执行下面的命令查看
python main.py --help    

数据保存

  • 支持保存到关系型数据库(Mysql、PgSQL等)
    • 执行 python db.py 初始化数据库数据库表结构(只在首次执行)
  • 支持保存到csv中(data/目录下)
  • 支持保存到json中(data/目录下)

开发者服务

感谢下列Sponsors对本仓库赞助


- 通过注册这个款免费的GPT助手,帮我获取GPT4额度作为支持。也是我每天在用的一款chrome AI助手插件

成为赞助者,展示你的产品在这里,联系作者:[email protected]

MediaCrawler爬虫项目交流群:

扫描下方我的个人微信,备注:github,拉你进MediaCrawler项目交流群(请一定备注:github,会有wx小助手自动拉群)

如果图片展示不出来,可以直接添加我的微信号:yzglan

relakkes_wechat

运行报错常见问题Q&A

遇到问题先自行搜索解决下,现在AI很火,用ChatGPT大多情况下能解决你的问题 免费的ChatGPT

➡️➡️➡️ 常见问题

dy和xhs使用Playwright登录现在会出现滑块验证 + 短信验证,手动过一下

项目代码结构

➡️➡️➡️ 项目代码结构说明

代理IP使用说明

➡️➡️➡️ 代理IP使用说明

词云图相关操作说明

➡️➡️➡️ 词云图相关说明

手机号登录说明

➡️➡️➡️ 手机号登录说明

打赏

免费开源不易,如果项目帮到你了,可以给我打赏哦,您的支持就是我最大的动力!

打赏-微信

打赏-支付宝

爬虫入门课程

我新开的爬虫教程Github仓库 CrawlerTutorial ,感兴趣的朋友可以关注一下,持续更新,主打一个免费.

项目贡献者

感谢你们的贡献,让项目变得更好!(贡献比较多的可以加我wx,免费拉你进我的知识星球,后期还有一些其他福利。)

NanmiCoder
程序员阿江-Relakkes
leantli
leantli
Rosyrain
Rosyrain
BaoZhuhan
Bao Zhuhan
nelzomal
zhounan
Hiro-Lin
HIRO
PeanutSplash
PeanutSplash
Ermeng98
Ermeng
henryhyn
Henry He
Akiqqqqqqq
leonardoqiuyu
jayeeliu
jayeeliu
ZuWard
ZuWard
Zzendrix
Zendrix
chunpat
zhangzhenpeng
tanpenggood
Sam Tan
xbsheng
xbsheng
yangrq1018
Martin
zhihuiio
zhihuiio
renaissancezyc
Ren
Tianci-King
Wang Tianci
Styunlen
Styunlen
Schofi
Schofi
Klu5ure
Klu5ure
keeper-jie
Kermit
kexinoh
KEXNA
aa65535
Jian Chang
522109452
tianqing

star 趋势图

  • 如果该项目对你有帮助,star一下 ❤️❤️❤️

Star History Chart

参考

免责声明

1. 项目目的与性质

本项目(以下简称“本项目”)是作为一个技术研究与学习工具而创建的,旨在探索和学习网络数据采集技术。本项目专注于自媒体平台的数据爬取技术研究,旨在提供给学习者和研究者作为技术交流之用。

2. 法律合规性声明

本项目开发者(以下简称“开发者”)郑重提醒用户在下载、安装和使用本项目时,严格遵守中华人民共和国相关法律法规,包括但不限于《中华人民共和国网络安全法》、《中华人民共和国反间谍法》等所有适用的国家法律和政策。用户应自行承担一切因使用本项目而可能引起的法律责任。

3. 使用目的限制

本项目严禁用于任何非法目的或非学习、非研究的商业行为。本项目不得用于任何形式的非法侵入他人计算机系统,不得用于任何侵犯他人知识产权或其他合法权益的行为。用户应保证其使用本项目的目的纯属个人学习和技术研究,不得用于任何形式的非法活动。

4. 免责声明

开发者已尽最大努力确保本项目的正当性及安全性,但不对用户使用本项目可能引起的任何形式的直接或间接损失承担责任。包括但不限于由于使用本项目而导致的任何数据丢失、设备损坏、法律诉讼等。

5. 知识产权声明

本项目的知识产权归开发者所有。本项目受到著作权法和国际著作权条约以及其他知识产权法律和条约的保护。用户在遵守本声明及相关法律法规的前提下,可以下载和使用本项目。

6. 最终解释权

关于本项目的最终解释权归开发者所有。开发者保留随时更改或更新本免责声明的权利,恕不另行通知。

mediacrawler's People

Contributors

522109452 avatar aa65535 avatar akiqqqqqqq avatar baozhuhan avatar chunpat avatar ermeng98 avatar github-actions[bot] avatar henryhyn avatar hiro-lin avatar jayeeliu avatar keeper-jie avatar kexinoh avatar klu5ure avatar leantli avatar nanmicoder avatar nelzomal avatar peanutsplash avatar renaissancezyc avatar rosyrain avatar schofi avatar styunlen avatar tanpenggood avatar tianci-king avatar xbsheng avatar yangrq1018 avatar zhihuiio avatar zuward avatar zzendrix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mediacrawler's Issues

报错信息

爬取抖音评论报错: MediaCrawler ERROR aweme_id: xxx get comments failed, error: Expecting value: line 1 column 1 (char 0)
应该如何解决

报错Full list of missing libraries:

File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 44, in start
self.browser_context = await self.launch_browser(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\media_platform\xhs\core.py", line 184, in launch_browser
browser_context = await chromium.launch_persistent_context(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Host system is missing dependencies!

Full list of missing libraries:
chrome_elf.dll

无法抓取视频评论

看到3周前其他issue里也有同样问题,作者回答解决了但现在还有相同报错,换了多个关键词测试结果相同,报错如下:

2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7241024491999022392 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7257898852668296485 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7246635327694179623 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7042503193409880590 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7256294498760674612 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler ERROR aweme_id: 7229990392005922081 get comments failed, error: Expecting value: l
ine 1 column 1 (char 0),
2023-09-03 23:32:26 MediaCrawler INFO Douyin Crawler finished ...

playwright._impl._api_types.Error: Browser closed.

File "e:\miniconda3\Lib\site-packages\playwright\async_api_generated.py", line 14727, in launch_persistent_context
await self._impl_obj.launch_persistent_context(
File "e:\miniconda3\Lib\site-packages\playwright_impl_browser_type.py", line 155, in launch_persistent_context
from_channel(await self._channel.send("launchPersistentContext", params)),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 61, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 461, in wrap_api_call
return await cb()
^^^^^^^^^^
File "e:\miniconda3\Lib\site-packages\playwright_impl_connection.py", line 96, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._api_types.Error: Browser closed.
==================== Browser output: ====================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup
=========================== logs ===========================
C:\Users\xiazhiqiang\AppData\Local\ms-playwright\chromium-1060\chrome-win\chrome.exe --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=C:\Users\xiazhiqiang\Desktop\MediaCrawler-main\browser_data\xhs_user_data_dir --remote-debugging-pipe about:blank
pid=6572
[pid=6572]
[pid=6572] starting temporary directories cleanup

关于redis

大佬,请问一下,抖音评论爬取是不是还不可以通过redis存储到数据库中?
能麻烦加一个windows 的 redis配置过程吗。
please😜

请问这个爬取的数据可以直接导出为csv吗?因为我在导到数据库中时遇到了挺多问题

以下是遇到的一些问题:

asyncmy.errors.OperationalError: (1054, "Unknown column 'nickname' in 'field list'")

tortoise.exceptions.OperationalError: (1054, "Unknown column 'add_ts' in 'field list'")
………………一些字段缺失(我在sql内补了一些)

tortoise.exceptions.OperationalError: (1054, "Unknown column 'image_list' in 'field list'")
(有一些python的list、dict类型的我不知道在sqln内需要设置成什么)

tortoise.exceptions.OperationalError: (1366, "Incorrect string value: '\xF0\x9F\x8C\xB0' for column 'nickname' at row 1")(我去调整了sql的collation为“utf-8_general_ci”)

asyncmy.errors.DataError: (1406, "Data too long for column 'avatar' at row 1")

tortoise.exceptions.OperationalError: (1406, "Data too long for column 'avatar' at row 1")

dy登录失败,没有弹出二维码

@NanmiCoder @tanpenggood

C:\Users\caps\.vitualenvs\crawler\Scripts\python.exe main.py --platform dy --lt qrcode 
2023-07-26  22:58:50 MediaCrawler ERROR login dialog box does not pop up automatically, error: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================ 
2023-07-26  22:58:50 MediaCrawler INFO login dialog box does not pop up automatically, we will manually click the login button 
Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 90, in popup_login_dialog
    await self.context_page.wait_for_selector(dialog_selector, timeout=1000 * 10)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 8266, in wait_for_selector
    await self._impl_obj.wait_for_selector(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_page.py", line 368, in wait_for_selector
    return await self._main_frame.wait_for_selector(**locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 322, in wait_for_selector
    await self._channel.send("waitForSelector", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 10000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//div[@id='login-pannel']") to be visible
============================================================

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 47, in <module>
    asyncio.run(main())
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\caps\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\main.py", line 42, in main
    await crawler.start()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\core.py", line 62, in start
    await login_obj.begin()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 45, in begin
    await self.popup_login_dialog()
  File "C:\Users\caps\PycharmProjects\MediaCrawler\media_platform\douyin\login.py", line 95, in popup_login_dialog
    await login_button_ele.click()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\async_api\_generated.py", line 15419, in click
    await self._impl_obj.click(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_locator.py", line 160, in click
    return await self._frame.click(self._selector, strict=True, **params)
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_frame.py", line 489, in click
    await self._channel.send("click", locals_to_params(locals()))
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 461, in wrap_api_call
    return await cb()
  File "C:\Users\caps\.vitualenvs\crawler\lib\site-packages\playwright\_impl\_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//p[text() = '登录']")
  locator resolved to <p class="lqiPv8cB">登录</p>
attempting click action
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #1
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #2
  waiting 20ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #3
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #4
  waiting 100ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #5
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #6
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #7
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #8
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #9
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #10
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #11
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #12
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #13
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #14
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #15
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #16
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #17
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #18
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #19
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #20
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #21
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #22
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #23
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #24
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #25
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #26
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #27
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #28
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #29
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #30
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #31
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #32
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #33
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #34
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #35
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #36
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #37
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #38
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #39
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #40
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #41
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #42
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #43
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #44
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #45
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #46
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #47
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #48
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #49
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #50
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #51
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #52
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #53
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #54
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #55
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #56
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #57
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #58
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #59
  waiting 500ms
  waiting for element to be visible, enabled and stable
  element is visible, enabled and stable
  scrolling into view if needed
  done scrolling
  <div id="captcha_container">…</div> intercepts pointer events
retrying click action, attempt #60
  waiting 500ms
============================================================

进程已结束,退出代码1

抖音无法登陆

dy不管使用哪种方法登录都同样报错,xhs正常

好像是playwright的问题,我这边用playwright无法打开抖音首页,把index_url换成www.douyin.com/discover
就正常了

二维码登录

二维码登录不可用,需要滑块验证了
<Page url='https://www.xiaohongshu.com/website-login/captcha?redirectPath=>

小红书 扫码成功后 报错

小红书 扫码后 报错

如下:
2023-07-08 18:04:13 root INFO Begin login xiaohongshu by qrcode ...
2023-07-08 18:04:23 root INFO waiting for scan code login, remaining time is 20s
Traceback (most recent call last):
File "/Users/username/work/github_test/MediaCrawler/main.py", line 58, in
asyncio.run(main())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/Users/username/work/github_test/MediaCrawler/main.py", line 39, in main
await crawler.start()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/core.py", line 82, in start
await login_obj.begin()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 48, in begin
await self.login_by_qrcode()
File "/Users/username/work/github_test/MediaCrawler/media_platform/xhs/login.py", line 155, in login_by_qrcode
login_flag: bool = await self.check_login_state(no_logged_in_session)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 88, in async_wrapped
return await fn(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/_asyncio.py", line 47, in call
do = self.iter(retry_state=retry_state)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tenacity/init.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1271e2ef0 state=finished returned bool>]

python main.py --platform dy --lt qrcode 失败

执行命令 python main.py --platform dy --lt qrcode
其中 --lt 后面能跟的参数都尝试了,结果总是错误:
Traceback (most recent call last):
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\main.py", line 45, in main
await crawler.start()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 66, in start
await self.search()
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 78, in get
await self.__process_req_params(params, headers)
File "C:\Users\nuc8\Downloads\MediaCrawler-main\MediaCrawler-main\media_platform\douyin\client.py", line 56, in __process_req_params
"webid": douyin_js_obj.call("get_web_id"),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 37, in call
return self._call(name, *args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 92, in _call
return self.eval("{identifier}.apply(this, {args})".format(identifier=identifier, args=args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 78, in eval
return self.exec
(code)
^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_abstract_runtime_context.py", line 18, in exec

return self.exec(source)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 88, in exec
return self._extract_result(output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\nuc8\AppData\Local\Programs\Python\Python311\Lib\site-packages\execjs_external_runtime.py", line 167, in _extract_result
raise ProgramError(value)
execjs._exceptions.ProgramError: SyntaxError: 缺少 ';'

sql插入失败

xhs.model里
"title": note_item.get("title") or note_item.get("desc", "")
有的只有desc没有title,就会导致title字符串过长溢出导致插入失败

我在使用cookie登录抖音后遇到了execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.的错误

详细的报错信息如下:(base) yyyy:~/Union/MediaCrawler$ python main.py --platform dy --lt cookie
/yyyy/MediaCrawler/main.py:51: DeprecationWarning: There is no current event loop
asyncio.get_event_loop().run_until_complete(main())
2023-08-11 15:08:42 MediaCrawler INFO Begin login douyin by cookie ...
2023-08-11 15:08:48 MediaCrawler INFO login finished then check login state ...
2023-08-11 15:08:48 MediaCrawler INFO Login successful then wait for 5 seconds redirect ...
2023-08-11 15:08:53 MediaCrawler INFO Begin search douyin keywords
2023-08-11 15:08:53 MediaCrawler INFO Current keyword: 健身
Traceback (most recent call last):
File "/yyyy/MediaCrawler/main.py", line 51, in
asyncio.get_event_loop().run_until_complete(main())
File "/yyyy/anaconda3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/yyyy/MediaCrawler/main.py", line 45, in main
await crawler.start()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 66, in start
await self.search()
File "/yyyy/MediaCrawler/media_platform/douyin/core.py", line 79, in search
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 129, in search_info_by_keyword
return await self.get("/aweme/v1/web/general/search/single/", params, headers=headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 78, in get
await self.__process_req_params(params, headers)
File "/yyyy/MediaCrawler/media_platform/douyin/client.py", line 38, in __process_req_params
douyin_js_obj = execjs.compile(open('libs/douyin.js').read())
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/init.py", line 61, in compile
return get().compile(source, cwd)
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 21, in get
return get_from_environment() or _find_available_runtime()
File "/yyyy/anaconda3/lib/python3.10/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime
raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.")
execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.

登录已过期

請問大佬, 我使用 掃碼登入或者cookie登入
可以順利登入
不過當開始抓取不到一組資料的時候
就會直接被登出顯示以下資訊
media_platform.xhs.exception.DataFetchError: 登录已

是為什麼呢?

使用cookies是否可以不重复登录

您好,在爬取小红书时,我尝试在第一次中使用QRcode登录并且获得到cookies,后续尝试使用获得到的cookies免登陆但是失败了,请问是我的操作有问题,还是免登录的上下文环境不止cookies呢,或者是其他原因?

二维码扫描登录show方法不关闭图片程序会卡住,不再往下执行

tools.utils.show_qrcode()

def show_qrcode(qr_code: str):
    """parse base64 encode qrcode image and show it"""
    qr_code = qr_code.split(",")[1]
    qr_code = base64.b64decode(qr_code)
    image = Image.open(BytesIO(qr_code))

    # Add a square border around the QR code and display it within the border to improve scanning accuracy.
    width, height = image.size
    new_image = Image.new('RGB', (width + 20, height + 20), color=(255, 255, 255))
    new_image.paste(image, (10, 10))
    draw = ImageDraw.Draw(new_image)
    draw.rectangle((0, 0, width + 19, height + 19), outline=(0, 0, 0), width=1)
    new_image.show()

login.login_by_qrcode
换成异步打开二维码,循环校验登录状态会不会体验好些

login_flag: bool = await self.check_login_state(no_logged_in_session)
        if not login_flag:
            # wait 2s
            # login_flag: bool = await self.check_login_state(no_logged_in_session)

数据输出卡住

运行之后发现只能输出大概200+的数据,然后就不在输出了,主题和评论加起来200+,这是小红书限制了还是其他什么原因呢

爬取能正常运行,但是在爬取评论时一条评论信息也爬取不到

报错信息如下 MediaCrawler ERROR aweme_id: 7266050530072481076 get comments failed, error: Expecting value: line 1 column 1 (char 0),
按理说即使又抖音的反爬取机制,但是也应该有一两条数据,但是一条也没有,以下是我修改过后的保存到本地csv的代码:
import json
from typing import Dict, List

from tortoise import fields
from tortoise.models import Model
import os
import config
from tools import utils
import pandas as pd

class DouyinBaseModel(Model):
id = fields.IntField(pk=True, autoincrement=True, description="自增ID")
user_id = fields.CharField(null=True, max_length=64, description="用户ID")
sec_uid = fields.CharField(null=True, max_length=128, description="用户sec_uid")
short_user_id = fields.CharField(null=True, max_length=64, description="用户短ID")
user_unique_id = fields.CharField(null=True, max_length=64, description="用户唯一ID")
nickname = fields.CharField(null=True, max_length=64, description="用户昵称")
avatar = fields.CharField(null=True, max_length=255, description="用户头像地址")
user_signature = fields.CharField(null=True, max_length=500, description="用户签名")
ip_location = fields.CharField(null=True, max_length=255, description="评论时的IP地址")
add_ts = fields.BigIntField(description="记录添加时间戳")
last_modify_ts = fields.BigIntField(description="记录最后修改时间戳")

class Meta:
    abstract = True

class DouyinAweme(DouyinBaseModel):
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
aweme_type = fields.CharField(max_length=16, description="视频类型")
title = fields.CharField(null=True, max_length=500, description="视频标题")
desc = fields.TextField(null=True, description="视频描述")
create_time = fields.BigIntField(description="视频发布时间戳", index=True)
liked_count = fields.CharField(null=True, max_length=16, description="视频点赞数")
comment_count = fields.CharField(null=True, max_length=16, description="视频评论数")
share_count = fields.CharField(null=True, max_length=16, description="视频分享数")
collected_count = fields.CharField(null=True, max_length=16, description="视频收藏数")

class Meta:
    table = "douyin_aweme"
    table_description = "抖音视频"

def __str__(self):
    return f"{self.aweme_id} - {self.title}"

def save_data_to_excel(data: Dict, sheet_name: str):
file_path = 'D:\douyin.xlsx'
if not os.path.exists(file_path):
df = pd.DataFrame(columns=list(data.keys()))
df.to_excel(file_path, sheet_name=sheet_name,index=False, engine='openpyxl')
else:
with pd.ExcelFile(file_path) as xls:

        df_old = pd.read_excel(xls, sheet_name=sheet_name, engine='openpyxl')

        # 使用 pd.concat 替代 append 方法
        df_new = pd.DataFrame([data])
        df_combined = pd.concat([df_old, df_new], ignore_index=True)

        df_combined.to_excel(file_path, sheet_name=sheet_name, index=False, engine='openpyxl')

async def save_aweme_to_excel(aweme_data: Dict):
save_data_to_excel(aweme_data, "aweme")

async def save_comment_to_excel(comment_data: Dict):
save_data_to_excel(comment_data, "comments")

async def save_aweme_to_excel(aweme_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(aweme_data.keys()))

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='aweme', engine='openpyxl')

df = df.append(aweme_data, ignore_index=True)

df.to_excel(file_path, sheet_name='aweme', index=False, engine='openpyxl')

async def save_comment_to_excel(comment_data: Dict):

file_path = 'D:\douyin.xlsx'

if not os.path.exists(file_path):

raise Exception(f"File not found: {file_path}")

if not os.path.exists(file_path):

df = pd.DataFrame(columns=list(comment_data.keys()))

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

else:

df = pd.read_excel(file_path, sheet_name='comments', engine='openpyxl')

df = df.append(comment_data, ignore_index=True)

df.to_excel(file_path, sheet_name='comments', index=False, engine='openpyxl')

class DouyinAwemeComment(DouyinBaseModel):
comment_id = fields.CharField(max_length=64, index=True, description="评论ID")
aweme_id = fields.CharField(max_length=64, index=True, description="视频ID")
content = fields.TextField(null=True, description="评论内容")
create_time = fields.BigIntField(description="评论时间戳")
sub_comment_count = fields.CharField(max_length=16, description="评论回复数")

class Meta:
    table = "douyin_aweme_comment"
    table_description = "抖音视频评论"

def __str__(self):
    return f"{self.comment_id} - {self.content}"

async def update_douyin_aweme(aweme_item: Dict):
aweme_id = aweme_item.get("aweme_id")
user_info = aweme_item.get("author", {})
interact_info = aweme_item.get("statistics", {})
local_db_item = {
"aweme_id": aweme_id,
"aweme_type": aweme_item.get("aweme_type"),
"title": aweme_item.get("desc", ""),
"desc": aweme_item.get("desc", ""),
"create_time": aweme_item.get("create_time"),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": user_info.get("avatar_thumb", {}).get("url_list", [""])[0],
"liked_count": interact_info.get("digg_count"),
"collected_count": interact_info.get("collect_count"),
"comment_count": interact_info.get("comment_count"),
"share_count": interact_info.get("share_count"),
"ip_location": aweme_item.get("ip_label", ""),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme id:{aweme_id}, title:{local_db_item.get('title')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAweme.filter(aweme_id=aweme_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAweme.create(**local_db_item)
else:
await DouyinAweme.filter(aweme_id=aweme_id).update(**local_db_item)
else:
await save_aweme_to_excel(local_db_item)

async def batch_update_dy_aweme_comments(aweme_id: str, comments: List[Dict]):
if not comments:
return
for comment_item in comments:
await update_dy_aweme_comment(aweme_id, comment_item)

async def update_dy_aweme_comment(aweme_id: str, comment_item: Dict):
comment_aweme_id = comment_item.get("aweme_id")
if aweme_id != comment_aweme_id:
print(f"comment_aweme_id: {comment_aweme_id} != aweme_id: {aweme_id}")
return
user_info = comment_item.get("user", {})
comment_id = comment_item.get("cid")
avatar_info = user_info.get("avatar_medium", {}) or user_info.get("avatar_300x300", {}) or user_info.get(
"avatar_168x168", {}) or user_info.get("avatar_thumb", {}) or {}
local_db_item = {
"comment_id": comment_id,
"create_time": comment_item.get("create_time"),
"ip_location": comment_item.get("ip_label", ""),
"aweme_id": aweme_id,
"content": comment_item.get("text"),
"content_extra": json.dumps(comment_item.get("text_extra", [])),
"user_id": user_info.get("uid"),
"sec_uid": user_info.get("sec_uid"),
"short_user_id": user_info.get("short_id"),
"user_unique_id": user_info.get("unique_id"),
"user_signature": user_info.get("signature"),
"nickname": user_info.get("nickname"),
"avatar": avatar_info.get("url_list", [""])[0],
"sub_comment_count": comment_item.get("reply_comment_total", 0),
"last_modify_ts": utils.get_current_timestamp(),
}
print(f"douyin aweme comment: {comment_id}, content: {local_db_item.get('content')}")
if config.IS_SAVED_DATABASED:
if not await DouyinAwemeComment.filter(comment_id=comment_id).exists():
local_db_item["add_ts"] = utils.get_current_timestamp()
await DouyinAwemeComment.create(**local_db_item)
else:
await DouyinAwemeComment.filter(comment_id=comment_id).update(**local_db_item)
else:
await save_comment_to_excel(local_db_item)

Invalid port Bug

报以下错误

Begin search xiaohongshu keywords:  健身
Traceback (most recent call last):
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 339, in normalize_port
    port_as_int = int(port)
ValueError: invalid literal for int() with base 10: ':1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 35, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "main.py", line 30, in main
    await crawler.start()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 70, in start
    note_res = await self.search_posts()
  File "/home/MediaCrawler/media_platform/xhs/core.py", line 134, in search_posts
    posts_res = await self.xhs_client.get_note_by_keyword(keyword=self.keywords)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 110, in get_note_by_keyword
    return await self.post(uri, data)
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 77, in post
    return await self.request(method="POST", url=f"{self._host}{uri}",
  File "/home/MediaCrawler/media_platform/xhs/client.py", line 53, in request
    async with httpx.AsyncClient(proxies=self.proxies) as client:
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1408, in __init__
    self._mounts: typing.Dict[URLPattern, typing.Optional[AsyncBaseTransport]] = {
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_client.py", line 1409, in <dictcomp>
    URLPattern(key): None
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_utils.py", line 397, in __init__
    url = URL(pattern)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urls.py", line 113, in __init__
    self._uri_reference = urlparse(url, **kwargs)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 246, in urlparse
    parsed_port: typing.Optional[int] = normalize_port(port, scheme)
  File "/home/MediaCrawler/env/lib/python3.8/site-packages/httpx/_urlparse.py", line 341, in normalize_port
    raise InvalidURL("Invalid port")
httpx.InvalidURL: Invalid port

Help me: 项目启动报错

software version
OS mac
python 3.11
$ python main.py --platform xhs --lt qrcode
Traceback (most recent call last):
  File "/Users/xxx/tp-code/MediaCrawler/main.py", line 8, in <module>
    from media_platform.douyin import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/__init__.py", line 1, in <module>
    from .core import DouYinCrawler
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/core.py", line 17, in <module>
    from .login import DouYinLogin
  File "/Users/xxx/tp-code/MediaCrawler/media_platform/douyin/login.py", line 6, in <module>
    import aioredis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/__init__.py", line 1, in <module>
    from aioredis.client import Redis, StrictRedis
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/client.py", line 32, in <module>
    from aioredis.connection import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/connection.py", line 33, in <module>
    from .exceptions import (
  File "/Users/xxx/miniconda3/lib/python3.11/site-packages/aioredis/exceptions.py", line 14, in <module>
    class TimeoutError(asyncio.TimeoutError, builtins.TimeoutError, RedisError):
TypeError: duplicate base class TimeoutError

小红书功能反馈

请问能不能更新一个,只从固定用户主页搜索全部笔记的功能。
另外,好像笔记下的评论不能完全爬取保存完整。
不过确实挺好用的,是我找了这么多,唯一可以用的项目

小红书扫码失败了

今天开始,小红书扫码无法登陆,弹出图片,扫码,直接提示失败,重新登录,然后就是二维码过期了。

视频爬取

作者你好,感谢开源代码,看到代码里面现在是爬取评论和笔记,如果想要爬取小红书或者抖音平台中的视频在哪个部分呢,现在代码中有吗~

在服务器山运行出错

在服务器上运行没有权限访问,这个在哪配置呢?

2023-09-23  22:30:23 MediaCrawler INFO Begin create browser context ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin create xiaohongshu API client ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin to ping xhs... 
2023-09-23  22:30:25 httpx INFO HTTP Request: POST https://edith.xiaohongshu.com/api/sns/web/v1/search/notes "HTTP/1.1 200 OK" 
2023-09-23  22:30:25 MediaCrawler ERROR Ping xhs failed: 您当前登录的账号没有权限访问, and try to login again... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu ... 
2023-09-23  22:30:25 MediaCrawler INFO Begin login xiaohongshu by qrcode ... 
2023-09-23  22:30:25 MediaCrawler INFO waiting for scan code login, remaining time is 20s 
<PIL.Image.Image image mode=RGB size=175x175 at 0x7F693C062E00>

douyin/client.py的第142行导致get_video_by_id错误

async def get_video_by_id(self, aweme_id: str):
"""
DouYin Video Detail API
:param aweme_id:
:return:
"""
params = {
"aweme_id": aweme_id
}
headers = copy.copy(self.headers)
headers["Cookie"] = "s_v_web_id=verify_leytkxgn_kvO5kOmO_SdMs_4t1o_B5ml_BUqtWM1mP6BF;"
del headers["Origin"]
return await self.get("/aweme/v1/web/aweme/detail/", params, headers)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.