Giter VIP home page Giter VIP logo

douyin_comments's Introduction

使用方式:运行douyin_comments.py文件,当页面出现弹窗验证时要快速手动验证,等待页面自行滑动,完成后保存在文件夹中,以下是可供调节的参数

#想要爬取的个人主页
account_home_page= 'https://www.douyin.com/user/MS4wLjABAAAAKouSmCULyRPvwO2ECzsUljHEmlAxvRIJSy3Q30VEuu0'
#想要爬取的视频数量
video_num = 10
#翻阅的最大视频数量,为了控制程序运行时间
max_video_num = 100
#想要爬取的评论数量
comment_num = 10
#翻阅的最大评论数量,为了控制程序运行时间
max_comment_num = 50

如果出现错误可以看看issues,里面介绍了程序运行步骤。

if there is an error, look at issues, which describes the procedure for running the program.

导入所需的模块

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

抖音用户主页链接和需要获取评论的视频数量

account_home_page= 'https://www.douyin.com/user/MS4wLjABAAAAABJTNtdE9bZKmIZfL_pR15F8X0VNK591ffRA9pXXZsw'
video_num = 10
comment_num = 20
video_or_comment = "video"

下拉函数,用于滚动网页并加载更多内容

def drop_down(video_or_comment):
    prev_height = 0
    if video_or_comment == "video":
        curr_num = len(driver.find_elements(By.CSS_SELECTOR, '.Eie04v01'))
        curr_height = driver.execute_script("return document.documentElement.scrollHeight")
        driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
        time.sleep(3)
        new_height = driver.execute_script("return document.documentElement.scrollHeight")
        new_num = len(driver.find_elements(By.CSS_SELECTOR, '.Eie04v01'))
        max_num = 10 * video_num


    elif video_or_comment == "comment":
        curr_num = len(driver.find_elements(By.XPATH, '//*[@id="douyin-right-container"]/div[2]/div/div[1]/div[5]/div/div/div[4]/*/div/div[2]/div'))
        curr_height = driver.execute_script("return document.documentElement.scrollHeight")
        driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
        time.sleep(3)
        new_height = driver.execute_script("return document.documentElement.scrollHeight")
        new_num = len(driver.find_elements(By.XPATH, '//*[@id="douyin-right-container"]/div[2]/div/div[1]/div[5]/div/div/div[4]/*/div/div[2]/div'))
        max_num = 10 * comment_num
    else:
        raise ValueError
    
    ele_num = new_num - curr_num
    
    while True:
        # Get current scroll height
        curr_height = driver.execute_script("return document.documentElement.scrollHeight")
        
        # Scroll down the page
        driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
        
        # Wait for some time to let the page load
        time.sleep(3)
        
        # Get the new scroll height
        new_height = driver.execute_script("return document.documentElement.scrollHeight")

        new_num = new_num + ele_num
        
        # Break if we have reached the bottom of the page
        if (new_height == curr_height == prev_height) or (new_num >= max_num):
            break
        
        # Update the previous scroll height
        prev_height = curr_height

将评论中的点赞数字符串转化为数字

def str2num(s):
    try:
        if s[-1] == '万':
            n = int(float(s[:-1]) * 10000)
        else:
            n = int(s)
    except:
        n = 0
    return n

提取评论中的文本和点赞数

def extract(comment_splitted):
    try:
        text = comment_splitted[1]
        if comment_splitted[-1] == '回复':
            like = str2num(comment_splitted[-3])
        else:
            like = str2num(comment_splitted[-4])
        
    except:
        text = ""
        like = 0
    return text, like

创建 Chrome 浏览器实例并打开用户主页

driver = webdriver.Chrome()
driver.get(account_home_page)
time.sleep(10)
drop_down("video")
driver.implicitly_wait(10)

获取用户 ID 并创建以其命名的文件夹,用于存储评论

account_id = driver.find_element(By.XPATH, '//*[@id="douyin-right-container"]/div[2]/div/div/div[1]/div[2]/div[1]/h1/span/span/span/span/span/span').text
account_path = './' + account_id
if not os.path.exists(account_path):
    os.mkdir(account_path)

获取热门视频的链接和点赞数,并按照点赞数排序

lis = driver.find_elements(By.CSS_SELECTOR, '.Eie04v01')
url_likes = []
for li in lis:
    url = li.find_element(By.CSS_SELECTOR, 'a').get_attribute('href')
    video_like = str2num(li.find_element(By.XPATH, './div/a/div/span/span').text)
    url_likes.append([url, video_like])

url_likes.sort(key=lambda x : x[-1], reverse=True)

遍历热门视频的链接,滚动网页加载评论,创建文件写入评论

num = 0
for url, video_like in url_likes[:min(video_num, len(url_likes))]:
    num += 1
    driver.get(url)
    time.sleep(3)
    drop_down("comment")

    comments = driver.find_elements(By.XPATH, '//*[@id="douyin-right-container"]/div[2]/div/div[1]/div[5]/div/div/div[4]/*/div/div[2]/div')
    text_likes = []
    for comment in comments:
        comment_splitted = comment.text.split(sep='\n')
        print(comment_splitted)
        text_likes.append(extract(comment_splitted))

    text_likes.sort(key= lambda x : x[-1], reverse=True)
    try:
        video_name = driver.find_element(By.XPATH, '//*[@id="douyin-right-container"]/div[2]/div/div[1]/div[3]/div/div[1]/div/h2/span').text.split(sep=' ')[0]
    except:
        video_name = 'unknown video'
    comment_path = account_path + '/' + str(num) + '.txt'
    f = open(comment_path, 'w',encoding='utf-8')
    f.write('视频名称:' + video_name + '\n')
    f.write('视频点赞数:' + str(video_like) + '\n')
    f.write('视频地址:' + url + '\n')
    f.write('评论:' + '\n')
    for i in range(min(comment_num, len(text_likes))):
        f.write('点赞数:' + str(text_likes[i][-1]) + '  ' + text_likes[i][0] + '\n')
    f.close()

douyin_comments's People

Contributors

zjrandom951 avatar

Stargazers

czk avatar  avatar Ze Han avatar  avatar 张志诚 avatar 杨家骅 avatar  avatar 正有此毅 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar Dong Qi avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

douyin_comments's Issues

貌似有一个bug

image

具体表现为无限制地下拉视频,直到最底,然后报错。

运行代码报错

非常感谢您的分享!但我在使用过程中遇到了一些问题,希望能够得到您的帮助。
下载了google driver,运行代码能够弹出界面。
(貌似并没有弹出验证窗口,只弹出了登录窗口)
出现界面后,停留在用户主页面,没有进入具体的视频并自动下滑。
不久之后,代码报错:
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
GetHandleVerifier [0x00007FF6267B7012+3522402]
(No symbol) [0x00007FF6263D8352]
(No symbol) [0x00007FF626285ABB]
(No symbol) [0x00007FF6262CBF0E]
(No symbol) [0x00007FF6262CC08C]
(No symbol) [0x00007FF62630E437]
(No symbol) [0x00007FF6262EF09F]
(No symbol) [0x00007FF62630BDA3]
(No symbol) [0x00007FF6262EEE03]
(No symbol) [0x00007FF6262BF4D4]
(No symbol) [0x00007FF6262C05F1]
GetHandleVerifier [0x00007FF6267E9B9D+3730157]
GetHandleVerifier [0x00007FF62683F02D+4079485]
GetHandleVerifier [0x00007FF6268375D3+4048163]
GetHandleVerifier [0x00007FF62650A649+718233]
(No symbol) [0x00007FF6263E4A3F]
(No symbol) [0x00007FF6263DFA94]
(No symbol) [0x00007FF6263DFBC2]
(No symbol) [0x00007FF6263CF2E4]
BaseThreadInitThunk [0x00007FFB653B7344+20]
RtlUserThreadStart [0x00007FFB66C826B1+33]

你好 想请教一下操作程序

image

按.md说明,只修改了用户链接参数,webdriver也是匹配的,但是拉起来以后只是在主页自动下滑,然后就报如图片所示的错误了。所以想请教一下这个具体是怎么用的,谢谢了!

评论为空

您好!我跑了一下您的代码,视频名称,视频点赞数和视频地址正常,但是没有爬取到评论,是为什么呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.