A YouTube video history metadata crawler (ytcrawl)

Author: Honglin Yu Email: [email protected]

About

ytcrawl is a YouTube video viewcount (can also crawl subscribers/shares/watchtimes when available) history crawler. During middle 2013, YouTube has published videos' history daily viewcount (when uploaders make it public, see the image below). These can be precious data for computational social science research. This crawler aims to help researchers efficiently download the data.

Warning

First run of single_crawl can be slow since the crawler needs to update cookies etc. It should be much faster afterwards.

Citation

This crawler was first used in the following project

Honglin Yu, Lexing Xie and Scott Sanner, Twitter-driven YouTube Views: Beyond Individual Influencers, to appear in ACM Mutimedia 2014

Basic functionalities

There are two ways to use the crawler: single_crawl and batch_crawl.
When using batch_crawl, due to the limitation of YouTube's server, it can access about 36,000 videos per hour (would meet lots of "quota exceeded" error when faster). The actual number of downloaded viewcounts depends on how many videos' data are public (more than 60% in former experiments).
It fully logs the crawling status. Easy to continue crawling from breakpoint.
Optionally, it can send you email to report "errors" or "job finished".

Usage

Put the code somewhere, add its folder to python library path and then import Crawler, for example,

>>> import sys
>>> sys.path.append('/path/to/ytcrawl')
>>> from crawler import Crawler

Create an crawler object

>>> c = Crawler()

The configuration of this object is explained in below 3. Begin to crawl,

single crawl

# crawl a single video by ID "OQSNhk5ICTI".
# We can only crawl daily viewcount of this video
>>> print c.single_crawl("OQSNhk5ICTI")
{'numShare': [], 'numSubscriber': [], 'watchTime': [], 'uploadDate': datetime.date(2010, 1, 9), 'dailyViewcount': [16,   11, 2, 1, 0, ...,  4800, 4891, 4844, 5427, 5505, 5068, 4663]}

batch_crawl,

# crawl all the videos listed in the input file.
# The crawled data and log files are stored in "/output/path".
>>> c.batch_crawl("/path/to/videoID/file", "/output/path")

Input/Output Format

For single_crawl
- Input : a videoID
- Output : a dictionary containing all the possible information crawled
For batch_crawl
- Input videoID file: every line in the input file is a videoID
- Output Folder:
  - data folder : contains the xml response from YouTube using "hash folder structure". One can extract the information after crawling.
  - key.done : videos whose information is crawled
  - key.disabled : videos whose information is not public (can not crawl)
  - key.invalidrequest : If you receive lots of error (for nearly all the videos) here, please report it on Github. These videos will be crawled again in next run.
  - key.nostatyet : no statistics for the video: often because the video is newly uploaded
  - key.quotalimit : the speed is too fast, exceed YouTube's quota. These videos can be crawled again in next run.
  - key.private : the video is a private video (can not crawl)
  - key.notfound : the videoID is wrong or the video has disappeared from YouTube

Configuration

One should configure the following parameters before crawling,

c = Crawler()
c._crawl_delay_time = 1 # set crawling delay to 1
c._cookie_update_delay_time = 1 # set cookie updating delay to 1

_crawl_delay_time: the time delay of crawling (in seconds). Default: 0.1 which will request 10 times per second.
_cookie_update_delay_time: the time delay of updating cookies (in seconds). Default: 0.1.

Continue from Breakpoint

Just rerun the crawler with same parameters, it can automatically jump over the videos which are already crawled.

Optional: Email Reminder

LICENSE

BSD 3 clause

naimahm / ytcrawl Goto Github PK

ytcrawl's Introduction

A YouTube video history metadata crawler (ytcrawl)

About

Warning

Citation

Basic functionalities

Usage

Input/Output Format

Configuration

Continue from Breakpoint

Optional: Email Reminder

LICENSE

ytcrawl's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent