Giter VIP home page Giter VIP logo

youtube-recommendations's People

Contributors

cwalker4 avatar

youtube-recommendations's Issues

MBFC scraper is broken

scripts/data_preparation/scrape_mbfc.py doesn't work anymore. Non-urgent, but should eventually be updated to reflect the page change that likely broke the original script.

Reconstruct full BFS tree from truncated representation

We store our crawl data in a table with the following schema (mod some key handling):

CREATE TABLE recommendations (
  video_id text NOT NULL,
  search_id integer NOT NULL,
  recommendation text,
  depth integer,
);

e.g. SELECT * FROM recommendations LIMIT 5 gives:

8wK7ZyxdELM|1|i5uB9ERXG3o|0
8wK7ZyxdELM|1|h8ftTlzYev0|0
8wK7ZyxdELM|1|DbypJZprPT4|0
8wK7ZyxdELM|1|siyW0GOBtbo|0
i5uB9ERXG3o|1|_mEHfrd43gc|1
i5uB9ERXG3o|1|QlaeirHJpns|1
i5uB9ERXG3o|1|jL8uDJJBjMA|1
i5uB9ERXG3o|1|0nCT8h8gO1g|1
h8ftTlzYev0|1|lpdiA8t8djw|1
h8ftTlzYev0|1|cnpe7d7bBRI|1

For efficiency reasons our crawler does not get the recommendations for a video if we have seen it before. As a result, the "tree" represented in recommendations is truncated. The implicit assumption is that the recommendations associated with any particular video_id do not change in the course of the crawl. For certain analyses, however, we might like to have access to the full tree. The question is: what is the best way to do this?

Let's take a small example: suppose we have the following tree:

a
|
|--a
|  |--a
|  |--b 
|
|--b
   |--c
   |--d

This would be stored in recommendations (omitting the search_id column) as:

a|a|0
a|b|0
b|c|1
b|d|1

But the tree that this table represents is truncated:

a
|
|--a
|
|--b
   |--c
   |--d

In this example, the desired output of some script untruncate.(py?R?) would be

a|a|0
a|b|0
a|a|1
a|b|1
b|c|1
b|d|1

This problem becomes less dumb when the example isn't a self-loop (in most cases we truncate a path when we land at a video we have seen before). Not a very interesting substantive question, but a data wrangling task that I'm not quite sure how to approach.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.