Giter VIP home page Giter VIP logo

youtube-embeddings's Introduction

Youtube topics

The goal of this repository is to create embeddings for YouTube channels. These embeddings can be used as-is for content similarity, and can also be used to extract social dimensions.

Types of Embeddings

We propose three types of embeddings.

  • Social Sharing / Reddit embedding: made using shares of YouTube videos on Reddit (using Pushshift data)
  • Content embedding: made from video titles and descriptions, fed through a Sentence Transformer.
  • Recommendation embedding: made from recording recommendations YouTube provides to a history-less user, and computing a node embedding.

Those embeddings for our filtered 40K channels are featured in the embeds/ folder.

Similarly, social dimensions are featured in the dims/ folder.

Recreating embeddings

Create a conda (/mamba) environment using conda env create -f environment.yml. This creates a conda environment named ytb with all libraries necessary for running the code. It might be necessary to upgrade your conda version beforehand (conda upgrade conda) if you get any error.

The repository uses jupytext for notebooks version control, so notebooks are saved in Markdown format, which still makes them readable from github, and removes the output.

All of the notebooks for recreating the embeddings are in the generate_embeddings/ folder. Please note that it will require some work to get everything working. Notably, it assumes you have already extracted all links to youtube in reddit comments and submissions (the pyspark code for extracting them is not (not yet?) public).


Unfortunately, it looks like the pushshift dumps are currently not accessible over on https://files.pushshift.io/ (although there seems to be a torrent remaining), and according to this post, Reddit revoked pushshift's access, so more recent posts will not be able to be included in datasets.

Notebook & methods diagram

Shows an illustrated sun in light color mode and a moon with stars in dark color mode.

youtube-embeddings's People

Contributors

boesingerl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.