Giter VIP home page Giter VIP logo

yt-cleanse's Introduction

Introduction

使用 yt-dlp 爬 youtube 語料。

steps 1. 下載

  • mkdir ttv
  • bash download.sh TTV https://www.youtube.com/@TTV_NEWS
    • 參數1: prefix
      • 音檔之 prefix
    • 參數2: url
      • 直接拋給 yt-dlp 之 youtube url
      • 主要針對頻道爬
      • 亦可以 query 爬,但容易爬到很多髒東西
  • 只存音檔
  • 以字幕作為 filter,預設以有中文字幕為主

steps 2. 清理

  • bash run.sh --stage 0 --stop-stage 2 db datadir formatted
    • 參數1: db
      • download.sh 爬下來的資料夾
    • 參數2: datadir
      • 整理後的資料夾,會產生 kaldi data dir
    • 參數3: formatted dir
      • 以 espnet 轉換音檔後的資料夾,已棄用
    • optional 參數:stage
      • 1: 清理 yt-dlp 下載的資料
        • see: main.py
      • 2: 文字正規劃
  • main.py 之 stage
    • 1: 以 whisper 偵測語言
    • 2: 根據字幕與 whisper 偵測結果篩選
      • 根據參數決定踢掉開頭/結果片段數量
      • 踢掉字幕與 whisper 偵測結果不合的影片
      • 踢掉音檔長度與文字長度差異過大的影片
      • 踢掉主要語言 (i.e. zh) 之比例過少的影片
    • 3: 整理統合表 info.txt
    • 4: 製作 kaldi data

yt-cleanse's People

Contributors

kakushawn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.