Giter VIP home page Giter VIP logo

vocadb-video-downloader-new's Introduction

vocadb-video-downloader-new

An integrated cli-based media archiving solution for VocaDB, it can:

  1. read a VocaDB favourite list and save it as a folder of JSON files (by calling VocaDB APIs)
  2. download the PV or audio for each songs in the favourite list (using the output from 1. as input)
  3. extract the audio track from PV if necessary, and add the thumbnail, tags to the audio track (using the output from 2. as input)

These 3 steps are implemented respectively as

  1. vvd-taskproducer
  2. vvd-downloader
  3. vvd-extractor

All 3 modules share a common document, please read it first: Common document

Then for each module, please read their own README.md file for more details.

vocadb-video-downloader-new's People

Contributors

cxwudi avatar dependabot[bot] avatar github-actions[bot] avatar renovate[bot] avatar

Stargazers

 avatar  avatar

Watchers

 avatar

vocadb-video-downloader-new's Issues

shall we use Kotlin?

  • getter/setter problem - kotlin call from java link
  • configuration properties data class - use init{} and the primary constructor should work
  • Gradle? no, use Maven to speed up IDE's project build time
  • JSON parsing, tree node approach, data class by map approach, or swagger with our own fixing approach?
    • if tree node, can we have [] operator working as getter and setter?
    • if data class by map, how to deal with field?? there is no tool to generate that kind of class. maybe we could write an issue for a JSON to kotlin plugin
    • if swagger, how is the maintainability? can it save time from the other two approaches?
    • An: swagger approach. it is good at having enums, proper time format. by doing our own fix, we help to make a great open-source vocadb client.
  • so kotlin has special definitions for properties, does Java getters return type? or type? it is type, honor project poc proved it
  • but with Xjsr305=strict, all @Nullable java function (including getter) wil return Type? in kotlin, which sometime is useful, sometime is annoying
    • An: depending on situation, either write orThrow() or just !!

features about downloading pvs

what features do we support? why

  • all features from the vvd-downloader-old include
    • global retry mechanism
    • different downloader for different PV service
    • custom command line for how to launch a downloader
  • retry setting on the single downloader, instead of global retry setting
    • this allows us to use a downloader that can continue if failed, by setting a large value
      • but this will falsely make other exceptions retry and retry unnecessarily
    • An: if a downloader relies on a retry-and-continue mechanism to successfully download a PV, such download must handle this retry-and-continue mechanism internally, instead of relying on the global retry mechanism
  • retry mechanism in #12
  • dynamic load downloader mechanism #13
  • if pv-preference doesn't list one PV service, filter out all PVs from that service, and no need to load downloaders for that PV service

refactor project to use "-label.json" as a marker

  • instead of scanning -songInfo.json, we should use a dedicated JSON file with all other necessary info for the next module to proceed. The JSON info also includes any filenames associated with the song (including songInfo.json file)
  • this can also fix the issue: to implement an advanced audio extractor choosing strategy, we need information from the previous module.
  • don't write too many util functions for labels, instead, each module has its own helpers (reason: for better extendibility)
  • also don't be confused that this label JSON is used to record necessary info and parameters that the next module needs
    • it is not the aggregated parameter object in #8

documentation

let's only worry about documents until we are ready to release this product

How to handle error path with easy-batch

Problem: easy-batch doesn't support Listener class on a single intermidate processor, how to handle error path.
Also for writers, since it already group them in Batch, we can't do much for it

Possible solutions:

  • just write error path logic inside the normal path business code
    • use an ErrHandler class
    • should we use Listener and reply on throwing exps to guide to err path? or should we just use conditions to guide to err path? An: long functions will split, and it will cause return value headache 😫
    • don't write delegate RecordProcessor class with exp handling. It probably won't work for all scenarios
  • let writer class implemented in RecordProcessor class, and use Listener to handle exceptions.
    • This is not a good usage of easy-batch, although it can solve problems easily because with batch writing, we can support writing to DB or cache if we want

Created from JetBrains using CodeStream

Deprecate vocadb-openapi-client-java in favour of vocadb-openapi-dto-java

Instead of generating the whole client, just generate the DTOs is better and allow users to use whatever client they want

However, it is shown that our hack of allowing Enum class name ended with "s" to support multiple unique enum value is necessary. Because in API call, such as GET /api/songs with the parameter fields, It must be a column separated list with unique elements like fields=tags,pvs. Using something else like fields=tags&fields=pvs won't work

We are also fixed in Jackson because the model is problematic with Gson, see https://github.com/CXwudi/vocadb-openapi-java-client-autofixer#before-generate

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Pending Branch Automerge

These updates await pending status checks before automerging. Click on a checkbox to abort the branch automerge, and create a PR instead.

  • ⬆ upgrade spring boot to v3.2.5 (org.springframework.boot:spring-boot-maven-plugin, org.springframework.boot:spring-boot-dependencies)

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

docker-compose
docker/docker-compose.base.yml
docker/docker-compose.debug-test-all.yml
docker/docker-compose.test-all.yml
dockerfile
docker/env-setup.Dockerfile
  • debian 12-slim
  • eclipse-temurin 21
github-actions
.github/workflows/test.yml
  • actions/checkout v4
  • actions/setup-java v4
  • jpribyl/action-docker-layer-caching v0.1.1
maven
pom.xml
  • org.springframework.boot:spring-boot-dependencies 3.2.4
  • org.jetbrains.kotlin:kotlin-bom 1.9.23
  • org.jetbrains.kotlin:kotlin-reflect 1.9.23
  • org.jetbrains.kotlin:kotlin-stdlib 1.9.23
  • com.github.VocaDB:vocadb-openapi-client-java 1.2.2
  • com.github.CXwudi:kotlin-jvm-inline-logging 1.0.1
  • com.github.CXwudi:kotlin-jvm-idiomatic-exec 1.1.0
  • org.jeasy:easy-batch-core 7.0.2
  • org.apache.tika:tika-core 2.9.2
  • io.kotest:kotest-runner-junit5-jvm 5.8.1
  • io.kotest.extensions:kotest-extensions-spring 1.1.3
  • com.lmax:disruptor 3.4.4
  • com.ninja-squad:springmockk 4.0.2
  • org.jetbrains.kotlin:kotlin-maven-plugin 1.9.23
  • org.jetbrains.kotlin:kotlin-maven-allopen 1.9.23
  • org.springframework.boot:spring-boot-maven-plugin 3.2.4
  • org.apache.maven.plugins:maven-compiler-plugin 3.13.0
  • org.apache.maven.plugins:maven-surefire-plugin 3.2.5
vvd-common/pom.xml
vvd-commonkt/pom.xml
vvd-downloader/pom.xml
vvd-extractor/pom.xml
vvd-taskproducer/pom.xml
maven-wrapper
.mvn/wrapper/maven-wrapper.properties
  • maven 3.9.6
  • maven-wrapper 3.2.0

  • Check this box to trigger a request for Renovate to run again on this repository

Use docker to setup test&production env

This will allow us to run tests with various dependencies on CI server.

Will create a single docker file that adds all dependencies in one docker image, and a base template docker-compose file for mounting either the project sources code or jar packages for testing and running.

Optionally, we can have another docker-compose file for building the jars

Remove easy-batch, but observe the batch pattern

Since #3, we are only using the reader, processor and writer classes, and manually build and run the batch processing.

Then there is no need to use explicitly easy-batch's reader, processor and writer classes as it introduce too much overhead of wrapping and unwrapping the Record class.

Solve conflict on file naming

Current naming is [vocalist]song name[producers], but if a producer self-remix a song with same vocalist, two songs will have exactly same name, and the -label.json and -songInfo.json will override another.

Solution: use [vocalist]song name[producers][vocadb-id] as the name

Improve how PV Type is handled in pvTaskDecider

currently we just do a fix order of pv type like


  companion object {

    private val pvTypeToIntMap = mapOf(
      PVType.ORIGINAL to 0,
      PVType.OTHER to 1,
      PVType.REPRINT to 2
    )

    /**
     * to make sure that reprinted types goes behind original and others type
     */
    private val pvTypeComparator = ToIntFunction<PVContract> {
      requireNotNull(
        pvTypeToIntMap[requireNotNull(it.pvType) { "the pv type is null for ${it.name}?" }]
      ) { "More PV Type?" }
    }
  }

let's make it configurable by changing

[vocadb-video-downloader-new] vvd-downloader/src/main/resources/application.yml (Lines 22-23)


    try-reprinted-pv: true #to improve: can let users decided on pv type, similar to how pv-preference works, make sure add similar validation as pv-preference too

setting

Use easy-batch's reader, processor, and writer classes

  • no spring-batch, in spring-batch, you need a job repo to store steps
  • with easy-batch, follow #10 on how to handle exceptions in intermedium steps and in the writing step
  • don't use easy-batch JobBuilder and JobExecutor, instead, we run the reader, all processors, and writer ourselves, using our favourite framework (currently using kotlin coroutine for parallel processing, and simply forEach for iterative processing)

A even better downloader enablement mechanism

Currently, no way for us to config two same downlaoders with different command line for even same pv service.

So in config.downloader, instead of having each pv service listing all available downloaders, lets it be a list of config with name. And the enablement will use the name as the order of downloaders for that pv service.

[vocadb-video-downloader-new] vvd-downloader/src/main/resources/application.yml (Lines 37-65)


  enablement:
    NicoNicoDouga:  # enable downloaders from NicoNicoDouga's available downloaders. e.g. nndownload
    Youtube:
    Bilibili:


  downloader:
    NicoNicoDouga:
      # settings of youtube-dl/ytp-dl for downloading niconico videos
      # settings must make sure nothing blocks from downloading the video (e.g. don't put --version on yt-dlp)
      # do the same for all other downloaders.
      youtube-dl:
        launch-cmd:
        external-args:

      nndownload:
        launch-cmd:
        external-args:

    Youtube:
      youtube-dl:
        launch-cmd:
        external-args:

    Bilibili:
      youtube-dl:
        launch-cmd:
        external-args:

Open in IDE · Open on GitHub

Created from JetBrains using CodeStream

only load downloader if needed

when we have more and more downloaders choice in the futures, we can't force users to fill in all downloaders config, so we need:

  • multiple downloaders for one PV service
  • validate that a downloader for one PV service can not be used in another PV service
  • load downloaders base on enablement and pv-preference
    • will probably use spring @ConditionOnBean to check if the corresponding downloader config existBIG NOTICE: base on Spring Doc, Spring Boot issue 1 and issue 2, any @ConditionalOn annotations that rely on knowledge about Spring beans will NOT work outside of Spring auto-configuration (PS: @ConditionalOnProperty would probably work)
    • downloader config bean is @ConditionOnExpression with the check if its name is in enablement config map Because of this, SpEL can not resolve properties (except @Value). So we need another solution to load only enabled downloaders
    • other solutions, see below comments

use aggregated parameter object

instead of thinking about what parameters does each method need, just uniformly make methods receive and output a giant parameter object only. Each module will have their own parameter object, and it contains anythings needed to process one data through the entire batch processing procedure

A way better sorting and enablement for PVs

The current way of sorting and filtering PVs are sucks and buggy. Here is a way better mechanism:

config:
  preference:
    # a list of <PvService>.<PvType>
    chosen-pvs: NicoNicoDouga.Original, NicoNicoDouga.Reprint, NicoNicoDouga.Other, Youtube.Original, Youtube.Reprint, Youtube.Other, Bilibili.Original, Bilibili.Reprint, Bilibili.Other

This single option can cover three old options altogether: pv-services, try-reprinted-pv and try-all-original-pvs-before-reprinted-pvs

retry mechanism in downloader

notes for retry in downloaders: (TODO)

  • if one downloader failed, try the next downloader on the same PV
    • allow users to choose multiple downloaders for one PV service
    • allow us to try the faster risky one, if not successful, use the safer slower one
    • make sure that the global retry count is independent for each downloader
  • if one PV service failed, try the next PV service
    • not all the time that VocaDB info of PV availability can be up to date, so we need our side to handle it
  • if all original PVs from all services are failed, try download reupload PVs
    • this could be useful if a Vocaloid producer deleted all his/her contents, but other ppl upload a copy
  • all retry features are configurable
    • how to resolve PV preference sorting base on current enabled retry features
    • An: sort and filter PVs base of all setting aboves

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.