notjoemartinez / yt-fts Goto Github PK
View Code? Open in Web Editor NEWYouTube Full Text Search - Search all of a YouTube channel from the command line
License: The Unlicense
YouTube Full Text Search - Search all of a YouTube channel from the command line
License: The Unlicense
The current way we parse vtt files inserts duplicate quote entries with time stamp off by a couple seconds. This is because the vtt files we get from yt-dlp
contain duplicate entries except one of them has a bunch of markup to segment the quote. See line 192. Removing these duplicates would probably speed something up
list how many channels had a match and how many matches were in each channel
.....
Used link: https://youtube.com/@TomScottGo
yt_fts version: 0.1.15
Downloaded subs where from all from @tomscottplus. This is not what i expected.
Add a flag to check version of package installed.
The program downloaded many subtitles. It took some time. But then i closed the terminal session. And now it does not appear to recognise the .db file in the current working path. Is there a way to specify the db file to use?
Use chromaDb to store and search embeddings
make the export command a flag. This prevents us from having to write a semantic-export
command
Hi, can I update my database without downloading all a subtitles of YouTube channel again?
Manual subs should be used if available but only auto-subs are downloaded.
This thread yt-dlp/yt-dlp#2262 might be relevant
Hi, what is the license of that code? The LICENCE file is missing.
It seems that the download command only downloads transcripts of the uploaded videos
It would be nice to also support videos which are live streamed
make yt-fts config
a command instead of yt-fts list --config
yt-fts export "word" --all
returns channel not found error
make project available on homebrew once it reaches a "stable version".
When trying to download an already existing channel, user will have to wait up to 10 seconds to see if they already have the channel downloaded
Hi, first thanks for this useful package!
It would be great if it can support alias.
like
python3 yt_fts.py alias [NAME] [channel_id]
python3 yt_fts.py search [ALIAS_NAME or ID] [search text]
It would be better: when downloading, we can also specify the alias and it would create it automatically.
Make an error message module to prevent cluddering
from pr #17
As suggested on HN, yt-fts is currently using LIKE operator for searches.
The goal here is to leverage the SQLite FTS5 full-text search using sqlite_utils library.
HN suggestion:
It looks like you're running searches using LIKE: https://github.com/NotJoeMartinez/yt-fts/blob/050981c0519a96...
SQLite has a really power full-text search mechanism built in - FTS5. It can handle things like stemming and stop words and relevance ranking.
My sqlite-utils Python library includes helper methods for setting that up: https://sqlite-utils.datasette.io/en/stable/python-api.html#...
Store embeddings in chroma instead of sqlite
This is an alternative to #18 to achieve similar goals.
It would be nice to be able to supply a regex on video titles as well as searching for content.
Using Lex Fridman's channel as an example:
His podcast has 376 videos: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4
However his "channel" has 689 videos: https://www.youtube.com/@lexfridman/videos
After downloading the channel content and querying through the episodes, a regex of /(Podcast)(?! Clips)/
will return all his podcast episodes but none of the other content.
This is obviously not as reliable as allowing a playlist URL but it might be a handy feature nonetheless and would seemingly only involve adjusting the search
command with a new flag.
Things seem to work as expected in testing on this channel: https://www.youtube.com/@CodexCommunity/videos
But trying this channel results in an error: https://www.youtube.com/@PerfectGuyLife/videos
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 212 (char 211)
My guess is that the 2nd channel uses emoji in some of their titles, like ๐ด Fishtank is LIVE ๐ 50% OFF Weekend
Any thoughts on this?
Is the search a traditional exact text match? If so, having semantic searching via embeddings/completion would be great!
initial run is unacceptably slow from whatever I did on #48. This might be due to import issues on yt-fts.py
. pulled version from pypi
I tried to run the example python yt_fts.py download "https://www.youtube.com/@TimDillonShow/videos"
UC4woSp8ITBoYDmjkukhEhxg
and consistently end up with an error No such file or directory: 'yt-dlp'
Downloading channel
Saving vtt files to /var/folders/x7/0r36c9sn7yg7tvs5sdm471000000gn/T/tmpbrh06qzz
The Tim Dillon Show
Traceback (most recent call last):
File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 273, in <module>
cli()
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 31, in download
download_channel(channel_id)
File "/Users/saif/WORKSPACE/yt-fts/yt_fts.py", line 84, in download_channel
subprocess.run([
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 503, in run
with Popen(*popenargs, **kwargs) as process:
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 971, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/Users/saif/opt/anaconda3/envs/yt/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'yt-dlp'
Is it possible to search in a individual video?
Add command to show config path or set specific config params
delete
command export semantic search stuffHi,
First, thanks for this tool, really useful.
As reported on HN by Europe users, it exists a YouTube cookies consent page that blocks channel_id retrieving (first) and consequently, all other requests.
File ".../yt-fts/yt_fts.py", line 29, in download
channel_id = get_channel_id(channel_url)
File ".../yt-fts/yt_fts.py", line 176, in get_channel_id
channel_id = re.search('channelId":"(.{24})"', html).group(1)
AttributeError: 'NoneType' object has no attribute 'group'
I already faced this issue and adding a cookie indicating that consent has been given to a requests session can "solve" this.
s = requests.session()
s.cookies.set("CONSENT", "YES+1")
[...]
res = s.get(url)
In order to respect the initial goal of this consent page, we can ask the user to give its consent through a CLI argument like so:
python yt_fts.py download "https://www.youtube.com/@ycombinator/videos" --cookies_consent=1
It's just a suggestion as it can also be a question that prompt in CLI during download but this require to know that the user is in Europe (or it can apply to all users but it can be annoying if it's not really needed after all).
I tried to analyse "Reject all" selection behavior but the CONSENT cookie's content is still PENDING+{RANDOM NUMBER} (perhaps not random from Google's POV but I couldn't explain this value) so from my point of view only "Accept all" is "working".
Do you have any thoughts about this?
Kind regards,
Imagine you need a sound bite. Currently the workflow is as follows:
download
for all the subs.search
and find a quote that fits your needs.yt-dl <link>
or yt-dl -x <link>
ffmpeg -i <input file> -ss <ts> -t <duration> -acodec copy -vcodec copy <output. file>
A streamlined workflow could look like this:
download
channel subssearch
key wordsyt-fts quote-dl --audio <ID>
to download sound or video bite. Maybe this needs a duration argument?<video-ID>-<quote-ID><Sanitized Quote>.mp3
(or similar) in your working dir.Done. yt-fts would download the file as specified (e.g. via --audio
or --video
) and cut it to bits.
Is this something that is in scope of this project? Do any user users have this use case?
Currently using a hacky script to test for behavior, not familiar with how unit testing is done with CLIs.
look into how watson does it:
https://github.com/TailorDev/Watson/blob/master/tests/test_cli.py
It would be nice if it were possible to search across all downloaded channels.
Maybe with an --all flag?
I think yt-dlp fetches all the videos in a channel, then fetches the stats of each video (checking to see if there are captions).
Large channels with single-digit number number of videos with captions are slow to download (and hit api limits).
The (paid and official) YouTube API allows you to retrieve the video IDs with captions in a specific channel.
curl \
'https://youtube.googleapis.com/youtube/v3/search?channelId=[ChannelID]&part=id&type=video&videoCaption=closedCaption&key=[KEY]' \
--header 'Accept: application/json' \
--compressed
{
"kind": "youtube#searchListResponse",
"etag": "995jyKTI3Q_SpXkNvcBCDR77qP0",
"nextPageToken": "CAUQAA",
"regionCode": "",
"pageInfo": {
"totalResults": 141,
"resultsPerPage": 5
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
}
},
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
}
},
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
}
},
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
}
},
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
}
}
]
}
Please add playlist support. Many video collections of interest are organized in playlists and not channels. I don't know if the identifier for playlists is in a different namespace. yt-dlp support playlists.
on macos/linux default config path should be
db_path = f"{os.path.join(os.getenv('HOME'), '.config', 'yt-fts')}/subtitles.db"
on windows
db_path = f"{os.path.join(os.getenv('APPDATA'), 'yt-fts')}/subtitles.db"
for some reason it's defaulting to the current directory
The script currently saves the database to the current working directory, ideally it should be some where in ~/.local/share/yt-fts/subtitles.db
. I don't know the best practices for writing software that "invites itself" to a users config directories.
My general questions are:
pip uninstall yt-fts
supposed to know where this is?Hi, this looks like a promising tool. A few points to hopefully help towards Windows support:
The README should be updated with instructions to set up a venv using activate.bat,.
What Python version(s) are supported? What versions do we know work with yt-fts?
Current state on Windows fails to run download
command. Here is the output from my terminal:
python yt_fts.py download "https://www.youtube.com/@TimDillonShow/videos"
UC4woSp8ITBoYDmjkukhEhxg
Downloading channel
Saving vtt files to C:\Users\FOO\AppData\Local\Temp\tmp6oqtgfyb
The Tim Dillon Show
Traceback (most recent call last):
File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 273, in <module>
cli()
File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\USER\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\FOO\Documents\git\yt-fts\.env\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 31, in download
download_channel(channel_id)
File "C:\Users\FOO\Documents\git\yt-fts\yt_fts.py", line 84, in download_channel
subprocess.run([
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1024, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1008.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1509, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [WinError 2] The system cannot find the file specified
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.