andykais / forager Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 15.03 MB

TypeScript 97.74% JavaScript 1.76% HTML 0.22% Svelte 0.29%

forager's People

Contributors

Watchers

forager's Issues

--search --delete flags for the cli

--search will output checksums and media_reference ids (or maybe thats configurable too?)

--delete will delete on what is searched. --delete will not do anything if no search params are specified.

low-impact backups

General idea

our database includes both the file content, and metadata about those files. I think we can come up with a smart backup system that exports all the data except for file content. E.g. the media references, the tags, the sequences and most importantly, the checksums. Those can happen weekly or something, we could even build that into the forager client.

Backup resolution

one of two options: either the backed up db is a starting point, and it just holds empty references until we re-import data with matching sha256s. OR we import the backups into a database and they just serve as additional metadata attachments. (this is far easier as it falls in line with one of our planned cli commands (forager update --tag some:tag --search --id=1)

robust duplication checks

We currently use a unique md5 checksum field to check if two pieces of media are equal, and log duplicates to a table. We should add a mechanism to compare whole buffers to see if they're equal, and either throw a critical error for non-equal collisions, or make checksums non-unique. This behavior should be behind a config flag most likely

[edit]
we should also probably move from md5 checksums to sha256 or even sha512

cursor based pagination

move from offset & limit to created_at > ? ORDER BY created_at LIMIT ?. In the future we can try doing something with source_created_at, but a compound sort (source_created_at OR created_at) seems slow, at least in the where clause.

timing instrumentation

statements should log to a table with their time to execute, and possibly the size of the database. Timing should be behind a config flag like instrumentation: true. The table might look like this:

CREATE TABLE timing (
  statement TEXT NOT NULL,
  execution_time_ms INTEGER NOT NULL,
  -- TBD if the following counts are too expensive to compute on top of each call
  media_reference_count INTEGER NOT NULL,
  media_chunk_count INTEGER NOT NULL,
  tag_count INTEGER NOT NULL,
  tag_group_count INTEGER NOT NULL,
  media_reference_tag_count INTEGER NOT NULL,
);

Alternatively, we could snapshot the counts every x minutes or hours rather than on every timing insert. We could also just forgo counting entirely and use something like sqlite_analyzer whenever a problem starts to arise.

tag & tag_group should have description & metadata field

description field could explain what a tag is
metadata could hold domain specific data we want to query for later (e.g. lyrics for a song tag)

media_chunk optimizations

add index column to media_chunk for faster chunk grabs
- we can use offset in the interim, but its going to walk each row
only videos should be chunked, images should always be shoved into a single chunk.
- I am assuming we never hit a chunk larger than 4GB (which I think is the limit on BLOB access)

add star count to media_reference table [1-5]

we want to filter media based on how highly its rated. On one hand, this could just be a tag. On the other hand, we can bake this into the data model and reduce search times a tad

add tag counts, media_reference counts

add tag_count field to the media_reference table and media_reference_count field to tag table. Both need triggers on INSERT/DELETE which will increment/decrement those counters

give it a cool name

ideally we get both a unique name, and a organization. Then we publish like so:

# install the cli
npm i -g forager
# install the core db/file manager
npm i -g @forager/lib
# install the web iu
npm. i -g @forager/web

npm i -g @forager/lib @forager/web forager
- name: taken and actively developed
- org: available
npm i -g @medima/lib @medima/web medima mashup of "media" and "manager"
- name: available
- org: available
npm i -g @packrat/lib @packrat/web packrat
- name: stale last publish 9 years ago
- org: available

media.export method

add a method to export media to disk. It should take the full search params as input (which should also include checksums & media_reference_ids as params). Second param should be a string template for filenames. Look at youtube-dl for inspiration.

core: store dates as epoch timestamps

this will save some space, and we dont really care about reading dates for debugging (sqlite has helpers if we really do care)

we could also store metadata in msgpack format, but then we lose the ability for sqlite to use json_extract selectors. (although we could take advantage of registered functions...we would have to benchmark this)

rejection log table

essentially track which files have been deleted, so that if we try importing them again they get ignored.

video previews should be an array of thumbnails

lets normalize thumbnails into its own table. Images get a single thumbnail, videos get x (like 16).

create table thumbnail (
  id INTEGER PRIMARY KEY NOT NULL,
  media_file_id INTEGER NOT NULL,
  index INTEGER NOT NULL,
  image_data BLOB NOT NULL,
  timestamp INTEGER NOT NULL -- for media_type IMAGE, this is just zero.
);

view thumbnail previews

to better sift through videos, we should use ffmpeg to create a grid preview of videos. Something like 10-15 frames from any given video imported, and store it on the media_file table. A great guide is here https://www.binpress.com/generate-video-previews-ffmpeg/

create file from ReadableStream

allow streams to be passed into forager.media.create() rather than a filepath. This lets us stream directly from a scraper to the db, without duplicating data into the filesystem. Inside a transaction we would push the chunks into the db, and calculate the sha256checksum on the fly. If the file with that checksum already exists, then we ROLLBACK the transaction.

folder view

forager should be encouraged to be used to just import everything in a folder, and use forager-data.json when its found. A folder view also works to view them. This is essentially the only reason I use a file system to organize some things right now.
system
additionally, this could be the new "sequence" view. It makes scraping data stored with keys easier than trying to nest the data structures. For instance, we might store a comic book like this sequence_location: /favorites/<title>/<chapter>/<page>

cli: template getters

we often have to import media in this general manner:

for media_folder in folder:
  info = read(media + '/info.json')
  forager.create(media_folder + '/media.mp4', info, info.tags)

we can attempt to handle simple control structures inside of the forager cli, much like how ffmpeg handles this:

#!/bin/bash


# ifunny:
forager \
  --database ~/.local/share/forager.db \
  'ifunny/%04d/media.*' \
  --json 'info=(__media_folder__)/media_info.json' \
  --title '(info.title)' \
  --description '(info.description)' \
  --source-created-at '(info.timestamp * 1000)' \
  --tag 'source:ifunny'
  --tag 'ifunny:(info.tags)' \
  --tag 'username:(info.username)' \

performance improvement ideas

there are lots of ways to speed up sqlite, so Ill just store musings here for now.

search/list counts happen on every paginate call. We avoid this by either:
- just doing the count at the beginning (susceptible to new data changes, but not really a problem since we dont load stuff when scrolling up)
- creating a table that essentially tracks the last update/delete to happen to either media_reference_tag or media_reference. On a paginated query, we check if the "cache key" has changed.
- store a table w/ search query counts. All rows will be removed when either media_reference_tag or media_reference has an update/delete. Search query would be json of the search params.

complex queries

This is just a running list of queries that would make the app better, but may be difficult using a traditional RDBMS.

smarter tag searches

tag searching could incorporate other selected tags to only search for tags that will have more than zero media results.

given a list of tags, what other tags are referenced from the first list of tags searched
- this is useful for our search bar. E.g. there are two search terms, ideally tag autocomplete should only show tags that are going to yield any results by adding.

grouped search results

I want to add a search arg to group results by tag group. For example: "search all media from twitter grouped by username"

-- this just gives us the media_reference count per tag, but we still could grab the top media_reference_id from each too

SELECT tag.name AS tag, COUNT(media_reference.id) AS media_count, tag.unread_media_reference_count
FROM media_reference
INNER JOIN media_reference_tag ON media_reference_id = media_reference.id
INNER JOIN tag ON tag_id = tag.id
INNER JOIN tag_group ON tag_group_id = tag_group.id
WHERE media_reference.id IN (
    SELECT media_reference.id FROM media_reference
    INNER JOIN media_reference_tag ON media_reference_id = media_reference.id
    INNER JOIN tag ON tag_id = tag.id
    INNER JOIN tag_group ON tag_group_id = tag_group.id
    WHERE tag_group.name = 'username'
    AND media_reference_id IN (
      SELECT media_reference_id FROM media_reference_tag
      INNER JOIN tag ON tag_id = tag.id
      WHERE tag.name = 'likee'
    )
    GROUP BY media_reference.id
  )
  AND tag_group.name = 'username'
GROUP BY tag_id
ORDER BY media_count;

media sequences should also have the option to be grouped. This might involve temporary tables