There are two methods that take ages to complete before actually scanning the files on

Yes, indexes are updated automatically, once a table is indexed. This also means

Yeah, I concluded the same after looking up how indexing works: n

(Looks like on Windows ctime == <code class="notransl

enumerating files in sqlitecache takes ages about cherrymusic HOT 6 CLOSED

devsnd commented on July 17, 2024

enumerating files in sqlitecache takes ages

from cherrymusic.

Comments (6)

tilboerner commented on July 17, 2024

enumerate_files shouldn't do everything up front. once it enters its while loop, it acts as a generator and will iterate only as quickly as new values are being requested. outside the enumerator, when it hands out an item from the 1st or 2nd hierarchy level, progress is updated by registering a tick towards a known endpoint, which is the total number of ticks expected.

anyway, the generator: before it gets into its loop, the paths argument is processed in its entirety: it might get sorted (which is controlled by another argument set to True at the moment), and it is turned into a stack. i'm not sure how exactly that happens, just that pythons deque is supposed to be "highly optimized", to quote a professor of mine. :P

so there is some initial time overhead, depending on how many directories media.basedir has. this could be avoided if we never sort, start with an empty stack and pop from paths whenever the stack is empty. we could also make paths a generator. however, that would require more integration with the compare phase and stepping through there as well. additionally, we would lose the total length information in paths, which is now taken as the base estimate for progress calculation. well, paths and its "immediate subpaths".

count_immediate_subpaths is trickier, and it happens because it's the only way I could think of to get an initial estimate of the total volume to be scanned, when progress is also updated for the 2nd hierarchy level. it could be avoided if we don't record a progress tick when a subfolder item gets processed, but update progress only for 1st level folders. subfolders could still get their name printed, but without any change in the progress indicator.

or, yeah, we could start a thread which counts the folders and updates the expected progress total while the actual scanning happens in a separate thread. since both threads would essentially be limited by the same i/o bottleneck, i'd be surprised if total speed gains were more than moderate, if even that. well, we could do the counting beforehand; there is a phase there already that checks the db against the filesystem, which we could use. it might be sufficient to stick a progress indicator on it, too - if we find a way to estimate progress this early. well, that might be handy here in this later phase as well.

hmm, or progress could check on each query how much time has passed since the last tick, compare that with average time between ticks and calculate a more granular estimate from there without counting the 2nd hierarchy level. ok, i like that. what do you think?

(nice productive write up that was. when i started, i hadn't thought of any of that.)

from cherrymusic.

tilboerner commented on July 17, 2024

There can be multiple yields in a generator function, right? We could get rid of that stack entirely. But is there an alternative to os.listdir?

from cherrymusic.

tilboerner commented on July 17, 2024

Draft of Possible Solution in Pseudocode

Process takes place in background. Therefore it is not stricly necessary to create a db/filesystem diff beforehand and have the user decide if it should take place. Plus, we can walk the whole structure to find new elements hiding in folders that are already known.

If we build an index of seen files while traversing one tree, we can probably speed up updating the second. Dunno how much difference it might make, though. What's faster: a filesystem lookup, or a lookup in a table which is in an open filehandle?

Definitions

db.define.tables('files', 'words', 'search': *as is*)
db.define.trigger('words', add|delete|update -> db.idx.words.update(word))
db.define.trigger('search', add|delete|update -> db.idx.search.update(word, file))

db_add(file) = transaction.savepoint {   *as is*   }

db_delete(fileid) = transaction.savepoint { 
                       words = db.search.delete.where(file)
                       for (words):                                // optional. 
                           if not db.search.has(word):             // needs up-to-date index?
                               db.words.delete(word)  
                       db.files.delete(file) 
                    }

Phase I: add new files to db & build remove index

db.create.tmp.table('seen', (fileid))
progress.total = count(items in basedir)
generator = recursive_lazy_file_lister(basedir)  // or (items in basedir)

for file in generator:
    if file.parent = basedir:
        progress += 1
    if db.files.has(file):
        db.tmp.seen.add(file)
        continue
    add(file)

Phase II: remove stale db entries

for file in (db.files where file not in db.tmp.seen):
    delete(file)

Phase II.alternate (obviates the need for db.tmp.seen):

 for file in db.files:
     if not file.exists:
         delete(fileid)

Additional Points:
- note that existing updating process does not seem to update indexes. or does that happen automatically?
- if not: index updates via trigger, or explicit rebuild after each phase (or the whole process) is done
- does not yet include id3-tags, which should probably be (re-)scanned
- optional "transaction log" with added/removed filepaths (mindful of possible rollbacks)
- worst case scenario: the whole content of basedir is moved into a new basedir/subfolder. temporarily doubles the database, then removes half of it
- how to avoid: recognize files by storing hashes in database?

from cherrymusic.

devsnd commented on July 17, 2024

Yes, indexes are updated automatically, once a table is indexed. This also means that indexed tables are slower. We could deactivate indexing for the very first update, and then enable it for further updates.
I don't care to much about the ETA. We should do automatic indexing in the background, and also start the server right away, but then i'd like to have feedback for the user, so the web-interface show "UPDATING" or something.
I think the implementation for adding files should walk the FS-tree in a DEPTH FIRST manner and only report what is known at the time. So if we iterate through basedir, all basedir folders are known, so the progress is PROCESSED_FOLDERS / BASEDIR_FOLDERS. Once we are inside the first subfolder, the progress becomes PROCESSED_FOLDERS / (BASEDIR_FOLDERS + FIRST_SUBFOLDER) and so on.
I think the implementation for deleting files should be performed in a BREADTH FIRST manner. So if your worst case scenario becomes reality, all root-folders are deleted imediately (and their children as well), and then everything will be added just as before.
hashing is a bad idea. It will slowdown everything even more than deleting and updating everythin each time, since every byte has to be processed. We could got with atime, ctime & mtime for lazy evaluation.

from cherrymusic.

tilboerner commented on July 17, 2024

Yeah, I concluded the same after looking up how indexing works: no indexes for the first update.
From a user perspective, a progress indicator is not a priority; but it might be for an admin who's in for a lot of wondering on the first update. There should be something he can interpret to get some rough whereabouts of the updating process: time, percentage, mention of directories that are known to him... The thing you mention in (3) should do.
Got me confused. Progress, reporting - are we only talking about how to format feedback? And what is known to whom? But I think we're on the same page. Any notion of progress feedback aside, I wouldn't look ahead at all, but simply march on depth-first, and decide about adding a node to the db the moment it is first encountered.

Wait, I think I get it now. You mean numbers, right? Like that:
```
todo = @basedir        // @ = direct children in directory
counter = 0
total = #basedir       // # = number of direct children
progress = 0

while todo:
    item = todo.pop
    process item       // item gets assigned an id as part of process
    if item.isdir:
        todo.pushall @item as children
        total += #item
    counter += 1
    progress = counter / total
```
That way, total wouldn't get very far ahead of counter, but as we advance, their ratio would approach "true progress".

process item, wherever it happens, could query the db if the item is known and otherwise trigger an add. Since we don't have an index on filenames (and probably shouldn't, either), we might want to look up "children of the current directory" and keep an index on the parent id column, or a children table. If we put that logic in an object, we could, on the very first add sweep, substitute a dummy that doesn't look up anything and just cries: "Add, add!"
Deleting should happen first, then. Suits me fine, I've grown doubtful about my idea of building a "seen" index during the add sweep. Horrible! The breadth-first approach is sound, which would also benefit from a quick way to look up an item's children.
I like the idea of using the timestamps, that might come in handy to find out if file contents, like id3-tags, have changed. (Do we even allow such madness?) ~~Might require another index, of course.~~ How do the different file systems handle these? Does os.stat come out the same in different OSes? A.f.a.i.k. atime might not change even on Linux systems, but what would we need it for, anyway?
We must be very careful using rowid as a unique identifier, since it can be re-issued to a new item after deleting the original row. Not just talking about the files db here. So, either we trust we can remember to purge all references on any delete, or we introduce a column for dedicated uids. They could be simply unique or primary key autoincrement, which would, of course, suggest or even mandate using another index. :)

from cherrymusic.

tilboerner commented on July 17, 2024

(Looks like on Windows ctime == mtime == atime.)

from cherrymusic.

enumerating files in sqlitecache takes ages about cherrymusic HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent