Comments (6)
enumerate_files
shouldn't do everything up front. once it enters its while
loop, it acts as a generator and will iterate only as quickly as new values are being requested. outside the enumerator, when it hands out an item from the 1st or 2nd hierarchy level, progress is updated by registering a tick towards a known endpoint, which is the total number of ticks expected.
anyway, the generator: before it gets into its loop, the paths
argument is processed in its entirety: it might get sorted (which is controlled by another argument set to True
at the moment), and it is turned into a stack. i'm not sure how exactly that happens, just that pythons deque
is supposed to be "highly optimized", to quote a professor of mine. :P
so there is some initial time overhead, depending on how many directories media.basedir
has. this could be avoided if we never sort, start with an empty stack and pop from paths
whenever the stack is empty. we could also make paths
a generator. however, that would require more integration with the compare phase and stepping through there as well. additionally, we would lose the total length information in paths
, which is now taken as the base estimate for progress calculation. well, paths
and its "immediate subpaths".
count_immediate_subpaths
is trickier, and it happens because it's the only way I could think of to get an initial estimate of the total volume to be scanned, when progress is also updated for the 2nd hierarchy level. it could be avoided if we don't record a progress tick when a subfolder item gets processed, but update progress only for 1st level folders. subfolders could still get their name printed, but without any change in the progress indicator.
or, yeah, we could start a thread which counts the folders and updates the expected progress total while the actual scanning happens in a separate thread. since both threads would essentially be limited by the same i/o bottleneck, i'd be surprised if total speed gains were more than moderate, if even that. well, we could do the counting beforehand; there is a phase there already that checks the db against the filesystem, which we could use. it might be sufficient to stick a progress indicator on it, too - if we find a way to estimate progress this early. well, that might be handy here in this later phase as well.
hmm, or progress could check on each query how much time has passed since the last tick, compare that with average time between ticks and calculate a more granular estimate from there without counting the 2nd hierarchy level. ok, i like that. what do you think?
(nice productive write up that was. when i started, i hadn't thought of any of that.)
from cherrymusic.
There can be multiple yields in a generator function, right? We could get rid of that stack entirely. But is there an alternative to os.listdir
?
from cherrymusic.
Draft of Possible Solution in Pseudocode
Process takes place in background. Therefore it is not stricly necessary to create a db/filesystem diff beforehand and have the user decide if it should take place. Plus, we can walk the whole structure to find new elements hiding in folders that are already known.
If we build an index of seen files while traversing one tree, we can probably speed up updating the second. Dunno how much difference it might make, though. What's faster: a filesystem lookup, or a lookup in a table which is in an open filehandle?
-
Definitions
db.define.tables('files', 'words', 'search': *as is*) db.define.trigger('words', add|delete|update -> db.idx.words.update(word)) db.define.trigger('search', add|delete|update -> db.idx.search.update(word, file)) db_add(file) = transaction.savepoint { *as is* } db_delete(fileid) = transaction.savepoint { words = db.search.delete.where(file) for (words): // optional. if not db.search.has(word): // needs up-to-date index? db.words.delete(word) db.files.delete(file) }
-
Phase I: add new files to db & build remove index
db.create.tmp.table('seen', (fileid)) progress.total = count(items in basedir) generator = recursive_lazy_file_lister(basedir) // or (items in basedir) for file in generator: if file.parent = basedir: progress += 1 if db.files.has(file): db.tmp.seen.add(file) continue add(file)
-
Phase II: remove stale db entries
for file in (db.files where file not in db.tmp.seen): delete(file)
-
Phase II.alternate (obviates the need for db.tmp.seen):
for file in db.files: if not file.exists: delete(fileid)
-
Additional Points:
- note that existing updating process does not seem to update indexes. or does that happen automatically?
- if not: index updates via trigger, or explicit rebuild after each phase (or the whole process) is done
- does not yet include id3-tags, which should probably be (re-)scanned
- optional "transaction log" with added/removed filepaths (mindful of possible rollbacks)
- worst case scenario: the whole content of basedir is moved into a new basedir/subfolder. temporarily doubles the database, then removes half of it
- how to avoid: recognize files by storing hashes in database?
from cherrymusic.
- Yes, indexes are updated automatically, once a table is indexed. This also means that indexed tables are slower. We could deactivate indexing for the very first update, and then enable it for further updates.
- I don't care to much about the ETA. We should do automatic indexing in the background, and also start the server right away, but then i'd like to have feedback for the user, so the web-interface show "UPDATING" or something.
- I think the implementation for adding files should walk the FS-tree in a DEPTH FIRST manner and only report what is known at the time. So if we iterate through basedir, all basedir folders are known, so the progress is PROCESSED_FOLDERS / BASEDIR_FOLDERS. Once we are inside the first subfolder, the progress becomes PROCESSED_FOLDERS / (BASEDIR_FOLDERS + FIRST_SUBFOLDER) and so on.
- I think the implementation for deleting files should be performed in a BREADTH FIRST manner. So if your worst case scenario becomes reality, all root-folders are deleted imediately (and their children as well), and then everything will be added just as before.
- hashing is a bad idea. It will slowdown everything even more than deleting and updating everythin each time, since every byte has to be processed. We could got with atime, ctime & mtime for lazy evaluation.
from cherrymusic.
-
Yeah, I concluded the same after looking up how indexing works: no indexes for the first update.
-
From a user perspective, a progress indicator is not a priority; but it might be for an admin who's in for a lot of wondering on the first update. There should be something he can interpret to get some rough whereabouts of the updating process: time, percentage, mention of directories that are known to him... The thing you mention in (3) should do.
-
Got me confused. Progress, reporting - are we only talking about how to format feedback? And what is known to whom? But I think we're on the same page. Any notion of progress feedback aside, I wouldn't look ahead at all, but simply march on depth-first, and decide about adding a node to the db the moment it is first encountered.
Wait, I think I get it now. You mean numbers, right? Like that:
todo = @basedir // @ = direct children in directory counter = 0 total = #basedir // # = number of direct children progress = 0 while todo: item = todo.pop process item // item gets assigned an id as part of process if item.isdir: todo.pushall @item as children total += #item counter += 1 progress = counter / total
That way,
total
wouldn't get very far ahead ofcounter
, but as we advance, their ratio would approach "true progress".process item
, wherever it happens, could query the db if the item is known and otherwise trigger an add. Since we don't have an index on filenames (and probably shouldn't, either), we might want to look up "children of the current directory" and keep an index on the parent id column, or a children table. If we put that logic in an object, we could, on the very first add sweep, substitute a dummy that doesn't look up anything and just cries: "Add, add!" -
Deleting should happen first, then. Suits me fine, I've grown doubtful about my idea of building a "seen" index during the add sweep. Horrible! The breadth-first approach is sound, which would also benefit from a quick way to look up an item's children.
-
I like the idea of using the timestamps, that might come in handy to find out if file contents, like id3-tags, have changed. (Do we even allow such madness?)
Might require another index, of course.How do the different file systems handle these? Doesos.stat
come out the same in different OSes? A.f.a.i.k.atime
might not change even on Linux systems, but what would we need it for, anyway? -
We must be very careful using
rowid
as a unique identifier, since it can be re-issued to a new item after deleting the original row. Not just talking about the files db here. So, either we trust we can remember to purge all references on any delete, or we introduce a column for dedicated uids. They could be simplyunique
orprimary key autoincrement
, which would, of course, suggest or even mandate using another index. :)
from cherrymusic.
(Looks like on Windows ctime
== mtime
== atime
.)
from cherrymusic.
Related Issues (20)
- Ignore specific music files when indexing HOT 3
- feature request: sort by date
- Can't open file browser, "failed to load file browser"
- French translation
- Cherrymusic behing nginx with uwsgi HOT 1
- No Local Access If Internet Is down
- cherrypy gets wedged when you try to gracefully shut down
- Download button next to each album
- Error on run - invalid literal for int() with base 10: '' HOT 1
- unicode problem when download a playlist
- http 400 failed to load file browser HOT 2
- generate stream url? HOT 1
- Not rebinding to localhost on initial setup
- Permission Denied on startup with systemd script HOT 4
- Site is down HOT 2
- Question > Status of the project HOT 1
- feature request: remote control HOT 2
- I created docker container which runs on raspberry pi 2B HOT 1
- Patch to make it work with Python 3.10 HOT 1
- Project home page URL no longer accessible HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cherrymusic.