Comments (8)
This is an excellent feature and one, which is already partially implemented.
First off: I find hardlinks to be a better alternative to copy-on-write since they provide the following:
- Any reasonable modern filesystem (except maybe Fat32) supportes them.
- If you modify it in one "place" you modify it "everywhere". Depending on your use-case this might be a drawback. I find it to be a feature.
- copy-on-write is not supported on ext4, which is the main file system used by
isisdl
users.
Considering these points, hard links are a better option than copy-on-write.
Current heuristic
The current heuristic is as follows (source, source2):
- If files have the same course id, name and size, they are deemed to be equal.
- If files have the same download url they are deemed to be equal.
Luckily for us the ISIS server does most of the heavy lifting. It implements a de-duplication on it's own and provides the same download url even across different courses. Thus, most of the space used (95% with my courses / files) is already de-duplicated. Sometimes, however this doesn't really work / the same videos are compressed a little bit different. But in this case no de-duplication could help us.
Now you might notice that this heuristic currently neglects same documents across different courses. This is intended since the only heuristic that is available is the file size.
Why set the size on a per course basis?
The assertion that every size is unique in a single course is usually valid, however it is not if you consider all courses. Because of that, the detection in advance is limited to files across a single course. There simply isn't enough information at this point in time to further determine if files are equal.
This, however can be solved, if duplicates are de-duplicateded after downloading. This is a bit more inefficient, but considering the fact that these files make up about 5% of the total download size this should be fine. With the total information about the file and a almost infinite read speed, it is possible to de-duplicate them even more.
I plan on using jdupes for the de-duplication after downloading. It will be part of the isisdl --compress
routine and probably? only be available on linux / if the binary is in the PATH.
from isisdl.
I've just re-downloaded my entire contents of ISIS (181GB videos, 6GB Documents) and threw jdupes at it:
du -sb isisdl
>>> 192456576558 isisdl
jdupes -r -L isisdl
du -sb isisdl
>>> 192410524062 isisdl
As you can see there is about 44MiB decrease in size.
If you want you can also post your results. It is interesting to see if it were better for your use-case.
from isisdl.
Great technical write-up! This is definitely a reasonable approach. To be honest I do not know a whole lot about filesystems (I'm a second semester cs student anyway) and was just so bold enough to request a feature without real in-depth knowledge. :D
I did not even know about hard-links, I just discovered this deduplication feature of APFS and thought hey maybe this could save some disk space. But indeed hard-links are better and more reliable across the board. I'm looking forward to see this feature in isisdl.
I did also run jdupes -r -L
and my downloads went down from 114GB to 71GB. So an almost 40% decrease. And I haven't even downloaded all the courses I need. Also isisdl is not confused by hard-links. When I use deduplication with jdupes -B -r
it messed up some metadata I guess and isisdl would not recognize all data anymore and start downloading again. It would be nice to have a choice, should this be part of the --compress
option, if you want to have the compression via ffmpeg or only duplication checking, since the former requires quite a lot of cpu muscle and time.
HMU if you need some testing on macOS. So far jdupes is working fine, so I can imagine this does not have to be Linux only functionality. The only two useful package managers on macOS also bring jdupes to PATH so this should not be a problem. A note about this on GitHub or during initial setup should be clear. Since this is nice-to-have and not a crucial requirement users will also be fine without and could reconfigure their setup when they install jdupes afterwards.
from isisdl.
Sorry for responding so late, I've had a lot of work with Uni at the moment. Anyway - would you mind sharing the courses you are currently enrolled in? With this information I should be able to track down what files are not correctly hardlinked and where the space is saved since this should be possible without jdupes
.
isisdl
itsself doesn't use much metadata from the files - only the size and, when syncing, the entire content of the file. This is due to the fact that "interesting" metadata about files is not uniform across different filesystems. When normally executing isisdl
the only metadata queried is the size. I would assume that jdupes
did not mess with this attribute and isisdl
should not be confused about what is downloaded and not. You can of course try to execute isisdl --sync
in order to synchronize the database, however it should not be necessary. In fact in my testing I found that isisdl
was never confused about files when executing jdupes
. Does isisdl
get confused consistently or does it only get confused sometimes?
I don't know exactly how or when I will implement this feature, but I want to keep the amount of questions in the configuration assistant as low as possible. Maybe it will be as a first step in the compression process since checking for duplicates does not require that much cpu power. Afterwards the user could cancel the compression. But I'll think about that in due time.
Thanks anyway for the feature suggestion ^^
from isisdl.
So I don't quite know the status of development, but I hope my feedback still helps. I configured isisdl to download Files for AlgoDat 2021, 2022 Sysprog 2021,2022 and ForSA 2022. Also I often have to run isisdl twice to download everything, because it is missing new files in the first run.
from isisdl.
First of all thanks for the courses. I could find all of them but SysProg. Can you send me the course ID located in https://isis.tu-berlin.de/course/view.php?id={}
?
I'll try jdupes myself on that dataset, and I'm interested in the results. Maybe the savings made by the jdupes algoritm could be natively integrated in isisdl
's filesize reduce algorithm?
As for the current state of development: I would love to implement a frontend for jdupes. It is a bit tricky to let isisdl
know which files should be which, and thus it takes a bit of time and effort to make that work. Currently, I don't have the time needed for me to implement this feature. Maybe I'll get around to it in the Semesterferien.
If you are however interested in coding a frontend for jdupes
, I am gladly accepting pull requests ^^
For the multiple download bug: I could verify it. I don't know what causes it yet, but I think it's a course that is somehow broken.
from isisdl.
The course id for Sysprog 2022 is 28476 and for 2021, 23037.
i don't know if i can deliver satisfactory code quality for the jdupes frontend, but i would give it a try during the semester break (so in about a month). :-)
from isisdl.
Sounds good ^^ If you have any further questions regarding how isisdl
works internally feel free to ask ^^
from isisdl.
Related Issues (15)
- isisdl-config crashes when cron is not installed HOT 2
- Feature Request: Propagation of choice implications to posterior prompts in config wizard HOT 1
- MacOS Support HOT 3
- Repeated calling of `isisdl --compress` just updating HOT 8
- `isisdl --compress` doesn't find non-existent video HOT 2
- --init: text for safe mode for filenames is missing actual question
- Course titles with slashes are interpreted as subdirectories HOT 12
- Feature Request: Archive old file versions HOT 8
- Feature Request: add argument to change encoder and preset used by ffmpeg HOT 1
- Filename conflict not detected HOT 5
- Incompatibility with `packaging` 22.0 HOT 3
- Bug: Login not working HOT 1
- Authentication fails :/ HOT 1
- Not working HOT 43
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from isisdl.