Giter VIP home page Giter VIP logo

Comments (8)

Emily3403 avatar Emily3403 commented on August 11, 2024

This is an excellent feature and one, which is already partially implemented.

First off: I find hardlinks to be a better alternative to copy-on-write since they provide the following:

  • Any reasonable modern filesystem (except maybe Fat32) supportes them.
  • If you modify it in one "place" you modify it "everywhere". Depending on your use-case this might be a drawback. I find it to be a feature.
  • copy-on-write is not supported on ext4, which is the main file system used by isisdl users.

Considering these points, hard links are a better option than copy-on-write.

Current heuristic

The current heuristic is as follows (source, source2):

  1. If files have the same course id, name and size, they are deemed to be equal.
  2. If files have the same download url they are deemed to be equal.

Luckily for us the ISIS server does most of the heavy lifting. It implements a de-duplication on it's own and provides the same download url even across different courses. Thus, most of the space used (95% with my courses / files) is already de-duplicated. Sometimes, however this doesn't really work / the same videos are compressed a little bit different. But in this case no de-duplication could help us.

Now you might notice that this heuristic currently neglects same documents across different courses. This is intended since the only heuristic that is available is the file size.

Why set the size on a per course basis?

The assertion that every size is unique in a single course is usually valid, however it is not if you consider all courses. Because of that, the detection in advance is limited to files across a single course. There simply isn't enough information at this point in time to further determine if files are equal.

This, however can be solved, if duplicates are de-duplicateded after downloading. This is a bit more inefficient, but considering the fact that these files make up about 5% of the total download size this should be fine. With the total information about the file and a almost infinite read speed, it is possible to de-duplicate them even more.

I plan on using jdupes for the de-duplication after downloading. It will be part of the isisdl --compress routine and probably? only be available on linux / if the binary is in the PATH.

from isisdl.

Emily3403 avatar Emily3403 commented on August 11, 2024

I've just re-downloaded my entire contents of ISIS (181GB videos, 6GB Documents) and threw jdupes at it:

du -sb isisdl
>>> 192456576558 isisdl

jdupes -r -L isisdl

du -sb isisdl
>>> 192410524062 isisdl

As you can see there is about 44MiB decrease in size.

If you want you can also post your results. It is interesting to see if it were better for your use-case.

from isisdl.

 avatar commented on August 11, 2024

Great technical write-up! This is definitely a reasonable approach. To be honest I do not know a whole lot about filesystems (I'm a second semester cs student anyway) and was just so bold enough to request a feature without real in-depth knowledge. :D

I did not even know about hard-links, I just discovered this deduplication feature of APFS and thought hey maybe this could save some disk space. But indeed hard-links are better and more reliable across the board. I'm looking forward to see this feature in isisdl.

I did also run jdupes -r -L and my downloads went down from 114GB to 71GB. So an almost 40% decrease. And I haven't even downloaded all the courses I need. Also isisdl is not confused by hard-links. When I use deduplication with jdupes -B -r it messed up some metadata I guess and isisdl would not recognize all data anymore and start downloading again. It would be nice to have a choice, should this be part of the --compress option, if you want to have the compression via ffmpeg or only duplication checking, since the former requires quite a lot of cpu muscle and time.

HMU if you need some testing on macOS. So far jdupes is working fine, so I can imagine this does not have to be Linux only functionality. The only two useful package managers on macOS also bring jdupes to PATH so this should not be a problem. A note about this on GitHub or during initial setup should be clear. Since this is nice-to-have and not a crucial requirement users will also be fine without and could reconfigure their setup when they install jdupes afterwards.

from isisdl.

Emily3403 avatar Emily3403 commented on August 11, 2024

Sorry for responding so late, I've had a lot of work with Uni at the moment. Anyway - would you mind sharing the courses you are currently enrolled in? With this information I should be able to track down what files are not correctly hardlinked and where the space is saved since this should be possible without jdupes.

isisdl itsself doesn't use much metadata from the files - only the size and, when syncing, the entire content of the file. This is due to the fact that "interesting" metadata about files is not uniform across different filesystems. When normally executing isisdl the only metadata queried is the size. I would assume that jdupes did not mess with this attribute and isisdl should not be confused about what is downloaded and not. You can of course try to execute isisdl --sync in order to synchronize the database, however it should not be necessary. In fact in my testing I found that isisdl was never confused about files when executing jdupes. Does isisdl get confused consistently or does it only get confused sometimes?

I don't know exactly how or when I will implement this feature, but I want to keep the amount of questions in the configuration assistant as low as possible. Maybe it will be as a first step in the compression process since checking for duplicates does not require that much cpu power. Afterwards the user could cancel the compression. But I'll think about that in due time.

Thanks anyway for the feature suggestion ^^

from isisdl.

 avatar commented on August 11, 2024

So I don't quite know the status of development, but I hope my feedback still helps. I configured isisdl to download Files for AlgoDat 2021, 2022 Sysprog 2021,2022 and ForSA 2022. Also I often have to run isisdl twice to download everything, because it is missing new files in the first run.

from isisdl.

Emily3403 avatar Emily3403 commented on August 11, 2024

First of all thanks for the courses. I could find all of them but SysProg. Can you send me the course ID located in https://isis.tu-berlin.de/course/view.php?id={}?

I'll try jdupes myself on that dataset, and I'm interested in the results. Maybe the savings made by the jdupes algoritm could be natively integrated in isisdl's filesize reduce algorithm?

As for the current state of development: I would love to implement a frontend for jdupes. It is a bit tricky to let isisdl know which files should be which, and thus it takes a bit of time and effort to make that work. Currently, I don't have the time needed for me to implement this feature. Maybe I'll get around to it in the Semesterferien.

If you are however interested in coding a frontend for jdupes, I am gladly accepting pull requests ^^

For the multiple download bug: I could verify it. I don't know what causes it yet, but I think it's a course that is somehow broken.

from isisdl.

 avatar commented on August 11, 2024

The course id for Sysprog 2022 is 28476 and for 2021, 23037.
i don't know if i can deliver satisfactory code quality for the jdupes frontend, but i would give it a try during the semester break (so in about a month). :-)

from isisdl.

Emily3403 avatar Emily3403 commented on August 11, 2024

Sounds good ^^ If you have any further questions regarding how isisdl works internally feel free to ask ^^

from isisdl.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.