There was discussion at this week's call to consider rewriting history and/or separati

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

we may also want to consider a setup like the one proposed in <a class="issue-link js-

Here is an in-depth deion of Linus Torvalds' policy, which an be summarized as:<

Make the Repository Smaller,about sandialabs/albany

Comments (36)

lxmota commented on August 17, 2024 2

@bartgol Yup, that is what I'm advocating too, make the tests small.

from albany.

ibaned commented on August 17, 2024 2

we may also want to consider a setup like the one proposed in trilinos/Trilinos#1026

from albany.

lxmota commented on August 17, 2024

Some of the examples are too big, and some others have been abandoned and are stale (at least in the case of LCM). Some clean up is necessary. @jtostie and I did a little bit of that some time ago, but perhaps more if it is in order.

We might also consider splitting the source and examples, as they do for Sierra. I think it would be a good idea to have an examples repository where people can put really big meshes and other files if they want. But the tests within the source repository should be small and they should run fast.

As for the Git history, I'm not sure. How do really big projects manage that issue? Say, the Linux kernel, or closer to home, Trilinos.

from albany.

ibaned commented on August 17, 2024

Agreed on your first two paragraphs. As for other repositories:

The Linux kernel has much stricter management that simply doesn't allow large files to get in. Since they are not a science app and testing is hard to automate for a kernel, they don't have the "input file" situation
Trilinos did go through a history rewrite in part to save space prior to their transition to GitHub. @bartlettroscoe has the details on that if you're interested, but I get the sense that significant savings were made but more could be achieved in theory.

from albany.

lxmota commented on August 17, 2024

Ok, but does the Linux kernel have issues with their Git history?

I'll ask Ross what they did for Trilinos.

from albany.

ibaned commented on August 17, 2024

Here is an in-depth description of Linus Torvalds' policy, which an be summarized as:

keep your work private and clean up / rewrite your own history in private as needed
once it is public (on GitHub in our case) or involves other people's work, no more changing history

http://www.mail-archive.com/[email protected]/msg39091.html

I myself could do better about this, and will try cleaning things a bit before pushing.

from albany.

lxmota commented on August 17, 2024

Would this be improved if people added this to their .gitconfig?

[branch]
autosetuprebase = always

[branch "master"]
rebase = true

Then we would not see the myriad of merge commits that we see now and the history would be more compact.

from albany.

mperego commented on August 17, 2024

I always rebase before pushing when I work in master. But I'm not sure I'd like an automatic rebase.

from albany.

ibaned commented on August 17, 2024

I actually don't like excessive rebasing, merge commits in the history are fine with me. One quick example: Trilinos does a fast-forward pull from develop to master. If they used merge commits, then we could tell exactly when develop was copied into master. Likewise, if a big change gets merged into Albany and it is very problematic, if it is a merge commit than the merge commit can be reverted by itself. If it is many commits that have been rebased, then they all need to be reverted one at a time.

from albany.

bartlettroscoe commented on August 17, 2024

I actually don't like excessive rebasing, merge commits in the history are fine with me.

The decision to rebase or merge is based on the chosen workflow. The git workflow building blocks and when to use a rebase or a merge is described in detail at:

https://docs.google.com/document/d/1uVQYI2cmNx09fDkHDA136yqDTqayhxqfvjFiuUue7wo

It should not be a haphazard decision that independent developers make on their own.

Generally, you want to rebase when your are doing the simple centralized workflow:

https://docs.google.com/document/d/1uVQYI2cmNx09fDkHDA136yqDTqayhxqfvjFiuUue7wo/edit#heading=h.7z34akh7lsvp

And not that one does the same workflow on a shared topic/release branch. But with the topic/release branch workflow, that branch is always (explicitly) merged into the mainline branch (be it 'develop' or 'master').

One quick example: Trilinos does a fast-forward pull from develop to master. If they used merge commits, then we could tell exactly when develop was copied into master.

Trilinos should create explicit merge commits from 'develop' to 'master' for many good reasons but does not (you will have to ask Jim W. why not, he is in charge of that). Therefore, Trilinos is not really following the develop/master workflow as documented at:

https://docs.google.com/document/d/1uVQYI2cmNx09fDkHDA136yqDTqayhxqfvjFiuUue7wo/edit#heading=h.u2ougk1wk7ph

Likewise, if a big change gets merged into Albany and it is very problematic, if it is a merge commit than the merge commit can be reverted by itself. If it is many commits that have been rebased, then they all need to be reverted one at a time.

Right.

from albany.

bartlettroscoe commented on August 17, 2024

Trilinos did go through a history rewrite in part to save space prior to their transition to GitHub. @bartlettroscoe has the details on that if you're interested.

You will have to ask Brent P. and Jim W. what they actually did for Trilinos. I pointed them to bfg:

https://rtyley.github.io/bfg-repo-cleaner/

and I think they used it. But you can use git filter-branch if it does not take too long and it is more flexible.

I recommend putting large files (for apps or testing) into separate git repos from the main source code. Managing a moderate number of git repos is easy with a tool like gitdist:

https://tribits.org/doc/TribitsDevelopersGuide.html#gitdist-documentation

but I get the sense that significant savings were made but more could be achieved in theory

Yes, it was a missed opportunity. We could have easily reduced the size of the Trilinos repo by at least another factor of 2x or more but people objected and that was it.

from albany.

ibaned commented on August 17, 2024

I guess I'll revise my statement to saying I support the use of topic branches as described in that document. I also support the use of a develop branch, but that is a separate issue.

I think I may have confused @lxmota by not clearly communicating the data at the beginning. I called the .git directory the "history", but that also includes (compressed) copies of every file at every version, including the mesh files. So when I say that .git is large, the vast majority of that space can be attributed to copies of large mesh files at different versions, as opposed to having too many commits. Establishing a formal workflow is a good thing and we should do it, but reduction of commits is not the main way we'll reduce the repository size; dealing with large non-source files is the way to do that (including filtering them out of the history).

from albany.

bartgol commented on August 17, 2024

What about adding a size limit for all tests input data? Something like a few MB. After all, the nightly builds only need to check integrity of the software.

If we want to keep the large examples, we could create a subrepo, that one can clone only if they want that extra stuff.

We could then clean all the git history and remove the large files history (effectively reducing the .git folder size) by using git-filter-branch.

from albany.

lxmota commented on August 17, 2024

@ibaned Thanks for the clarification. Yes, I agree that cleaning non-source files from .git would improve things.

@bartgol Creating a separate examples repository is exactly what I'm advocating for, as they do in Sierra. See above https://github.com/gahansen/Albany/issues/39#issuecomment-276852711

from albany.

ibaned commented on August 17, 2024

I also agree with the steps outlined by @bartgol .

I just ran the following in examples/:

~/src/Albany/examples$ find . -type f | xargs du -s | sort -n > files_size.txt

It seems there are a handful of files above 10MB, and a couple dozen above 1MB (out of >3000 files total).
The most embarrasing thing is our biggest file is a 62MB movie in MP4 format, so there is an easy savings right there.

from albany.

bartgol commented on August 17, 2024

@ibaned And as you mentioned above, the big cost comes from the commits that modify those big files. If there are large files that have been modified in time, but whose history we are not interested in (meshes, input data, etc), we could erase them from the history and re-add them.

@lxmota Creating an example repo sounds good to me. But I would then create a test directory for the standard Albany testing (lightweight and not too problem specific). Sure, one could just run the examples as a testsuite, but if someone downloads Albany (without examples) and modifies some file, he/she should still be able to test its integrity without downloading the (possibly big) example repo.

from albany.

djlittl commented on August 17, 2024

I did something similar once when moving the Peridigm repo from a sandia machine to github. I used the bfg tools, with Brent's help. My notes on exactly what I did (command history) are below. The result was the complete removal of a few large files, which significantly reduced the overall repo size.

Be warned, I believe the changes made by running these commands are permanent and not reversible. Well, I suppose nothing is truly unreversible, but this is pretty close. I rehearsed several times before pushing the changes, which permanently removes the entire history for the given large files (7M+ in this case).

git clone --mirror ssh://software.sandia.gov/git/peridigm peridigm
java -jar ~/Desktop/bfg-1.12.8.jar --strip-blobs-bigger-than 7M peridigm
cd peridigm
git reflog expire --expire=now --all && git gc --prune=now —aggressive
git push

from albany.

ibaned commented on August 17, 2024

@djlittl thanks for the commands !
Its true that this can wipe out the local repository, but as long as you don't push its contained to the local machine. I may do a few practice runs of this and report the results.

from albany.

bartgol commented on August 17, 2024

Well, if we agree that some files should be removed from the repo, then the fact that it's irreversible should not be a big concern. Besides, if we are talking about input meshes or data, then the history of the file is probably not much important. If we decide we should reinsert the file in the repo, we can do it, and we will still end up with a smaller repo (due to the erased history).

But yeah, some local practice runs are a good idea.

from albany.

bartgol commented on August 17, 2024

We could create a branch where we test the approach proposed in that trilinos PR. Then we evaluate 1) how much the repo shrinks, 2) how easy it is to run the "extended" testsuite.

from albany.

ibaned commented on August 17, 2024

Okay, I've done two things.

First, I created this repository to show that the BFG repo cleaner can be effectively used to erase the history of the examples/ directory, that the resulting repository is 30MB in size, which is a 20X reduction from the previous 860MB, and that if we push the rewritten history to GitHub, subsequent clones only grab the new 30MB history. I'm fairly certain we'll do this concurrently with the resolution of #36, so that when people have to re-clone due to the new organization name, they also receive the cleaned-up history.

Second, I created this branch to test the ExernalData system on a few SCOREC data files. The good news is that the system works, meaning:

The contents of the data file are replaced with its hash (I chose SHA1 so that is 20 bytes), which is negligible so essentially as much space savings as removing the data file.
When you compile Albany, it automatically downloads the data files from SourceForge and puts them in the build directory (there is a directory for the downloaded content, and it makes symlinks to that from the expected places):

daibane@rendezvous:~/build/gcc/Albany/examples/LCM/SCOREC/meshes/cube$ ls
CMakeFiles            cube1.smb.sha1-stamp  cube.dmg.sha1-stamp        cube-quad2.smb.sha1-stamp  cube-quad-serial0.smb.sha1-stamp
cmake_install.cmake   cube2.smb             cube-quad0.smb             cube-quad3.smb             cube-serial0.smb
CTestTestfile.cmake   cube2.smb.sha1-stamp  cube-quad0.smb.sha1-stamp  cube-quad3.smb.sha1-stamp  cube-serial0.smb.sha1-stamp
cube0.smb             cube3.smb             cube-quad1.smb             cube-quad.dmg              lcm_scorec_meshes_data_config.cmake
cube0.smb.sha1-stamp  cube3.smb.sha1-stamp  cube-quad1.smb.sha1-stamp  cube-quad.dmg.sha1-stamp   Makefile
cube1.smb             cube.dmg              cube-quad2.smb             cube-quad-serial0.smb
daibane@rendezvous:~/build/gcc/Albany/examples/LCM/SCOREC/meshes/cube$ file cube2.smb
cube2.smb: symbolic link to ../../../../../ExternalData/Objects/SHA1/ea82fcb6d4c07f5f3d6812128a1dd21df48358c5
daibane@rendezvous:~/build/gcc/Albany/examples/LCM/SCOREC/meshes/cube$ cat cube2.smb.sha1-stamp 
ea82fcb6d4c07f5f3d6812128a1dd21df48358c5

There is no change to the user workflow (once the files are set up), I just compile and run ctest like before and all the tests that use those files pass.
I think only the data files needed for tests that are active are downloaded, so for example if you don't enable LCM then you don't download LCM data files.

The bad news is that (as suggested by @maxrpi even before I did this), it is a bit complex to set up:

We need to bring another service (SourceForge) into the picture, developers have to make accounts there as well and we may have to get approval to use this service.
To add a new data file, you have to compute its hash, add to Git a file containing only the hash, and upload to SourceForge a file whose name is the hash and whose content is the data.

An easier-to-use, although less computationally efficient, approach is to simply make examples/ its own Git repository, under the new organization (#36). People can then simply clone examples into Albany or not, depending on what they want to do. Dealing with examples would still be as slow as dealing with Albany now, but the two would be separate and the source is what really matters.

Although I personally have come to like ExternalData and will try it in personal projects, I can see how concerns over workflow complexity would lead us to choose the easier-to-use option, and that would be acceptable.

from albany.

bartgol commented on August 17, 2024

Great work Dan!

I was also thinking about the maintainability of having two different services (github and sourceforge). I don't think it's a huge cost, but as you said, we may explore the idea of having example as a separate git repo to keep everything in the same place. If we do that, we could think about doing it with EXTERNAL_PROJECT_ADD, so that the dowload and build (which would not require any compilation, just a setup of a build subtree) would be performed by cmake, without the user having to manually clone the repo. From the user's point of view, this should be roughly similar to the solution proposed in trilinos, although it would create a bunch of temporary folders for caching, dowload, source and build...

from albany.

ibaned commented on August 17, 2024

Well, I'd like to have an easy-to-use option where the tests/examples are not downloaded, for the use case of people just trying to compile Albany on a cluster to run their big case; they don't intend to run the CTest suite in that workflow.

from albany.

bartgol commented on August 17, 2024

Uhm, but how do you suggest avoiding the download step? It appears to me that, if you remove them from the git repo, an automatic nightly build would have to fetch them somehow. We could add an optional CMake variable, which specifies an existing "installation" of the examples folder, which then albany would use as "source" to configure the example directory in the build directory. This way one only downloads the example repo once, and in the next builds, cmake locates a valid example "installation" and uses that. If someone does not specify the installation, CMake proceeds to check the directory where it would "install" the example repo, in case CMake already took care of that in a previous configuration. If found, good, otherwise it will finally proceed to download the repo and "build" it.

from albany.

ibaned commented on August 17, 2024

an automatic nightly build does need the examples, but not all of our users are automatic nightly builds. if we assume that everyone always needs examples, then there is no point in removing them and the repo will always be huge. there needs to be a build configuration that doesn't need or download examples, where you get the Albany application but no tests.

from albany.

bartlettroscoe commented on August 17, 2024

Personally, I think it is a bad idea to mix version control and/or network communication inside of the configure and/or build process (which is what ExernalData is doing). It is much better to do all version control and data fetching up front, and then do your configures and builds 100% locally.

Instead, the way CASL (and other projects) handle this type of large files is to put them into separate git repos. Then you can use git LFS on these data repos to manage that binary data better or whatever. And you don't need to learn a new process, you just use git.

Just my two cents. I will mute the thread now.

from albany.

ibaned commented on August 17, 2024

Yea, I think a separate Git repo is the way we're leaning. I'm not sure if we'll use LFS, it depends how annoying it is for people to install LFS on the relevant machines.

from albany.

bartlettroscoe commented on August 17, 2024

Git LFS is pretty easy to install. I prototyped this for Drekar. See:

TriBITSPub/TriBITS#139 (comment)

But Git-LFS has some disadvantages you need to be aware of:

TriBITSPub/TriBITS#139 (comment)

But if your data is not huge and is not changing often, I would not bother with git-lfs. I would just use a regular (separate) git repo and then just clean it out ever now and then (i.e. remove old history or start over and reclone). It depends how important the VC history of these files is to your project. But if you are using the CMake ExternalData module, you are not really tracking history on those files anyway.

from albany.

bartgol commented on August 17, 2024

Uh, sorry for the confusion. I wasn't planning on having the examples downloaded for everyone, of course. The example external repo would trigger only if ALBANY_ENABLE_EXAMPLES is ON.

On one hand I agree that an automatic network connection and download in the config/build process is sub-optimal. On the other hand, it has the advantage that the user does not need to know that he/she has to download another repo. It may be confusing (at first, at least) to see an option/folder in Albany that never builds (until you realize you were missing a separate repo). Automatic fetch can hide this detail of double-repo to the average user.

from albany.

ibaned commented on August 17, 2024

I did a few more tests with BFG, and it looks like removing files by size and extension is actually not helping so much. If I just remove from the history all files >200KB (any lower and I would remove Albany_Application.cpp), the repo only goes down to 400MB, about half the size. As for extensions, it seems a lot of our space is used on files besides exodus. There are ".sms", ".smb", ".out", ".out.4.2", ".nc", ".nb" (Mathematica notebooks ??), "mm", "pdf", "mp4", and finally there are FELIX/AsciiMeshes that simply don't have an extension, instead they are just called "xyz0" or something. Anyway, my conclusion is that the only way we'll get significant reductions in the history size is if we remove examples/ entirely.

As for re-integration methods, submodules actually seems like a decent idea, because unless one runs certain extra commands, the submodules are not brought in automatically. In addition, the relation between commits in the different repositories will be tracked, and we can turn it off at any time without needing to rewrite history again.

from albany.

ibaned commented on August 17, 2024

So, I've created a directory called large-tests/, and have moved a lot of files in there that used to be in examples/. All those are files that were never used by CTest in any configuration, and if I made a mistake I'll move one back. The examples/ directory is now down to 456MB, but I'd certainly like to keep reducing its size. Here are the subdirectories bigger than 1MB, sorted by size (units are 1KB):

1164	CahnHillElast2D/
4044	MOR/
7084	AMP/
21608	PerformanceTests/
22100	QCAD/
52632	ATO/
56512	Aeras/
119660	FELIX/
174824	LCM/

And here are the remaining files over 5MB (several of which I confirmed are still used by CTest):

5044  ./ATO/RegHeaviside_3D/RegHeaviside_3D.ref.exo
5328  ./FELIX/ExoMeshes/gis20km_upn4_in.exo
5760  ./LCM/Schwarz/NotchedCylinder/hex-tet-large/notched-cylinder-0.g
5808  ./FELIX/FO_GIS/gis20km_out_tpetra.exo.4.2
5840  ./ATO/FixedBlocks/FixedBlocks.ref.exo
5864  ./FELIX/FO_GIS/gis20km_out_tpetra.exo.4.0
5876  ./FELIX/FO_GIS/gis20km_out_tpetra.exo.4.1
5900  ./FELIX/FO_GIS/gis20km_out_tpetra.exo.4.3
6312  ./AMP/PhaseContinuation/movinglaser.sms
6508  ./ATO/MultiPhys_Homogenize_2D/MultiPhys_Homogenize_2D.ref.exo
7248  ./LCM/MechWithHydrogenFastPath/surface_diffusion/surfaceDiffusion.gold.e
7316  ./FELIX/CismAlbany/ncGridSamples/greenland.nc
7392  ./LCM/PeridigmCoupling/WaveInBarFEM/WaveInBar.gold.e
7684  ./QCAD/input_exodus/pointcharge_3D.exo
21648 ./LCM/HMC/Transient/TransientHMC_2DQuad/TransientHMC_2DQuad.ref.exo

My plan is to have large-tests/ be a CMake subdirectory that can execute tests just like examples/ can right now, and move some of the larger tests there. The short-term impact should be small because its still part of the repository, but in the future it will be separated into its own repository.

I'll also try to rename examples/ to tests/ in the near future.

from albany.

bartgol commented on August 17, 2024

How about making large-tests a subdirectory of examples? Perhaps have two subdirectories of examples, like examples/small-tests and examples/large-tests (or some other name), which makes clear that they are all tests, separated by size/execution-time. This would keep the top-level directory slightly cleaner

from albany.

ibaned commented on August 17, 2024

I didn't know that was a concern, but sure I've done that and a few other things. Now there are just four directories in the root.

from albany.

bartgol commented on August 17, 2024

Well, that's just the way I like code trees, slim and organized, easy to read for new users. But it's just MY view. ;-)

from albany.

ibaned commented on August 17, 2024

the tests/ directory is pretty much organized. waiting on #36.

from albany.

bartgol commented on August 17, 2024

This has been done. Closing.

from albany.

Make the Repository Smaller about albany HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent