Giter VIP home page Giter VIP logo

Comments (19)

ResidentMario avatar ResidentMario commented on June 1, 2024

Yes, there's a few ways of hosting example data of this sort. Investigating that ecosystem is on my to-do list actually.

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

@choldgraf OK so I removed the data to a separate repo. Now what remains is removing these files from git history.

I don't suppose you know how to do that? It's an awful lot of magic...

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

hhmmmm - it's something I've done but that was a long time ago :-)

usually I remind myself with this SO post:

https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository

this tool has always seemed helpful, though I've never used it since in my case it was usually just one file

https://rtyley.github.io/bfg-repo-cleaner/

A challenge here is that this rewrites git history, so I think it might mess up people's forks when they try to commit (double check this though). That said, it's a good reason to nip these things in the bud sooner than later....

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

See here for some answers: https://twitter.com/huitseeker/status/909094893833695232

It sounds like people will need to rebase onto master if they've already got forks, but other than that I think you're safe to do this. Another person recommended the BFG thing above :-)

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

I tried to manual way provided in the StackOverflow thread, that did not help---after pushing a rebase, the data was still there.

I will try the BFG approach tomorrow.

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024
(geoplot) Honorss-MacBook-Air-42:geoplot.git Honors$ java -jar ../bfg-1.12.15.jar --delete-folders "data" .

Using repo : /Users/Honors/Desktop/geoplot.git/.

Found 196 objects to protect
Found 4 tag-pointing refs : refs/tags/0.0.1, refs/tags/0.0.2, refs/tags/0.0.3, refs/tags/0.0.4
Found 4 commit-pointing refs : HEAD, refs/heads/master, refs/pull/34/head, refs/pull/34/merge

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 613080fd (protected by 'HEAD')

Cleaning
--------

Found 257 commits
Cleaning commits:       100% (257/257)
Cleaning commits completed in 635 ms.

Updating 7 Refs
---------------

	Ref                  Before     After   
	----------------------------------------
	refs/heads/master  | 613080fd | ca2eecfd
	refs/pull/34/head  | 56f0e66c | c61b4257
	refs/pull/34/merge | 04f1aefb | 63c3ca9b
	refs/tags/0.0.1    | 5822a3d5 | 58b4d3cc
	refs/tags/0.0.2    | 364a880c | b9dd4bf7
	refs/tags/0.0.3    | 227db476 | 654cbec4
	refs/tags/0.0.4    | c8a23d08 | a54b814a

Updating references:    100% (7/7)
...Ref update completed in 39 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	...............DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After   
	-------------------------------------------
	First modified commit | 9ddd5f1f | e5b19f0b
	Last dirty commit     | 42dc047a | 8e55e249


In total, 291 object ids were changed. Full details are logged here:

	/Users/Honors/Desktop/geoplot.git/..bfg-report/2017-09-17/10-42-56

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive


--
You can rewrite history in Git - don't let Trump do it for real!
Trump's administration has lied consistently, to make people give up on ever
being told the truth. Don't give up: https://github.com/bkeepers/stop-trump
--


(geoplot) Honorss-MacBook-Air-42:geoplot.git Honors$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Counting objects: 2184, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2148/2148), done.
Writing objects: 100% (2184/2184), done.
Total 2184 (delta 1091), reused 890 (delta 0)

(geoplot) Honorss-MacBook-Air-42:geoplot.git Honors$ git push
fatal: remote error: 
  You can't push to git://github.com/ResidentMario/geoplot.git
  Use https://github.com/ResidentMario/geoplot.git
(geoplot) Honorss-MacBook-Air-42:geoplot.git Honors$ git push --set-upstream https://github.com/ResidentMario/geoplot.git master
To https://github.com/ResidentMario/geoplot.git
 ! [rejected]        master -> master (fetch first)
error: failed to push some refs to 'https://github.com/ResidentMario/geoplot.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
(geoplot) Honorss-MacBook-Air-42:geoplot.git Honors$ git push --set-upstream https://github.com/ResidentMario/geoplot.git master --force
Counting objects: 1997, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (1013/1013), done.
Writing objects: 100% (1997/1997), 116.10 MiB | 48.00 KiB/s, done.
Total 1997 (delta 963), reused 1980 (delta 955)
remote: Resolving deltas: 100% (963/963), done.
To https://github.com/ResidentMario/geoplot.git
 + 613080f...ca2eecf master -> master (forced update)
Branch master set up to track remote branch master from https://github.com/ResidentMario/geoplot.git.

It seems to have worked. When I look at the repo commit history, I no longer see a data folder in any commits!

But...

(geoplot) Honorss-MacBook-Air-42:Desktop Honors$ git clone https://github.com/ResidentMario/geoplot.git
Cloning into 'geoplot'...
remote: Counting objects: 2285, done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 2285 (delta 87), reused 176 (delta 85), pack-reused 2105
Receiving objects: 100% (2285/2285), 152.76 MiB | 5.76 MiB/s, done.
Resolving deltas: 100% (1062/1062), done.
Checking connectivity... done.

...it's still 150 MiB.

The files seem to be well and truly gone. But the repo is still the same size as it was before!

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

huh, that's strange! and the files are gone from history and everything?

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

Yes. I asked this Q on StackOverflow.

from geoplot.

asottile avatar asottile commented on June 1, 2024

you need to force-push all the tags and branches as well (not just master)

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

@asottile I ran:

git tag -d 0.0.3
git tag 0.0.3 fb27de2
git tag -d 0.0.4
git tag 0.0.4 9f7e5a9
git push --tags origin --force

Which netted:

Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/ResidentMario/geoplot.git
 + 227db47...fb27de2 0.0.3 -> 0.0.3 (forced update)
 + c8a23d0...9f7e5a9 0.0.4 -> 0.0.4 (forced update)

But downloading and unpacking [email protected] from here opens up 191 MB on disk (are these numbers going...up?).

from geoplot.

asottile avatar asottile commented on June 1, 2024

Those aren't the revisions I expect given the output above. Your tags still contain the data history:

$ git tag -l | xargs --replace bash -c 'echo ============ && echo {} && echo ============ && git log --oneline {} -- data'
============
0.0.1
============
ec23e99 Demos.
cac4125 Work on aggplot.
e42eb0c Swap examples.
1114725 Populate examples plage.
2420b0e Another example.
57c20c8 Another example. Implement custom geometry in sankey.
fee0fd9 Another example.
30a4c34 Another example.
51c0d28 Another example.
ccbb393 WSubplotting, first example done.
9ddd5f1 Upload data.
============
0.0.2
============
ec23e99 Demos.
cac4125 Work on aggplot.
e42eb0c Swap examples.
1114725 Populate examples plage.
2420b0e Another example.
57c20c8 Another example. Implement custom geometry in sankey.
fee0fd9 Another example.
30a4c34 Another example.
51c0d28 Another example.
ccbb393 WSubplotting, first example done.
9ddd5f1 Upload data.
============
0.0.3
============
ec23e99 Demos.
cac4125 Work on aggplot.
e42eb0c Swap examples.
1114725 Populate examples plage.
2420b0e Another example.
57c20c8 Another example. Implement custom geometry in sankey.
fee0fd9 Another example.
30a4c34 Another example.
51c0d28 Another example.
ccbb393 WSubplotting, first example done.
9ddd5f1 Upload data.
============
0.0.4
============
ec23e99 Demos.
cac4125 Work on aggplot.
e42eb0c Swap examples.
1114725 Populate examples plage.
2420b0e Another example.
57c20c8 Another example. Implement custom geometry in sankey.
fee0fd9 Another example.
30a4c34 Another example.
51c0d28 Another example.
ccbb393 WSubplotting, first example done.
9ddd5f1 Upload data.

from geoplot.

asottile avatar asottile commented on June 1, 2024

For example, I expect the 0.0.4 tag to point to 4e1fbf5 (part of your master history)

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

Yeah I noticed this too (that they still contains the folder, even). I didn't delete the GitHub tags before pushing the local ones, which appears to have resulted in no change (?).

I ran a bunch of git push --delete origin 0.0.x commands (per this Gist) then recreated the 0.0.4 release using the button on GitHub. That did seem to work---there's now just [email protected], which is 24 MB zipped. The repo is now down to 110 MiB when cloned, but that's still clearly too much?

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

To summarize the changes thus far: I followed the BFG sequence above and, on the advice above, deleted all of the old tags (0.0.1 through 0.0.4). Then I created a new 0.0.4 tag based on the present state of the repository.

Overall, this reduced git clone size from 150-ish MiB to 100-sih MiB. Which still doesn't seem correct to me.

After further reflection I'm realizing that there are other large file diffs that I have pushed into history that are causing this excessive size. The figures folder contains a set of images for the website that were updated relatively often; there used to be an html folder with the raw website output. I've now cleaned both out.

One other large file that remains is the tutorial and API reference generator notebooks, which contain a lot of images as well. Practically speaking, the solution is going to be to fork all of that stuff off into a separate repository, e.g. geoplot.github.io. I can host the website off of that domain (instead of my own) and haul all that cruft over to there instead of here.

I'm still not happy with the size of the repo on clone, but I'd like to prioritize feature work for a bit...

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

one option is to use something like sphinx-gallery, which would let you include examples as .py files and it'd collect the image outputs etc and render them notebook-like online. It shouldn't be too hard to get that working if it's a big part of the size.

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

(e.g., see http://martinos.org/mne/dev/auto_tutorials/plot_sensors_decoding.html for an example from another package I work on)

from geoplot.

ResidentMario avatar ResidentMario commented on June 1, 2024

Down to ~35 MB now, after very aggressive history pruning. This is likely as good as it's going to get!

from geoplot.

choldgraf avatar choldgraf commented on June 1, 2024

woot! that's an order of magnitude improvement...nice!

from geoplot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.