Giter VIP home page Giter VIP logo

grokmirror's Introduction

GROKMIRROR

Framework to smartly mirror git repositories

Author: [email protected]
Date: 2020-09-18
Copyright: The Linux Foundation and contributors
License:GPLv3+
Version: 2.0.0

DESCRIPTION

Grokmirror was written to make replicating large git repository collections more efficient. Grokmirror uses the manifest file published by the origin server in order to figure out which repositories to clone, and to track which repositories require updating. The process is lightweight and efficient both for the primary and for the replicas.

CONCEPTS

The origin server publishes a json-formatted manifest file containing information about all git repositories that it carries. The format of the manifest file is as follows:

{
  "/path/to/bare/repository.git": {
    "description": "Repository description",
    "head":        "ref: refs/heads/branchname",
    "reference":   "/path/to/reference/repository.git",
    "forkgroup":   "forkgroup-guid",
    "modified":    timestamp,
    "fingerprint": sha1sum(git show-ref),
    "symlinks": [
        "/location/to/symlink",
        ...
    ],
   }
   ...
}

The manifest file is usually gzip-compressed to preserve bandwidth.

Each time a commit is made to one of the git repositories, it automatically updates the manifest file using an appropriate git hook, so the manifest.js file should always contain the most up-to-date information about the state of all repositories.

The mirroring clients will poll the manifest.js file and download the updated manifest if it is newer than the locally stored copy (using Last-Modified and If-Modified-Since http headers). After downloading the updated manifest.js file, the mirrors will parse it to find out which repositories have been updated and which new repositories have been added.

Object Storage Repositories

Grokmirror 2.0 introduces the concept of "object storage repositories", which aims to optimize how repository forks are stored on disk and served to the cloning clients.

When grok-fsck runs, it will automatically recognize related repositories by analyzing their root commits. If it finds two or more related repositories, it will set up a unified "object storage" repo and fetch all refs from each related repository into it.

For example, you can have two forks of linux.git:
torvalds/linux.git:
refs/heads/master refs/tags/v5.0-rc3 ...

and its fork:

maintainer/linux.git:
refs/heads/master refs/heads/devbranch refs/tags/v5.0-rc3 ...

Grok-fsck will set up an object storage repository and fetch all refs from both repositories:

objstore/[random-guid-name].git
refs/virtual/[sha1-of-torvalds/linux.git:12]/heads/master refs/virtual/[sha1-of-torvalds/linux.git:12]/tags/v5.0-rc3 ... refs/virtual/[sha1-of-maintainer/linux.git:12]/heads/master refs/virtual/[sha1-of-maintainer/linux.git:12]/heads/devbranch refs/virtual/[sha1-of-maintainer/linux.git:12]/tags/v5.0-rc3 ...

Then both torvalds/linux.git and maintainer/linux.git with be configured to use objstore/[random-guid-name].git via objects/info/alternates and repacked to just contain metadata and no objects.

The alternates repository will be repacked with "delta islands" enabled, which should help optimize clone operations for each "sibling" repository.

Please see the example grokmirror.conf for more details about configuring objstore repositories.

ORIGIN SETUP

Install grokmirror on the origin server using your preferred way.

IMPORTANT: Only bare git repositories are supported.

You will need to add a hook to each one of your repositories that would update the manifest upon repository modification. This can either be a post-receive hook, or a post-update hook. The hook must call the following command:

/usr/bin/grok-manifest -m /var/www/html/manifest.js.gz \
    -t /var/lib/gitolite3/repositories -n `pwd`

The -m flag is the path to the manifest.js file. The git process must be able to write to it and to the directory the file is in (it creates a manifest.js.randomstring file first, and then moves it in place of the old one for atomicity).

The -t flag is to help grokmirror trim the irrelevant toplevel disk path, so it is trimmed from the top.

The -n flag tells grokmirror to use the current timestamp instead of the exact timestamp of the commit (much faster this way).

Before enabling the hook, you will need to generate the manifest.js of all your git repositories. In order to do that, run the same command, but omit the -n and the `pwd` argument. E.g.:

/usr/bin/grok-manifest -m /var/www/html/manifest.js.gz \
    -t /var/lib/gitolite3/repositories

The last component you need to set up is to automatically purge deleted repositories from the manifest. As this can't be added to a git hook, you can either run the --purge command from cron:

/usr/bin/grok-manifest -m /var/www/html/manifest.js.gz \
    -t /var/lib/gitolite3/repositories -p

Or add it to your gitolite's D command using the --remove flag:

/usr/bin/grok-manifest -m /var/www/html/manifest.js.gz \
    -t /var/lib/gitolite3/repositories -x $repo.git

If you would like grok-manifest to honor the git-daemon-export-ok magic file and only add to the manifest those repositories specifically marked as exportable, pass the --check-export-ok flag. See git-daemon(1) for more info on git-daemon-export-ok file.

You will need to have some kind of httpd server to serve the manifest file.

REPLICA SETUP

Install grokmirror on the replica using your preferred way.

Locate grokmirror.conf and modify it to reflect your needs. The default configuration file is heavily commented to explain what each option does.

Make sure the user "mirror" (or whichever user you specified) is able to write to the toplevel and log locations specified in grokmirror.conf.

You can either run grok-pull manually, from cron, or as a systemd-managed daemon (see contrib). If you do it more frequently than once every few hours, you should definitely run it as a daemon in order to improve performance.

GROK-FSCK

Git repositories should be routinely repacked and checked for corruption. This utility will perform the necessary optimizations and report any problems to the email defined via fsck.report_to ('root' by default). It should run weekly from cron or from the systemd timer (see contrib).

Please examine the example grokmirror.conf file for various things you can tweak.

FAQ

Why is it called "grok mirror"?

Because it's developed at kernel.org and "grok" is a mirror of "korg". Also, because it groks git mirroring.

Why not just use rsync?

Rsync is extremely inefficient for the purpose of mirroring git trees that mostly consist of a lot of small files that very rarely change. Since rsync must calculate checksums on each file during each run, it mostly results in a lot of disk thrashing.

Additionally, if several repositories share objects between each-other, unless the disk paths are exactly the same on both the remote and local mirror, this will result in broken git repositories.

It is also a bit silly, considering git provides its own extremely efficient mechanism for specifying what changed between revision X and revision Y.

grokmirror's People

Contributors

edmonds avatar kscherer avatar mricon avatar pypingou avatar qulogic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grokmirror's Issues

Add configurable timeout for git commands

On unreliable connections (cough China cough), git commands can timeout and the whole process will hang. Combined with getting rid of the global lock, add a configurable timeout to handle hung git operations.

Missing tags for v1.2.x

I noticed the tags fro v1.2.1, v1.2.2 (and also the releases) are missing. Could you
push those missing tags?

Thanks!

Make pretty-printing the manifest configurable

Adding indent (and sort) to writing out the manifest makes it it quite a bit slower. Need to make it configurable, but as this is used by both grok-pull and grok-manifest, this will require some finagling.

Use ujson when found

We should make use of ujson when we find it. On very large collections (thousands of repos), parsing and saving manifest.js takes upwards of a second, and it can probably be significantly improved by using a faster json library, like ujson.

Inconsistent use of slashes results in multiple manifest entries

Grok mirror does not normalize repo paths and the manifest file can end up with multiple entries for the same repo.

mkdir ~/test
cd ~/test
git init repo1
cd repo1
touch a
git add a
git commit -m "Commit #1"
cd ..
mkdir git
cd git
git init --bare ../repo1 bare1
grok-manifest.py -m ~/test/manifest.js.gz -y -t ~/test/git/
grok-manifest.py -m ~/test/manifest.js.gz -y -t ~/test/git
grok-manifest.py -m ~/test/manifest.js.gz -y -t ~/test/git ~/test/git/bare1/
cat ~/test/manifest.js.gz
{
    "/bare1": {
    "description": "Unnamed repository; edit this file 'description' to name the repository.", 
    "modified": 1378225388, 
    "owner": null, 
    "reference": null
    }, 
    "/bare1/": {
    "description": "Unnamed repository; edit this file 'description' to name the repository.", 
    "modified": 1378225388, 
    "owner": null, 
    "reference": null
    }, 
    "bare1": {
    "description": "Unnamed repository; edit this file 'description' to name the repository.", 
    "modified": 1378225388, 
    "owner": null, 
    "reference": null
    }
}

Allow for removing path prefixes on the client

To be able to mirror server's foo/bar.git repo into /target/dir/bar.git on the client, an option to strip the initial foo/ part would be useful.

Note though, that this could lead to name clashes on the client, so the client would have to check for duplicates in the set of repositories to be written before actually doing the pull, and bail out.

error: RPC failed Request Entity Too Large

I've got many errors of this kind in my log:

pull[92482] 2021-09-01 18:54:00,908 - WARNING - Stderr (/var/lib/git/mirror/pub/scm/linux/kernel/git/ak/linux-misc.git): error: RPC failed; HTTP 
413 curl 22 The requested URL returned error: 413 Request Entity Too Large
fatal: the remote end hung up unexpectedly
error: Could not fetch _grokmirror

It this a known problem?

Grokmirror and Gitolite

Does grokmirror require git-daemon to be running on a remote Git repository in order to sync? We have several geographically isolated Git repositories that do not run git-daemon but instead rely on Gitolite to provide fine-grained read/write access control to the repos. Can grokmirror still work in that environment?

One thing we're concerned about is maintaining access control on the mirrored repositories. Is there a way to either "copy" the access control list (ACL) from the repository being mirrored or establish a default access control list in gitolite for the newly mirrored repository?

Any recommendations on using grokmirror using gitolite in a secured environment where access to the mirrored repositories is controlled would be appreciated.

Missing dependency on packaging

'packaging' module (e.g. python3-packaging on Ubuntu) is required for grok-fsck:

$ grok-fsck -v --repack-only -c /etc/grokmirror/kernel.conf
Analyzing /var/lib/mirror/manifest.js.gz
Traceback (most recent call last):
  File "/usr/local/bin/grok-fsck", line 11, in <module>
    sys.exit(command())
  File "/usr/local/lib/python3.6/dist-packages/grokmirror/fsck.py", line 1392, in command
    opts.repack_all_quick, opts.repack_all_full)
  File "/usr/local/lib/python3.6/dist-packages/grokmirror/fsck.py", line 1367, in grok_fsck
    fsck_mirror(config, force, repack_only, conn_only, repack_all_quick, repack_all_full)
  File "/usr/local/lib/python3.6/dist-packages/grokmirror/fsck.py", line 657, in fsck_mirror
    if commitgraph and not grokmirror.git_newer_than('2.18.0'):
  File "/usr/local/lib/python3.6/dist-packages/grokmirror/__init__.py", line 101, in git_newer_than
    from packaging import version
ModuleNotFoundError: No module named 'packaging'

This dependency should be added to setup.py.

Get rid of global lock

We added threaded execution as part of the current master, but it's not going far enough. If someone creates a 4GB repository (like webkit.git), cloning it will take hours over a slow uplink and block updates on all other repositories in the process.

We need to make updates non-dependent on the global lock -- by tracking last-updated information not just in the manifest itself, but also inside each repo.

Tag release 1.1.1

As the title says, please tag 1.1.1 so it can be properly packaged downstream.

Thanks.

grok-pull fails in non-English locale

When running grok-pull in a non-English locale, it fails like this:

Traceback (most recent call last):
  File "/env/grokmirror/bin/grok-pull", line 11, in <module>
    sys.exit(command())
  File "/env/grokmirror/local/lib/python2.7/site-packages/grokmirror/pull.py", line 1186, in command
    opts.forcepurge)
  File "/env/grokmirror/local/lib/python2.7/site-packages/grokmirror/pull.py", line 1170, in grok_pull
    noreuse, purge, pretty, forcepurge)
  File "/env/grokmirror/local/lib/python2.7/site-packages/grokmirror/pull.py", line 579, in pull_mirror
    last_modified = time.strptime(last_modified, '%a, %d %b %Y %H:%M:%S %Z')
  File "/usr/lib/python2.7/_strptime.py", line 478, in _strptime_time
    return _strptime(data_string, format)[0]
  File "/usr/lib/python2.7/_strptime.py", line 332, in _strptime
    (data_string, format))
ValueError: time data 'Tue, 24 Apr 2018 18:06:08 GMT' does not match format '%a, %d %b %Y %H:%M:%S %Z'

This is while parsing the Last-Modified header, which has a fixed, locale-independent format.

One option could be to replace

last_modified = ufh.headers.get('Last-Modified')
last_modified = time.strptime(last_modified, '%a, %d %b %Y %H:%M:%S %Z')
last_modified = calendar.timegm(last_modified)

with

last_modified = ufh.headers.getdate_tz('Last-Modified')
last_modified = calendar.timegm(last_modified)

getting rid of the locale-dependent strptime().

Incompatibility between GitPython <2.1.7 and git >2.15.0

Just a heads up about a problem that I ran into. Git 2.15.0 added a comment to the packed-refs file that breaks the GitPython library.

gitpython-developers/GitPython#687

This was fixed with GitPython 2.1.8. I ran into this because my git servers are using the Ubuntu git-core PPA.

I was able to workaround this problem by setting up grokmirror in a virtualenv with GitPython 2.1.8. Because grokmirror is not available on pypi I had to generate the grokmirror python package and install it manually into the virtualenv.

I don't know what the best way to handle this is. Changing requirements.txt to require >=2.1.8 isn't necessary for servers with older git versions. Perhaps add version checks to grokmirror to detect this incompatibility?

Add git-repack support to grok-fsck

Calling "git repack -a -d -l" on shared repos should allow saving a lot of local space. If repack is enabled, call it before calling fsck.

Runtime error: KeyError: 'reference'

$ python --version
Python 2.7.16
$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ /usr/local/bin/grok-pull --verbose --verbose -p -c /etc/grokmirror/repos.conf
Checking [kernel.org]
Fetching remote manifest from http://git.kernel.org/manifest.js.gz
/home/mirror/kernel/manifest.js.gz not found, assuming initial run
Compared new manifest against 1089 repositories in 0.06s
No repositories need updating
Cloning 1089 repos from git://git.kernel.org
Traceback (most recent call last):
  File "/usr/local/bin/grok-pull", line 10, in <module>
    sys.exit(command())
  File "/usr/local/lib/python2.7/dist-packages/grokmirror/pull.py", line 1201, in command
    opts.forcepurge)
  File "/usr/local/lib/python2.7/dist-packages/grokmirror/pull.py", line 1185, in grok_pull
    noreuse, purge, pretty, forcepurge)
  File "/usr/local/lib/python2.7/dist-packages/grokmirror/pull.py", line 875, in pull_mirror
    clone_order(to_clone, manifest, to_clone_sorted, existing)
  File "/usr/local/lib/python2.7/dist-packages/grokmirror/pull.py", line 364, in clone_order
    reference = manifest[gitdir]['reference']
KeyError: 'reference'

Comparing: 100%|##################################################################| 1089/1089 [00:00<00:00, 17861.29 repos/s]

Additional debug shows that gitdir is equal /pub/scm/linux/kernel/git/holtmann/ktls.git and manifest[gitdir] is

{u'description': u'Kernel TLS testing tree', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1440739766, u'fingerprint': u'4c6108657604c5bb55b238b898c835d309606393', u'owner': u'Marcel Holtmann', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/gerg/uclinux.git
{u'head': u'ref: refs/heads/master', u'description': u'uClinux non-MMU changes', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1362357307, u'fingerprint': u'9c25e554320c6acc6a6e08ca724ba208e7fb9d91', u'owner': u'Greg Ungerer', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git
{u'head': u'ref: refs/heads/master', u'description': u'Darrick J. Wong : XFS Tool Dev.', u'reference': u'/pub/scm/linux/kernel/git/ebiggers/xfsprogs-dev.git', u'modified': 1621611172, u'fingerprint': u'e707d92dafc46f513145d8c285c715de54ee0283', u'owner': u'Darrick J. Wong', u'forkgroup': u'96f91ae4-032d-4ab8-b6e1-6fb0e7199a2a'}
/pub/scm/linux/kernel/git/jmorris/linux-security.git
{u'head': u'ref: refs/heads/master', u'description': u'Linux Kernel Security Subsystem', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1619565889, u'fingerprint': u'ec06d81750a4327e8d76bb374229ad8f3820e9a4', u'owner': u'James Morris', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw.git
{u'description': u'GFS2 next merge window tree', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1423563971, u'fingerprint': u'9c1bb9bc9258d54b3c12bdd042ce7641ab658c78', u'owner': u'Steven Whitehouse', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/krisman/unicode.git
{u'description': u"Gabriel Krisman Bertazi's fork of linux.git", u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1594280073, u'fingerprint': u'545bf1409e7d9c9abcd8b550716fdeed4fec57ec', u'owner': u'Gabriel Krisman Bertazi', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/lee/backlight.git
{u'head': u'ref: refs/heads/master', u'description': u'Backlight Subsystem Tree - Next and Fixes', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1621581517, u'fingerprint': u'e96c9de66e2ce5efec332abb7ffec0fa142f2782', u'owner': u'Lee Jones', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/linux/kernel/git/horms/renesas-bsp.git
{u'description': u'Kernel tree for Renesas R-Car BSP', u'reference': u'/pub/scm/linux/kernel/git/paulg/4.8-rt-patches.git', u'modified': 1562861958, u'fingerprint': u'ffbf02a73c59afc2069ae3034e66fe32c4b6d5db', u'owner': u'Simon Horman', u'forkgroup': u'af9f4487-d538-46e5-b148-e18dfb461f8a'}
/pub/scm/network/tftp/tftp-hpa.git
{u'owner': u'H. Peter Anvin', u'description': u'tftp-hpa official tree', u'modified': 1438973805, u'fingerprint': u'ddf890698962db6e3fb188080bb57e7ab28364f2'}

Python 3 support

I'm trying to find out if grokmirror works with Python 3 or not, however, there is no information in setup.py metadata, readme etc. Does it? Thanks

Add dependencies to setup.py

Currently, the dependencies are in the requirements.txt only, so the dependencies are not checked when installing via pip in a checkout or from a wheel. Adding install_requires to the setup() call would help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.