Giter VIP home page Giter VIP logo

scmrepo's Introduction

scmrepo

PyPI Status Python Version License

Tests Codecov pre-commit Black

SCM wrapper and fsspec filesystem for Git for use in DVC.

Features

  • Works with multiple backends: pygit2, dulwich and gitpython.
  • Provides fsspec filesystem over Git: GitFileSystem.

See fsspec docs for full list of available fs methods.

Requirements

Installation

You can install scmrepo via pip from PyPI:

$ pip install scmrepo

Usage

Git File System

scmrepo provides fsspec based gitfs that provides fs-like API for your git repositories without having to git checkout them first. For example:

from scmrepo.fs import GitFileSystem

fs = GitFileSystem("path/to/my/repo", rev="mybranch")

for root, dnames, fnames in fs.walk("path/in/repo"):
    for dname in dnames:
        print(fs.path.join(root, dname))

    for fname in fnames:
        print(fs.path.join(root, fname))

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the Apache 2.0 license, scmrepo is free and open source software.

Issues

If you encounter any problems, please file an issue along with a detailed description.

scmrepo's People

Contributors

bobertlo avatar casperdcl avatar cclauss avatar daavoo avatar dberenbaum avatar dependabot[bot] avatar devramx avatar django-kz avatar dtrifiro avatar dudarev avatar efiop avatar ei-grad avatar github-actions[bot] avatar gsvolt avatar gthb avatar isidentical avatar jorgeorpinel avatar karajan1001 avatar pared avatar pmrowla avatar pre-commit-ci[bot] avatar rahulr19 avatar riserrad avatar rogermparent avatar shcheklein avatar sisp avatar skshetry avatar suor avatar tpietruszka avatar vgerak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scmrepo's Issues

windows: credential helpers failing

In windows, I'm not able to get credential helpers to retrieve my credentials.

I installed with the Windows installer and am using git bash.

Here's my git config:

$ git config -l --show-origin
file:C:/Program Files/Git/etc/gitconfig diff.astextplain.textconv=astextplain
file:C:/Program Files/Git/etc/gitconfig filter.lfs.clean=git-lfs clean -- %f
file:C:/Program Files/Git/etc/gitconfig filter.lfs.smudge=git-lfs smudge -- %f
file:C:/Program Files/Git/etc/gitconfig filter.lfs.process=git-lfs filter-process
file:C:/Program Files/Git/etc/gitconfig filter.lfs.required=true
file:C:/Program Files/Git/etc/gitconfig http.sslbackend=openssl
file:C:/Program Files/Git/etc/gitconfig http.sslcainfo=C:/Program Files/Git/mingw64/etc/ssl/certs/ca-bundle.crt
file:C:/Program Files/Git/etc/gitconfig core.autocrlf=true
file:C:/Program Files/Git/etc/gitconfig core.fscache=true
file:C:/Program Files/Git/etc/gitconfig core.symlinks=false
file:C:/Program Files/Git/etc/gitconfig pull.rebase=false
file:C:/Program Files/Git/etc/gitconfig init.defaultbranch=master
file:C:/Users/Administrator/.gitconfig  credential.helper=store
file:.git/config        core.repositoryformatversion=0
file:.git/config        core.filemode=false
file:.git/config        core.bare=false
file:.git/config        core.logallrefupdates=true
file:.git/config        core.symlinks=false
file:.git/config        core.ignorecase=true

My credentials are stored:

$ cat ~/.git-credentials
https://dberenbaum:***@github.com

Git clone uses these credentials as expected:

$ git clone https://www.github.com/dberenbaum/dataset-registry
Cloning into 'dataset-registry'...
warning: redirecting to https://github.com/dberenbaum/dataset-registry.git/
remote: Enumerating objects: 167, done.
remote: Counting objects: 100% (167/167), done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 167 (delta 41), reused 167 (delta 41), pack-reused 0Receiving objeReceiving objects: 100% (167/167), 26.04 KiB | 225.00 KiB/s, done.
Resolving deltas: 100% (41/41), done.

dvc import prompts for my username and password and fails when I don't provide them:

$ dvc import -v https://www.github.com/dberenbaum/dataset-registry use-cases/cats-dogs
2023-06-16 20:41:08,931 DEBUG: v2.57.1 (exe), CPython 3.10.11 on Windows-10-10.0.20348-SP0
2023-06-16 20:41:08,966 DEBUG: command: import -v https://www.github.com/dberenbaum/dataset-registry use-cases/cats-dogs
2023-06-16 20:41:15,310 DEBUG: Removing output 'cats-dogs' of stage: 'cats-dogs.dvc'.
2023-06-16 20:41:15,323 DEBUG: Removing 'C:\Users\Administrator\repo\cats-dogs'
Importing 'use-cases/cats-dogs (https://www.github.com/dberenbaum/dataset-registry)' -> 'cats-dogs'
2023-06-16 20:41:15,417 DEBUG: Computed stage: 'cats-dogs.dvc' md5: '04818bbca125387334761fa24de5759a'
2023-06-16 20:41:15,451 DEBUG: 'md5' of stage: 'cats-dogs.dvc' changed.
2023-06-16 20:41:15,453 DEBUG: Creating external repo https://www.github.com/dberenbaum/dataset-registry@None
2023-06-16 20:41:15,454 DEBUG: erepo: git clone 'https://www.github.com/dberenbaum/dataset-registry' to a temporary dir
2023-06-16 20:41:23,384 ERROR: failed to import 'use-cases/cats-dogs' - SCM error: Failed to clone repo 'https://www.github.com/dberenbaum/dataset-registry' to 'C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp1p9s0y__dvc-clone': No valid credentials provided
Traceback (most recent call last):
  File "scmrepo\git\backend\dulwich\client.py", line 49, in _http_request
  File "dulwich\client.py", line 2218, in _http_request
dulwich.client.HTTPUnauthorized: No valid credentials provided

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scmrepo\git\backend\dulwich\__init__.py", line 220, in clone
  File "dulwich\porcelain.py", line 514, in clone
  File "dulwich\client.py", line 703, in clone
  File "dulwich\client.py", line 781, in fetch
  File "dulwich\client.py", line 2084, in fetch_pack
  File "dulwich\client.py", line 1940, in _discover_references
  File "scmrepo\git\backend\dulwich\client.py", line 60, in _http_request
  File "dulwich\client.py", line 2218, in _http_request
dulwich.client.HTTPUnauthorized: No valid credentials provided

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc\scm.py", line 160, in clone
  File "scmrepo\git\__init__.py", line 142, in clone
  File "scmrepo\git\backend\dulwich\__init__.py", line 225, in clone
scmrepo.exceptions.CloneError: Failed to clone repo 'https://www.github.com/dberenbaum/dataset-registry' to 'C:\Users\ADMINI~1\AppData\Local\Temp\2\tmp1p9s0y__dvc-clone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc\commands\imp.py", line 17, in run
  File "dvc\repo\imp.py", line 6, in imp
  File "dvc\repo\__init__.py", line 65, in wrapper
  File "dvc\repo\scm_context.py", line 151, in run
  File "dvc\repo\imp_url.py", line 93, in imp_url
  File "funcy\decorators.py", line 47, in wrapper
  File "dvc\stage\decorators.py", line 43, in rwlocked
  File "funcy\decorators.py", line 68, in __call__
  File "dvc\stage\__init__.py", line 609, in run
  File "funcy\decorators.py", line 47, in wrapper
  File "dvc\stage\decorators.py", line 43, in rwlocked
  File "funcy\decorators.py", line 68, in __call__
  File "dvc\stage\__init__.py", line 646, in _sync_import
  File "dvc\stage\imports.py", line 56, in sync_import
  File "dvc\stage\__init__.py", line 500, in save_deps
  File "dvc\dependency\repo.py", line 58, in save
  File "dvc\fs\dvc.py", line 412, in repo
  File "functools.py", line 981, in __get__
  File "dvc\fs\dvc.py", line 401, in fs
  File "fsspec\spec.py", line 76, in __call__
  File "dvc\fs\dvc.py", line 120, in __init__
  File "dvc\fs\dvc.py", line 179, in _make_repo
  File "contextlib.py", line 135, in __enter__
  File "dvc\repo\open_repo.py", line 30, in _external_repo
  File "dvc\repo\open_repo.py", line 153, in _cached_clone
  File "funcy\decorators.py", line 47, in wrapper
  File "funcy\flow.py", line 246, in wrap_with
  File "funcy\decorators.py", line 68, in __call__
  File "dvc\repo\open_repo.py", line 221, in _clone_default_branch
  File "dvc\scm.py", line 165, in clone
dvc.scm.CloneError: SCM error

2023-06-16 20:41:23,417 DEBUG: Analytics is enabled.
2023-06-16 20:41:23,429 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\2\\tmp02z6x4vt']'
2023-06-16 20:41:23,554 DEBUG: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\2\\tmp02z6x4vt']'

support .gitattributes normalization

Follow up to #211

dulwich currently lacks support for .gitattributes line ending/normalization settings, we should either contribute this upstream or detect when text normalization attributes are set and then force the use of pygit2/libgit2 for generating commits.

pull: "No valid credentials" when using an ssh-agent

Bug Report

pull: "no valid credentials" when using ssh-agent

Description

As outlined in #215, I tried setting up SSH keys using the webfactory/ssh-agent action. However, dvc always complained that no valid credentials were provided even though the SSH keys were added as deploy keys to the individual repositories. This is the same action that was mentioned in iterative/dvc#7702 as well, so I see some similarities here (even though the linked issue only mentions these problems on Windows-based machines).

Reproduce

Use the following GitHub action:

on:
  push:
    branches:
      - main
      - beta
env:
  POETRY_VERSION: 1.3.1
  PYTHON_VERSION: 3.9
  DVC_VERSION: 2.43.0
jobs:
  dvc-test:
    runs-on: ubuntu-latest
    steps:
      - uses: webfactory/[email protected]
        with:
          ssh-private-key: |
            ${{ secrets.SSH_ARGUEBASE_PUBLIC }}
            ${{ secrets.SSH_ARGUEBASE_PRIVATE }}
      - uses: actions/checkout@v3
      - uses: iterative/setup-dvc@v1
        with:
          version: ${{ env.DVC_VERSION }}
      - run: dvc pull --force --verbose

Expected

DVC uses the credentials provided by the ssh-agent and pulls the data. However, dvc always complains that no valid credentials were provided.

Environment information

The problem occurs on GitHub actions using ubuntu-latest and the setup-dvc action.

Additional Information (if any):

I ran the following script provided by @dtrifiro in the same GitHub action:

import asyncio

import asyncssh


async def main():
    async with asyncssh.agent.connect_agent() as agent:
        keys = await agent.get_keys()
        for key in keys:
            print(key.algorithm, key.get_comment())


if __name__ == "__main__":
    asyncio.run(main())

and got this output:

python dvc_test.py
  shell: /usr/bin/bash -e {0}
  env:
    POETRY_VERSION: 1.3.1
    PYTHON_VERSION: 3.9
    DVC_VERSION: 2.43.0
    SSH_AUTH_SOCK: /tmp/ssh-XXXXXXiyBBBE/agent.1658
    SSH_AGENT_PID: 1659
    pythonLocation: /opt/hostedtoolcache/Python/3.9.16/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.16/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.16/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.16/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.16/x64/lib
b'ssh-ed25519' [email protected]:recap-utr/arguebase-public.git
b'ssh-ed25519' [email protected]:recap-utr/arguebase-private.git

meaning that the keys are picked up by asyncssh.

dvc exp pull UnicodeDecodeError when using ssh git backend

When running dvc exp pull with a ssh git remote, I get:

☧hnr γmainᵪ dvc exp pull -vvv -j1 -r litis-remote pianosa 'train_gae.num_layers=37'
2021-12-13 13:50:09,697 TRACE: Namespace(cprofile=False, yappi=False, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=3, version=None, cd='.', cmd='
pull', force=False, pull_cache=True, dvc_remote='litis-remote', jobs=1, run_cache=False, git_remote='pianosa', experiment='train_gae.num_layers=37', func=<class 'dvc.command.experiments.Cmd
ExperimentsPull'>)
2021-12-13 13:50:09,803 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/config.local' to gitignore file.
2021-12-13 13:50:09,805 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/tmp' to gitignore file.
2021-12-13 13:50:10,345 DEBUG: git pull experiment 'pianosa' -> 'refs/exps/80/e4443848e6a61775b67834054b7bbf3f0cb982/train_gae.num_layers=37:refs/exps/80/e4443848e6a61775b67834054b7bbf3f0cb
982/train_gae.num_layers=37'
2021-12-13 13:50:11,016 ERROR: unexpected error - 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/main.py", line 55, in main
    ret = cmd.do_run()
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/command/base.py", line 45, in do_run
    return self.run()
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/command/experiments.py", line 758, in run
    self.repo.experiments.pull(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/repo/experiments/__init__.py", line 1015, in pull
    return pull(self.repo, *args, **kwargs)
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/repo/scm_context.py", line 152, in run
    return method(repo, *args, **kw)
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dvc/repo/experiments/pull.py", line 38, in pull
    repo.scm.fetch_refspecs(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 253, in _backend_func
    return func(*args, **kwargs)
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 526, in fetch_refspecs
    fetch_result = client.fetch(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dulwich/client.py", line 531, in fetch
    result = self.fetch_pack(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dulwich/client.py", line 1055, in fetch_pack
    self._handle_upload_pack_tail(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dulwich/client.py", line 841, in _handle_upload_pack_tail
    self._read_side_band64k_data(
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/dulwich/client.py", line 604, in _read_side_band64k_data
    cb(pkt)
  File "/home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/scmrepo/progress.py", line 53, in __call__
    msg.decode("ascii").strip() if isinstance(msg, bytes) else msg
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
------------------------------------------------------------
2021-12-13 13:50:11,093 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/config.local' to gitignore file.
2021-12-13 13:50:11,094 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/tmp' to gitignore file.
2021-12-13 13:50:11,094 DEBUG: Removing '/home/kchoi/prog/.eq9Lt6MFKbjqcroSTD2bPe.tmp'
2021-12-13 13:50:11,094 DEBUG: Removing '/home/kchoi/prog/.eq9Lt6MFKbjqcroSTD2bPe.tmp'
2021-12-13 13:50:11,094 DEBUG: Removing '/home/kchoi/prog/.eq9Lt6MFKbjqcroSTD2bPe.tmp'
2021-12-13 13:50:11,095 DEBUG: Removing '/home/kchoi/prog/hnr/.dvc/../../litis-dataset/.dvc/cache/.ZCVaiQvNkiz3kGe8dUALzQ.tmp'
2021-12-13 13:50:11,096 DEBUG: Version info for developers:
DVC version: 2.9.2 (pip)
---------------------------------
Platform: Python 3.9.7 on Linux-5.15.6-200.fc35.x86_64-x86_64-with-glibc2.34
Supports:
        webhdfs (fsspec = 2021.11.1),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        ssh (sshfs = 2021.11.2)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/mapper/luks-84945a3e-93c0-48b1-9cae-683b71c03215
Caches: local
Remotes: ssh
Workspace directory: btrfs on /dev/mapper/luks-84945a3e-93c0-48b1-9cae-683b71c03215
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-12-13 13:50:11,097 DEBUG: Analytics is enabled.
2021-12-13 13:50:11,127 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpc7fypzoo']'
2021-12-13 13:50:11,128 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpc7fypzoo']'

When manually printing msg I get:

☧hnr γmainᵪ dvc exp pull -vvv -j1 -r litis-remote pianosa 'train_gae.num_layers=37'
2021-12-13 13:51:21,263 TRACE: Namespace(cprofile=False, yappi=False, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, quiet=0, verbose=3, version=None, cd='.', cmd='
pull', force=False, pull_cache=True, dvc_remote='litis-remote', jobs=1, run_cache=False, git_remote='pianosa', experiment='train_gae.num_layers=37', func=<class 'dvc.command.experiments.Cmd
ExperimentsPull'>)
2021-12-13 13:51:21,348 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/config.local' to gitignore file.
2021-12-13 13:51:21,350 DEBUG: Adding '/home/kchoi/prog/hnr/.dvc/tmp' to gitignore file.
2021-12-13 13:51:21,894 DEBUG: git pull experiment 'pianosa' -> 'refs/exps/80/e4443848e6a61775b67834054b7bbf3f0cb982/train_gae.num_layers=37:refs/exps/80/e4443848e6a61775b67834054b7bbf3f0cb
982/train_gae.num_layers=37'
Fetching git refs                                                                                                                                                  |0.00 [00:00,      ?obj/s]
> /home/kchoi/mambaforge/envs/hnr/lib/python3.9/site-packages/scmrepo/progress.py(53)__call__()
-> self._reporter._parse_progress_line(
(Pdb) msg
b'D\xc3\xa9compte des objets: 10, fait.\n'
(Pdb) msg.decode('ascii')
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
(Pdb) msg.decode('utf-8')
'Décompte des objets: 10, fait.\n'
(Pdb)

I believe this is because my system is localized in french, and so I'm hitting a utf-8 decoding error.
I'll make a pull request later to change the decoding from ascii to utf-8.

git-credential follow ups

  • support GIT_ASKPASS/core.askPass/SSH_ASKPASS
  • support quit=true/1 to short-circuit additional credential checks
  • drop memory_only flag (after testing, CLI git does send prompted credentials to existing helpers like GCM)

fs: working with remote git repos

In mlem we need to have fsspec implementation for remote git repos. For now we rely on builtin GithubFileSystem for github, but it does not support git credentials, and also we want to support urls like ssh://git@... or git://... that point to git repos.
The easiest way imo will be to just clone repo to temporary dir and then delegate to LocalFileSystem (with some path hacking), but I am no expert on git internals.

codespaces: account for github/codespaces `--system` prefix

Codespaces sets user.name/email for the github user in a non-standard --system level config. If we fail to get a valid signature (user.name/user.email) when generating git commits, we need to check for codespaces env vars and then manually load the signature from /usr/local/etc/gitconfig if we are in a codespaces env

(libgit2/pygit2 and dulwich will only check /etc/gitconfig which is the standard --system config location)

related: community/community#38070

to be clear this is a workaround for codespaces specific containers that should be removed if/when the codespaces behavior changes

Credentials not parsed when git configuration contains unexpanded file paths

Credentials not parsed correctly when credential.helper is configured as the store type in a users' git-config, when pointing to a path that contains a ~ .e.g:

...
[credential]
        helper = store --file ~/.git-credentials
....

The git repositories I'm working with are hosted on-prem w/gitlab, and must be accessed using credentials over https due toi nternal infrastructure reasons (thus using ssh keys is not an option).

The work-around is to use an absolute path in your git-config then the error above goes away.

I diagnosed by inserting import pdb;pdb.set_trace() here and inspecting the result of the return value of the subprocess.run invocation.

Which operating system and Python version are you using?

Both Python 3.10.6 (Host), Python 3.9.16 (Inside docker container) from mambaforge

Which version of this project are you using?

0.1.7 (pip) (In both python environments described above)

What did you do?

Ran dvc update and dvc pull against .dvc files containing url entries containing git urls using git credentialshelper configuration in place as described in this issue.

What did you expect to see?

The commands complete without authentication errors due to missing credentials.

What did you see instead?

This error (identical to that shown in "actual output" in as also reported here when running various dvc update and dvc pull commands .

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/···/site-packages/dulwich/porcelain.py", line 1618, in ls_remote
    return client.get_refs(host_path)
  File "/···/site-packages/dulwich/client.py", line 2089, in get_refs
    refs, _, _ = self._discover_references(b"git-upload-pack", url)
  File "/···/site-packages/dulwich/client.py", line 1906, in _discover_references
    resp, read = self._http_request(url, headers, allow_compression=True)
  File "/···/site-packages/dulwich/client.py", line 1875, in _http_request
    raise HTTPUnauthorized(resp.getheader("WWW-Authenticate"), url)
dulwich.client.HTTPUnauthorized: No valid credentials provided

tests improvements

Currently, we are testing every backend, but we are verifying that with Git which uses multiple backends, which can be for some backend and API the same backend. This is questionable, we could use gitpython for verifying results instead.

scmrepo git driver hard codes hooks directory

I have an issue in iterative/dvc#7967, that appears is actually an issue with scmrepo.

In scmrepo, there is an assumption that hooks are in .git/hooks, and it is hardcoded as such (my guess is there might be other assumptions like that as well). That's all well and good, unless the repo happens to be cloned/integrated as a submodule. In that case, the '.git' directory is actually the parent repo's .git/modules/. Therefore, the hooks directory actually should be /.git/modules//hooks.

You can actually just use:

git rev-parse --git-path hooks

to find out whatever the hook directory is for a given repo (or any other path, btw). That saves you from having to build all of the logic depending on where things land.

Happy to propose some code, but this might be something that makes more sense to generalize

flake8-bandit fail

In both CI, and locally I met

Traceback (most recent call last):
  File "/Users/gao/anaconda3/envs/dvc/bin/flake8", line 8, in <module>
    sys.exit(main())
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/main/cli.py", line 22, in main
    app.run(argv)
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/main/application.py", line 363, in run
    self._run(argv)
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/main/application.py", line 351, in _run
    self.run_checks()
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/main/application.py", line 264, in run_checks
    self.file_checker_manager.run()
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/checker.py", line 323, in run
    self.run_serial()
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/checker.py", line 307, in run_serial
    checker.run_checks()
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/checker.py", line 589, in run_checks
    self.run_ast_checks()
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8/checker.py", line 494, in run_ast_checks
    for (line_number, offset, text, _) in runner:
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8_bandit.py", line 85, in run
    for warn in self._check_source():
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/flake8_bandit.py", line 59, in _check_source
    bnv = BanditNodeVisitor(
TypeError: __init__() missing 1 required positional argument: 'metrics'

in cmd flake8

This error also appears in other repository (DVC) after I pip install flake8-bandit manually. So guess it is from flake8-bandit?

pygit: use anonymous remotes

          This is probably another indicator that we should really be using `git_remote_add_anonymous` (since we technically are breaking the config until we actually hit our `finally:` cleanup block)

Originally posted by @pmrowla in #177 (comment)

User reports there are still cases where our temp remotes are getting written to git config, we need to just use explicitly anonymous remotes instead.

requires libgit2/pygit2#1229

Question about scmrepo's usage

Hi guys. Long story short, now we have this GTO tool that allows you to build an Artifact Registry on top of your repo (see example here). I would like to get your advice on how to use scmrepo in it.
Basically, artifacts are listed in artifacts.yaml. To make sense of the repo,

  1. GTO needs to traverse all commits (starting from heads and going back) and read the content of this artifacts.yaml in each commit.
  2. GTO needs to get a list of git tags will all information (who created that tag, when, etc).

To solve both, I use GitPython now. Everything works pretty well when I have a repo cloned locally, but doesn't work with remote repos. OFC I can clone a repo to a temporary folder and do the same things, but it looks like scmrepo should solve the same kind of tasks for DVC.

So, I would be grateful if you could help me out with this:

  1. Am I right and I can use scmrepo for my task? Does it look like the intended usage?
  2. What is the right way/right methods/functions to call? Maybe you have some simple examples to start with, that would be awesome.

Version restriction on asyncssh

What are the rationales for the version restrictions for asyncssh:

asyncssh>=2.7.1,<2.9

It currently makes dvc uninstallable on Nix due to asyncssh being newer.

I was wondering whether the upper bound could be dropped?

Add multi rev support for function `describe`

In our current describe function we only accept one rev at a time and will try to get a reference for every reference in the repo. It will cost O(MN) in exp show in which we try to get a ref for every single experiment.

def _describe(
self,
rev: str,
base: Optional[str] = None,
match: Optional[str] = None,
exclude: Optional[str] = None,
) -> Optional[str]:
if not base:
base = "refs/tags"
for ref in self.iter_refs(base=base):
if (match and not fnmatch.fnmatch(ref, match)) or (
exclude and fnmatch.fnmatch(ref, exclude)
):
continue
if self.get_ref(ref, follow=False) == rev:
return ref

image

ssh: `User` overridden when using SSH host alias

Bug Report

Description

As I have multiple GitHub accounts (work and personal) on a single machine, I use SSH aliases to easily switch between the accounts when using various git commands. Up until now this has worked perfectly, even since using DVC in our repositories.

The main thing I use the aliases for is cloning without needing credentials. For example if I needed to clone a work GitHub repository I would run

git clone work:WorkAccount/repo.git

This works fine, and I can carry on working on the code as normal, with all git commands and DVC commands functioning as expected, all except for dvc exp pull.

If a git repository has been set up with SSH using an alias, then dvc exp pull origin -A (or any experiment name/origin name) will crash with the following output (Where the SSH alias in this example is github):

ERROR: unexpected error - Git failed to fetch ref from 'origin': failed to resolve address for github: nodename nor servname provided, or not known  

This behavior is only applicable to dvc exp pull. dvc exp push and dvc exp list both work as expected when the repository is set up with an ssh alias to the remote location, for example the .git/config has the following:

[remote "origin"]

url = github:GitUser/repo.git

and .ssh/config contains for example:

Host github
  AddKeysToAgent yes
  UseKeychain yes
  HostName GitHub.com
  User git
  IdentityFile ~/.ssh/github

and ~/.ssh/github has been set up correctly for ssh access to GitHub.

If the .git/config file is edited as follows:

[remote "origin"]

url = [email protected]:GitUser/repo.git

then dvc exp pull works as expected

Reproduce

  1. Set up SSH to work with GitHub (GitHub guide here and working with miltiple GitHub accounts guide here
  2. Clone a DVC repo using git clone alias:iterative/example-get-started.git where alias is github in our above description, and should match whatever is set in ~/.ssh/config when you set up SSH with GitHub (Step 1).
  3. cd example-get-started
  4. dvc exp pull origin -A
  5. DVC crashes with the following error
ERROR: unexpected error - Git failed to fetch ref from 'origin': failed to resolve address for github: nodename nor servname provided, or not known  
  1. Edit .git/config and change
[remote "origin"]
        url = alias:iterative/example-get-started

to

[remote "origin"]
        url = [email protected]:iterative/example-get-started
  1. Rerun dvc exp pull origin -A and it will pull all experiments as expected

Expected

DVC to pull all experiments from the remote repository without any error when using an SSH alias

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.2.3 (pip)
------------------------
Platform: Python 3.10.10 on macOS-13.1-arm64-arm-64bit
Subprojects:
	dvc_data = 2.3.1
	dvc_objects = 0.23.0
	dvc_render = 0.3.1
	dvc_task = 0.3.0
	scmrepo = 1.0.4
Supports:
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.6.0, boto3 = 1.26.161)
Config:
	Global: /Users/georged/Library/Application Support/dvc
	System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/32fab68d5a6e2090fffcc1a2bb65b88b

Additional Information (if any):

  • Also tested with homebrew installed DVC
  • Also tested on older versions of DVC (> 3) and latest DVC version
  • As mentioned above dvc exp push and dvc exp list both work as expected with an SSH alias

Workaround

When adding the relevant reproducibility notes to this issue, I noticed that in the gist I linked for setting up SSH for multiple accounts, they clone with:

git clone git@alias:GitAccount/repo.git

Which adds a git@ portion to the URL, and when I tried this, it did in fact fix the issue I was facing. However, while this is a fix, I'm still posting the issue in case there is actually a bug in place, as the git@ was not required for any other Git or DVC commands to function properly.

exp run: can't stash changes (nothing to stash) on Win

Bug Report

Description

dvc exp run fails on Windows (conda) with an error from pygit2: unexpected error - 'cannot stash changes - there is nothing to stash.'

Reproduce

I was using: Win 11, VS Code or miniconda terminal, https://github.com/shcheklein/hackathon repo w/o any substantial changes.

dvc exp run fails with this error from the very start.

Output of dvc doctor:

DVC version: 2.41.1 (pip)
---------------------------------
Platform: Python 3.10.8 on Windows-10-10.0.22621-SP0
Subprojects:
        dvc_data = 0.29.0
        dvc_objects = 0.14.1
        dvc_render = 0.0.17
        dvc_task = 0.1.9
        dvclive = 1.3.2
        scmrepo = 0.1.5
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: https
Workspace directory: NTFS on C:\
Repo: dvc, git

optimize is_dirty in dulwich?

I wonder if we should optimize this further, i.e. use generator and return False if there is any changes rather than collecting everything. :)

Something like:

from itertools import chain, zip_longest

with open_repo_closing(repo) as r:
        # 1. Get status of staged
        tracked_changes = get_tree_changes(r)
        # 2. Get status of unstaged
        index = r.open_index()
        normalizer = r.get_blob_normalizer()
        filter_callback = normalizer.checkin_normalize

        unstaged_changes = get_unstaged_changes(index, r.path, filter_callback)
        untracked_paths = get_untracked_paths(
            r.path,
            r.path,
            index,
            exclude_ignored=not ignored,
            untracked_files=untracked_files,
        )
        return any(chain.from_iterable(zip_longest(untracked_paths, unstaged_changes)))

The only doubt that I have is that it won't reuse IgnoreManager, which I think already happens for status.
No strong opinion though, we could also propose this in dulwich.

Originally posted by @skshetry in #74 (comment)

is_ignored() broken for .gitignore in subdirectories

is_ignored() is broken when dealing with nested .gitignore files

import os
import sys

from scmrepo.git import Git

dirname = "ignore_issue_repo"
try:
    os.mkdir(dirname)
except FileExistsError:
    print(f"Delete {dirname} and try again", file=sys.stderr)
    exit(1)


os.chdir(dirname)

repo = Git.init(".")
print("Initialized repo")

subdir = "subdir"
os.mkdir(subdir)

# create a .gitignore in the subdirectory
ignored_files = ("ignoredfile1", "ignoredfile2")
with open(f"{subdir}/.gitignore", "w") as fh:
    for name in ignored_files:
        fh.write(f"{name}\n")

# create dummy (gitignored) files
for file in ignored_files:
    with open(f"{subdir}/{file}", "w") as fh:
        fh.write("dummy")

repo.add(f"{subdir}/.gitignore")
repo.commit("add subdir gitignore")

ignored_files_ignored = [repo.is_ignored(f"{subdir}/file") for file in ignored_files]

assert all(ignored_files_ignored)  # fails

clone: use gitpython's CLI git first and fallback to dulwich if CLI git is not available

With all of the auth problems and stuff like iterative/dvc-ssh#20 , we are probably better off just using cli git for clone first and fallback to dulwich if not available.

clone used to be the last gitpython thing for a very long time without any problems and the whole point of migrating clone to dulwich was so that dvc get/import/list could work in environments without git cli (e.g. during deployment in docker images), but we could get both benefits by just falling back to dulwich.

CC @dberenbaum

linting: type checking failures with dulwich 0.20.36

There have been some type checking related changes in the latest dulwich release (0.20.36) which break mypy type checking:

nox > python -m mypy
scmrepo/git/backend/dulwich/__init__.py: note: In member "fetch_refspecs" of class "DulwichBackend":
scmrepo/git/backend/dulwich/__init__.py:630:33: error: Argument
"determine_wants" to "fetch" of "GitClient" has incompatible type
"Callable[[Any], Any]"; expected
"Optional[Callable[[Dict[bytes, bytes], Optional[int]], List[bytes]]]" 
[arg-type]
                    determine_wants=determine_wants,
                                    ^
Found 1 error in 1 file (checked 26 source files)
nox > Command python -m mypy failed with exit code 1
nox > Session lint failed.

add: support `force=True/False`

Currently scm.add() always implies git add --force and does not check for ignored files. This was good-enough behavior before, but we need non-force adding for DVC exps, and in general it would be better to use git's defaults (force=False) in scmrepo calls

Update on `push_refspec` and `fetch_refspec`

  1. Unify the API of push_refspec and fetch_refspec.
  2. Can push multi refspec for one time.
  3. No repository Error handle for fetch_refspec
  4. Now the push_refspec and fetch_refspec will return the status for each refspec.

tests: get rid of `test_clone`

This clones a remote repo.

def test_clone(tmp_dir: TmpDir, matcher: Type[Matcher]):
progress = MagicMock()
url = "https://github.com/iterative/dvcyaml-schema"
Git.clone(url, "dir", progress=progress)
progress.assert_called_with(matcher.instance_of(GitProgressEvent))
assert (tmp_dir / "dir").exists()

We should replace this with tests testing:

  • clone with file://
  • clone local directory
  • clone with ssh
  • clone with http(s)
  • push_refspec with progress
  • clone with progress
  • fetch_refspec with progress

pythonic API

The APIs in scmrepo is not very pythonic and clunky. The APIs evolved to this in DVC and we extracted as-is.

I think we can do more on the API side, with our experience in object databases implementation: dulwich/libgit2/dvc-objects/dvc-data, etc. :)

clone: support creating mirrors

We've been discussing clone caching before multiple times (e.g. in #41) and it always seemed pretty involved with unshallowing and other potential conflicts.

Seems like we could introduce a mirror concept that would allow you to establish your local cached "mirror" with multiple repos cached by token(url)(or smth like that) and that would use git clone's --mirror flag.

E.g. in dvc we would create our mirror (say in platformdirs.site_cache_dir to avoid strange nfs/cifs/etc problems) and would use it to clone stuff we need:

from scmrepo import Mirror

mirror = Mirror(platformdirs.site_cache_dir(...))

repo = mirror.clone("https://github.com/iterative/example-get-started", ...)
...

fs: support version_aware ?

Currently, gitfs works on one single revision, but we could totally make version_aware version (similar to s3fs, gcsfs, adlfs) and support revisions as version_id. The implementation is fairly straightforward (just use Tree for a particular version_id and the rest is the same). This seems to make a lot of sense in dvc context of unifying get/import with get-url/import-url and gettind rid of DependencyRepo.

`pygit2` backend didn't raise a proper exception in `resolve_rev`.

How to reproduce it:

$ mkdir scmtest
$ cd scmtest
# this is correct
$ python3 -c "from scmrepo.git import Git; Git('.').pygit2.resolve_rev('HEAD~')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 271, in resolve_rev
    raise RevError(f"unknown Git revision '{rev}'")
scmrepo.exceptions.RevError: unknown Git revision 'HEAD~'

# add some remote
$ git remote add origin someremote
# do not wrap the exception properly from `pygit2`
$ python3 -c "from scmrepo.git import Git; Git('.').pygit2.resolve_rev('HEAD~')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 263, in resolve_rev
    shas = {
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 264, in <setcomp>
    self.get_ref(f"refs/remotes/{remote.name}/{rev}")
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/scmrepo/git/backend/pygit2.py", line 322, in get_ref
    ref = self.repo.references.get(name)
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/pygit2/repository.py", line 1440, in get
    return self[key]
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/pygit2/repository.py", line 1436, in __getitem__
    return self._repository.lookup_reference(name)
_pygit2.InvalidSpecError: refs/remotes/origin/HEAD~: the given reference name 'refs/remotes/origin/HEAD~' is not valid

# gitpython works correctly
python3 -c "from scmrepo.git import Git; Git('.').gitpython.resolve_rev('HEAD~')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/gao/anaconda3/envs/dvc/lib/python3.8/site-packages/scmrepo/git/backend/gitpython.py", line 352, in resolve_rev
    raise RevError(f"unknown Git revision '{rev}'")
scmrepo.exceptions.RevError: unknown Git revision 'HEAD~'

For now, the default backend for resolve_rev is pygit2, and it causes some problems in iterative/dvc#7204 (comment). To solve this, we have two choices:

  1. wrap the _pygit2.InvalidSpecError from in pygit2 backend.
  2. use gitpython backend instead.

credential-store: support `store` for new credentials

Bug Report

Description

When using the git credential manager to manage login information and using https-based remotes in dvc (e.g., dvc import https://github.com/...), the credentials are not properly cached. If I just try to fetch some repo using https, GCM correctly uses the saved credentials. In the case of dvc, the login window appears for every step of the pipeline, meaning I have to sign in three times for a single import. GitHub already warned me that GCM triggered some rate limit for my account because of this. Here is an exemplary terminal output:

Importing '/ (https://github.com/xxx/xxx.git)' -> 'xxx'
Cloning xxx.git|                                                                                                                                                                                                                  |0.00/? [00:00,      ?obj/s]
info: please complete authentication in your browser...
Cloning xxx.git|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| Compressing |876/876 [03:38,    219s/obj]
info: please complete authentication in your browser...
info: please complete authentication in your browser...

The same happens for subsequent runs of dvc pull.

Reproduce

  1. dvc import https://github.com/xxx/xxx.git
  2. sign in with GitHub in the GCM window
  3. sign in again
  4. ...

Expected

Credentials are used without the dialogue popping up at all (just like it does for the regular git commands).

Environment information

Output of dvc doctor:

DVC version: 2.43.1 (brew)
---------------------------------
Platform: Python 3.11.1 on macOS-13.2-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.35.1
	dvc_objects = 0.19.0
	dvc_render = 0.0.17
	dvc_task = 0.1.11
	dvclive = 1.3.3
	scmrepo = 0.1.6
Supports:
	azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
	gdrive (pydrive2 = 1.15.0),
	gs (gcsfs = 2023.1.0),
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	oss (ossfs = 2021.8.0),
	s3 (s3fs = 2023.1.0, boto3 = 1.24.59),
	ssh (sshfs = 2023.1.0),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8),
	webhdfs (fsspec = 2023.1.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

bootstrapping apis

We need some APIs implemented in scmrepo like init, add, commit for all backends, that are required for bootstrapping the tests.

dulwich: properly patch credentials

From @dtrifiro

I've had a quick look. Need to fix the clone method in scmrepo.git.backends.dulwich, it doesn't use the GitCredentials client. This needs to also be done in other methods which use get_transport_and_path

iterative/dvc#7670

Looks like we need to copypaste clone(and other) implementations from dulwich and use our client instead of relying on get_transport_and_path.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.