Hi, thanks for providing such a large github corpus but I have really problems to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to get the github repos as zip/tar or clone? about datasets HOT 8 CLOSED

KujaEx commented on August 27, 2024

How to get the github repos as zip/tar or clone?

from datasets.

Comments (8)

smola commented on August 27, 2024 1

Closing this, since it was a question that seems already answered. Please, feel free to comment or reopen if there are still pending conversation here.

from datasets.

vmarkovtsev commented on August 27, 2024

@KujaEx what you've unpacked is the bare .git directory. You can apply any git library to it.

from datasets.

KujaEx commented on August 27, 2024

@vmarkovtsev So every ".siva" file I downloaded with the pga tool (pga get -l java) is unpacked a ".git" directory of a github project?

Like I wrote above I unpacked a siva-file and got, like you said, the contents of the .git directory. How can I use that now? It isn't recognized as a git tree when I try to use a git command above this .git directory:

$ git status
fatal: This operation must be run in a work tree

Again, I'm really confused about the dataset with it's siva-format, pga, multi and other tools. What is the easiest way to get the file content of the github repositories (in my case the java projects) from the "PublicGitArchive" copora?

I could also simply clone/download the github repositories by name (you provided with pga get -l java) from github.com itself, but then I would have the most up-to-date repositories which data I can't compare to your copora.

from datasets.

vmarkovtsev commented on August 27, 2024

@KujaEx

I could also simply clone/download the github repositories by name (you provided with pga get -l java) from github.com itself, but then I would have the most up-to-date repositories which data I can't compare to your copora.

If you need to clone a few thousand repos then PGA is not your choice anyway. Go ahead with GHTorrent or GitHub API to get the list of repos and clone them directly from GitHub. PGA contains much more repos and basically solves the problem with cloning so many.

Again, I'm really confused about the dataset with it's siva-format, pga, multi and other tools.

siva is the only way to store forks efficiently. GitHub uses their own "siva" internally btw. I am sorry that it is confusing but there is no other way really because we do not want to store 10 TB instead of 3 TB (according to our humble estimation).

It isn't recognized as a git tree when I try to use a git command above this .git directory:

This is a standard bare Git repo.

pga get -u https://github.com/src-d/go-git
mkdir go-git/.git
cd go-git/.git
siva unpack ../../siva/latest/5d/5d7303c49ac984a9fec60523f2d5297682e16646.siva
cd ..
git show-ref
# checkout any you want, e.g.
git checkout refs/heads/HEAD/01614be7-8992-df66-ee0f-7917c39266e3

from datasets.

vmarkovtsev commented on August 27, 2024

That being said, how about using the engine? It runs directly on PGA and has many nice features.

from datasets.

KujaEx commented on August 27, 2024

(Did not try the engine yet)

the last line with git checkout will not work until you set git config core.bare false, but anyway this procedure is really inconvenient. You have to list all siva file names, create new projects-folders (where you don't know the project-name yet because they are inside the siva), create a .git folder, unpack the siva files, adjust config, find the head, checkout the head.
Isn't there any easy way to extract the project files from the siva files?

I have another question. When I use pga list -l java I get a list of 24810 java github projects. When I now use pga get -l java it downloads at all 41055 siva files. So one siva file doesn't matches one github project. So the described way above wouldn't work because there are way more unpackings of siva files than actually projects were listed?

from datasets.

smola commented on August 27, 2024

@KujaEx Note that a siva file can contain multiple GitHub projects, since forks are stored together. You can see a description of this grouping here: https://github.com/src-d/borges#key-concepts

from datasets.

vmarkovtsev commented on August 27, 2024

@KujaEx

The siva<->project name associations are stored in the CSV index, that's what pga uses internally. Can be downloaded separately with pga list. We are not serving Git repos directly - PGA is not a GitHub mirror, it is a bulk download dataset with an optimized fork storage.

The easy way would be to use https://github.com/src-d/go-billy-siva e.g. like in https://github.com/src-d/go-license-detector/blob/master/licensedb/filer/filer.go#L205 You are more than welcome to write the tool which automates your needs and PR it to https://github.com/src-d/datasets/tree/master/PublicGitArchive or even to https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga @campoy will be also happy to learn more about your requirements as he is constantly improving pga.

Regarding pga get -l java, a Git repository may contain more than 1 unrooted reference. E.g. github.com/google/angle contains more than 6,000 because of the way Gerrit is integrated with GitHub. Yep, if the goal is to determine the "master" siva file, it may require scanning the whole set, though most often the biggest siva file is the one you are looking for.

from datasets.

How to get the github repos as zip/tar or clone? about datasets HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent