Giter VIP home page Giter VIP logo

Comments (8)

smola avatar smola commented on August 27, 2024 1

Closing this, since it was a question that seems already answered. Please, feel free to comment or reopen if there are still pending conversation here.

from datasets.

vmarkovtsev avatar vmarkovtsev commented on August 27, 2024

@KujaEx what you've unpacked is the bare .git directory. You can apply any git library to it.

from datasets.

KujaEx avatar KujaEx commented on August 27, 2024

@vmarkovtsev So every ".siva" file I downloaded with the pga tool (pga get -l java) is unpacked a ".git" directory of a github project?

Like I wrote above I unpacked a siva-file and got, like you said, the contents of the .git directory. How can I use that now? It isn't recognized as a git tree when I try to use a git command above this .git directory:

$ git status
fatal: This operation must be run in a work tree

Again, I'm really confused about the dataset with it's siva-format, pga, multi and other tools. What is the easiest way to get the file content of the github repositories (in my case the java projects) from the "PublicGitArchive" copora?

I could also simply clone/download the github repositories by name (you provided with pga get -l java) from github.com itself, but then I would have the most up-to-date repositories which data I can't compare to your copora.

from datasets.

vmarkovtsev avatar vmarkovtsev commented on August 27, 2024

@KujaEx

I could also simply clone/download the github repositories by name (you provided with pga get -l java) from github.com itself, but then I would have the most up-to-date repositories which data I can't compare to your copora.

If you need to clone a few thousand repos then PGA is not your choice anyway. Go ahead with GHTorrent or GitHub API to get the list of repos and clone them directly from GitHub. PGA contains much more repos and basically solves the problem with cloning so many.

Again, I'm really confused about the dataset with it's siva-format, pga, multi and other tools.

siva is the only way to store forks efficiently. GitHub uses their own "siva" internally btw. I am sorry that it is confusing but there is no other way really because we do not want to store 10 TB instead of 3 TB (according to our humble estimation).

It isn't recognized as a git tree when I try to use a git command above this .git directory:

This is a standard bare Git repo.

pga get -u https://github.com/src-d/go-git
mkdir go-git/.git
cd go-git/.git
siva unpack ../../siva/latest/5d/5d7303c49ac984a9fec60523f2d5297682e16646.siva
cd ..
git show-ref
# checkout any you want, e.g.
git checkout refs/heads/HEAD/01614be7-8992-df66-ee0f-7917c39266e3

from datasets.

vmarkovtsev avatar vmarkovtsev commented on August 27, 2024

That being said, how about using the engine? It runs directly on PGA and has many nice features.

from datasets.

KujaEx avatar KujaEx commented on August 27, 2024

(Did not try the engine yet)

the last line with git checkout will not work until you set git config core.bare false, but anyway this procedure is really inconvenient. You have to list all siva file names, create new projects-folders (where you don't know the project-name yet because they are inside the siva), create a .git folder, unpack the siva files, adjust config, find the head, checkout the head.
Isn't there any easy way to extract the project files from the siva files?

I have another question. When I use pga list -l java I get a list of 24810 java github projects. When I now use pga get -l java it downloads at all 41055 siva files. So one siva file doesn't matches one github project. So the described way above wouldn't work because there are way more unpackings of siva files than actually projects were listed?

from datasets.

smola avatar smola commented on August 27, 2024

@KujaEx Note that a siva file can contain multiple GitHub projects, since forks are stored together. You can see a description of this grouping here: https://github.com/src-d/borges#key-concepts

from datasets.

vmarkovtsev avatar vmarkovtsev commented on August 27, 2024

@KujaEx

The siva<->project name associations are stored in the CSV index, that's what pga uses internally. Can be downloaded separately with pga list. We are not serving Git repos directly - PGA is not a GitHub mirror, it is a bulk download dataset with an optimized fork storage.

The easy way would be to use https://github.com/src-d/go-billy-siva e.g. like in https://github.com/src-d/go-license-detector/blob/master/licensedb/filer/filer.go#L205 You are more than welcome to write the tool which automates your needs and PR it to https://github.com/src-d/datasets/tree/master/PublicGitArchive or even to https://github.com/src-d/datasets/tree/master/PublicGitArchive/pga @campoy will be also happy to learn more about your requirements as he is constantly improving pga.

Regarding pga get -l java, a Git repository may contain more than 1 unrooted reference. E.g. github.com/google/angle contains more than 6,000 because of the way Gerrit is integrated with GitHub. Yep, if the goal is to determine the "master" siva file, it may require scanning the whole set, though most often the biggest siva file is the one you are looking for.

from datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.