igrigorik / gharchive.org Goto Github PK

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

Home Page: https://www.gharchive.org

License: MIT License

JavaScript 14.74% Ruby 81.72% Procfile 0.21% Shell 3.33%

gharchive.org's Introduction

gharchive.org's People

Contributors

Stargazers

Watchers

Forkers

piotrsikora danieljpeter dineshkummarc sheymann happydust danishkhan utkarshkukreti chaconnewu metdos michaelcolenso tsnow lotterblad niumowm adamrneary cwsteinbach rjurney fengabe abelenki streambo pokerg martinyoo n0rmrx adamstac mshamber enjalot imclab zeke web5design kuikui sirithink ysei a-campbell duyvk nikolayvoronchikhin cloudpanda yjrhot lyt23 lgs 796f ssr20 webchick fhoffa markandrewj hnq90 knowno spencerx harishvc ianjennings pombreda neonichu pb-pravin ronaknnathani seglo frnkbaum kalburgimanjunath mcanthony zotherstupidguy mr-justin hut8 githubwangtwo wjywbs alysonla guiman arjunvijayvargiya dongjoon-hyun xiaoyuanxie xyxie benzntech notslang zthomas modulexcite xiajian emijrp rugby110 rantav mlinksva jerry-goodboy andela-anandaa andelaosp harshavardhana sunghopark1 fanyer jerjohn15 microamp soueid zqy123456 billyprice1old madnight ranaivosonherimanitra sudhi10 ikinfly2 kaziplused vrkansagara alexxnica sjoerdapp kryndex systemx fluquid duanyanan paulkimds

gharchive.org's Issues

Top level index file for data.githubarchive.org

I'm wondering if it would be useful if there was a http://data.githubarchive.org/index.json or similar file that would list all the available files and their checksums.

I'm mirroring the data for https://github.com/copiousfreetime/ghent and feel like a bad netizen for probably polling for new data more than I should.

Content-Encoding header missing?

I could be wrong, but I think that the headers from data.githubarchive.org should include a Content-Encoding: gzip or the Content-Type should be application/x-gzip

% curl -I http://data.githubarchive.org/2012-02-12-12.json.gz
HTTP/1.1 200 OK
Server: HTTP Upload Server Built on Feb 13 2013 15:53:33 (1360799613)
Expires: Sun, 24 Feb 2013 10:42:07 GMT
Date: Sun, 24 Feb 2013 09:42:07 GMT
Last-Modified: Mon, 19 Nov 2012 02:43:55 GMT
ETag: "94b0540a8e37e248457b4c9244744547"
x-goog-generation: 1352248816656002
x-goog-metageneration: 1
Content-Type: application/json
Content-Language: en
Accept-Ranges: bytes
Content-Length: 775893
Cache-Control: public, max-age=3600, no-transform
Age: 175

unicode support

Hello Ilya,

I was trying to get repository description via BigQuery UI, and noticed that some unicode symbols are not shown correctly. E.g. this query:

SELECT repository_description FROM [githubarchive:github.timeline]
WHERE repository_name = 'baidupcsapi'

Returns:

Row repository_description   
1   ������������api  
2   ������������api  
3   ������������api  
4   ������������api  
5   ������������api

While the actual description of ly0/baidupcsapi is 百度网盘api

Real JSON data

Hey,

From what I see the .json files aren't valid JSON files because instead of having an array of objects they contain JSON objects which are not delimited, like for example:

{"repository": ...}{"repository": ...}

This is making things hard to parse in Node, like when using: https://www.npmjs.org/package/JSONStream

Shouldn't these files be valid JSON so we can properly use JSON streaming parsers?

I just downloaded all data from the site (Mar and Apr), and I cannot load them because there is no separator between objects in json, right after closing one with "}", the next is opened with "{". I don't know json at all, so I don't know whether this is the right format or not, but while reading it with yajl and ruby works, python's standard json library fails.

How to get one repository stars number changing?

I found that in the schema of github archive repository_watchers stand for the stars number of one repository. I can not find the watch number about one repository. Am I right?
What's more, I use this big query sql to get the stars changing. But it have many recodes, I just want to get them by month. One month one recode, it's enough.

SELECT repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE 
     repository_url = "https://github.com/celery/celery"
ORDER BY created_at DESC

Then I use this sql below:

SELECT repository_watchers, LEFT(created_at, 7) as month
FROM [githubarchive:github.timeline]
WHERE 
     repository_url = "https://github.com/celery/celery"
GROUP BY month
ORDER BY month DESC

But after this, I get the error:

Query Failed
Error: (L1:8): Expression 'repository_watchers' is not present in the GROUP BY list
Job ID: skilful-courage-797:job_iCAwrEDspMW805oai-FM5g-rFTI

So how to make it?

Missing files.

Hey,

It appears that after 2015-01-07-13.json.gz there are no more files.

Not all files have \n separator

Hi.

As I understand the format of the files on githubarchive.org is json lines separated by new line.
But I found that not all files have new line character. Sometimes records are not separated.
It makes parsing them a little bit tricker since I have to look for "}{" string which is not inside valid json.

I can compile list of files with this problem if you need it
But do you plan to fix it or should I fix it only in my local copy?

BTW thanks for such a great data! :-)

Best regards
Krzysztof

Access Denied For Dates 1-9

Using your wget script, we get access denied errors for any of the first 9 days of the month. It should use 01, 02, etc. formatting, but in fact, wget reduces it to 1,2, etc. This is not an issue with the time (in hours). I wonder if it's just my build of wget.

Provide a stream API for the firehose

We want to display commits in real-time on our hackerspace's LED wall: http://www.youtube.com/watch?v=YZwoYrEyhz0

JSON schema for event type 'PushEvent' does not match the schema posted in the README file

JSON schema for event type 'PushEvent' does not match the schema posted in the README file (https://github.com/igrigorik/githubarchive.org/blob/master/crawler/README.md) . Some of the differences include repository is now repo , payload.sha and numerous repository attributes like language, description are missing. I first noticed this change 2 months ago. Any assistance will be much appreciated.

service stopped ?

BigQuery table seems not updated anymore

SELECT actor, created_at FROM [githubarchive:github.timeline] LIMIT 1 ORDER BY created_at desc;

=> 2014-03-16 23:04:44

some event jsons appear in the same line

In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .

for example in 2011-03-15-21.json.gz

To get the above do : wget http://data.githubarchive.org/2011-03-15-21.json.gz

In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line see http://codebeautify.org/jsonviewer/2cb891

the two jsons in same line is a combination of

http://codebeautify.org/jsonviewer/c7e18e

and

http://codebeautify.org/jsonviewer/945d56

What is the impact ? when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .

Invalid string in JSON text

Hello, I was wondering if you had seen this before:

/Users/admin/.rvm/gems/ruby-1.9.2-p318/gems/yajl-ruby-1.1.0/lib/yajl.rb:36:in parse': lexical error: invalid string in json text. (Yajl::ParseError) [ { name: "repository_url", type: " (right here) ------^ from /Users/admin/.rvm/gems/ruby-1.9.2-p318/gems/yajl-ruby-1.1.0/lib/yajl.rb:36:inparse'
from argtest.rb:51:in `

I get the same error if I'm using 1.8.7

Event ID needs to be in the archived JSON

I've noticed some duplicate events getting recorded and I'm trying to get them from a couple of different places. The event API seems to have an ID on every event. Is it possible to get that included in the archive for the sake of duplicate detection?

the example script to import data into SQLite db doens load the data correctly

I run the script, but it doesn't load the correct dat for pushEvent, it doesn't load the payload commit info.

UTF8 Errors

I'm using the script provided on the web site and I get UTF-8 issues every once in a while. One example is for Jan 1, 2012 09:00:

require 'open-uri'
require 'zlib'
require 'yajl'

gz = open('http://data.githubarchive.org/2012-01-01-9.json.gz')
js = Zlib::GzipReader.new(gz).read

Yajl::Parser.parse(js) do |event|
  puts event
end

I get the following error:

/Users/benbjohnson/.rvm/gems/ruby-1.9.3-p362/gems/yajl-ruby-1.1.0/lib/yajl.rb:36:in `parse': lexical error: invalid bytes in UTF8 string. (Yajl::ParseError)
          put.txt\" in gitignore eingef?gt"}],"head":"2987516b1873ff48
                     (right here) ------^
    from /Users/benbjohnson/.rvm/gems/ruby-1.9.3-p362/gems/yajl-ruby-1.1.0/lib/yajl.rb:36:in `parse'
    from ./gharchive.rb:8:in `<main>'

I ran iconv and it's showing a conversion issue too:

...
{"type":"PushEvent","repo":{"id":3039235,"url":"https://api.github.dev/repos/liro/lang","name":"liro/lang"},"created_at":"2012-01-01T09:58:57Z","payload":{"ref":"refs/heads/master","push_id":55767466,"commits":[{"sha":"2987516b1873ff48d04ebb5859f4bf19bb7fd5e7","author":{"email":"[email protected]","name":"Thomas Heck"},"url":"https://api.github.com/repos/liro/lang/commits/2987516b1873ff48d04ebb5859f4bf19bb7fd5e7","message":"\"testoutput.txt\" in gitignore eingef
iconv: 2012-01-01-9.json:5322:466: cannot convert

missing gzip compression for 2015

When I do

wget http://data.githubarchive.org/2015-01-01-15.json.gz

I get a plain JSON file that is not gzipped.

License Type Scraping

I am doing research on the licensing choices of developers on Github. Can anyone think of a way to scrape for the license type of repos? I am looking for relationships between project characteristics and license.

Any help would be much appreciated!

Same query working for web console, fails with other clients

Google BigQuery Web Console:

SELECT type, count(type) as count
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2014-12-01 00:00:00')
  AND repository_organization="GoogleCloudPlatform" 
  AND repository_name="kubernetes"
GROUP BY type

IssueCommentEvent   495  
2   PullRequestEvent    123  
3   PushEvent   61   
4   WatchEvent  153  
5   PullRequestReviewCommentEvent   200

abronte/BigQuery

irb(main):178:0> @bq_resp = @bq.query("
irb(main):179:1" WHERE PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2014-12-01 00:00:00')
irb(main):180:1"   AND repository_organization="GoogleCloudPlatform" 
irb(main):181:1"   AND repository_name="kubernetes"
irb(main):182:1" GROUP BY type
irb(main):183:1" LIMIT 5
irb(main):184:1" ")
SyntaxError: (irb):180: syntax error, unexpected tCONSTANT, expecting ')'
  AND repository_organization="GoogleCloudPlatform" 
                                                  ^
(irb):181: syntax error, unexpected tIDENTIFIER, expecting end-of-input
  AND repository_name="kubernetes"
                                 ^
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/console.rb:110:in `start'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/console.rb:9:in `start'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/commands_tasks.rb:68:in `console'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands.rb:17:in `'
    from bin/rails:4:in `require'
    from bin/rails:4:in `'
irb(main):185:0>

bq Command-Line Tool

lsoave@basenode:~$ bq query "
> WHERE PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2014-12-01 00:00:00')
>   AND repository_organization="GoogleCloudPlatform" 
>   AND repository_name="kubernetes"
> GROUP BY type
> LIMIT 5
> "
Error in query string: Error processing job 'eng-particle-738:bqjob_r73f7672adf44b8aa_0000014a20355b1b_1': Encountered " "WHERE" "WHERE "" at line 2,
column 1.
Was expecting:

pure Google::APIClient::KeyUtils:

irb(main):186:0> key = Google::APIClient::KeyUtils.load_from_pkcs12('./config/keys/github-zeitgeist-ca9d6017a8a7.p12', 'notasecret')
=> #
irb(main):187:0> client = Signet::OAuth2::Client.new(
irb(main):188:1*         :token_credential_uri => 'https://accounts.google.com/o/oauth2/token',
irb(main):189:1*         :audience => 'https://accounts.google.com/o/oauth2/token',
irb(main):190:1*         :scope => 'https://www.googleapis.com/auth/prediction',
irb(main):191:1*         :issuer => '[email protected]',
irb(main):192:1*         :signing_key => key)


irb(main):193:0> client.execute("
irb(main):194:1" WHERE PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2014-12-01 00:00:00')
irb(main):195:1"   AND repository_organization="GoogleCloudPlatform" 
irb(main):196:1"   AND repository_name="kubernetes"
irb(main):197:1" GROUP BY type
irb(main):198:1" LIMIT 5
irb(main):199:1" ")
SyntaxError: (irb):195: syntax error, unexpected tCONSTANT, expecting ')'
  AND repository_organization="GoogleCloudPlatform" 
                                                  ^
(irb):196: syntax error, unexpected tIDENTIFIER, expecting end-of-input
  AND repository_name="kubernetes"
                                 ^
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/console.rb:110:in `start'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/console.rb:9:in `start'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/commands_tasks.rb:68:in `console'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands/commands_tasks.rb:39:in `run_command!'
    from /home/lsoave/.rbenv/versions/2.1.5/lib/ruby/gems/2.1.0/gems/railties-4.2.0.rc1/lib/rails/commands.rb:17:in `'
    from bin/rails:4:in `require'
    from bin/rails:4:in `'
irb(main):200:0>

Use time of event instead of local time

Currently the crawler uses EventMachine's timer functions to rotate the output files. Because the clock of the machine that runs the crawler might not be 100% synced with the clock at Github, this leads to events ending up in the "wrong" file.

For example, the earliest event in the file "2012-03-20-20.json" happened at "2012/03/20 19:59:55 -0700" and should actually be in the file "2012-03-20-19.json". The latest event in the same file happened at "2012/03/20 20:59:52 -0700", but an event that happened on "2012/03/20 20:59:58 -0700" is wrongly written into the file "2012-03-20-21.json".

Instead of relying on the crawler's clock, the crawler could parse the "created_at" field that is part of each event's dictionary and use thw information to select the correct file to write to.

Invalid UTF-8

Hey there,

I discovered your project a few days ago: it's exactly what I needed! Thank you very much for your work :)

I spotted some UTF-8 encoding issues in the timeline table, any chance you're aware of them? I tried to run the crawler locally and don't think there is issue in the code. Which version of ruby are you using in production? Any chance we could force it to be > 2.1 in the Gemfile? Or add extra # encoding: utf-8 in the ruby script?

For instance, running the following query:

SELECT
  a.actor_attributes_login as login,
  a.actor_attributes_name as name,
  a.actor_attributes_company as company,
  a.actor_attributes_location as location,
  a.actor_attributes_blog AS blog,
  a.actor_attributes_email AS email
FROM [githubarchive:github.timeline] a
WHERE a.actor_attributes_login = 'asurak'
LIMIT 1

will produce

payload_pull_request_additions shows a timestamp

When I query [githubarchive:github.timeline], I see the field payload_pull_request_additions returning a timestamp rather than an integer.

Missing Event Types

I've noticed that in the 2011-06-23..28 period, there are 2+ million events whose type is just "Event":

Timestamp     | COUNT()
-----------------------
2011-06-23T01 |  13,953
2011-06-23T02 | 107,784
2011-06-23T03 | 128,189
2011-06-23T04 |  92,514
2011-06-23T13 |  38,931
2011-06-23T14 | 119,948
2011-06-23T15 |  98,372
2011-06-23T16 | 112,033
2011-06-23T17 |  98,080
2011-06-23T18 | 106,664
2011-06-25T04 |     578
2011-06-25T11 |  25,791
2011-06-25T12 |  61,996
2011-06-25T15 |  47,623
2011-06-25T16 |  36,010
2011-06-25T23 |  85,684
2011-06-26T02 |  55,254
2011-06-26T03 | 147,521
2011-06-26T04 |  32,689
2011-06-26T12 |  48,655
2011-06-26T13 |  48,568
2011-06-27T02 |   2,823
2011-06-27T03 |  43,694
2011-06-27T04 |  48,347
2011-06-27T13 |  63,557
2011-06-27T14 |  90,506
2011-06-27T15 |  46,909
2011-06-27T22 |  28,947
2011-06-27T23 | 133,341
2011-06-28T00 | 120,680
2011-06-28T01 |  90,489

Here are the first and last events where this occurs:

[
    {
        "actor": {
            "avatar_url": "https://secure.gravatar.com/avatar/07426bc321f9f519e7545e650c6cbe3b?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png",
            "gravatar_id": "07426bc321f9f519e7545e650c6cbe3b",
            "id": 13704,
            "login": "fbehrens",
            "url": "https://api.github.dev/users/fbehrens"
        },
        "created_at": "2011-06-23T01:59:59Z",
        "id": "1125860066",
        "payload": {
            "name": "rdomino",
            "object": "tag",
            "object_name": "v0.9.62"
        },
        "public": true,
        "repo": {
            "id": 663119,
            "name": "fbehrens/rdomino",
            "url": "https://api.github.dev/repos/fbehrens/rdomino"
        },
        "type": "Event"
    },
    {
        "actor": {
            "avatar_url": "https://secure.gravatar.com/avatar/959779a0672a83d1679c74a0afbc5ecd?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png",
            "gravatar_id": "959779a0672a83d1679c74a0afbc5ecd",
            "id": 15437,
            "login": "philly-mac",
            "url": "https://api.github.dev/users/philly-mac"
        },
        "created_at": "2011-06-28T01:00:00Z",
        "id": "1007290836",
        "payload": {
            "head": "d79a75cd5751e4a7467473e381e36e3e25973404",
            "push_id": 20201019,
            "ref": "refs/heads/master",
            "shas": [
                [
                    "d79a75cd5751e4a7467473e381e36e3e25973404",
                    "[email protected]",
                    "Added credentials file and changed the database to be more generic",
                    "Philip MacIver"
                ]
            ],
            "size": 1
        },
        "public": true,
        "repo": {
            "id": 842876,
            "name": "philly-mac/tvrss",
            "url": "https://api.github.dev/repos/philly-mac/tvrss"
        },
        "type": "Event"
    }
]

Is this an oversight in the original data? I would guess so, because of the api.github.dev endpoints.

Do all these events belong to the same event type..? If yes, it would be easy to fix this.

Thanks for making all this data available BTW, great stuff! 👍

Missing files.

I have checked out all files since 2012-2-12-00h. This is the list of all missing files:

2011-10-27-7.json.gz

2012-03-10-17.json.gz
2012-03-10-14.json.gz
2012-03-01-8.json.gz
2012-03-10-13.json.gz
2012-03-05-6.json.gz
2012-03-10-11.json.gz
2012-03-05-4.json.gz
2012-03-11-2.json.gz
2012-03-10-16.json.gz
2012-03-10-18.json.gz
2012-03-05-7.json.gz
2012-03-10-21.json.gz
2012-03-10-10.json.gz
2012-03-10-19.json.gz
2012-03-10-20.json.gz
2012-03-05-5.json.gz
2012-03-10-9.json.gz
2012-03-10-15.json.gz
2012-03-10-12.json.gz

2013-11-25-15.json.gz
2013-09-06-0.json.gz
2013-05-31-22.json.gz
2013-08-15-7.json.gz
2013-03-10-2.json.gz
2013-03-05-20.json.gz
2013-09-28-9.json.gz
2013-10-24-14.json.gz
2013-09-11-23.json.gz

2014-01-24-10.json.gz
2014-03-09-2.json.gz
2014-04-29-11.json.gz
2014-04-29-21.json.gz
2014-04-29-12.json.gz

Although the number of missing files is small compared to the total number of files, I would like to figure out exact history of GitHub.com.

Please check out those files again.

Dongsun.

github enterprise

will this ever be configurable for use with github enterprise?

Missing files in 2012

I'm trying to download the full archive for 2013, and saw that these files result in 404.

http://data.githubarchive.org/2012-03-01-8.json.gz
http://data.githubarchive.org/2012-03-05-4.json.gz
http://data.githubarchive.org/2012-03-05-5.json.gz
http://data.githubarchive.org/2012-03-05-6.json.gz
http://data.githubarchive.org/2012-03-05-7.json.gz
http://data.githubarchive.org/2012-03-10-9.json.gz
http://data.githubarchive.org/2012-03-10-10.json.gz
http://data.githubarchive.org/2012-03-10-11.json.gz
http://data.githubarchive.org/2012-03-10-12.json.gz
http://data.githubarchive.org/2012-03-10-13.json.gz
http://data.githubarchive.org/2012-03-10-14.json.gz
http://data.githubarchive.org/2012-03-10-15.json.gz
http://data.githubarchive.org/2012-03-10-16.json.gz
http://data.githubarchive.org/2012-03-10-17.json.gz
http://data.githubarchive.org/2012-03-10-18.json.gz
http://data.githubarchive.org/2012-03-10-19.json.gz
http://data.githubarchive.org/2012-03-10-20.json.gz
http://data.githubarchive.org/2012-03-10-21.json.gz
http://data.githubarchive.org/2012-03-11-2.json.gz

repository_closed_issues

Hi there,

I'd like to use the BigTable query API to create charts showing the number of closed and open issues for a project over time.

An example of the chart I'd like to make is here: http://jira.codehaus.org/browse/JRUBY

githubarchive.org currently stores repository_open_issues, but in order to do this I'd also need repository_closed_issues.

Then I should be able to run a query like this:

SELECT created_at, repository_open_issues, repository_closed_issues
FROM [githubarchive:github.timeline]
WHERE repository_owner="rails"
    AND repository_name="rails"
    AND type="IssuesEvent"
    AND TIMESTAMP(created_at) > TIMESTAMP("2013-03-14 10:13:55")
ORDER BY created_at DESC
LIMIT 100

I poked around the source code to see if I could do this myself and send a pull request. I imagine I'd have to add the new field to bigquery/schema.js, but there must be something else that needs to be done as well to ensure github actually includes this field whenever an issue is created, closed or reopened. I'm not sure where this would be.

Any pointers would be greatly appreciated!

Cheers,
Aslak

31 days in April in README.md

In your readme, you show examples of fetching data from github through HTTP:

Query Command

Activity for April 11, 2012 at 3PM PST wget http://data.githubarchive.org/2012-04-11-15.json.gz

Activity for April 11, 2012 wget http://data.githubarchive.org/2012-04-11-{0..23}.json.gz

Activity for April 2012 wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz

Query	Command
Activity for April 11, 2012 at 3PM PST	`wget http://data.githubarchive.org/2012-04-11-15.json.gz`
Activity for April 11, 2012	`wget http://data.githubarchive.org/2012-04-11-{0..23}.json.gz`
Activity for April 2012	`wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz`

However, there's an error that doesn't let me sleep at nights:

Activity for April 2012 wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz

Activity for April 2012	`wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz`

There are 30 days in april, not 31.

Missing Data 2013-11-25-15

http://data.githubarchive.org/2013-11-25-15.json.gz returns an xml error

JSON files missing events for 1 hour in transition from DST to winter time

It seems that events between 1:00 am PST and 2:00 am PST on the day from DST to winter time is missing. On the other hand, ~~JSON for 1:00 am PST on the day from winter time to DST holds events for two hours while~~ there is no JSON for 2:00 am in transition from winter time to DST.

https://github.com/zunda/emoticommits/blob/master/scratch/check-archive-url.rb shows the folloiwng for githubarchive JSONs:

http://data.githubarchive.org/2012-11-04-0.json.gz
  from 00:00PDT/07:00UTC to 01:00PDT/08:00UTC
http://data.githubarchive.org/2012-11-04-1.json.gz
  from 01:00PDT/08:00UTC to 01:00PST/09:00UTC # Mind the gap in UTC
http://data.githubarchive.org/2012-11-04-2.json.gz
  from 02:00PST/10:00UTC to 03:00PST/11:00UTC
http://data.githubarchive.org/2012-11-04-3.json.gz
  from 03:00PST/11:00UTC to 04:00PST/12:00UTC

http://data.githubarchive.org/2013-03-10-0.json.gz
  from 00:00PST/08:00UTC to 01:00PST/09:00UTC
http://data.githubarchive.org/2013-03-10-1.json.gz
  from 01:00PST/09:00UTC to 03:00PDT/10:00UTC
http://data.githubarchive.org/2013-03-10-2.json.gz
  404 Not Found
http://data.githubarchive.org/2013-03-10-3.json.gz
  from 03:00PDT/10:00UTC to 04:00PDT/11:00UTC

Description of Schema

Hello Ilya!
I am following the great work of
https://github.com/anvaka/ghindex

and I would like to play some experiments for recommendations against github repo.
Where can I find information about the contributors to a repo?
And a more 'semantic' question for a git expert :)
Which is the difference between "watchers" respect to "stargazers", I mean what kind of information may reflect upon a repo ?

I was following the examples to use big query, but i could not found a description of the fields in schema at:
https://bigquery.cloud.google.com/table/publicdata:samples.github_timeline

As example (see below):
the query described in the ghindex git by anvaka, he fetches stargazers from actor_attributes_login
... Where can i find a description that stargazers are contained in actor_attributes_login ?

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent' AND actor_attributes_login IN (
SELECT actor_attributes_login FROM [githubarchive:github.timeline]
WHERE type='WatchEvent'
GROUP BY actor_attributes_login HAVING (count() > 1) AND (count () < 500)
)
GROUP EACH BY repository_url, actor_attributes_login;

Archive from 2012-10 contains events from 2012-11

2012-10-31-23.json
Contains at the end records from 2012-11-01.

Possibly missing activities in BigQuery table

I have a query to return all closed pull requests for a repo. When I compare it with the Github website, I realize there are some pull requests missing.

Here is an example:
component/reactive#15 is a closed pull request

SELECT payload_pull_request_number as pr_id, created_at as created_at, payload_description as payload_description, payload_pull_request_state as payload_pull_request_state, payload_action as payload_action, type as type
FROM [githubarchive:github.timeline]
WHERE repository_url = 'https://github.com/component/reactive' AND payload_pull_request_number = 15
ORDER BY created_at DESC;

returns

Row	pr_id	created_at	payload_description	payload_pull_request_state	payload_action	type
1	15	2013-01-09 01:13:11	null	open	opened	PullRequestEvent

No close action activity =/

======end of issue report======

======begins digging around to figure out why======

The pull request was closed at 2013-01-16 10:29:29. So I went ahead and

wget http://data.githubarchive.org/2013-01-16-10.json.gz

The close action is found there, which makes me think that somehow some data didn't make it into the bigquery table.

So in for a penny in for a pound, I figured, for any hour, the activity count of the .json.gz file should match that returned by bigquery

wc -l 2013-01-16-10.json 
9440 2013-01-16-10.json

SELECT count(1)
FROM [githubarchive:github.timeline]
WHERE PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2013-01-16 18:00:00') AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2013-01-16 19:00:00');

The .json.gz file is for 2013-01-16-10 PST therefore +7 +8 hours to convert to the corresponding UTC hour in the query

Row	f0_
1	~~9851~~ 152

9440 vs ~~9851~~ 152. That's quite off.

Did I do something wrong?(very likely)
Or is there data inconsistency?

hourly archives returning ERROR 403:Forbidden since 2012-11-04

Temporary issue or has something changed?
Coincidental with the date that clocks rolled back?
Up until 2012-11-03-23 responds normally.

wget http://data.githubarchive.org/2012-11-04-01.json.gz
--2012-11-05 14:13:55-- http://data.githubarchive.org/2012-11-04-01.json.gz
Resolving data.githubarchive.org... 205.251.242.164
Connecting to data.githubarchive.org|205.251.242.164|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2012-11-05 14:13:55 ERROR 403: Forbidden.

google bigquery interface is returning only top 100 rows

I am using the following command to retrieve all the Java projects:

bq query 'SELECT repository_name, count(repository_name) as watches, repository_description, repository_url
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent"
AND repository_language="Java"
GROUP BY repository_name, repository_description, repository_url
ORDER BY watches DESC'

However, it is returning only 100 rows.

Improve BigQuery usage instructions.

I'm trying to play with BigQuery but I think the UI may have changed since these instructions were written:

add the project (name: "githubarchive"), or take a look at the 03/11..05/11 snapshot of the data under "publicdata:samples"

I'm not sure how to do this. Here's what I see:

@igrigorik, any advice?

publications as rss

I would like to know if you there is a way to have this info as an rss feed.
Thanks a lot, excellent work.

Files without \n separator from February

I have just downloaded data for February and found problems in the following files:
2012-02-02-21.json.gz
2012-02-02-22.json.gz
2012-02-03-1.json.gz
2012-02-07-0.json.gz
2012-02-07-1.json.gz
2012-02-07-18.json.gz
2012-02-08-16.json.gz
2012-02-08-17.json.gz
2012-02-08-21.json.gz
2012-02-11-3.json.gz
2012-02-13-20.json.gz
2012-02-16-17.json.gz
2012-02-16-18.json.gz
2012-02-16-21.json.gz
2012-02-16-22.json.gz

There is at least one case in each file where 2 records are not separated by new line.

BTW I can send you full report with probematic records if you need

I hope it helps :-)
Krzysztof

CommitCommentEvent issue

I fetched 2011-05-24-12.json.gz, un-gzip and opened in gedit. Object for CommitCommentEvent looked like this:

{
  "id": "1467004355",
  "actor": {
    "login": "davehunt",
    "avatar_url": "https://secure.gravatar.com/avatar/fd74178aadc963ffc6397ad1e22d8ce7?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png",
    "url": "https://api.github.dev/users/davehunt",
    "id": 122800,
    "gravatar_id": "fd74178aadc963ffc6397ad1e22d8ce7"
  },
  "payload": {
    "commit": "9053eaa86e31f35d9c5f3617541815d9c7d6d7e3",
    "actor_gravatar": "fd74178aadc963ffc6397ad1e22d8ce7",
    "comment_id": 396736,
    "actor": "davehunt",
    "repo": "bebef1987/Addon-Tests"
  },
  "created_at": "2011-05-24T12:01:13Z",
  "public": true,
  "type": "CommitCommentEvent",
  "repo": {
    "name": "bebef1987/Addon-Tests",
    "url": "https://api.github.dev/repos/bebef1987/Addon-Tests",
    "id": 1792058
  }
}

But it should be


{
    "type": "CommitCommentEvent",
    "actor": {
        "avatar_url": "https://secure.gravatar.com/avatar/5b45540ae377ec54a071f313b7193a27?d=http://github.dev%2Fimages%2Fgravatars%2Fgravatar-user-420.png",
        "url": "https://api.github.dev/users/dabrahams",
        "login": "dabrahams",
        "id": 44065,
        "gravatar_id": "5b45540ae377ec54a071f313b7193a27"
    },
    "repo": {
        "url": "https://api.github.dev/repos/wahikihiki/boost-modularize",
        "id": 2133453,
        "name": "wahikihiki/boost-modularize"
    },
    "public": true,
    "payload": {
        "comment": {
            "html_url": "https://github.com/wahikihiki/boost-modularize/commit/a339f625e4#commitcomment-830037",
            "commit_id": "a339f625e492d21926c449c17269c4d77e94f78a",
            "url": "https://api.github.com/repos/wahikihiki/boost-modularize/comments/830037",
            "updated_at": "2012-01-01T00:03:11Z",
            "body": "I think you closed the wrong issue here.",
            "user": {
                "avatar_url": "https://secure.gravatar.com/avatar/5b45540ae377ec54a071f313b7193a27?d=https://a248.e.akamai.net/assets.github.com%2Fimages%2Fgravatars%2Fgravatar-140.png",
                "url": "https://api.github.com/users/dabrahams",
                "login": "dabrahams",
                "id": 44065,
                "gravatar_id": "5b45540ae377ec54a071f313b7193a27"
            },
            "position": null,
            "id": 830037,
            "path": null,
            "created_at": "2012-01-01T00:03:11Z",
            "line": null
        }
    },
    "id": "1508512415",
    "created_at": "2012-01-01T00:03:11Z"
}

What is wrong? :/

BigQuery broken links

Hi Ilya,

Is the BigQuery dataset linked from the home page (https://bigquery.cloud.google.com/dataset/githubarchive) no longer available?

Thanks

repo or repository?

Hi,

I'm afraid I found another problem.

There are 2 different formats in file: 2012-03-11-0.json.gz

Sometimes the key for repository info has name repo. Like in this example:
{
"repo": {
"id": 3889255,
"url": "https://api.github.dev/repos/azonwan/rable",
"name": "azonwan/rable"
},
...
}

but sometimes the key name is repository. Like in this example:
{
"repository": {
"url": "https://github.com/hmlovely/localhost",
"has_downloads": true,
"created_at": "2012/02/17 20:46:18 -0800",
"has_issues": true,
"description": "本地demo服务器",
"forks": 1,
"fork": false,
"has_wiki": true,
"homepage": "",
"size": 1072,
"private": false,
"name": "localhost",
"owner": "hmlovely",
"open_issues": 0,
"watchers": 2,
"pushed_at": "2012/03/10 23:59:59 -0800",
"language": "JavaScript"
},
...
}

Is there any reason for this?

Thanks
Krzysztof

Email updates?

Hey, I love your project! I've been using your daily digest emails to stay on top of what's trending, but I haven't gotten any emails since Dec 31st. Are you still running those? Cuz you should. They rock.

Events appear twice

I ran the following query to list all WatchEvent of repository textmate/textmate by date:

SELECT actor_attributes_login, type, repository_name, repository_owner, created_at as date
FROM [githubarchive:github.timeline]
WHERE (type="WatchEvent")
    AND repository_name="textmate"
    AND repository_owner="textmate"
    AND payload_action="started"
ORDER BY date DESC

I think there are 2 possible issues with the results of this query:

The query returns 12,571 results, while in the repository page (https://github.com/textmate/textmate) there are only 1,215 watchers.
The results include many actor_attributes_login duplicates.

Your help is much appreciated in understanding this mismatch.

Thanks :)

Provide a mapping between GitHub API and timeline of BigQuery

I am having a hard time answering the question such as:

Where is the field "forkee" of ForkEvent mapped into the timeline?

Query URL's Not Working For Date Ranges

On the githubarchive.org website, we can see options for querying the server using a timestamp or ranges.

Activity for April 11, 2012, 3PM UTC wget http://data.githubarchive.org/2012-04-11-15.json.gz
Activity for April 11, 2012 wget http://data.githubarchive.org/2012-04-11-{0..23}.json.gz
Activity for April 2012 wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz

However, when I include a range in my query (ex. http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz), I receive an XML document containing an error message.

<Error>
  <Code>NoSuchKey</Code>
  <Message>The specified key does not exist.</Message>
</Error>

The XML bit only appears in the browser, while the same request through wget would return a 404 (perhaps a URL encoding issue from my shell).

Has anyone experienced this problem and found a solution for it?

What if some fields don't exist in some recordes

I found the following sample BigQuery in the readme file:

/* distribution of different events on GitHub for Ruby */
SELECT type, count(type) as cnt
FROM [githubarchive:github.timeline]
WHERE repository_language="Ruby"
GROUP BY type
ORDER BY cnt DESC

But I also found that there are a lot of records whose repository.language doesn't exist. For example, I import the 2013-09-28-23.json to mongoDB as a collection called test:
I do the following query:

These records will affect the Query set.

Incorrect instructions for grabbing entire month of april

wget http://data.githubarchive.org/2012-04-{01..31}-{0..23}.json.gz

01..31 does not work as expected. It produces 0..31, so 0-9 are not downloaded.

README does not specify that hours should not be zero-padded

Zero-padded

--2012-08-10 03:29:42--  http://data.githubarchive.org/2012-05-11-03.json.gz
Resolving data.githubarchive.org... 207.171.163.151
Connecting to data.githubarchive.org|207.171.163.151|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2012-08-10 03:29:42 ERROR 403: Forbidden.

Not zero padded

--2012-08-10 03:42:21--  http://data.githubarchive.org/2012-05-11-3.json.gz
Resolving data.githubarchive.org... 207.171.163.161
Connecting to data.githubarchive.org|207.171.163.161|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1124617 (1.1M) [application/json]
Saving to: `2012-05-11-3.json.gz'

Inconsistent Schema?

I've been fiddling with the archived data and I noticed something weird (again):

[
    {
        "actor": {
            "avatar_url": "https://secure.gravatar.com/avatar/36c356edfda9c51838bb9013a83e746a?d=%2Fimages%2Fgravatars%2Fgravatar-user-420.png",
            "gravatar_id": "36c356edfda9c51838bb9013a83e746a",
            "id": 1117087,
            "login": "jamband",
            "url": "https://api.github.com/users/jamband"
        },
        "created_at": "2012-03-10T08:00:00Z",
        "id": "1528569590",
        "payload": {
            "description": "",
            "master_branch": "master",
            "ref": "contain_auth",
            "ref_type": "branch"
        },
        "public": true,
        "repo": {
            "id": 3634769,
            "name": "jamband/testProject",
            "url": "https://api.github.com/repos/jamband/testProject"
        },
        "type": "CreateEvent"
    },
    {
        "actor": "moloch--",
        "actor_attributes": {
            "blog": "http://rootthebox.com",
            "company": "[Buffer]Overflow",
            "email": "[email protected]",
            "gravatar_id": "a65909227f4385974f7af051e4ac3e8d",
            "location": "Earth",
            "login": "moloch--",
            "name": "Moloch",
            "type": "User"
        },
        "created_at": "2012/03/10 22:36:55 -0800",
        "payload": {
            "description": "Game of Hackers",
            "master_branch": "master",
            "ref": "master",
            "ref_type": "branch"
        },
        "public": true,
        "repository": {
            "created_at": "2012/03/10 22:30:48 -0800",
            "description": "Game of Hackers",
            "fork": false,
            "forks": 1,
            "has_downloads": true,
            "has_issues": true,
            "has_wiki": true,
            "homepage": "http://rootthebox.com",
            "name": "RootTheBox",
            "open_issues": 0,
            "owner": "moloch--",
            "private": false,
            "pushed_at": "2012/03/10 22:36:55 -0800",
            "size": 0,
            "url": "https://github.com/moloch--/RootTheBox",
            "watchers": 1
        },
        "type": "CreateEvent",
        "url": "https://github.com/moloch--/RootTheBox/compare/master"
    }
]

These events are sequential as returned by githubarchive.org. The date format of the second event does not follow what is defined in the API spec, and the crawler won't be able to parse it. GitHub being GitHub I would image they would take more dilligence following their own specification, but hey...

The second, most important question I wanted to ask is related to the id key of the event: the first event above is the last one where it's ever present. Is the crawler still using the events API endpoint? Because I can still see id there (although the repository is still called repo and not repository).