Github Archive

by Mads Hartmann - 05 Feb 2015

I recently thought it would be cool to show a list, here on the blog, of all the projects that I've contributed to on Github. I googled around a bit and found this Stack Overflow answer.

It turns out that Github has a project named Github Archive which stores all public activity on Github dating back to 2011-02-12. What is even cooler is that this dataset is available on Google Big-query and it's updated by the hour!

I looked through the documentation of the event I was interested in (PullRequestEvent) and modified the query from the SO answer slightly to also return the programming language of the repository and the date of the my latest pull request. I ended up with the following query. Keep in mind that this query only works for data before 2015. See note at the bottom for more information.

  LAST(created_at) AS last_pr,
  LAST(repository_language) as language
FROM [githubarchive:github.timeline]
WHERE payload_pull_request_user_login = 'mads379'
GROUP BY repository_url
ORDER BY last_pr DESC;

The results of the query is shown in the table below

repository_url last_pr language 2014-11-05 OCaml 2014-10-10 Emacs Lisp 2014-10-07 Erlang 2014-08-26 OCaml 2014-05-08 OCaml 2013-08-17 PHP 2013-08-02 Scala 2013-01-26 Scala 2013-01-04 Scala 2012-12-11 JavaScript 2012-08-11 C 2012-08-06 null 2012-08-05 Scala 2012-06-02 Ruby 2012-05-31 JavaScript 2012-04-27 Python

The documentation for the BigQuery query language is here. I'm definitely going to play around with this some more. Let me know if you come up with any other fun queries.

(Added 2015-02-05) Felipe Hoffa (@felipehoffa) mentioned on Twitter that the database layout for the archive was changed in 2015. You can find more information about it in the announcement here. I personally prefer this format as the JSON from the Github Archive documentation is stored directly in the tables which means I don't have to guess the column names for the various nested fields.

  JSON_EXTRACT(payload, '$.pull_request.base.repo.language') as language,
  JSON_EXTRACT(payload, '$.pull_request.updated_at') as updated_at
FROM [githubarchive:month.201501]
  type = 'PullRequestEvent' AND
  JSON_EXTRACT(payload, '$.pull_request.user.login') = 'mads379'