It turns out that Github has a project named Github Archive which stores all public activity on Github dating back to 2011-02-12. What is even cooler is that this dataset is available on Google Big-query and it’s updated by the hour!
I looked through the
of the event I was interested in (
PullRequestEvent) and modified the query
from the SO answer slightly to also return the programming language of the
repository and the date of the my latest pull request. I ended up with the
following query. Keep in mind that this query only works for data before 2015.
See note at the bottom for more information.
SELECT repository_url, LAST(created_at) AS last_pr, LAST(repository_language) as language FROM [githubarchive:github.timeline] WHERE payload_pull_request_user_login = 'mads379' GROUP BY repository_url ORDER BY last_pr DESC;
The results of the query is shown in the table below
The documentation for the BigQuery query language is here. I’m definitely going to play around with this some more. Let me know if you come up with any other fun queries.
(Added 2015-02-05) Felipe Hoffa (@felipehoffa) mentioned on Twitter that the database layout for the archive was changed in 2015. You can find more information about it in the announcement here. I personally prefer this format as the JSON from the Github Archive documentation is stored directly in the tables which means I don’t have to guess the column names for the various nested fields.
SELECT repo_name, JSON_EXTRACT(payload, '$.pull_request.base.repo.language') as language, JSON_EXTRACT(payload, '$.pull_request.updated_at') as updated_at FROM [githubarchive:month.201501] WHERE type = 'PullRequestEvent' AND JSON_EXTRACT(payload, '$.pull_request.user.login') = 'mads379'