I recently thought it would be cool to show a list, here on the blog, of all the projects that I've contributed to on Github. I googled around a bit and found this Stack Overflow answer.
It turns out that Github has a project named Github Archive which stores all public activity on Github dating back to 2011-02-12. What is even cooler is that this dataset is available on Google Big-query and it's updated by the hour!
I looked through the documentation of the event I was interested in
PullRequestEvent) and modified the query from the SO answer
slightly to also return the programming language of the repository and
the date of the my latest pull request. I ended up with the following
query. Keep in mind that this query only works for data
before 2015. See note at the bottom for more information.
SELECT repository_url, LAST(created_at) AS last_pr, LAST(repository_language) as language FROM [githubarchive:github.timeline] WHERE payload_pull_request_user_login = 'mads379' GROUP BY repository_url ORDER BY last_pr DESC;
The results of the query is shown in the table below
The documentation for the BigQuery query language is here. I'm definitely going to play around with this some more. Let me know if you come up with any other fun queries.
(Added 2015-02-05) Felipe Hoffa (@felipehoffa) mentioned on Twitter that the database layout for the archive was changed in 2015. You can find more information about it in the announcement here. I personally prefer this format as the JSON from the Github Archive documentation is stored directly in the tables which means I don't have to guess the column names for the various nested fields.
SELECT repo_name, JSON_EXTRACT(payload, '$.pull_request.base.repo.language') as language, JSON_EXTRACT(payload, '$.pull_request.updated_at') as updated_at FROM [githubarchive:month.201501] WHERE type = 'PullRequestEvent' AND JSON_EXTRACT(payload, '$.pull_request.user.login') = 'mads379'