Query GitHub With SQL
AskGit now supports querying the GitHub GraphQL API with SQL, which unlocks some interesting use cases 📊
The latest AskGit release (v0.4.2) adds support for SQL queries that can access the GitHub API (v4). (Astute followers will note that GitHub API tables were available in AskGit in “preview”, but were removed when we did some re-architecting and are now being re-added 🙂).
In short, you can now run queries using the following SQL functions:
github_stargazer_count
github_repo_pull_requests
(coming soon)
You’ll need to supply a GitHub authentication token as an environment variable when using the CLI.
So how is this useful?
The GitHub (GraphQL) API is powerful and expressive on its own - but to do more complex filtering, aggregating, and paginating you’ll likely need to script on top of it. Using AskGit SQL, you can avoid writing a bespoke script (that implements those API calls, pagination, custom aggregations, filters) - and use SQL instead.
In fact, by compiling AskGit to a runtime-loadable SQLite extension, you can write scripts that use AskGit functionality in any language (that has a SQLite driver) - but that’s a post for another time.
What can I query?
We’ll be publishing some use-cases in the coming weeks as this functionality settles, but for now, here are some queries you could try:
For large, public GitHub orgs, slice and dice repos by
primary_language
,license_name
,created_at
/updated_at
. Is Google creating more Go repositories over time? What licenses does Microsoft typically use? These orgs have a lot of repos, and looking at some basic stats in aggregate could be interestingWhat’s the average time between opening an issue and closing that issue on a particular project? Could this be a proxy for “liveness” or “responsivess” of the maintainers? (More maintained projects close issues faster?)
Similarly, what’s the average time between opening and merging a Pull Request? Others have called this (or something like it) “cycle time.” Once the
github_repo_pull_requests
table is available, you should be able to roll your own “cycle time” metricsHow many people starred my repo this week? This month? In the last hour?
Among the people who starred a repo, what other repos were most commonly starred?
Show me the top repos in an org by star growth rate (as opposed to total # of stars)
Constraint Push Down
As noted in the README:
Constraints in your SQL query are pushed to the GitHub API as much as possible. For instance, if your query includes an
ORDER BY
clause and if items can be ordered in the GitHub API response (on the specified column), your query can avoid doing a full table scan and rely on the ordering returned by the API.
We try our best to optimize your query with constraints “pushed down” into the API request. Sometimes it’s not possible, and there’s still room for improvement, but the basic cases should be covered for now 🙂.
Export a Table for Faster Queries
You can use the askgit export
command to save the results of longer running queries into a SQLite database file. This makes it easier to work with data that requires some time to fetch (lots of API calls, for instance). If you know ahead of time the set of tables you want to work with (all the repos from a particular org, all the issues from a particular repo, etc) you could (and should) export those query results to a SQLite database file, then write ad-hoc queries against the db file directly.
You could checkout sqliteviz as well 🎉 . See here for an example.
The demo site will be updated soon with GitHub API table support 🚀
Stay up to date on more AskGit releases and feature announcements by subscribing below 🙂, if you haven’t already