Data Analytics for GCD

Using SQL to analyze GCD snapshots

The gcd-etl project builds a partition in a denormalized Hive table (Parquet format) for each GCD data dump. For performance, a Presto SQL query engine is deployed in front of Hive, and then a self-hosted Redash instance provides a front-end for authoring queries and dashboards.

To get a sense of the types of analysis possible with this dataset, check out these sample queries.

You can learn more about the dataset and see the schema on the about page.

Using Imhotep and IQL to analyze GCD snapshots

The gcd-etl project also produces a similar dataset in the Flamdex format used by Imhotep, an OSS project from Indeed. The Flamdex format is more efficient than Parquet, but there are more limitations on the types of queries possible with Imhotep.

You can peruse the equivalent sample queries for Imhotep, in the SQL-like Imhotep Query Language (IQL).

Note that there is no longer an active Imhotep cluster serving this data.