Reporting Analytics using AWS Quicksight | Lambda CSV -> Parquet
This is the story of building data late and new reporting platform using AWS Quicksight.
There are very good articles on AWS to build reporting platform. I won’t be repeating them here.
Ref: https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html
Situation
Our current ‘very expensive’ platform had limited flexibility and required subject matter expertise on the platform to build reporting.
Task
We were in search of new reporting and analytics platform. The platform was supposed to be highly flexible and scalable, cost effective, easy to develop and use data visualization.
Action
I zeroed on AWS Quicksight given its cost model (Enterprise), flexibility and SPICE capacity approach was rightly mapping to our requirements.
Challenges:
- ETL from various data sources
One of the main challenge was to build the data lake using the data coming from on-premise data sources (such as SQL Server, Mainframe, Oracle DB, file systems, application APIs etc.) and external data sources (such as Workday, Azure SQL, Jira etc.)
I chose Apache Airflow as an ETL orchestration platform. Mainly because we were already building ETL jobs in python and Airflow is super easy to maintain.
2. Converting CSV to parquet was ideal scenario since we were going to use Athena for querying the data lake (we started with direct ingestion of CSVs in QS SPICE, but it quickly became unmanageable)
We separated data lake ingestion from generating Parquet formatted files (used for Athena queries). This was done mainly because the CSV files in S3 were shared with other applications in our Data lake eco-system.
I have configured Lambda trigger to S3 for converting CSV files to Parquet and place them in a separate S3 bucket used for Reporting and analytics
Why not Glue Jobs?
Glue jobs basically do a transformation from CSV based table to Parquet. I don’t wanted to create CSV tables
Glue Jobs need to be added into Glue Workflow for trigger, where it has limit as how many jobs we can trigger for an event. This was a limitation as with Lambda we should create Parquets as they arrived.
Glue Jobs when converted from CSV to Parquet, doesn’t do a good job with replacing existing Parquet files thus corrupting the dataset.