Project - QuickBase Data Pipeline
Overview
This project is a consolidated Extract, Transform, Load (ETL) data pipeline that supports a program that builds maps for the aviation industry. It extracts data from a specific source on schedule, does extensive cleaning, feature development and transformation, then imports this data into a QuickBase environment. This process ensures that the data is ready and usable for operations, relieving the users of having to manually work with the data. In addition, this data pipeline is designed for customization - necessary updates in either the data or associated logic are easy to make and immediately deployed.
Features
- Operates in a secure, isolated account within Amazon Web Services. All identity and access management are customized, and account is pre-configured with all permissions, policies and safeguards in place. If account is transferred to new owner, this pipeline can be operational immediately.
- Multi-step, modularized process with data integrity checks and stop-gap measures at every step of the process ensure any issues with the data are addressed immediately - users are notified when the data schema updates or the source data is down.
- Processing only occurs when data updates - this is a major enhancement from previous versions, delivering significant cost savings.
- Easy to understand logs that allow management and engineering to keep track of processing and the status of each run.
- Serverless operations with AWS Step Functions and AWS Lambda to ensure unlimited scalability and flexibility in operations.
Technical Overview
There are three stages to the data pipeline:
We use a combination of AWS Event Bridge, Step Functions and
Lambda to process the cleaning and transformation modules. It
initiates with an EventBridge Scheduler call that triggers a
Lambda, which downloads data from a set of databases. It places
this data into an S3 bucket that has versioning on (essentially
keeps a copy of each dataset). We then query the new dataset to
understand if there are changes - if so, we run a sequence of Lambda (Python) scripts coordinated with AWS Step Function. If there are no changes in the data, we log this information and do not run the cleaning and transformation modules, saving significant compute costs.
In the third and final module, we take this dataset and format it specifically for transfer using the QuickBase API. This breaks the data up into fundamental components and updates the data within QuickBase, and tracks whether which data is appended, updated or deleted.
All of these processes are tracked with AWS CloudWatch, which conducts detailed application and process monitoring. This system simplifies the vast amount of information contained within CloudWatch with customized logging specific for both engineering and management.
Cost Efficiency
Cost efficiency is a core pillar of AWS’ Well-Architected Framework, and our data pipeline adheres to best practices. Initially, we tested a custom Virtual Machine (VM), AWS Managed Airflow (MWAA), and EventBridge/Lambda. VMs proved difficult to maintain, especially during debugging and refinement. While MWAA worked well initially, it had two significant drawbacks. First, occasional memory bursts required adding more worker nodes, driving costs to nearly $2,000 per month for a single pipeline. Second, MWAA lacked a simple on/off feature (while possible, it requires significant effort and custom containers), causing unnecessary costs during off hours. After substantial testing, we transitioned to EventBridge, Step Functions, and Lambda, reducing our monthly costs by 20x—from nearly $2,000 with MWAA to under $100 with Lambda.
Future Planning
The data pipeline we developed is fully complete, with no future enhancements planned. However, the core concepts and architecture used in this pipeline are being applied to other ongoing data pipeline projects across different domains. This allows us to leverage the lessons learned and optimize cost efficiency, scalability, and automation in new implementations.