Data Pipeline into QuickBase

Data Pipeline into QuickBase

This is a data engineering pipeline we built that sources data into QuickBase

Project - Quickbase Data Pipeline

Overview

This project is a consolidated Extract, Transform, Load (ETL) data pipeline that supports a program that builds maps for the aviation industry. It extracts data from a specific source on schedule, does extensive cleaning, feature development and transformation, then imports this data into a QuickBase environment. This process ensures that the data is ready and usable for operations, relieving the users of having to manually work with the data. In addition, this data pipeline is designed for customization - necessary updates in either the data or associated logic are easy to make and immediately deployed.

Features

  1. Operates in a secure, isolated account within Amazon Web Services. All identity and access management are customized, and account is pre-configured with all permissions, policies and safeguards in place. If account is transferred to new owner, this pipeline can be operational immediately.
  2. Multi-step, modularized process with data integrity checks and stopgap measures at every step of the process ensure any issues with the data are addressed immediately - users are notified when the data schema updates or the source data is down.
  3. Processing only occurs when data updates - this is a major enhancement from previous versions, delivering significant cost savings.
  4. Easy to understand logs that allow management and engineering to keep track of processing and the status of each run.
  5. Most up-to date version of Apache Airflow for scheduling down to the second.
  6. Serverless operations with AWS Step Functions and AWS Lambda to ensure unlimited scalability and flexibility in operations.

Technical Overview

There are three stages to the data pipeline:
First, we use AWS Managed Apache Airflow to schedule a data extract. Airflow is an open-source platform that schedules operations down to the second. With Airflow, we extract the data and store it on AWS S3 (object storage) in JSON format.
Second, we use a combination of AWS Event Bridge, Step Functions and Lambda to process the cleaning and transformation modules. Each time data is loaded into S3, we query the new dataset to understand if there are changes - if so, we run a sequence of AWS Lambda (Python) scripts coordinated with AWS Step Function. If there are no changes in the data, we log this information and do not run the cleaning and transformation modules, saving significant compute costs.
In the third and final module, we take this dataset and format it specifically for transfer using the QuickBase API. This breaks the data up into fundamental components and updates the data within QuickBase, and tracks whether which data is appended, updated or deleted.
All of these processes are tracked with AWS CloudWatch, which conducts detailed application and process monitoring. This system simplifies the vast amount of information contained within CloudWatch with customized logging specific for both engineering and management.

Cost Efficiency

One of the main pillars of AWS’ Well Architected Framework is cost efficiency, and this data pipeline follows best practices. In fact, from the first iteration of this software to the the current version, we have reduced costs by over 35%, primarily by shifting to serverless operations (AWS Step Functions and AWS Lambda) and only processing exactly what is needed.

Future Planning

There are no major enhancements planned, but we continue to make updates and small efficiency tweaks. In addition, there are several logging modules that we intend to build in the coming weeks if useful for the user.