Background
A Cloud Data Platform (CDP) project involved building Ingestor and Exporter tools to manage data across various teams. The Ingestor converts data from different formats into Parquet format and loads it into AWS Redshift tables. The Exporter does the reverse, converting Redshift data to user-desired formats. These tools utilize AWS Lambda, Python, Apache Airflow, AWS S3, AWS Redshift, and AWS SNS/SQS to process and manage data efficiently.
Challenges
- Integrating heterogeneous data formats (Excel, CSV, XML, JSON, DAT) into a unified processing pipeline
- Implementing schema validation and type enforcement across diverse data structures
- Designing a scalable metadata management system for multiple file types
- Developing robust ETL processes for high-volume, varied data ingestion
- Ensuring data integrity and consistency during transformation to columnar (Parquet) format
- Architecting a flexible system to accommodate future data type additions
- Implementing efficient error handling and logging in a distributed environment
Solutions
- Developed a Python-based metadata module with key/value structure for configuration
- Implemented CI/CD pipeline using Azure DevOps for metadata version control and deployment
- Created modular AWS Lambda functions in Python for format-specific data processing
- Utilized AWS S3 for data staging and Apache Airflow for ETL orchestration
- Engineered a dynamic file processing system using SNS/SQS for Lambda triggering
- Implemented Parquet conversion for optimized columnar storage in AWS Redshift
- Developed custom logging in JSON format for analytics and debugging purposes
Results
- Successfully scaled to support 28 distinct data pipelines across the organization
- Achieved near real-time data processing through event-driven Lambda architecture
- Reduced data inconsistencies through standardized ETL processes and type checking
- Implemented a flexible system supporting multiple file formats including zip and folder ingestion
- Enhanced data observability through CloudWatch integration and custom JSON logging
- Optimized data warehouse performance by utilizing Parquet format in Redshift
- Improved development efficiency through modular code design and infrastructure-as-code (Terraform)