This project demonstrates the design and implementation of a Fully Automated Serverless ETL & Analytics Pipeline using AWS services. The goal of this project is to collect data from external APIs, process and clean it, store it in a structured format, and provide analytical insights via SQL queries and visualizations, all while maintaining observability and monitoring.
This project implements a Serverless ETL and Analytics Pipeline on AWS. Data is collected from the Twelve Data API using a Python script (DataIngestion.py) written in Databricks Notebook and stored in Amazon S3 (Raw Zone). The raw data is then processed in AWS Glue Studio, where it is cleaned, transformed, and structured by removing duplicates, handling missing values, aggregating, and converting data types. AWS Glue Data Quality jobs check the data for accuracy and consistency. Then the transformed data is stored in Amazon S3 (Processed Zone). A Glue Crawler updates the Glue Data Catalog with schema and metadata, making the data ready for queries. Amazon Athena runs SQL queries on the processed data to extract insights such as trading volumes, price changes, and risk metrics. These insights are visualized in Amazon QuickSight dashboards, showing trends, correlations, and market volatility. Finally, Amazon CloudWatch monitors the pipeline, collecting logs and metrics to ensure reliability and detect issues. This pipeline shows a complete, cloud-native approach for analyzing financial data.
Data ingestion is the first critical stage of the pipeline. The goal is to fetch raw data from Twelve Data API and store it in Amazon S3 (Raw Zone) for further processing. The ingestion process is implemented using Databricks Workspace and Notebook, which allows scalable, interactive data processing before storing it in S3.
Implementation Details:
- Data Source: Twelve Data API, providing financial and time-series data (OHLCV).
- Data Retrieval: Data is ingested via a Databricks Notebook, which runs a Python script (
DataIngestion.py) to fetch data from the Twelve Data API. Databricks provides an interactive environment to process, validate, and prepare the data before storage. - Storage: Raw data is stored in Amazon S3 Raw Zone, organized by source and date for traceability and easy access for downstream ETL processes.
- Automation (Optional): The ingestion process can be scheduled using orchestration tools (e.g., EventBridge Scheduler) to ensure regular updates without manual intervention.
Python Script (DataIngestion.py) : Access the "DataIngestion.py" file here
After ingesting raw data into S3, the next stage is ETL processing using AWS Glue Studio. The ETL job, implemented as awsdataflowjob.py, performs multiple transformations on the raw data, including duplicate removal, null handling, aggregations, data type conversions, and joining multiple datasets to produce clean, structured data ready for analysis.
Once the data is processed, AWS Glue Data Quality jobs validate it against business rules to ensure completeness, correctness, and consistency across related datasets. This includes checks for mandatory fields, correct data types, and reasonable value ranges.
Finally, a Glue Crawler scans the processed S3 zone and updates the AWS Glue Data Catalog with schema and metadata for each dataset. This enables efficient querying of the processed data using Amazon Athena or other AWS services, completing the ETL and validation cycle.
After processing and cataloging the data using AWS Glue, Amazon Athena is used as a serverless query engine to analyze the cleaned datasets stored in S3. Athena enables ad-hoc queries, aggregations, and analytics on the processed OHLCV data.
Athena Queries File: : Access "athena_queries.sql" file Here
- Aggregated Metrics
Daily and monthly aggregates of trading volume and prices were computed.
Insight: Revealed trends in market activity, highlighting days with unusually high or low trading volumes and significant price changes. - Time-Series Analysis
Daily changes in closing price and trading volume were tracked.
Insight: Showed market volatility with peaks and drops on specific dates, essential for understanding trading behavior and risk. - Comparative Analysis
Sector-wise and category-wise aggregations were analyzed.
Insight: Identified top-performing stocks or sectors and their contribution to overall market activity, aiding investment and portfolio analysis. - Risk and Performance Metrics
Daily rate of change and percentage returns were calculated.
Insight: Pinpointed high-risk days with major price spikes or drops, providing insights into market stability and informing risk management strategies
In this stage, the processed OHLCV (Open, High, Low, Close, Volume) data is visualized using Amazon QuickSight. The visualizations provide insights into market trends, daily price movements, trading activity, and price-volume correlations.
-
Daily Price Change

Insight: This line chart shows daily price changes for the selected time period. It highlights market volatility, helping to identify days with significant gains or losses and assess overall price stability. -
Daily Volume

Insight: This bar chart tracks daily trading activity by displaying the total trading volume per day. Peaks in the chart indicate high activity days, which can correlate with major market events. -
Monthly Volume

Insight: This bar chart summarizes long-term trading trends by aggregating volume over the month. It helps to understand whether trading activity is increasing, decreasing, or stable over time. -
Price vs Volume

Insight: This bubble/scatter chart analyzes the correlation between price and trading volume. The X-axis represents price, Y-axis represents trading volume, and bubble size indicates the relative market impact. It helps identify anomalies and relationships between price movements and trading activity.
Monitoring is a critical part of any production-grade ETL pipeline. Amazon CloudWatch provides centralized monitoring, logging, and alerting for the entire serverless ETL and analytics workflow. All AWS Glue ETL job runs, API ingestion events, and pipeline activities are logged, offering detailed information on execution status, errors, warnings, and performance metrics, which helps with quick troubleshooting and validation of ETL execution. Custom dashboards track key metrics such as ETL job duration, success/failure rates, data ingestion volume, API response times, and S3 storage growth in raw and processed zones. These dashboards provide real-time observability into pipeline health, ensure SLA compliance, and help optimize resource usage.




.png)
.png)