Yelp Batch ETL Pipeline
A batch ETL pipeline that processes Yelp business raw data to generate analytics and insights
Yelp Batch ETL PipelineHi everyone ! 👋I’m sharing a project where I built a kind of “data factory” using real Yelp data.(collected from kaggle.com) Yelp provides huge files with information about bus...

About this project
Yelp Batch ETL Pipeline
Hi everyone ! 👋
I’m sharing a project where I built a kind of “data factory” using real Yelp data.(collected from kaggle.com)
Yelp provides huge files with information about businesses, users, reviews, tips, etc. They’re very big, messy to work with directly, and not ready for analysis.
What it does
The pipeline transforms raw Yelp data (businesses, reviews, users, checkin, tips) through a medallion architecture:
1️⃣Bronze Layer: Raw data ingestion from JSON rawfile or MongoDB in Delta Lake Format
2️⃣Silver Layer: Cleaned and validated data stored in Delta Lake format
3️⃣ Gold Layer: Analytics-ready tables in PostgreSQL for dashboards and reporting
Tech Stack
➡️MongoDB : a place where the raw Yelp data can be stored (.json format)
➡️ Scala : the main programming language I use to write the data processing logic
➡️ Apache Spark : like a super‑powerful calculator that can process a lot of data quickly
➡️ Delta Lake : a storage system that keeps the cleaned data in different quality layers
➡️ Apache Airflow : a scheduler that knows which steps to run and in which order
➡️ PostgreSQL : a traditional database where the final, easy‑to‑use tables are stored
The pipeline handles large-scale data efficiently and enables downstream analytics like identifying top-rated businesses by city, user behavior patterns, and review sentiment trends.
You must be logged in to comment
Sign in to commentComments (1)
Thank you for sharing this project! Gonna look into it :)