Yelp Batch ETL Pipeline

A batch ETL pipeline that processes Yelp business raw data to generate analytics and insights

Apache Airflow·
Apache Spark·
Scala·
MongoDB·
PostgreSQL

Yelp Batch ETL PipelineHi everyone ! 👋I’m sharing a project where I built a kind of “data factory” using real Yelp data.(collected from kaggle.com) Yelp provides huge files with information about bus...

Screenshot 1

About this project

Yelp Batch ETL Pipeline

Hi everyone ! 👋
I’m sharing a project where I built a kind of “data factory” using real Yelp data.(collected from kaggle.com)
Yelp provides huge files with information about businesses, users, reviews, tips, etc. They’re very big, messy to work with directly, and not ready for analysis.

What it does

The pipeline transforms raw Yelp data (businesses, reviews, users, checkin, tips) through a medallion architecture:

1️⃣Bronze Layer: Raw data ingestion from JSON rawfile or MongoDB in Delta Lake Format
2️⃣Silver Layer: Cleaned and validated data stored in Delta Lake format
3️⃣ Gold Layer: Analytics-ready tables in PostgreSQL for dashboards and reporting

Tech Stack

➡️MongoDB : a place where the raw Yelp data can be stored (.json format)
➡️ Scala : the main programming language I use to write the data processing logic
➡️ Apache Spark : like a super‑powerful calculator that can process a lot of data quickly
➡️ Delta Lake : a storage system that keeps the cleaned data in different quality layers
➡️ Apache Airflow : a scheduler that knows which steps to run and in which order
➡️ PostgreSQL : a traditional database where the final, easy‑to‑use tables are stored


The pipeline handles large-scale data efficiently and enables downstream analytics like identifying top-rated businesses by city, user behavior patterns, and review sentiment trends.

Stack:
Apache AirflowApache SparkScalaMongoDBPostgreSQL
Team

You must be logged in to comment

Sign in to comment

Comments (1)

0
M
Marc Lamberti1mo ago

Thank you for sharing this project! Gonna look into it :)

Project Info

Published on Dec 15, 2025
View on GitHub