Yelp Medallion Batch Pipeline

Building medallion architecture for crowd-sourced reviews using Snowflake native features

Snowflake·
SQL

This batch data pipeline implements a medallion architecture data pipeline for the Yelp Open Dataset using Snowflake Dynamic Tables. The project demonstrates modern data engineering best practices by...

Screenshot 1

About this project

This batch data pipeline implements a medallion architecture data pipeline for the Yelp Open Dataset using Snowflake Dynamic Tables. The project demonstrates modern data engineering best practices by processing raw JSON data through Bronze, Silver, and Gold layers to create analytics-ready datasets.

Business Value

This project provides:

  • 360° Business Intelligence: Complete view of businesses, users, and reviews

  • Scalable Analytics: Dimensional design supports complex analytical workloads

  • Near-Real-time Insights: Dynamic Tables provide fresh data with minimal latency

  • Cost Optimization: Automated warehouse management and incremental processing

  • Enterprise-Ready: Production-grade data governance and quality controls

Key Highlights

  • Medallion Architecture: Bronze (Raw) → Silver (Cleaned) → Gold (Business Logic)

  • Snowflake Dynamic Tables: Automated refresh management and incremental processing

  • Dimensional Modeling: Star schema design optimized for analytics

  • Advanced SQL Transformations: JSON processing, array operations, and business logic

📋 Prerequisites

Snowflake Requirements

  • Snowflake account with SYSADMIN role access

  • Warehouse creation privileges

  • Database creation privileges

  • Dynamic Tables feature enabled

Local Environment

  • Python 3.7+ with standard libraries

  • Access to Yelp Open Dataset (see Licensing)

  • Snowflake connectivity (SnowSQL, Python connector, or Snowsight)

✨ Features

  • JSON Processing: Complex VARIANT data extraction and transformation

  • Array Operations: Category parsing, friend networks, elite year processing

  • Dynamic Tables: Automated incremental refresh management

  • Warehouse Management: Auto-suspend/resume for cost optimization

  • Scalable Architecture: Designed for large dataset processing

  • Audit Trails: Complete data lineage from source to gold

  • Metadata Tracking: File-level processing information

Stack:
SnowflakeSQL
Team

You must be logged in to comment

Sign in to comment

Comments

No comments yet

Be the first to share your thoughts!

Project Info

Published on Dec 10, 2025
View on GitHub