This batch data pipeline implements a medallion architecture data pipeline for the Yelp Open Dataset using Snowflake Dynamic Tables. The project demonstrates modern data engineering best practices by processing raw JSON data through Bronze, Silver, and Gold layers to create analytics-ready datasets.

Business Value

This project provides:

360° Business Intelligence: Complete view of businesses, users, and reviews
Scalable Analytics: Dimensional design supports complex analytical workloads
Near-Real-time Insights: Dynamic Tables provide fresh data with minimal latency
Cost Optimization: Automated warehouse management and incremental processing
Enterprise-Ready: Production-grade data governance and quality controls

Key Highlights

Medallion Architecture: Bronze (Raw) → Silver (Cleaned) → Gold (Business Logic)
Snowflake Dynamic Tables: Automated refresh management and incremental processing
Dimensional Modeling: Star schema design optimized for analytics
Advanced SQL Transformations: JSON processing, array operations, and business logic

📋 Prerequisites

Snowflake Requirements

Snowflake account with SYSADMIN role access
Warehouse creation privileges
Database creation privileges
Dynamic Tables feature enabled

Local Environment

Python 3.7+ with standard libraries
Access to Yelp Open Dataset (see Licensing)
Snowflake connectivity (SnowSQL, Python connector, or Snowsight)

✨ Features

JSON Processing: Complex VARIANT data extraction and transformation
Array Operations: Category parsing, friend networks, elite year processing

Dynamic Tables: Automated incremental refresh management
Warehouse Management: Auto-suspend/resume for cost optimization
Scalable Architecture: Designed for large dataset processing

Audit Trails: Complete data lineage from source to gold
Metadata Tracking: File-level processing information

Yelp Medallion Batch Pipeline

About this project

Business Value

Key Highlights

📋 Prerequisites

✨ Features

Comments

Project Info