A self-healing data pipeline platform built on Airflow that uses AI to classify failures, intelligently rerun tasks, and continuously learn from outcomes via vector embeddings and feedback loops.Overv...

A self-healing data pipeline platform built on Airflow that uses AI to classify failures, intelligently rerun tasks, and continuously learn from outcomes via vector embeddings and feedback loops.
Airflow AI Ops is an intelligent platform that automatically detects, analyzes, and recovers from Airflow task failures. Instead of manually debugging logs and retrying tasks blindly, the system uses LLMs + vector search + feedback loops to make context-aware recovery decisions and continuously improve over time.
π‘ Captures Airflow task failures in real-time via callbacks
π Uses semantic search (pgvector) to find similar historical errors
π€ Applies LLM reasoning + context to classify failures
β‘ Recommends actions: RERUN / NO_RERUN with confidence scoring
π§βπ» Supports human-in-the-loop approvals for edge cases
π Automatically reruns tasks based on confidence + SLA rules
π Learns from successful recoveries and updates its knowledge base
Failure β AI β Decision β Rerun β Learn β Improve
Airflow Layer: Captures failures via callbacks and logs to pipeline_signals
AI Intelligence: Uses embeddings + pgvector to find similar errors, applies LLM reasoning
Decision Engine: Routes to auto-rerun (high confidence) or human review (low confidence)
Human Review: Streamlit dashboard for approvals with error details and logs
Recovery Layer: Clears tasks and tracks rerun outcomes via Airflow API
Learning Loop: Generates YAML knowledge base from successful recoveries
You must be logged in to comment
Sign in to commentNo comments yet
Be the first to share your thoughts!