Auto Loader: End-to-End Schema Evolution
Summary
Databricks' Auto Loader, built on Structured Streaming, incrementally processes new files from cloud storage like ADLS, S3, or GCS, tracking processed files to load only new ones efficiently. It maintains metadata, scales for millions of files, and can infer and evolve schemas. Key configurations include `cloudFiles.format`, `cloudFiles.schemaLocation` for schema history, `cloudFiles.inferColumnTypes`, and `cloudFiles.schemaEvolutionMode` (options: `addNewColumns`, `rescue`, `none`). End-to-end schema evolution requires configuring both Auto Loader ingestion and Delta table storage, specifically enabling `spark.databricks.delta.schema.autoMerge.enabled = true` or `mergeSchema = true` for Delta writes. The "rescue" column captures unexpected fields as JSON, preventing pipeline failure while maintaining strict schema control.
Key takeaway
For Data Engineers building incremental ingestion pipelines on Databricks, understanding Auto Loader's schema evolution is crucial. You must configure both `cloudFiles.schemaEvolutionMode` and Delta's `mergeSchema` to prevent pipeline failures when source schemas change. Evaluate whether `addNewColumns` or the `rescue` column mode aligns with your data governance and downstream system requirements, especially for regulated data where explicit schema updates might be safer.
Key insights
Auto Loader provides scalable, incremental data ingestion with configurable schema evolution for cloud storage.
Principles
- Schema evolution requires dual-layer configuration.
- Rescue columns prevent data loss from unexpected fields.
- Historical data gets nulls for new columns.
Method
Configure Auto Loader with `cloudFiles.schemaLocation` and `cloudFiles.schemaEvolutionMode`. For full evolution, also enable Delta schema merge via `spark.databricks.delta.schema.autoMerge.enabled` or `mergeSchema`.
In practice
- Use `addNewColumns` for flexible, evolving sources.
- Use `rescue` for strict schema control and monitoring.
- Avoid auto-evolution for regulated data or strict governance.
Topics
- Databricks Auto Loader
- Structured Streaming
- Schema Evolution
- Delta Lake
- Data Ingestion
Best for: Data Engineer, MLOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.