Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.5.2. Schema Evolution, Data Lineage, and Data Quality

šŸ’” First Principle: Schemas change because business requirements change — new columns are added, types are modified, and old fields are deprecated. Schema evolution is the art of making these changes without breaking downstream consumers. In a pipeline ecosystem where study guides, flashcards, and dashboards all depend on the same table, a careless schema change can cascade failures across the organization.

Backward-compatible changes (safe): adding new columns, adding new partition keys (with Iceberg), and relaxing column types (INT → BIGINT). Consumers ignore columns they don't know about.

Breaking changes (dangerous): renaming columns, changing data types (STRING → INT), removing columns, and changing partition strategies. These require coordinated updates across producers and consumers.

AWS DMS Schema Conversion (formerly AWS SCT) converts database schemas between engines — for example, from Oracle to Aurora PostgreSQL. It identifies incompatible data types, stored procedures, and functions, and provides conversion recommendations. Use this during database migration projects.

Data lineage tracks the origin and transformation history of data — "this dashboard column came from table X, which was derived from raw file Y via Glue job Z." Lineage is essential for impact analysis (which dashboards break if we change this table?), debugging (where did this incorrect value originate?), and compliance (prove the provenance of this report number).

Amazon SageMaker ML Lineage Tracking and Amazon SageMaker Catalog both support lineage tracking. SageMaker Catalog (v1.1) provides lineage within the data governance context, while ML Lineage Tracking focuses on machine learning experiment tracking.

Vectorization concepts (v1.1) relate to converting data into vector embeddings for AI/ML workloads. Amazon Bedrock Knowledge Bases automates the vectorization pipeline: extract text from documents, generate embeddings via a Bedrock model, and store vectors in a supported vector store.

āš ļø Exam Trap: Schema evolution on S3 files requires planning. Adding a column to a Parquet file doesn't update existing files — Athena handles this gracefully (returns NULL for the new column in old files). But removing or renaming a column breaks queries that reference the old name. Apache Iceberg handles schema evolution more gracefully than plain S3 tables — another reason the exam tests it.

Reflection Question: A Glue ETL job writes Parquet data to S3 with 20 columns. A new business requirement adds 3 columns. What happens when Athena queries files written before the change? What happens with Apache Iceberg?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications