3.3. Data Cataloging Systems
š” First Principle: Imagine a data lake with 5,000 tables and no catalog ā analysts spend more time finding data than analyzing it. A data catalog is like the card catalog of your data lake ā without it, data is discoverable only by people who already know it exists. Imagine a library with no catalog: every book is somewhere on the shelves, but finding a specific one requires wandering every aisle. A data catalog records what data exists, where it lives, what format it's in, and what the columns mean, making the data lake searchable and governable.
Without a catalog, data lakes devolve into "data swamps" ā terabytes of files that nobody can find, understand, or trust. Teams duplicate data because they can't discover what already exists. Compliance teams can't audit what data is stored where. New analysts spend weeks asking colleagues "where's the customer data?" instead of querying a catalog.
The v1.1 exam update draws a distinction between technical catalogs (schema and partition metadata for query engines) and business catalogs (human-readable descriptions, ownership, glossaries for data governance). The Glue Data Catalog serves the technical role; Amazon SageMaker Catalog serves the business role. How do you decide which one a question is asking about? Technical signals: "crawlers," "schema discovery," "partition sync," "Athena queries." Business signals: "data portal," "business glossary," "data ownership," "data discovery."