3.3.2. Business Data Catalogs: Amazon SageMaker Catalog
š” First Principle: Technical catalogs tell machines where data lives; business catalogs tell humans what data means. The Glue Data Catalog knows that table orders_raw has a column cust_id of type string ā but it doesn't know that cust_id represents the customer's unique identifier, is owned by the CRM team, is classified as PII, and should not be shared with external partners. Business catalogs bridge this gap between technical metadata and organizational context.
Amazon SageMaker Catalog (added in v1.1, previously part of Amazon DataZone) provides a business data catalog with:
Data portal. A web interface where analysts and data scientists browse, search, and request access to datasets. Non-technical users can find data without knowing S3 paths or table names.
Business glossary. Define standard terms ("customer," "revenue," "churn rate") with agreed-upon definitions. This prevents the perennial problem of different teams using the same word to mean different things.
Data ownership and access requests. Each dataset has an owner who approves or denies access requests. This decentralizes governance ā the team that knows the data best controls who can use it.
Integration with Glue Data Catalog. SageMaker Catalog layers business metadata on top of Glue's technical metadata. The technical catalog powers query engines; the business catalog powers human discovery and governance.
ā ļø Exam Trap: SageMaker Catalog (business catalog) and Glue Data Catalog (technical catalog) are complementary, not competing. If a question asks about "data discovery for business users" or "data governance portal," it's SageMaker Catalog. If it asks about "schema discovery" or "partition management," it's Glue Data Catalog. A modern data architecture uses both.
Reflection Question: An organization's data lake has 500 tables. Data scientists can't find the right datasets and frequently create duplicates. What kind of catalog solves this, and what features would help?