2.3.1. Data Quality Validation for FM Consumption
💡 First Principle: Data validation for FM pipelines must happen at two levels: structural validation (is the data in the expected format and size?) and semantic validation (does the data actually contain the information the FM needs?). Traditional ETL validation catches structural issues; FM-specific validation must also catch semantic quality problems.
The validation pipeline architecture:
AWS Glue Data Quality — declarative data quality rules evaluated at scale:
# Glue Data Quality ruleset example for FM training data
ruleset = """
Rules = [
IsComplete "content", # No null content fields
ColumnLength "content" > 50, # Minimum meaningful content length
ColumnLength "content" < 4000, # Under token limit threshold
IsComplete "document_id",
IsUnique "document_id", # No duplicate documents
ColumnValues "language" in ["en", "es", "fr"] # Supported languages only
]
"""
SageMaker Data Wrangler handles the transformation layer: normalizing text encoding (critical for multilingual content), standardizing date formats, cleaning HTML/markdown artifacts, and handling missing values with FM-appropriate defaults.
Lambda-based FM-specific validation:
import boto3, tiktoken
def validate_for_fm(record):
# Token count estimation (tiktoken approximates Bedrock token counts)
enc = tiktoken.get_encoding("cl100k_base")
token_count = len(enc.encode(record['content']))
# Validate against model context window (leave 20% headroom for response)
if token_count > 160000: # Claude 3 is 200K; leave room for system prompt + response
return False, f"Token count {token_count} exceeds safe threshold"
# Check encoding — common issue with PDF extraction
try:
record['content'].encode('utf-8')
except UnicodeEncodeError:
return False, "Invalid UTF-8 encoding"
return True, None
⚠️ Exam Trap: AWS Glue Data Quality and SageMaker Data Wrangler serve different roles in the pipeline. Glue Data Quality validates existing data in your data lake against quality rules. Data Wrangler transforms and prepares data with a visual interface. Exam scenarios that ask about "enforcing data quality rules at scale" point to Glue; scenarios about "feature engineering and transformation" point to Data Wrangler.
Reflection Question: Your FM pipeline processes 50,000 product descriptions nightly from a legacy database. 3% of records have encoding issues that cause the FM to produce garbled output — but no errors are raised in your pipeline. What two AWS services would you add to detect and quarantine these records before they reach Bedrock?