2.3.2. Multimodal Data Processing
💡 First Principle: Multimodal data cannot be passed directly to an FM — images must be base64-encoded, audio must be transcribed to text, PDFs must have text extracted, and all inputs must be assembled into the model-specific JSON structure before the API call. The FM never touches your raw files.
Processing pipelines by modality:
| Input Type | Extraction Service | FM-Ready Format | Key Consideration |
|---|---|---|---|
| PDF documents | Amazon Textract (for scanned) or direct text extraction | Text blocks in JSON | OCR confidence score — low confidence → poor FM quality |
| Images | Pass directly if Claude 3 multimodal | base64-encoded in request body | Max file size; cannot pass S3 URI to model directly |
| Audio/video | Amazon Transcribe | Transcript text with speaker labels | Speaker diarization for meeting notes use cases |
| Tabular data (CSV/Excel) | Lambda/Pandas transformation | Markdown table or structured JSON | FMs understand markdown tables better than raw CSV |
| HTML/web content | Lambda HTML parser | Clean text (strip tags, scripts) | Boilerplate navigation HTML degrades context quality |
Bedrock Data Automation — the managed service for multimodal document processing at scale. It handles PDFs, images, audio, and video with standardized output formats, eliminating the need to build and maintain custom extraction pipelines:
# Bedrock Data Automation for PDF batch processing
bedrock_data_auto = boto3.client('bedrock-data-automation')
response = bedrock_data_auto.invoke_data_automation_async(
inputConfiguration={
's3Uri': 's3://my-bucket/invoices/',
'documentConfiguration': {'parsingStrategy': 'AUTO'}
},
outputConfiguration={'s3Uri': 's3://my-bucket/extracted/'},
dataAutomationConfiguration={
'dataAutomationProjectArn': 'arn:aws:bedrock:...:data-automation-project/invoice-extraction'
}
)
⚠️ Exam Trap: Images are encoded as base64 and embedded directly in the Bedrock API request body — not referenced by S3 URL. When exam scenarios describe "sending an image to Bedrock for analysis," the correct architecture includes a Lambda function that reads the image from S3, base64-encodes it, and constructs the multimodal API payload. The FM cannot read from S3 independently.
Reflection Question: You need to build a system that processes 10,000 scanned invoices per day, extracts line items, and generates a structured JSON summary using an FM. What is the complete processing pipeline, naming each AWS service involved?