1.2.3. Multimodal Inputs and Processing
💡 First Principle: Multimodal models process and generate across more than one data type — text, images, audio, documents — within the same context window. This unlocks use cases that pure text models cannot handle, but introduces different data processing requirements upstream.
The key multimodal capabilities relevant to AIP-C01:
| Modality | AWS Services | Typical Use Case |
|---|---|---|
| Text → Text | Bedrock (all text FMs) | Summarization, Q&A, generation |
| Image → Text | Claude 3.x via Bedrock, Titan Multimodal | Document understanding, image analysis |
| Text → Image | Stable Diffusion via Bedrock | Image generation |
| Audio → Text | Amazon Transcribe (pre-processing) | Meeting notes, voice interfaces |
| Document → Text | Amazon Textract + Bedrock, Bedrock Data Automation | PDF extraction, form processing |
Processing pipeline for multimodal data: Raw data (PDFs, images, audio) is never sent directly to an FM without preprocessing. The typical pipeline:
- Extract text/structure from documents (Textract, Transcribe)
- Normalize data format to FM-expected input (JSON/base64)
- Chunk for context window management
- Embed and index for retrieval
- Invoke the FM with properly formatted multimodal content
⚠️ Exam Trap: Images are passed to multimodal Bedrock models as base64-encoded bytes within the API request body — not as S3 URLs. The model cannot reach out to S3 directly. Preprocessing and encoding must happen in your Lambda or application layer before the Bedrock API call.
Reflection Question: You need to build a system that answers questions about scanned PDF invoices. What processing steps occur before a foundation model ever sees the data?