Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

1.2.3. Multimodal Inputs and Processing

💡 First Principle: Multimodal models process and generate across more than one data type — text, images, audio, documents — within the same context window. This unlocks use cases that pure text models cannot handle, but introduces different data processing requirements upstream.

The key multimodal capabilities relevant to AIP-C01:

ModalityAWS ServicesTypical Use Case
Text → TextBedrock (all text FMs)Summarization, Q&A, generation
Image → TextClaude 3.x via Bedrock, Titan MultimodalDocument understanding, image analysis
Text → ImageStable Diffusion via BedrockImage generation
Audio → TextAmazon Transcribe (pre-processing)Meeting notes, voice interfaces
Document → TextAmazon Textract + Bedrock, Bedrock Data AutomationPDF extraction, form processing

Processing pipeline for multimodal data: Raw data (PDFs, images, audio) is never sent directly to an FM without preprocessing. The typical pipeline:

  1. Extract text/structure from documents (Textract, Transcribe)
  2. Normalize data format to FM-expected input (JSON/base64)
  3. Chunk for context window management
  4. Embed and index for retrieval
  5. Invoke the FM with properly formatted multimodal content

⚠️ Exam Trap: Images are passed to multimodal Bedrock models as base64-encoded bytes within the API request body — not as S3 URLs. The model cannot reach out to S3 directly. Preprocessing and encoding must happen in your Lambda or application layer before the Bedrock API call.

Reflection Question: You need to build a system that answers questions about scanned PDF invoices. What processing steps occur before a foundation model ever sees the data?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications