5.1. Text and Multimodal Solutions
💡 First Principle: A generative model is a flexible text engine, so "text analysis" in Foundry is often just a well-crafted prompt asking the model to classify sentiment, pull entities, or summarize — no separate service required. A multimodal model extends the same idea to images: you send a picture plus a text question, and the one model reasons over both.
Why care? The exam wants you to know that modern text and vision understanding tasks can run through a single deployed model with the right prompt, rather than stitching together one service per task. It also tests the distinction between a model that only reads text and a multimodal one that can also interpret an image you pass in.
⚠️ Common Misconception: "A multimodal model needs a separate model for each input type." A multimodal model accepts multiple input types — text and images together — within one model and one call. You don't chain a vision service into a text model; you send both to a model built to handle them.