5.3.2. Building a Vision-Capable Application
💡 First Principle: A "vision-capable" application is just one that includes a vision step in its workflow — either understanding (send an image, get information via a multimodal model) or generation (send a prompt, get an image). The two-client pattern still applies; the vision capability is what you deploy and call.
Putting Phase 5 together: a real app might accept a product photo, use a multimodal model to extract a description (understanding), then use an image-generation model to create marketing variants (generation) — two vision capabilities in one workflow, both reached through the same project and OpenAI-compatible clients. The exam's "build a lightweight application that includes vision capabilities" objective is satisfied by recognizing this: deploy the right vision-capable model, send the appropriate input, handle the output.
⚠️ Exam Trap: "Vision-capable" doesn't automatically mean "image-generating." It can mean the app understands images, generates them, or both. Match the capability (multimodal understanding vs. image generation) to what the scenario actually requires.
Reflection Question: An app lets users photograph a room and then shows them AI-generated redecoration ideas. Which two distinct vision capabilities does it use, and in what order?