5.2.2. Responding to Spoken Prompts
💡 First Principle: A spoken-prompt experience chains the pieces: capture audio, get it as something the model understands, let the model respond, and optionally speak the response back. A multimodal model that accepts audio can simplify this by taking the spoken prompt more directly, rather than always running a separate transcription step first.
The classic pipeline is recognition → model → synthesis: hear the user (speech-to-text), reason about the request (model call), and reply in voice (text-to-speech). A multimodal model capable of audio input can collapse the first step, accepting the spoken prompt and responding directly. Either way, the user experience is "talk to the app, the app talks back," and Foundry supplies both the model and the Azure Speech tooling to build it.
⚠️ Exam Trap: Don't assume every voice app needs a separate transcription service before the model. A multimodal model that accepts audio can take the spoken prompt directly. The separate recognition step is one valid design, not a universal requirement.
Reflection Question: Describe the three-step recognition-model-synthesis pipeline for a voice assistant. Where could a multimodal audio model let you skip a step?