5.2. Speech Solutions
💡 First Principle: Speech is about converting between audio and text in two directions. Speech recognition turns spoken audio into text (so a machine can act on what was said); speech synthesis turns text into spoken audio (so a machine can talk back). Azure Speech, available through Foundry Tools, provides both, and modern multimodal models can also handle spoken input directly.
Why care? The exam tests the two directions and which one a scenario needs. "Transcribe this meeting" is recognition; "have the app read the answer aloud" is synthesis. A voice assistant uses both — recognition to hear you, synthesis to reply.
⚠️ Common Misconception: "Speech recognition and speech synthesis are one feature." They're opposite conversions. Recognition is audio-to-text; synthesis is text-to-audio. A scenario usually needs one specific direction, and picking the wrong one is a classic distractor.