5.2.1. Speech Recognition and Synthesis
💡 First Principle: Recognition and synthesis are mirror images. Recognition (speech-to-text) is the input side — capturing what a person said as text your app can process. Synthesis (text-to-speech) is the output side — turning your app's text response into natural-sounding audio. Naming the direction names the capability.
Azure Speech in Foundry Tools provides both directions, with options like choosing a voice for synthesis or handling real-time versus batch transcription for recognition. At fundamentals level, knowing the two directions and that Azure Speech covers both is the core requirement.
⚠️ Exam Trap: "Convert this recorded interview into a written transcript" is recognition (speech-to-text). "Generate an audio version of this article" is synthesis (text-to-speech). Read the direction of the conversion, not just the word "speech."
Reflection Question: A navigation app listens for "find the nearest gas station" and then announces directions aloud. Which speech capability handles the listening, and which handles the announcing?