6.2. Process and Translate Speech
š” First Principle: Speech is the most natural human interface, but computers process text. Speech services bridge this gap bidirectionally: STT (Speech-to-Text) converts user speech into text your app can process; TTS (Text-to-Speech) converts your app's text responses into audio users can hear. Think of it like a translator at a conferenceāone direction captures what's said, the other broadcasts the response.
What breaks without proper speech configuration:
- Without locale settings, "schedule" transcribes as "skedule" (US vs UK pronunciation)
- Without SSML markup, synthesized speech sounds robotic and monotone
- Without custom speech models, domain terms like "Azure" become "azure" (lowercase) or misrecognized entirely
- Without proper audio format configuration, real-time applications suffer latency or quality issues
Consider a call center application: customers speak naturally, the system transcribes their speech, processes intent, generates a response, and speaks it backāall in under 2 seconds. Each step requires different SDK classes: SpeechRecognizer for input, SpeechSynthesizer for output. The exam tests whether you know which class handles which direction.