5.2.4. Azure AI Speech Service
Azure AI Speech (also called "Azure Speech in Foundry Tools") handles spoken language capabilities. If Azure AI Language handles TEXT, Azure AI Speech handles AUDIO—converting between them and enabling voice-powered applications.
Core principle: Speech service is about the audio dimension of language. Any scenario involving spoken words, audio files, or voice interfaces likely needs Speech service.
Key features include:
| Feature | What It Does | Input → Output |
|---|---|---|
| Speech-to-Text | Converts spoken audio to text | Audio → Text |
| Text-to-Speech | Converts text to spoken audio | Text → Audio |
| Speech Translation | Translates spoken language in real-time | Audio (Language A) → Audio (Language B) |
| Speaker Recognition | Identifies distinct speaker voices | Audio → Speaker ID |
| Language Identification | Identifies spoken language | Audio → Language code |
| Voice Assistants | Powers conversational voice interfaces | Voice commands → Actions |
Speech-to-Text deep dive: Speech-to-Text (STT) transcribes audio into text. Use cases include:
- Live meeting transcription
- Video captioning
- Voice commands for applications
- Call center conversation analysis
Real-time vs. Batch transcription:
| Mode | Response Time | Use Case |
|---|---|---|
| Real-time | Immediate | Live captioning, voice assistants |
| Batch | Minutes to hours | Process audio archives |
Text-to-Speech deep dive: Text-to-Speech (TTS) generates spoken audio from text. Azure offers:
- Neural voices: High-quality, natural-sounding synthesis
- Custom Neural Voice: Train a voice on YOUR audio samples
- SSML support: Control pronunciation, speed, pitch, pauses
Speaker Recognition: Speaker recognition analyzes audio patterns to identify WHO is speaking:
- Speaker verification: "Is this the same person?" (1:1 comparison)
- Speaker identification: "Which person is this?" (1:N comparison)
Use cases include voice biometrics for authentication and multi-speaker meeting transcription.
Speech Translation: Unlike Azure Translator (text-to-text), Speech Translation works with SPOKEN language:
- Input: Spoken audio in one language
- Output: Spoken audio OR text in another language
- Real-time translation for conversations
⚠️ Exam Trap: Document translation and text-to-text translation are part of the Translator service, NOT the Speech service. If the question mentions SPOKEN audio, it's Speech. If it mentions TEXT or documents, it's Translator.
⚠️ Exam Tip: "Generate closed captions" = Speech-to-Text. "Read content aloud" = Text-to-Speech. "Translate conversations" = Speech Translation. "Identify who is speaking" = Speaker Recognition.