7.3. Extract Information with Azure AI Content Understanding
💡 First Principle: Real-world content is multimodal—a training video has audio narration, on-screen text, and visual demonstrations. Analyzing each modality separately means building three pipelines and correlating their outputs. Content Understanding provides a unified pipeline that handles documents, images, video, and audio together, extracting summaries, classifications, and entities across all modalities in one pass. Use it when your content spans multiple formats and you need consistent analysis without building separate pipelines.
Building on the input-output framework from Section 1.3, Content Understanding extends Document Intelligence by handling multimodal content. While Document Intelligence focuses on document structure extraction, Content Understanding enables summarization, classification, and entity extraction across diverse content types.
đź”§ Implementation Reference: Azure AI Content Understanding
| Item | Value |
|---|---|
| Package | azure-ai-contentunderstanding |
| Class | ContentUnderstandingClient |
| Methods | analyze(), begin_analyze() |
| Endpoint | POST /contentunderstanding/analyze |
Core Capabilities:
| Capability | Input Types | Output |
|---|---|---|
| OCR Pipeline | Images, PDFs | Extracted text with layout |
| Summarization | Documents, text | Concise summaries |
| Classification | Documents, images | Category labels |
| Entity Extraction | All content types | Structured entities |
| Table Extraction | Documents | Structured table data |
| Attribute Detection | Documents | Key attributes and properties |
OCR Pipeline Pattern:
from azure.ai.contentunderstanding import ContentUnderstandingClient
client = ContentUnderstandingClient(endpoint=endpoint, credential=AzureKeyCredential(key))
# Create OCR pipeline for text extraction
result = client.analyze(
content=document_bytes,
features=["ocr", "entities", "tables"]
)
# Access extracted text
for page in result.pages:
for line in page.lines:
print(line.text)
# Access extracted entities
for entity in result.entities:
print(f"{entity.category}: {entity.text}")
Error Handling Pattern:
from azure.ai.contentunderstanding import ContentUnderstandingClient
from azure.core.exceptions import HttpResponseError
try:
result = client.analyze(
content=document_bytes,
features=["ocr", "entities", "tables"]
)
# Process results
for page in result.pages:
for line in page.lines:
print(line.text)
except HttpResponseError as e:
if e.status_code == 400:
# Invalid content format or unsupported file type
logging.error("Invalid content format. Supported: PDF, images, Office documents")
elif e.status_code == 413:
# Content too large
logging.error("Content exceeds maximum size limit")
elif e.status_code == 415:
# Unsupported media type
logging.error("Unsupported content type")
elif e.status_code == 429:
# Rate limited
retry_after = int(e.response.headers.get("Retry-After", 60))
time.sleep(retry_after)
CLI Equivalent (REST):
# Analyze document
curl -X POST "https://{endpoint}/contentunderstanding/analyze?api-version=2024-12-01-preview" \
-H "Ocp-Apim-Subscription-Key: {key}" \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf
# Analyze with specific features
curl -X POST "https://{endpoint}/contentunderstanding/analyze?api-version=2024-12-01-preview&features=ocr,entities,tables" \
-H "Ocp-Apim-Subscription-Key: {key}" \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf
Processing Multimodal Content:
Azure AI Content Understanding can process and ingest content from multiple sources:
| Content Type | Processing Capabilities |
|---|---|
| Documents | Text extraction, summarization, entity recognition |
| Images | OCR, classification, object detection |
| Videos | Transcription, scene analysis, content moderation |
| Audio | Transcription, speaker identification |
⚠️ Exam Trap: Content Understanding provides a unified pipeline for multimodal content—don't confuse it with individual services like Document Intelligence or Vision, which handle specific content types.