3.1.2. Amazon Bedrock Knowledge Bases Architecture
💡 First Principle: Bedrock Knowledge Bases operationalizes the full RAG pipeline as a managed service — it moves the undifferentiated heavy lifting of ingestion, chunking, embedding, indexing, and retrieval out of your application code and into a fully managed AWS service. You define what to index; AWS manages how.
The Bedrock Knowledge Bases architecture:
Sync configuration — the most frequently tested operational detail: Knowledge Bases does not automatically update when source documents change. You must trigger a sync:
- Manual sync: Via console or API call (
bedrock-agent.start_ingestion_job()) - Scheduled sync: EventBridge Scheduler triggers sync job on a cron schedule
- Event-driven sync: S3 event notification → Lambda → start ingestion job on document change
# Event-driven sync triggered by S3 object creation
def lambda_handler(event, context):
bedrock_agent = boto3.client('bedrock-agent')
response = bedrock_agent.start_ingestion_job(
knowledgeBaseId='KBID123456',
dataSourceId='DSID789012',
description=f"Auto-sync triggered by {event['Records'][0]['s3']['object']['key']}"
)
return response['ingestionJob']['ingestionJobId']
Metadata schema for filtered retrieval:
Documents in S3 can have accompanying .metadata.json files that define structured attributes for filtered retrieval:
{
"metadataAttributes": {
"department": "legal",
"document_type": "policy",
"effective_date": "2024-01-01",
"confidentiality": "internal"
}
}
This enables queries like "retrieve only documents from the legal department effective after 2024" — combining semantic similarity with structured filtering.
⚠️ Exam Trap: Bedrock Knowledge Bases sync jobs are not instantaneous — they can take minutes to hours for large corpora. Architectures requiring real-time document availability (documents must be searchable within seconds of upload) cannot use Bedrock Knowledge Bases alone and need a custom OpenSearch solution with direct document ingestion.
Reflection Question: A compliance team uploads a new regulatory document to S3 and expects the chatbot to be able to answer questions about it "immediately." They're currently using Bedrock Knowledge Bases with a nightly sync job. How would you re-architect the pipeline to minimize the delay between document upload and query availability?