Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.1.3. High-Performance Vector Store Design

💡 First Principle: A vector store that performs well at 10,000 documents will fail at 10,000,000 documents unless it was designed for scale from the start. High-performance vector store architecture is about index strategy, sharding decisions, and query optimization — decisions that are very costly to change after data is loaded.

Sharding strategy for OpenSearch: OpenSearch distributes vectors across shards for parallel search. The key principle: each shard should hold 10–50GB of data for optimal performance. Too few shards → single-shard bottleneck. Too many shards → excessive overhead.

# Calculate shard count: target 20GB per shard
# 1 million documents × 1024 dimensions × 4 bytes/float = 4GB raw vectors
# Add overhead for HNSW graph: ~2-3x raw size = ~10-12GB
# Recommended: 1-2 shards for 1M documents; scale up as corpus grows

index_settings = {
    "number_of_shards": 3,      # For ~3M document corpus
    "number_of_replicas": 1,    # 1 replica = reads scale with 2 nodes; higher availability
    "refresh_interval": "30s"   # Reduce refresh frequency during bulk indexing
}

Multi-index architecture for specialized domains: Rather than a single monolithic index, partition vectors by domain or document type. Benefits: smaller index per domain = faster queries; domain-specific embedding models can be used per index; index can be rebuilt per domain independently.

production-vector-store/
├── index-legal/          # Legal docs with legal-specific metadata
├── index-technical/      # Technical docs with product/version metadata
├── index-hr/             # HR docs with access-controlled retrieval
└── index-financial/      # Financial docs with date/quarter metadata

Hierarchical indexing — parent-child documents: Store both chunk-level vectors (for precise retrieval) and parent-document summaries (for context preservation). At retrieval time, use chunk vectors to find relevant passages, then return the parent document section for richer context.

⚠️ Exam Trap: Increasing number_of_replicas improves read throughput and availability but doubles storage costs. Exam scenarios that ask for "improved search performance under high read load" should increase replicas; scenarios that ask for "faster bulk indexing" should decrease replicas during indexing (then restore after).

Reflection Question: Your vector store performs well with 500K documents but retrieval latency increases 5x after loading to 5M documents without any code changes. What are the two most likely causes, and what architectural changes would you investigate first?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications