2.1.1. Structured, Semi-Structured, and Unstructured Data
💡 First Principle: The level of internal organization in data determines the type of storage system required and the operations that can be performed efficiently. Think of it like filing systems: structured data is a meticulously organized filing cabinet where every document has a designated folder; semi-structured is a folder of labeled envelopes that can contain varying contents; unstructured is a box of photos—valuable, but you need to look at each one to know what's there.
Scenario: A retail company collects three types of data: (1) sales transactions with customer ID, product ID, quantity, and price; (2) customer feedback submitted as JSON from a mobile app; (3) security camera footage from stores. Each requires a different storage approach.
Structured Data (Relational)
- Concept: Data adheres to a strict, predefined schema organized into tables with rows and columns. Relationships between tables are defined by primary and foreign keys.
- Characteristics:
- Fixed schema defined before data entry
- Strong data typing (integers, strings, dates)
- Supports complex queries with SQL
- Enforces referential integrity
- Examples: Customer records, financial transactions, inventory data
- Azure Service: Azure SQL Database, Azure Synapse Analytics
Semi-Structured Data (Non-Relational)
- Concept: Data contains internal markers (tags, keys) to identify fields and hierarchy, but fields can vary between records. No rigid tabular schema.
- Characteristics:
- Self-describing (metadata embedded in data)
- Flexible schema (fields can differ per record)
- Hierarchical or nested structures
- Human-readable (JSON, XML) or binary (Avro)
- Examples: JSON from web APIs, XML configuration files, sensor data with varying attributes
- Azure Service: Azure Cosmos DB, Azure Blob Storage (for JSON/XML files)
Unstructured Data
- Concept: Data has no predefined schema or internal structure that a database engine can interpret. It is a "blob" of binary data.
- Characteristics:
- No inherent data model
- Requires external processing to extract meaning
- Often large in size
- Cannot be queried without transformation
- Examples: Images, videos, audio files, PDFs, Word documents
- Azure Service: Azure Blob Storage, Azure Data Lake Storage Gen2
Visual: Data Representation Decision Tree
Comparative Table: Data Types
| Characteristic | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Fixed, predefined | Flexible, self-describing | None |
| Format | Tables (rows/columns) | JSON, XML, Key-Value | Binary (images, video) |
| Query Capability | Full SQL support | Limited query (by key/path) | No direct query |
| Examples | Sales transactions | API responses, logs | Images, PDFs, video |
| Azure Service | Azure SQL DB | Cosmos DB | Blob Storage |
⚠️ Exam Trap: Confusing semi-structured with unstructured is a common mistake. A JSON file IS semi-structured because it has internal keys and values that can be parsed. A JPEG image is unstructured because its binary data has no queryable meaning without processing.
Key Trade-Offs:
- Data Integrity vs. Agility: Structured data enforces consistency at write time; semi-structured data validates at read time (schema-on-read).
- Storage Cost vs. Query Cost: Blob storage is cheapest for storage but requires expensive compute to extract insights. Structured databases cost more to store but queries are efficient.
Reflection Question: If you receive data from an IoT sensor that includes varying fields (some sensors report temperature, others report humidity, some report both), why would you choose a semi-structured storage solution over a relational database?