3.3.2. Data and Compute Services
Azure Machine Learning provides specialized services for managing data and computation. Think of these as the infrastructure layer—you need data to train on and compute power to do the training.
Data services:
| Service | What It Does | Analogy |
|---|---|---|
| Datastores | Connect to external data sources | A bookmark to where data lives |
| Datasets | Registered, versioned references to data | A labeled box of training data |
Datastores explained: Datastores don't copy your data—they connect to where your data already lives:
- Azure Blob Storage (files, images)
- Azure Data Lake Storage (large-scale data)
- Azure SQL Database (structured data)
- Azure Files (shared file storage)
When you create a datastore, you're setting up the connection credentials once, so jobs can access data without embedding passwords in code.
Datasets explained: Datasets reference specific data through datastores and add useful capabilities:
- Versioning: Track which data version trained which model
- Labeling: Store labels alongside data for supervised learning
- Profiling: Auto-generated statistics about your data
- Sampling: Work with subsets during development
Compute services:
| Compute Type | Purpose | When to Use |
|---|---|---|
| Compute instances | Development VMs | Writing code, exploring data, testing |
| Compute clusters | Scalable training | Training models, running experiments |
| Inference clusters | Production hosting | Serving predictions to applications |
| Attached compute | External resources | Using existing VMs or Databricks |
Compute instances vs. clusters:
- Instance: Like your personal laptop in the cloud—always running, for interactive work
- Cluster: Like a pool of workers—scales up when you submit jobs, scales down when idle
Why compute matters for the exam: Questions may ask about "scaling resources for training" (compute clusters) vs. "hosting models for predictions" (inference endpoints). Development work happens on instances; production work uses clusters or endpoints.
⚠️ Exam Tip: Datastores are CONNECTION POINTS to external storage. Datasets are REFERENCES to specific data you'll use for training. Compute instances are for DEVELOPMENT; compute clusters are for TRAINING at scale.