2.1.4. Troubleshooting Ingestion Issues
💡 First Principle: Data ingestion failures in ML pipelines are almost always caused by one of three things: capacity limits, format mismatches, or permission errors. Developing a systematic triage approach—check permissions first, then format, then capacity—resolves most issues quickly and is what the exam expects you to demonstrate.
When a training job fails with a vague error like "Unable to read data," resist the urge to immediately increase instance size or add more storage. Instead, work through the diagnostic ladder:
Step 1: Permissions. Does the SageMaker execution role have s3:GetObject on the training data bucket? Are KMS decrypt permissions in place if the data is encrypted? Permission errors account for the majority of "data not found" failures.
Step 2: Format and schema. Does the data format match what the algorithm expects? SageMaker built-in algorithms often require specific formats (RecordIO, CSV with specific header conventions). A CSV file with a header row can cause training to fail silently if the algorithm interprets the header as a data record.
Step 3: Capacity and throughput. Is the instance running out of disk space during file mode download? Is S3 request rate limiting causing throttling? For large datasets, S3 performance scales with key prefix diversity—all files under the same prefix can hit throughput limits.
| Symptom | Likely Cause | Fix |
|---|---|---|
| "Access Denied" on training start | IAM role missing S3/KMS permissions | Add s3:GetObject, kms:Decrypt to execution role |
| Training starts but crashes immediately | Data format mismatch (e.g., header in CSV) | Verify format matches algorithm requirements |
| Training starts, runs slow, then OOM | Dataset too large for instance disk (file mode) | Switch to pipe mode or Fast File mode, or use larger instance |
| S3 read throttling (503 errors) | Too many requests to same prefix | Distribute data across multiple S3 prefixes |
| Streaming ingestion lag increases | Kinesis shard count too low | Split shards or enable auto-scaling |
| Data appears in S3 but is empty/corrupt | Firehose transform Lambda error | Check Lambda CloudWatch logs for transform failures |
⚠️ Exam Trap: When a question describes a training job that "fails to start" or "cannot access data," the answer is almost always a permissions issue—not a capacity issue. The exam tests whether you check IAM roles and bucket policies before considering infrastructure changes. Don't jump to "use a bigger instance" when the problem is an access denied error.
Reflection Question: A SageMaker training job configured with file mode and a 200 GB dataset on S3 fails after running for 15 minutes with a "No space left on device" error. The ml.m5.xlarge instance has 40 GB of local storage. What are two ways to fix this without changing the dataset size?