7. Glossary
A/B Testing — Comparing two model versions in production by splitting live traffic between them. See §5.1.3.
Amazon Bedrock — Managed service for accessing and customizing foundation models from multiple providers. See §3.1.3. High exam relevance.
Amazon Comprehend — NLP service for sentiment analysis, entity recognition, and topic modeling. See §3.1.4.
Amazon Comprehend Medical — Specialized NLP service for extracting medical entities from clinical text. See §3.1.4.
Amazon EMR — Managed Hadoop/Spark framework for large-scale data processing. See §2.2.4.
Amazon Kinesis — Managed streaming data platform for real-time data ingestion and processing. See §2.1.3. High exam relevance.
Amazon Macie — Service that uses ML to discover and protect sensitive data (PII/PHI) in S3. See §5.3.4.
Amazon Rekognition — Pre-built computer vision service for image and video analysis. See §3.1.4.
Amazon SageMaker — Comprehensive ML platform covering data prep through deployment and monitoring. See §1.3. Highest exam relevance — appears in 60-80% of questions.
Amazon Textract — Service for extracting text, tables, and forms from scanned documents. See §3.1.4.
Asynchronous Endpoint — SageMaker endpoint type for large payloads (up to 1 GB) and long processing times (up to 1 hour), with auto-scale-to-zero capability. See §4.1.1. High exam relevance.
Auto Scaling — Automatically adjusting compute resources based on demand metrics. For SageMaker endpoints, commonly based on InvocationsPerInstance. See §4.2.1.
Automatic Model Tuning (AMT) — SageMaker's hyperparameter optimization service using Bayesian optimization, random search, or grid search. See §3.2.2.
AWS CloudFormation — Infrastructure as code service using declarative JSON/YAML templates. See §4.2.2.
AWS CloudTrail — Service that logs all API calls in an AWS account for audit and compliance. See §5.2.1, §5.3.4. High exam relevance for security questions.
AWS CodeBuild — Managed build service for compiling code, running tests, and producing artifacts. See §4.3.1.
AWS CodeDeploy — Service for automating application deployments with strategies like blue/green and canary. See §4.3.1.
AWS CodePipeline — CI/CD orchestration service for automating release pipelines. See §4.3.1.
AWS Config — Service that monitors and evaluates resource configurations against compliance rules. See §5.3.4.
AWS Cost Explorer — Tool for visualizing and analyzing AWS spending. See §5.2.2.
AWS Glue — Serverless ETL service for data integration, built on Apache Spark. See §2.2.4. High exam relevance.
AWS Glue DataBrew — Visual data preparation tool with 250+ built-in transformations. See §2.2.4.
AWS Glue Data Quality — Service for defining and monitoring data quality rules. See §2.3.2.
AWS IAM — Identity and Access Management service for controlling access to AWS resources. See §5.3.1. High exam relevance for security questions.
AWS KMS — Key Management Service for creating and controlling encryption keys. See §5.3.3. High exam relevance.
AWS Lambda — Serverless compute service for running code without managing servers. See §4.1.1.
Batch Transform — SageMaker feature for large-scale offline inference without maintaining a persistent endpoint. See §4.1.1.
Bayesian Optimization — Hyperparameter search strategy that uses past results to intelligently choose next parameter combinations. Default strategy for SageMaker AMT. See §3.2.2.
Bias Drift — Change in model fairness metrics over time in production, detected by Model Monitor with Clarify. See §5.1.1.
Blue/Green Deployment — Deployment strategy maintaining two environments; traffic switches from old (blue) to new (green) with instant rollback capability. See §4.3.3. High exam relevance.
Bring Your Own Container (BYOC) — Custom Docker container for SageMaker training or inference when pre-built containers don't support your framework. See §4.1.3.
Canary Deployment — Deployment strategy that routes a small percentage of traffic to the new version, monitoring for errors before increasing. See §4.3.3.
Class Imbalance (CI) — Pre-training bias metric measuring uneven distribution of target classes in training data. See §2.3.1.
CloudWatch — Monitoring service for AWS resources, providing metrics, logs, and alarms. See §5.2.1. High exam relevance.
Concept Drift — Change in the relationship between input features and target variable over time. See §5.1.1.
Confusion Matrix — Table comparing predicted vs. actual classifications, enabling calculation of precision, recall, and F1 score. See §3.3.1.
Customer-Managed Key (CMK) — KMS key created and managed by the customer, providing full control over key policies. See §5.3.3.
Data Drift — Change in the statistical distribution of input features compared to training data. See §5.1.1. High exam relevance.
Data Wrangler — SageMaker visual tool for data exploration, transformation, and feature engineering with minimal code. See §2.2.4, §1.3.2. High exam relevance.
Difference in Proportions of Labels (DPL) — Pre-training bias metric measuring label imbalance between demographic groups. See §2.3.1.
Distributed Training — Splitting training across multiple GPUs or instances using data parallelism or model parallelism. See §3.2.4.
Dropout — Regularization technique that randomly disables neurons during training to prevent overfitting. See §3.2.3.
Early Stopping — Training technique that halts training when validation loss stops improving, preventing overfitting and saving compute. See §3.2.2.
Epoch — One complete pass through the entire training dataset. See §3.2.1.
F1 Score — Harmonic mean of precision and recall, balancing both metrics. See §3.3.1.
Feature Engineering — Process of creating, transforming, and selecting input variables to improve model performance. See §2.2.2.
Feature Store — SageMaker service for storing, sharing, and reusing features across teams and models. See §1.3.2, §2.2.4.
Foundation Model — Large pre-trained model that can be fine-tuned for specific tasks (e.g., via Bedrock or JumpStart). See §3.1.3.
Ground Truth — SageMaker service for creating labeled training datasets using human annotators. See §2.3.3.
Hyperparameter — Model configuration value set before training (e.g., learning rate, number of trees). See §3.2.2.
Inference Recommender — SageMaker tool that load-tests models across instance types to find optimal cost-performance. See §5.2.2.
JumpStart — SageMaker hub of pre-trained models, solution templates, and example notebooks. See §3.1.3.
L1/L2 Regularization — Techniques that add penalty terms to the loss function to prevent overfitting. L1 promotes sparsity; L2 penalizes large weights. See §3.2.3.
Managed Spot Training — SageMaker feature for using Spot Instances for training with automatic checkpointing and resumption. See §5.2.2.
Model Monitor — SageMaker service for continuous monitoring of deployed models, detecting data quality, model quality, bias, and feature attribution drift. See §5.1.2. High exam relevance.
Model Registry — SageMaker service for versioning, cataloging, and managing approval workflows for models. See §3.2.5.
Network Isolation — SageMaker configuration that prevents containers from making any outbound network calls. See §5.3.2.
One-Hot Encoding — Encoding technique that converts categorical values into binary vectors. See §2.2.3.
Overfitting — Model performs well on training data but poorly on unseen data due to memorizing noise. See §3.2.3.
Parquet — Columnar data format optimized for analytics and ML workloads with efficient compression. See §2.1.1.
Pipe Mode — SageMaker training mode that streams data directly from S3 rather than downloading to local disk. See §3.2.4.
Production Variant — A model version deployed behind a SageMaker endpoint that receives a configurable percentage of traffic. See §5.1.3.
Regularization — Techniques to prevent model overfitting by constraining model complexity. See §3.2.3.
RMSE (Root Mean Square Error) — Regression metric measuring the square root of average squared prediction errors. See §3.3.1.
ROC-AUC — Classification metric measuring the area under the receiver operating characteristic curve, evaluating model discrimination ability. See §3.3.1.
SageMaker Clarify — Service for detecting bias in data and models, and explaining model predictions using SHAP values. See §3.3.2, §5.1.2. High exam relevance.
SageMaker Debugger — Service for debugging training convergence issues by capturing tensor data during training. See §3.3.3.
SageMaker Neo — Service for compiling and optimizing models for deployment on edge devices. See §4.1.4.
SageMaker Pipelines — ML workflow orchestration service for building, automating, and managing end-to-end ML pipelines. See §4.3.2. High exam relevance.
Script Mode — SageMaker training approach where you provide your own training script with a supported framework (TensorFlow, PyTorch). See §3.2.1.
Serverless Endpoint — SageMaker endpoint type that auto-scales to zero when idle, suitable for intermittent traffic. See §4.1.1.
Shadow Variant — A model version that receives production traffic copies for evaluation but doesn't serve predictions to users. See §5.1.3.
SHAP Values (Shapley Values) — Method for explaining individual predictions by quantifying each feature's contribution. See §3.3.2.
SMOTE — Synthetic Minority Over-sampling Technique for addressing class imbalance by generating synthetic training examples. See §2.3.1.
Spot Instances — EC2 instances available at up to 90% discount but subject to 2-minute interruption notice. See §5.2.2.
Underfitting — Model is too simple to capture underlying data patterns, performing poorly on both training and test data. See §3.2.3.
VPC Mode — SageMaker configuration where training jobs and endpoints run inside a customer's VPC for network isolation. See §5.3.2. High exam relevance.