Phase 7: Glossary

A/B Testing: A deployment strategy where traffic is split between two or more model versions (variants) to compare their performance on live data and determine which is more effective for a given business metric.
Amazon Athena: A serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It's commonly used for Exploratory Data Analysis (EDA) and ad-hoc querying of data lakes.
Amazon CloudWatch: A monitoring and observability service that collects metrics, logs, and events from AWS resources. For ML, it's used to monitor the operational health of SageMaker endpoints (CPU/memory, latency) and to receive metrics from Model Monitor.
Amazon Comprehend: A high-level AI service that uses natural language processing (NLP) to extract insights, relationships, and sentiment from text, without requiring you to build and train your own models.
Amazon EMR (Elastic MapReduce): A managed cluster platform that simplifies running big data frameworks like Apache Spark and Hadoop. It's used for large-scale data processing, transformation, and feature engineering tasks.
Amazon Kinesis: A suite of services for collecting, processing, and analyzing real-time streaming data. Data Streams is for custom processing, Firehose is for simple delivery to destinations like S3, and Data Analytics is for real-time SQL/Flink processing.
Amazon Rekognition: A high-level AI service that provides pre-trained computer vision capabilities, such as object detection, facial analysis, and text detection in images and videos.
Amazon S3 (Simple Storage Service): A highly durable and scalable object storage service that serves as the foundation for data lakes in AWS, storing raw data, processed data, and model artifacts.
Amazon SageMaker: A fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models at scale, covering the entire ML lifecycle.
Amazon SageMaker Asynchronous Inference: A deployment option for models with large payloads or long processing times. It queues incoming requests and delivers results to an S3 location, making it cost-effective for intermittent traffic.
Amazon SageMaker Automatic Model Tuning (HPO): A capability that automates the process of finding the best hyperparameters for a model by running many training jobs with different combinations, using strategies like Bayesian optimization.
Amazon SageMaker Batch Transform: A deployment option for generating predictions on an entire dataset offline. It's a high-throughput, cost-effective solution when real-time latency is not required.
Amazon SageMaker Clarify: A feature that helps detect statistical bias in data and models and explain model predictions using techniques like SHAP, promoting fairness and transparency.
Amazon SageMaker Data Wrangler: A visual data preparation tool that simplifies the process of data cleaning, transformation, and feature engineering with a point-and-click interface, generating code for production pipelines.
Amazon SageMaker Endpoints: A fully managed, auto-scaling deployment option for hosting ML models to serve low-latency, real-time predictions.
Amazon SageMaker Feature Store: A centralized repository for storing, updating, retrieving, and sharing ML features for both training and inference, ensuring consistency and preventing training-serving skew.
Amazon SageMaker Model Monitor: A service that continuously monitors deployed models for data drift, model quality degradation, and bias drift, triggering alerts when performance drops.
Amazon SageMaker Model Registry: A version-controlled repository for cataloging and managing ML models, their metadata, and their approval status, facilitating governance and MLOps.
Amazon SageMaker Pipelines: A purpose-built CI/CD service for ML that automates and orchestrates end-to-end ML workflows as a series of interconnected steps (e.g., processing, training, evaluation, deployment).
Amazon SageMaker Processing Jobs: A managed environment for running large-scale data processing, feature engineering, and model evaluation workloads using frameworks like Spark or Scikit-learn.
AWS CloudTrail: A service that provides a record of all API calls made in your AWS account, used for security auditing, compliance, and troubleshooting ML resource changes.
AWS Glue: A serverless data integration service used for ETL (Extract, Transform, Load). Its Data Catalog acts as a central metadata repository for data lakes, and its ETL jobs run Spark/Python scripts for data transformation.
AWS IAM (Identity and Access Management): The service used to securely control access to AWS resources. For ML, it defines roles and policies that grant least-privilege permissions to users and SageMaker jobs.
AWS KMS (Key Management Service): A managed service for creating and controlling encryption keys. It's used to encrypt data at rest in S3, EBS, and other services, securing sensitive ML data and models.
AWS Lake Formation: A service that simplifies building, securing, and managing data lakes by providing a centralized console for defining fine-grained (table, column, row-level) data access policies.
AWS Step Functions: A serverless workflow orchestrator that can sequence AWS Lambda functions, SageMaker jobs, and other AWS services, often used for more general-purpose or complex ML pipelines.
Bias (in ML): Systematic errors in a model that create unfair outcomes for specific groups. Sources include historical data bias, selection bias, and algorithmic bias.
Blue/Green Deployment: A deployment strategy where a new "Green" environment is created alongside the old "Blue" one. Traffic is switched all at once, allowing for zero downtime and instant rollback.
Canary Deployment: A deployment strategy where a small percentage of traffic is gradually shifted to a new model version to test its performance in production before a full rollout.
Checkpointing: The practice of saving the state of a model (e.g., weights) at regular intervals during training. This enables fault tolerance, allowing training to resume after an interruption (e.g., in Managed Spot Training).
CI/CD (Continuous Integration/Continuous Delivery): A set of MLOps practices that automate the process of building, testing, and deploying ML models, often orchestrated by services like AWS CodePipeline and SageMaker Pipelines.
Confusion Matrix: A table used to evaluate the performance of a classification model by showing the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Cross-Validation: A technique for assessing how a model will generalize to an independent dataset by partitioning the data into multiple folds and training/validating on different combinations of these folds.
Data Drift: A change in the statistical distribution of input data over time compared to the data the model was trained on. Detected by SageMaker Model Monitor.
Distributed Training: A strategy to accelerate model training by using multiple compute instances or GPUs in parallel. Data Parallelism splits the data across workers, while Model Parallelism splits the model itself.
Early Stopping: A regularization technique that stops the training process when the model's performance on a validation set stops improving, preventing overfitting and saving compute time.
Ensemble Learning: A technique that combines multiple machine learning models (e.g., decision trees in a Random Forest) to produce a more accurate and robust prediction than any individual model.
Explainability (XAI): The degree to which a human can understand the cause of a decision made by an ML model. Techniques like SHAP and LIME are used to explain individual predictions.
F1-Score: The harmonic mean of Precision and Recall, providing a single metric that balances both, especially useful for imbalanced datasets.
Feature Engineering: The process of using domain knowledge to create new input features from raw data to improve the performance of machine learning algorithms.
Fine-tuning: A transfer learning technique where the weights of a pre-trained model are further trained (or "fine-tuned") on a new, smaller, and specific dataset.
Hyperparameter: A configuration parameter that is set before the training process begins (e.g., learning rate, number of trees). Their optimal values are found through hyperparameter tuning.
Imbalanced Data: A common problem in classification where the number of samples in one class is significantly different from the others, which can bias the model.
Managed Spot Training: A SageMaker feature that uses interruptible EC2 Spot Instances to run training jobs at a significantly lower cost (up to 90% savings), ideal for fault-tolerant workloads.
MLOps (Machine Learning Operations): A set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently by combining ML, DevOps, and Data Engineering.
Model Drift / Concept Drift: A change in the underlying relationship between input features and the target variable, leading to a degradation in model performance. Detected by SageMaker Model Monitor's model quality checks.
Multi-Model Endpoints: A SageMaker deployment option that allows hosting thousands of models that share the same container on a single endpoint, reducing costs for scenarios with many small models.
One-Hot Encoding: A common technique for converting categorical features into a numerical format by creating a new binary (0/1) column for each unique category.
Overfitting: A modeling error where a model learns the training data too well, including its noise, and fails to generalize to new, unseen data.
Precision: A classification metric (TP / (TP + FP)) that measures the proportion of positive predictions that were actually correct. Important when the cost of a false positive is high.
Recall (Sensitivity): A classification metric (TP / (TP + FN)) that measures the proportion of actual positives that were correctly identified. Important when the cost of a false negative is high.
Regression: A type of supervised learning task where the goal is to predict a continuous numerical value (e.g., price, temperature).
ROC-AUC: The Area Under the Receiver Operating Characteristic Curve. A metric that evaluates a classifier's performance across all classification thresholds, making it robust to class imbalance.
SMOTE (Synthetic Minority Over-sampling Technique): An advanced oversampling technique used to address class imbalance by generating new, synthetic samples for the minority class.
Transfer Learning: A technique where a model pre-trained on a large, general dataset is used as a starting point for a new, related task, significantly reducing data and compute requirements.
VPC (Virtual Private Cloud): A logically isolated section of the AWS Cloud where you can launch resources. For ML, it's used to create a secure, private network environment for SageMaker jobs and endpoints.
XGBoost: A powerful and highly optimized gradient boosting algorithm that is a popular choice for a wide range of classification and regression problems on tabular data.