Copyright (c) 2025 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.
Phase 7: Glossary
- A/B Testing: A deployment strategy where traffic is split between two or more model versions (variants) to compare their performance on live data and determine which is more effective for a given business metric.
- Amazon Athena: A serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It's commonly used for Exploratory Data Analysis (EDA) and ad-hoc querying of data lakes.
- Amazon CloudWatch: A monitoring and observability service that collects metrics, logs, and events from AWS resources. For ML, it's used to monitor the operational health of SageMaker endpoints (CPU/memory, latency) and to receive metrics from Model Monitor.
- Amazon Comprehend: A high-level AI service that uses natural language processing (NLP) to extract insights, relationships, and sentiment from text, without requiring you to build and train your own models.
- Amazon EMR (Elastic MapReduce): A managed cluster platform that simplifies running big data frameworks like Apache Spark and Hadoop. It's used for large-scale data processing, transformation, and feature engineering tasks.
- Amazon Kinesis: A suite of services for collecting, processing, and analyzing real-time streaming data. Data Streams is for custom processing, Firehose is for simple delivery to destinations like S3, and Data Analytics is for real-time SQL/Flink processing.
- Amazon Rekognition: A high-level AI service that provides pre-trained computer vision capabilities, such as object detection, facial analysis, and text detection in images and videos.
- Amazon S3 (Simple Storage Service): A highly durable and scalable object storage service that serves as the foundation for data lakes in AWS, storing raw data, processed data, and model artifacts.
- Amazon SageMaker: A fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models at scale, covering the entire ML lifecycle.
- Amazon SageMaker Asynchronous Inference: A deployment option for models with large payloads or long processing times. It queues incoming requests and delivers results to an S3 location, making it cost-effective for intermittent traffic.
- Amazon SageMaker Automatic Model Tuning (HPO): A capability that automates the process of finding the best hyperparameters for a model by running many training jobs with different combinations, using strategies like Bayesian optimization.
- Amazon SageMaker Batch Transform: A deployment option for generating predictions on an entire dataset offline. It's a high-throughput, cost-effective solution when real-time latency is not required.
- Amazon SageMaker Clarify: A feature that helps detect statistical bias in data and models and explain model predictions using techniques like SHAP, promoting fairness and transparency.
- Amazon SageMaker Data Wrangler: A visual data preparation tool that simplifies the process of data cleaning, transformation, and feature engineering with a point-and-click interface, generating code for production pipelines.
- Amazon SageMaker Endpoints: A fully managed, auto-scaling deployment option for hosting ML models to serve low-latency, real-time predictions.
- Amazon SageMaker Feature Store: A centralized repository for storing, updating, retrieving, and sharing ML features for both training and inference, ensuring consistency and preventing training-serving skew.
- Amazon SageMaker Model Monitor: A service that continuously monitors deployed models for data drift, model quality degradation, and bias drift, triggering alerts when performance drops.
- Amazon SageMaker Model Registry: A version-controlled repository for cataloging and managing ML models, their metadata, and their approval status, facilitating governance and MLOps.
- Amazon SageMaker Pipelines: A purpose-built CI/CD service for ML that automates and orchestrates end-to-end ML workflows as a series of interconnected steps (e.g., processing, training, evaluation, deployment).
- Amazon SageMaker Processing Jobs: A managed environment for running large-scale data processing, feature engineering, and model evaluation workloads using frameworks like Spark or Scikit-learn.
- AWS CloudTrail: A service that provides a record of all API calls made in your AWS account, used for security auditing, compliance, and troubleshooting ML resource changes.
- AWS Glue: A serverless data integration service used for ETL (Extract, Transform, Load). Its Data Catalog acts as a central metadata repository for data lakes, and its ETL jobs run Spark/Python scripts for data transformation.
- AWS IAM (Identity and Access Management): The service used to securely control access to AWS resources. For ML, it defines roles and policies that grant least-privilege permissions to users and SageMaker jobs.
- AWS KMS (Key Management Service): A managed service for creating and controlling encryption keys. It's used to encrypt data at rest in S3, EBS, and other services, securing sensitive ML data and models.
- AWS Lake Formation: A service that simplifies building, securing, and managing data lakes by providing a centralized console for defining fine-grained (table, column, row-level) data access policies.
- AWS Step Functions: A serverless workflow orchestrator that can sequence AWS Lambda functions, SageMaker jobs, and other AWS services, often used for more general-purpose or complex ML pipelines.
- Bias (in ML): Systematic errors in a model that create unfair outcomes for specific groups. Sources include historical data bias, selection bias, and algorithmic bias.
- Blue/Green Deployment: A deployment strategy where a new "Green" environment is created alongside the old "Blue" one. Traffic is switched all at once, allowing for zero downtime and instant rollback.
- Canary Deployment: A deployment strategy where a small percentage of traffic is gradually shifted to a new model version to test its performance in production before a full rollout.
- Checkpointing: The practice of saving the state of a model (e.g., weights) at regular intervals during training. This enables fault tolerance, allowing training to resume after an interruption (e.g., in Managed Spot Training).
- CI/CD (Continuous Integration/Continuous Delivery): A set of MLOps practices that automate the process of building, testing, and deploying ML models, often orchestrated by services like AWS CodePipeline and SageMaker Pipelines.
- Confusion Matrix: A table used to evaluate the performance of a classification model by showing the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
- Cross-Validation: A technique for assessing how a model will generalize to an independent dataset by partitioning the data into multiple folds and training/validating on different combinations of these folds.
- Data Drift: A change in the statistical distribution of input data over time compared to the data the model was trained on. Detected by SageMaker Model Monitor.
- Distributed Training: A strategy to accelerate model training by using multiple compute instances or GPUs in parallel. Data Parallelism splits the data across workers, while Model Parallelism splits the model itself.
- Early Stopping: A regularization technique that stops the training process when the model's performance on a validation set stops improving, preventing overfitting and saving compute time.
- Ensemble Learning: A technique that combines multiple machine learning models (e.g., decision trees in a Random Forest) to produce a more accurate and robust prediction than any individual model.
- Explainability (XAI): The degree to which a human can understand the cause of a decision made by an ML model. Techniques like SHAP and LIME are used to explain individual predictions.
- F1-Score: The harmonic mean of Precision and Recall, providing a single metric that balances both, especially useful for imbalanced datasets.
- Feature Engineering: The process of using domain knowledge to create new input features from raw data to improve the performance of machine learning algorithms.
- Fine-tuning: A transfer learning technique where the weights of a pre-trained model are further trained (or "fine-tuned") on a new, smaller, and specific dataset.
- Hyperparameter: A configuration parameter that is set before the training process begins (e.g., learning rate, number of trees). Their optimal values are found through hyperparameter tuning.
- Imbalanced Data: A common problem in classification where the number of samples in one class is significantly different from the others, which can bias the model.
- Managed Spot Training: A SageMaker feature that uses interruptible EC2 Spot Instances to run training jobs at a significantly lower cost (up to 90% savings), ideal for fault-tolerant workloads.
- MLOps (Machine Learning Operations): A set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently by combining ML, DevOps, and Data Engineering.
- Model Drift / Concept Drift: A change in the underlying relationship between input features and the target variable, leading to a degradation in model performance. Detected by SageMaker Model Monitor's model quality checks.
- Multi-Model Endpoints: A SageMaker deployment option that allows hosting thousands of models that share the same container on a single endpoint, reducing costs for scenarios with many small models.
- One-Hot Encoding: A common technique for converting categorical features into a numerical format by creating a new binary (0/1) column for each unique category.
- Overfitting: A modeling error where a model learns the training data too well, including its noise, and fails to generalize to new, unseen data.
- Precision: A classification metric (
TP / (TP + FP)
) that measures the proportion of positive predictions that were actually correct. Important when the cost of a false positive is high. - Recall (Sensitivity): A classification metric (
TP / (TP + FN)
) that measures the proportion of actual positives that were correctly identified. Important when the cost of a false negative is high. - Regression: A type of supervised learning task where the goal is to predict a continuous numerical value (e.g., price, temperature).
- ROC-AUC: The Area Under the Receiver Operating Characteristic Curve. A metric that evaluates a classifier's performance across all classification thresholds, making it robust to class imbalance.
- SMOTE (Synthetic Minority Over-sampling Technique): An advanced oversampling technique used to address class imbalance by generating new, synthetic samples for the minority class.
- Transfer Learning: A technique where a model pre-trained on a large, general dataset is used as a starting point for a new, related task, significantly reducing data and compute requirements.
- VPC (Virtual Private Cloud): A logically isolated section of the AWS Cloud where you can launch resources. For ML, it's used to create a secure, private network environment for SageMaker jobs and endpoints.
- XGBoost: A powerful and highly optimized gradient boosting algorithm that is a popular choice for a wide range of classification and regression problems on tabular data.