AWS-MLS-C01 & AWS CERTIFICATION | Unsupervised Learning Algorithms - AWS Certified Machine Learning

4.2. Unsupervised Learning Algorithms

First Principle: Unsupervised learning algorithms fundamentally discover hidden patterns, structures, or relationships within unlabeled data, enabling tasks like clustering, dimensionality reduction, or anomaly detection without prior knowledge of outcomes.

Unsupervised learning is a type of machine learning where the model learns from an unlabeled dataset (i.e., there is no explicit target variable or output). The goal is to discover hidden patterns, structures, or relationships within the data.

Key Characteristics of Unsupervised Learning:

Unlabeled Data: No predefined target variable.
Pattern Discovery: Aims to find inherent groupings, associations, or representations in the data.
Problem Types:
- Clustering: Grouping similar data points together.
- Dimensionality Reduction: Reducing the number of features while preserving essential information.
- Anomaly Detection: Identifying rare items, events, or observations that deviate significantly from the majority of the data.
Applications: Customer segmentation, recommendation systems, fraud detection, data compression, data exploration.

Common Unsupervised Learning Algorithms & AWS Usage:

Clustering:
- K-Means: (SageMaker built-in algorithm.) Partitions data into K clusters, where each data point belongs to the cluster with the nearest mean.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions.
Dimensionality Reduction:
- Principal Component Analysis (PCA): (SageMaker built-in algorithm.) Transforms data into a new set of orthogonal (uncorrelated) variables called principal components, capturing the most variance in the data.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): A non-linear dimensionality reduction technique for visualizing high-dimensional data in 2 or 3 dimensions.
Anomaly Detection:
- Random Cut Forest (RCF): (SageMaker built-in algorithm.) An unsupervised algorithm for detecting anomalous data points within a dataset.
- Isolation Forest: An ensemble tree-based anomaly detection algorithm.
- One-Class SVM: Identifies outliers as points that fall outside a learned boundary.

Scenario: You have a large dataset of unlabeled customer browsing behavior and purchase history. You want to group similar customers together for targeted marketing (segmentation). Additionally, you need to identify any unusual or fraudulent patterns in financial transactions without having labeled examples of fraud.

Reflection Question: How do unsupervised learning algorithms (e.g., K-Means for clustering, PCA for dimensionality reduction, Random Cut Forest for anomaly detection) fundamentally discover hidden patterns, structures, or relationships within unlabeled data, enabling tasks like customer segmentation, data compression, or fraud detection?