AWS-MLS-C01 & AWS CERTIFICATION | Clustering (K-Means, DBSCAN) - AWS Certified Machine Learning

4.2.1. Clustering (K-Means, DBSCAN)

First Principle: Clustering algorithms fundamentally group similar data points into distinct clusters based on their inherent characteristics, uncovering natural groupings within unlabeled datasets.

Clustering is an unsupervised learning task that divides a dataset into groups (clusters) such that data points within the same group are more similar to each other than to those in other groups.

Key Clustering Algorithms:

K-Means Clustering:
- What it is: An iterative algorithm that partitions n observations into k clusters. It aims to minimize the sum of squared distances between data points and their cluster centroids.
- How it works:
  1. Randomly initialize k cluster centroids.
  2. Assign each data point to the closest centroid.
  3. Recalculate new centroids as the mean of all data points assigned to that cluster.
  4. Repeat steps 2-3 until centroids no longer move significantly or maximum iterations are reached.
- Strengths: Simple, relatively efficient for large datasets, easy to implement.
- Limitations: Requires specifying k (number of clusters) in advance, sensitive to initial centroid placement, assumes spherical clusters of similar size, sensitive to outliers.
- AWS: SageMaker K-Means is a built-in algorithm, optimized for scale.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- What it is: A density-based clustering algorithm that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions.
- How it works: Defines clusters as areas of high density separated by areas of low density. It does not require specifying the number of clusters in advance and can find arbitrarily shaped clusters.
- Strengths: Can discover arbitrarily shaped clusters, robust to outliers, does not require specifying k.
- Limitations: Struggles with varying densities, boundary points can be ambiguous, sensitive to parameter epsilon (neighborhood radius) and minPts (minimum points in neighborhood).
Use Cases: Customer segmentation, document clustering, anomaly detection (points not belonging to any cluster).

Scenario: You have a dataset of unlabeled customer transaction records. You want to identify distinct groups of customers for targeted marketing campaigns, but you're not sure how many groups exist, and some customers might be outliers.

Reflection Question: How do clustering algorithms like K-Means (for predefined cluster counts) and DBSCAN (for discovering arbitrary shapes and handling outliers) fundamentally group similar data points into distinct clusters based on their inherent characteristics, uncovering natural groupings within unlabeled datasets?