Introduction
Unsupervised learning is a branch of machine learning where models are trained on unlabeled data, learning patterns autonomously by grouping similar data points. Unlike supervised learning, which relies on ground truths to predict specific target variables like classes or numerical values, unsupervised learning discovers hidden structures and associations without requiring any human intervention. It's primarily used for grouping data, compressing information, and detecting anomalies.
Algorithms achieve this mathematically by searching for regions of high density, correlation, or distance across the feature space. One of the core methodologies is clustering, which groups unlabeled data based on similarity to discover natural groupings without predefined categories. A well-formed cluster exhibits high intra-cluster similarity — points are close together inside the group — and low inter-cluster similarity, where different groups are widely separated.
Supervised vs. unsupervised learning
| Aspect | Supervised learning | Unsupervised learning |
|---|---|---|
| Training data | Labeled data (input data + labels) | Unlabeled data (input data only) |
| Output | Classes, numerical values | Groups, clusters, latent features, associations |
| Goal | Learn to predict a target variable (labels) | Discover hidden patterns or structures within the input data |
| Typical tasks | Classification, regression | Clustering, dimensionality reduction |
Clustering
Clustering algorithms group unlabeled data based on the similarity of data points in the feature space. A valid cluster has high intra-cluster similarity (points are close together inside the group) and low inter-cluster similarity (different groups are widely separated). Algorithms use metrics such as Euclidean distance or cosine similarity to assign points into clusters.
Hard vs. soft clustering
One cluster, exactly
Assigns each data point to exactly one cluster.
Use cases: market segmentation, customer grouping, and document clustering.
Limitation: cannot handle overlapping groups where a point might logically belong to multiple clusters.
Probabilities across clusters
Allows a data point to belong to multiple clusters simultaneously, using probabilities (e.g. 70% in Cluster 1, 30% in Cluster 2).
Use cases: overlapping class boundaries, customer personas, and medical diagnosis.
Benefits: captures ambiguity when boundaries are unclear.
K-Means algorithm
K-Means groups points based on their distance to cluster centers, identifying natural groupings in raw, unorganized data.
The algorithm steps
- Initialization: randomly choose initial centroids from the data points.
- Assignment: assign each data point to the nearest centroid to form clusters.
- Update centroids: recalculate the centroids by finding the mean of the points currently in each cluster.
- Convergence: repeat the assignment and update steps until the centroids stabilize and stop moving.
Fast, highly scalable for large datasets, easy to interpret, and highly efficient for well-separated, spherical data.
Highly sensitive to outliers, requires manually predefining the number of clusters k, and struggles with complex, non-convex shapes.
Getting the optimal k
The elbow method finds the optimal number of clusters k by plotting the Within-Cluster Sum of Squares (WCSS) against different k values. WCSS decreases as k increases, and the "elbow point" shows where the improvement slows down.
- The outer sum adds up the results for all k clusters.
- The inner sum adds up the results for all ni data points inside a specific cluster.
- xij is a single data point j that belongs to cluster i.
- ci is the centroid (center) of cluster i.
Two major metrics are used in the elbow method:
Distortion
Measures the average squared distance between each data point and its assigned cluster center. It's a measure of how well the clusters represent the data — a lower distortion value indicates better clustering.
Inertia
The sum of squared distances of each data point to its closest cluster center — essentially the total squared error of the clustering.
Hierarchical clustering
This technique groups data into a hierarchy of clusters based on similarity, visualizing relationships through a tree-like structure called a dendrogram.
- Dendrogram: each leaf is a sample/observation. Moving up the dendrogram, similar observations begin to fuse into branches, and branches fuse into bigger branches. Observations that fuse later — near the top of the tree, or root — are more different than observations that fuse earlier.
- Clusters: created by making a horizontal cut across the dendrogram. Clusters are the separate trees below the cut.
Types of hierarchical clustering
Starts with every data point as a single cluster and successively merges pairs until all data is in one giant cluster.
Starts with all data in a single cluster and recursively splits it until each point is its own singleton cluster.
Types of linkage
- Single linkage: the distance between the two closest points in the respective clusters.
- Complete linkage: the distance between the two furthest points in the respective clusters.
- Average linkage: the average distance between all possible pairs of points across the two clusters.
- Ward's linkage: merges clusters to minimize the increase in total within-cluster variance (Sum of Squared Errors, or SSE).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups data points closely packed together and marks outliers as noise, based on their density in the feature space. It identifies clusters as dense regions in the data space separated by areas of lower density.
Key parameters
- eps: defines the radius of the neighborhood around a data point. If
the distance between two points is ≤ eps, they're considered neighbors.
- If eps is too small, most points will be classified as noise.
- If eps is too large, clusters may merge and the algorithm may fail to distinguish between them.
- MinPts: the minimum number of points required within the eps radius to form a dense region.
Dimensionality reduction
Dimensionality reduction reduces the total feature count while preserving the underlying variance and structure of the data. It helps eliminate collinearity, accelerates model training, and allows data visualization in 2D or 3D.
Principal Component Analysis (PCA)
PCA is a linear technique that projects data onto new orthogonal axes called Principal Components. The first component captures the absolute maximum variance, and each subsequent component captures the remaining variance while staying perfectly perpendicular to the previous one.
Handles multicollinearity (creates uncorrelated variables), reduces noise, compresses data for faster processing, and aids in outlier detection.
The new components can be difficult to interpret, it's highly sensitive to proper data scaling, some important information may be lost, it assumes linearity in the data, and it's computationally heavy for massive datasets.
Non-Negative Matrix Factorization (NMF)
NMF breaks down a large dataset into smaller, meaningful parts while strictly ensuring that all values remain non-negative. For a matrix A of dimensions m × n where every element is ≥ 0, NMF factorizes it into two non-negative matrices W and H:
Applications: anomaly detection
Anomaly detection is about finding data points that are different from what is considered normal or expected. It relies on historical data or established knowledge to determine what falls within the usual range.