← Back to programme Day 3 · Session 1 · Theory

Unsupervised Learning

View presentation slides ↗

Introduction

Unsupervised learning is a branch of machine learning where models are trained on unlabeled data, learning patterns autonomously by grouping similar data points. Unlike supervised learning, which relies on ground truths to predict specific target variables like classes or numerical values, unsupervised learning discovers hidden structures and associations without requiring any human intervention. It's primarily used for grouping data, compressing information, and detecting anomalies.

Algorithms achieve this mathematically by searching for regions of high density, correlation, or distance across the feature space. One of the core methodologies is clustering, which groups unlabeled data based on similarity to discover natural groupings without predefined categories. A well-formed cluster exhibits high intra-cluster similarity — points are close together inside the group — and low inter-cluster similarity, where different groups are widely separated.

Supervised vs. unsupervised learning

Supervised vs. unsupervised learning
Aspect Supervised learning Unsupervised learning
Training data Labeled data (input data + labels) Unlabeled data (input data only)
Output Classes, numerical values Groups, clusters, latent features, associations
Goal Learn to predict a target variable (labels) Discover hidden patterns or structures within the input data
Typical tasks Classification, regression Clustering, dimensionality reduction

Clustering

Clustering algorithms group unlabeled data based on the similarity of data points in the feature space. A valid cluster has high intra-cluster similarity (points are close together inside the group) and low inter-cluster similarity (different groups are widely separated). Algorithms use metrics such as Euclidean distance or cosine similarity to assign points into clusters.

Hard vs. soft clustering

Hard clustering

One cluster, exactly

Assigns each data point to exactly one cluster.

Use cases: market segmentation, customer grouping, and document clustering.

Limitation: cannot handle overlapping groups where a point might logically belong to multiple clusters.

Soft clustering

Probabilities across clusters

Allows a data point to belong to multiple clusters simultaneously, using probabilities (e.g. 70% in Cluster 1, 30% in Cluster 2).

Use cases: overlapping class boundaries, customer personas, and medical diagnosis.

Benefits: captures ambiguity when boundaries are unclear.

Diagram comparing hard clustering, where each point belongs to one cluster, against soft clustering, where points can belong to multiple clusters with a probability
Figure. Hard clustering assigns each point to a single group; soft clustering allows partial membership across groups.

K-Means algorithm

K-Means groups points based on their distance to cluster centers, identifying natural groupings in raw, unorganized data.

The algorithm steps

  1. Initialization: randomly choose initial centroids from the data points.
  2. Assignment: assign each data point to the nearest centroid to form clusters.
  3. Update centroids: recalculate the centroids by finding the mean of the points currently in each cluster.
  4. Convergence: repeat the assignment and update steps until the centroids stabilize and stop moving.
Diagram showing the K-Means algorithm steps: initializing centroids, assigning points, updating centroids, and converging
Figure. The K-Means loop: assign points to the nearest centroid, then update each centroid to the mean of its assigned points, repeating until stable.
Advantages

Fast, highly scalable for large datasets, easy to interpret, and highly efficient for well-separated, spherical data.

Disadvantages

Highly sensitive to outliers, requires manually predefining the number of clusters k, and struggles with complex, non-convex shapes.

Getting the optimal k

The elbow method finds the optimal number of clusters k by plotting the Within-Cluster Sum of Squares (WCSS) against different k values. WCSS decreases as k increases, and the "elbow point" shows where the improvement slows down.

Formula for the Within-Cluster Sum of Squares (WCSS), summing squared distances between each point and its cluster centroid across all clusters
Figure. The WCSS formula used by the elbow method.
  • The outer sum adds up the results for all k clusters.
  • The inner sum adds up the results for all ni data points inside a specific cluster.
  • xij is a single data point j that belongs to cluster i.
  • ci is the centroid (center) of cluster i.

Two major metrics are used in the elbow method:

Distortion

Measures the average squared distance between each data point and its assigned cluster center. It's a measure of how well the clusters represent the data — a lower distortion value indicates better clustering.

Formula for distortion
Figure. The distortion formula.

Inertia

The sum of squared distances of each data point to its closest cluster center — essentially the total squared error of the clustering.

Formula for inertia
Figure. The inertia formula.

Hierarchical clustering

This technique groups data into a hierarchy of clusters based on similarity, visualizing relationships through a tree-like structure called a dendrogram.

  • Dendrogram: each leaf is a sample/observation. Moving up the dendrogram, similar observations begin to fuse into branches, and branches fuse into bigger branches. Observations that fuse later — near the top of the tree, or root — are more different than observations that fuse earlier.
  • Clusters: created by making a horizontal cut across the dendrogram. Clusters are the separate trees below the cut.
A dendrogram tree diagram showing samples fusing into branches at increasing heights of dissimilarity
Figure. A dendrogram — cutting horizontally at a chosen height determines the resulting clusters.

Types of hierarchical clustering

Agglomerative (bottom-up)

Starts with every data point as a single cluster and successively merges pairs until all data is in one giant cluster.

Divisive (top-down)

Starts with all data in a single cluster and recursively splits it until each point is its own singleton cluster.

Types of linkage

  • Single linkage: the distance between the two closest points in the respective clusters.
  • Complete linkage: the distance between the two furthest points in the respective clusters.
  • Average linkage: the average distance between all possible pairs of points across the two clusters.
  • Ward's linkage: merges clusters to minimize the increase in total within-cluster variance (Sum of Squared Errors, or SSE).

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a density-based clustering algorithm that groups data points closely packed together and marks outliers as noise, based on their density in the feature space. It identifies clusters as dense regions in the data space separated by areas of lower density.

Key parameters

  • eps: defines the radius of the neighborhood around a data point. If the distance between two points is ≤ eps, they're considered neighbors.
    • If eps is too small, most points will be classified as noise.
    • If eps is too large, clusters may merge and the algorithm may fail to distinguish between them.
  • MinPts: the minimum number of points required within the eps radius to form a dense region.
Diagram illustrating DBSCAN's eps radius and MinPts parameters, showing core points, border points, and noise
Figure. DBSCAN groups points by local density, using eps and MinPts to distinguish core points, border points, and noise.

Dimensionality reduction

Dimensionality reduction reduces the total feature count while preserving the underlying variance and structure of the data. It helps eliminate collinearity, accelerates model training, and allows data visualization in 2D or 3D.

Principal Component Analysis (PCA)

PCA is a linear technique that projects data onto new orthogonal axes called Principal Components. The first component captures the absolute maximum variance, and each subsequent component captures the remaining variance while staying perfectly perpendicular to the previous one.

Diagram showing data projected onto principal component axes, with the first component capturing the maximum variance
Figure. Data projected onto its principal components.
Advantages

Handles multicollinearity (creates uncorrelated variables), reduces noise, compresses data for faster processing, and aids in outlier detection.

Disadvantages

The new components can be difficult to interpret, it's highly sensitive to proper data scaling, some important information may be lost, it assumes linearity in the data, and it's computationally heavy for massive datasets.

Non-Negative Matrix Factorization (NMF)

NMF breaks down a large dataset into smaller, meaningful parts while strictly ensuring that all values remain non-negative. For a matrix A of dimensions m × n where every element is ≥ 0, NMF factorizes it into two non-negative matrices W and H:

Formula showing matrix A factorized into non-negative matrices W and H
Figure. NMF factorizes matrix A into non-negative matrices W and H.

Applications: anomaly detection

Anomaly detection is about finding data points that are different from what is considered normal or expected. It relies on historical data or established knowledge to determine what falls within the usual range.

Scatter plot showing a cluster of normal data points with a small number of outlier points marked as anomalies
Figure. Anomalies sit outside the normal range established by the bulk of the data.

References