AI, But Simple
Posts
Unsupervised Machine Learning: Clustering

Unsupervised Machine Learning: Clustering

AI, But Simple Issue #13

Edwin Dong
August 19, 2024

Unsupervised Machine Learning: Clustering

AI, But Simple Issue # 13

There are 3 main categories that make up machine learning as a whole:

Supervised learning, where your training examples are labeled,
Unsupervised learning, where your training data is unlabeled,
Reinforcement learning, where an agent learns in an environment through trial and error—given feedback.

In this issue, we’re going to focus on the middle (and unpopular) option; unsupervised learning.

In particular, there are 2 main problems that unsupervised learning algorithms solve.
There is clustering (grouping ungrouped data) and dimensionality reduction (lowering the number of features/dimensions of the data).

We’ll focus on clustering, as it is the most used and popular form of unsupervised learning.

Clustering deals with finding structure in a collection of unlabeled data.

If the data were to be labeled, it would be a supervised learning problem.

A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.

A cluster is therefore a grouping of objects or data points that are more similar to each other in the cluster than to those in other clusters.

Some applications of clustering include market segmentation (dividing potential clients into certain groups), anomaly detection (discovering non-normal data points), and image segmentation (image classification, recognition).

Different clustering algorithms have different ways of sorting data into various clusters. Let’s investigate a few:

Clustering Algorithms

Partitional clustering

Partitional clustering divides data objects into non-overlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.

These types of algorithms require the user to specify a hyperparameter that represents the number of clusters, usually denoted by the variable k.

Many partitional clustering algorithms work through an iterative process to assign subsets of data points into k clusters. Two examples of partitional clustering algorithms are k-means and k-medoids.

Partitional clustering algorithms have their strengths and weaknesses:

Strengths:
- They work well when clusters have a spherical shape.
- They’re scalable with respect to algorithm complexity.
Weaknesses:
- They’re not well suited for clusters with complex shapes and different sizes.

Hierarchical clustering

Hierarchical clustering is a clustering algorithm that uses a hierarchy to group data. In this way, clusters will have subclusters, and the groups will be overlapping.

To further show the difference between the 2 different clustering algorithms, see the image below:

Hierarchical clustering algorithms generate a dendrogram that can be used for further analysis of the data.

There are 2 main types of hierarchical clustering:

Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar (closest) until all points have been merged into a single cluster.

Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain.

Unlike many partitional clustering techniques, hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data.

Linkages in Clustering:

Especially in hierarchical clustering, there must be a way to decide how to assign data into their respective clusters.

We need a measure of distance between the clusters; we need to define what “close” is.
We call this measure the type of linkage, and choosing the right type of linkage can boost your performance.

Single Linkage (Minimum Linkage):

Single linkage is a type of linkage that uses the minimum distance between any single point in the first cluster and any single point in the second cluster.

Tends to produce long, "chain-like" clusters.
Sensitive to noise and outliers.

Complete Linkage (Maximum Linkage):

Complete linkage is a type of linkage that uses the maximum distance between any single point in the first cluster and any single point in the second cluster.

Tends to produce more compact and spherical clusters.

Average Linkage:

Average linkage is a type of linkage that uses the average distance between all pairs of points, where one point is from the first cluster and the other point is from the second cluster.

Produces clusters that are more balanced in size compared to single and complete linkage

Centroid Linkage:

Centroid linkage is a type of linkage that uses the distance between the centroids (mean points) of the clusters.

Can produce clusters with irregular shapes.
Sensitive to the mean values and can be influenced by outliers.

Hierarchical clustering has its own strengths and weaknesses.

The strengths:
- They often reveal more details about the relationships between data points.
- They provide an interpretable dendrogram.
The weaknesses:
- They’re computationally expensive with respect to algorithm complexity.
- They’re sensitive to noise and outliers.

Density-based clustering

Density-based clustering determines cluster assignments based on the density of data points in a region.

Clusters are assigned where there are high densities of data points separated by low-density regions.

Unlike partitional clustering, this approach doesn’t require the user to specify the number of clusters.

Instead, there is a distance-based parameter that acts as a tunable threshold.

This threshold determines how close points must be to be considered members of the cluster.

Examples of density-based clustering algorithms include DBSCAN and OPTICS. Here’s a quick visualization of how DBSCAN can assign points to weirdly shaped clusters:

The strengths:
- They excel at identifying clusters of non-spherical shapes.
- They’re resistant to outliers.
The weaknesses:
- They aren’t well suited for clustering in high-dimensional spaces.
- They have trouble identifying clusters of varying densities.

The different types of clustering algorithms

There are additional types of clustering algorithms, but we won’t go too deep into the other ones as they are more advanced. We might add it to another issue, so be on the lookout!

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now