DBSCAN easily explained

Shubhendu ghosh
7 min readAug 8, 2022
Photo by Ed van duijn on Unsplash

Introduction

Clustering analysis is an unsupervised machine learning method that groups data points into several groups called clusters such that data points in same clusters are very similar to each other and data points in different clusters are different to each other.

Two most popular clustering algorithms are “k means clustering” and “ Hierarchical clustering”. Perhaps in this article we will be discussing about a different type of clustering algorithm.

DBSCAN

DBSCAN stands for density-based Spatial Clustering of Application with Noise. Proposed by Martin Easter in 1996 DBSCAN is a density based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density.

So basically it measures the density of the whole space, and then assigns densely grouped data points into a single clusters. The best feature of DBSCAN is that it is robust to outliers. Which means it can identify outliers in the data and work accordingly. More on this later in the post.

The need for another clustering algorithm

Before going into anything let’s revise our two previously mentioned algorithms.

K-means algorithm

K means algorithm is a partitioning algorithm. In this approach we first construct a partition of the database D

Of n objects into a set of k clusters. K is an input parameter for this algorithm. How to choose value for k is topic for another article. So after an initial partition of D it uses an iterative control strategy to optimize an objective function. Each cluster is represented by gravity center of the cluster. It uses a two step procedure. Firstly, determine k representatives minimizing optimization function. Secondly, assign each object to the cluster with its representative closest to the considered object.

Hierarchical clustering

Hierarchical clustering creates hierarchical decomposition of D. the hierarchical decomposition is represented by a dendogram. A dendogram is tree that iteratively splits D into smaller subset until each subset consist only one object. It has two approach. Agglomerative approach and divisive approach. It does not need k as an input. However a terminal condition need to be defined.

Okay , so now that we know what k means clustering and hierarchical clustering is we can now proceed further on why we need another clustering algorithm.

Both k means and hierarchical clustering fails to create clusters of arbitrary shapes. They are not able to construct clustering based on varying densities.

Another problem with the traditional algorithm is that they make sure each observation be a part of any of the clusters, even if the point very far from the centroid. A slight change in data points might affect the clustering outcome.

Another advantage of DBSCAN is that you don’t have to specify the number of cluster beforehand. So not having the domain knowledge is not a problem here.

Both K means and hierarchical clustering calculates the distance between data points and between cluster center many times, while in DBSCAN it only once.

So now that we know why we need another clustering algorithm let’s dive deep into the working of the algorithm.

How DBSCAN works

When looking at the sample set of points in the figure1 2 and 3 we can easily detect clusters of points and noise points not belonging to any clusters.

figure 1
figure 2
figure 3

We can identify clusters because within each cluster we have a typical density of points which is higher than outside of the cluster. Furthermore the density within the areas of noise is considerably lower than density within clusters.

Now let’s formalize our initial notion of “cluster” and ”noise”.

The algorithm can be applied to 2D, 3D or any higher dimensional space.

So, the key idea is that for each point of a cluster the neighbourhood of a given radius has to contain a minimum number of points. The shape of the neighbourhood is determined by the distance function between two points p and q. For example when using Euclidean distance in 2D space it will be a circle, in 3D it will be a sphere.

ε neighbourhood: the ε neighbourhood of a point p is denoted by Nε (p) and is defined by

Now based on ε neighbourhood DBSCAN each point of a database D into three category.

Core points: a point p is called a core point if the ε neighbourhood of p contains at least a minimum point of the database. The minimum point is a hyperparameter which we will be explaining in the later part of the article.

Border points: a point q is called a border point if the ε neighbourhood of q contains at least one but less than the minimum number of points.

Noise: a point r is called a noise if ε neighbourhood of r contains no points. Look at the following figure to understand visually.

The blue colored points are noise. The red color points are core points and green ones are borders. Here minimum point were taken 3.

The algorithm starts with a random data points in the dataset. Then if there are a minimum number of points within the neighbourhood then it considers all the points to be part of the same cluster. The clusters are expanded by recursively repeating the neighbourhood calculation for each neighbouring point.

Reachability and connectivity

Before moving further you need to understand two important concepts. Reachability states if a data point can be accessed from another data point directly or indirectly. Connectivity states whether two data points belong to the same cluster or not. In terms of reachability and connectivity two point in DBSCAN can be referred as :

1. Directly density-reachable

2. Density-reachable

3. Density-connected

Let’s elaborate more on that.

Directly density reachable: a point p is Directly density-reachable from a point q wrt. ε , minpts if

obviously Directly density-reachability is symmetric for pair of core points but not when one core point and one border point is involved.

density-reachable: a point p is density-reachable from a point q wrt. ε and minpts if there is a chain of points p_1, p_2, …, p_n . where p_1 = p, p_n =q such that p_i+1 is directly density-reachable from p_i.

This relation is transitive but not symmetric. In case of core points density-reachability is symmetric.

Density-connected: A point p is density-connected to a point q wrt. ε and minpts if there is a point o such that both, p and q are density-reachable from o wrt. ε and minpts.

With the help of these terms we can define cluster.

Def. (cluster): let D be a database of points, a cluster C with respect to ε and minpts is a non empty subset of D satisfying the following conditions

  1. ∀ p and q : if p ∈ C and q is density-reachable from p wrt. ε and minpts then q ∈ C.

2. ∀ p, q ∈ C : p is density-connected to q wrt. ε and minpts.

Parameters of the algorithm

DBSCAN has two parameters. ε and minpts. The algorithm is very sensitive to these two parameters. A slight variation in these parameters can change the result of the analysis drastically.

  1. ε : the distance that specifies the neighbourhood. Two points are considered to be neighbors if the distance between them are less than or equal to ε .If the value of epsilon chosen is too small then a higher number of clusters will be created, and more data points will be taken as noise. Whereas, if chosen too big then various small clusters will merge into a big cluster, and we will lose details.

Minpts: the minimum number of data points to define a cluster. As a rule of thumb, a minimum minpts can be derived from the number of dimensions D in the data set, as minpts ≥ D + 1. The low value minpts = 1 does not make sense, as then every point on its own will already be a cluster. With minpts ≤ 2, the result will be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at height ε. Therefore, minpts must be chosen at least 3. However, larger values are usually better for data sets with noise and will yield more significant clusters. As a rule of thumb, minpts = 2·dim can be used, but it may be necessary to choose larger values for very large data, for noisy data or for data that contains many duplicates.

This is all for this article. I hope you guys have enjoyed reading this.

Please suggest your thoughts in the comment sections.

Follow me in other places

Github, linkedin, twitter

Thank you

--

--