Skip to content

DBSCAN

Problem Statement

Given:

  • A continuously updated set of 2D points \(P = \{p_1, p_2, \dots, p_n\}\) representing the locations (latitude, longitude) of people.
  • A fixed radius \(r = 100\) meters.

Objective:

  • Continuously identify the coordinate \(c\) **such that the circle of radius 100 m centred at \(c\)contains the maximum number of points from P. The solution must be updated in near real time as new points are added (or removed).

DBSCAN

  • Density Based Spatial Clustering of Applications with Noise
  • Clustering algorithm used in ML to partition data into clusters based on their distance to other points.
  • Effective at identifying and removing noise (outliers) in a data set.
  • Complexity depends on how the nearest neighbour search is implemented.
    • Brute Force Search: \(O(n^2)\)
    • Tree (k-dimensional tree, ball tree, r-tree): \(O(n \: logn)\)
    • scikit-learn allows diff searching algorithms
  • Key parameters:
    • eps: radius of the “neighbourhood”
    • MinPts: minimum # of points required within the eps radius to be considered as a core point.
  • If esp is too high ⇒ a lot of points are considered as 1 cluster
  • If esp is too low ⇒ points may be divided in to too many clusters and a lot of them could be considered as outliers.
  • If minPts is too high ⇒ cut off is too high so there might be less clusters and a lot of outliers.
  • If minPts is too low ⇒ outliers can be considered as clusters

Procedure

  1. Parameter: radius and the min. neighbour points

  2. Determine if this point is a core point or not

    • if the radius was 10m and the min neighbour points was 4, a point that has at least 4 other points within the 10m radius circle (can be touching the circle, doesn’t have to be fully inside), it is considered a core point. image.pngimage.png
  3. Randomly pick a core point and put it into cluster 1. And in the neighbour points, if there are other core points, put them into cluster 1 as well. Do not include the non core points. Repeat until there are no more core points that can be extended to.

    image.png

  4. Add the non core points that are neighbours of the core points that are classified as cluster 1.

    image.pngimage.png

  5. Other core points that was not “connected” to the core points classified as cluster 1 becomes a new cluster, ie cluster 2. Repeat the step until there are no more core points left.

    image.png

  6. After grouping with the core points, if there are non core points left, they are classified as outliers.

Pseudocode

DBSCAN(dataset, eps, MinPts) {
    C = 1 # cluster index

    for each unvisited point p in dataset {
        mark p as visited

        # find all the points within a distance eps from point p
        Neighbours N = find the neigbouring points of p

        if |N| >= MinPts:
            N = N U newN # expand the neigbours (=> add the neigbours neighbour if they're core points)
            if newP is not a member of any cluster:
                add newP to cluster C # here the newP is the neighbours)
    }
}
Algorithm Implementation with Python
import numpy as np

def find_neighbours(data, point_idx, eps):
    neighbours = []
    for i in range(len(data)):
        if np.linalg.norm(data[point_idx] - data[i]) <= eps:
            neighbours.append(i)
    return neighbours

def dbscan(data, eps, min_pts):
    clusters = [-1] * len(data)
    cluster_id = 0

    for i in range(len(data)):
        if clusters[i] != -1:
            continue

        neighbours = find_neighbours(data, i, eps)

        if len(neighbors) < min_pts:
            clusters[i] = 0
            continue

        cluster_id += 1
        clusters[i] = cluster_id
        queue = neighbours

        while queue:
            j = queue.pop(0)

            if clusters[j] == 0:  
                clusters[j] = cluster_id

            if clusters[j] != -1:
                continue

            clusters[j] = cluster_id
            new_neighbours = find_neighbours(data, j, eps)

            if len(new_neighbors) >= min_pts:
                queue.extend(new_neighbors)

    return clusters

data = np.array([[1, 2], [2, 2], [3, 2], [8, 7], [8, 8], [25, 80]])
eps = 2
min_pts = 2

clusters = dbscan(data, eps, min_pts)
print(clusters)

Clustering

image.png

import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.5, random_state=0)

db = DBSCAN(eps=0.3, min_samples=5).fit(X)
labels = db.labels_

print(labels)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', edgecolors='k')
plt.title("DBSCAN Clustering")
plt.show()

Find the Densest Cluster

  • Calculates the centre point of each cluster then counts the number of points inside.

image.png

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from scipy.spatial import distance

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.5, random_state=0)

db = DBSCAN(eps=0.3, min_samples=5).fit(X)
labels = db.labels_

unique_labels = set(labels) - {-1} 
centers = []

for l in unique_labels:
    mask_bool = (labels == l)
    cluster_points = X[mask_bool]

    x_min = np.min(cluster_points[:, 0])
    x_max = np.max(cluster_points[:, 0])
    y_min = np.min(cluster_points[:, 1])
    y_max = np.max(cluster_points[:, 1])

    center_pt = np.array([(x_min + x_max) / 2, (y_min + y_max) / 2])
    centers.append((l, center_pt))

radius = 1
pt_counts = {}

for l, c in centers:
    count = np.sum(distance.cdist([c], X, 'euclidean')[0] <= radius)
    pt_counts[l] = count
    print(f"cluster {l}: {c} [{count}]")

max_dens_cluster = max(pt_counts, key=pt_counts.get)
max_dens_pt = dict(centers)[max_dens_cluster]

plt.figure(figsize=(8, 6))     
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', edgecolors='k')

for l, c in centers:
    markerC = 'red' if l != max_dens_cluster else 'blue'
    plt.scatter(c[0], c[1], c=markerC, marker='X', s=200, label=f"Cluster {l}")

    circle = plt.Circle(c, radius, color=markerC, fill=False, linestyle='dashed')
    plt.gca().add_patch(circle)

plt.axis('equal')
plt.title("DBSCAN Clustering - Max Density Circle")
plt.legend()
plt.show()

References