# DBSCAN密度聚类

DBSCAN是一种基于密度的聚类算法，此类算法假设聚类结构能通过样本分布的紧密深度确定，从样本的密度角度来考量样本之间的可连接性，基于可连接样本不断扩张聚类簇以获得最终聚类结果。

**1）定义**

![img](https://img-blog.csdnimg.cn/20190302120515122.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2ppYW5nNDI1Nzc2MDI0,size_16,color_FFFFFF,t_70)

![img](https://images2015.cnblogs.com/blog/1042406/201612/1042406-20161222112847323-1346197243.png)

**2）DBSCAN算法流程**

![img](https://img-blog.csdnimg.cn/20190302113838430.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2ppYW5nNDI1Nzc2MDI0,size_16,color_FFFFFF,t_70)

![img](https://img-blog.csdnimg.cn/20190302113901586.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2ppYW5nNDI1Nzc2MDI0,size_16,color_FFFFFF,t_70)

## sklearn

> 在scikit-learn中，DBSCAN算法类为sklearn.cluster.DBSCAN
>
> sklearn API:<https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>
>
> 参考中文参数介绍：<https://www.cnblogs.com/pinard/p/6217852.html>
>
> 部分参数：
>
> eps：ϵ-邻域的距离阈值，认值是0.5。
>
> min\_samples： 核心对象的ϵ-邻域的样本数阈值。默认值是5。min\_samples过大，则核心对象会过少，此时簇内部分本来是一类的样本可能会被标为噪音点，类别数也会变多。反之min\_samples过小的话，则会产生大量的核心对象，可能会导致类别数过少。
>
> metric：距离度量参数：欧式距离 “euclidean”、曼哈顿距离 “manhattan”、切比雪夫距离“chebyshev”、闵可夫斯基距离 “minkowski”、带权重闵可夫斯基距离 “wminkowski”、标准化欧式距离 “seuclidean”: 即对于各特征维度做了归一化以后的欧式距离。此时各样本特征维度的均值为0，方差为1、马氏距离“mahalanobis”。
>
> algorithm：最近邻搜索算法参数，‘brute’对应第一种蛮力实现，‘kd\_tree’对应第二种KD树实现，‘ball\_tree’对应第三种的球树实现， ‘auto’则会在上面三种算法中做权衡，选择一个拟合最好的最优算法
>
> leaf\_size：最近邻搜索算法参数，为使用KD树或者球树时， 停止建子树的叶子节点数量的阈值。这个值越小，则生成的KD树或者球树就越大，层数越深，建树时间越长，反之，则生成的KD树或者球树会小，层数较浅，建树时间较短。默认是30. 因为这个值一般只影响算法的运行速度和使用内存大小，因此一般情况下可以不管它。
>
> p: 最近邻距离度量参数。只用于闵可夫斯基距离和带权重闵可夫斯基距离中p值的选择，p=1为曼哈顿距离， p=2为欧式距离。如果使用默认的欧式距离不需要管这个参数。

```python
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)
# 标准化
X = StandardScaler().fit_transform(X)

# #############################################################################
# 0.3邻域、10核心对象阈值
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
# 全false
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# 核心样本位置为True
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# 计算簇标签, 忽略噪点（label=-1)
n_clusters_ = np.unique(labels).shape[0] - (1 if -1 in labels else 0)
# 计算噪点数量
n_noise_ = labels[labels == -1].shape[0]

print('簇数量: %d' % n_clusters_)
print('噪点数量: %d' % n_noise_)

# #########################衡量指标####################################################
# reference：https://blog.csdn.net/sinat_26917383/article/details/70577710

# homogeneity_score(labels_true, labels_pred)集群标签的同质性度量，得分在0.0到1.0之间。1.0代表完全均匀的标签。
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
# 集群标签的完整性度量。如果所有数据点都是同一簇的元素，则聚类结果满足完整性。得分在0.0到1.0之间。1.0代表完美的标签
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
# 同质性和完整性的调和平均
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

# 兰德系数，[-1,1]越大意味着与真实情况越吻合
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
# 互信息，也是衡量吻合度，[-1,1]
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
# 轮廓系数，适用于实际类型未知情况，[-1,1]同类别样本越近，不同类样本距离越远分数越高
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# ################################绘图#############################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
# 遍历类别、颜色列表，给不同的簇类画上不同的颜色
for k, col in zip(unique_labels, colors):
    # 噪点用黑色
    if k == -1:
        col = [0, 0, 0, 1]
    # 类别为k的bool矩阵
    class_member_mask = (labels == k)

    # 取类别为k，且core_samples_mask中为True(核心样本位置为True)的位置的样本
    # 既，核心样本markersize=14，比非核心样本大
    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    # ~ 取反：把1变为0,把0变为1
    #   取类别为k，且core_samples_mask中为False(非核心样本位置为False)的位置的样本
    # 既，非核心样本markersize=6，比核心样本小
    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

'''
output:

簇数量: 3
噪点数量: 18
Homogeneity: 0.953
Completeness: 0.883
V-measure: 0.917
Adjusted Rand Index: 0.952
Adjusted Mutual Information: 0.883
Silhouette Coefficient: 0.626
'''
```

![img](https://img-blog.csdnimg.cn/20190303182701896.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2ppYW5nNDI1Nzc2MDI0,size_16,color_FFFFFF,t_70)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://im-qianuxn.gitbook.io/pytorch/ji-suan-ji/ml/cluster/dbscan.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
