书签分享收藏举报版权申诉 / 11

立即下载加入VIP,免费下载

当前位置：首页 > 求职职场 > 简历 > 数据挖掘实验报告资料下载.pdf

数据挖掘实验报告资料下载.pdf

文档编号：5966646
上传时间：2023-05-05
格式：PDF
页数：11
大小：909.03KB

数据挖掘实验报告资料下载.pdf

《数据挖掘实验报告资料下载.pdf》由会员分享，可在线阅读，更多相关《数据挖掘实验报告资料下载.pdf（11页珍藏版）》请在冰点文库上搜索。

数据挖掘实验报告资料下载.pdf

1publicclassKMeans2publicClustergetClusters（intk,Pointpoints）3if（k=points.length）4returnnull;

567Clusterclusters=getInitialClusters（k,points）;

8ClusternewClusters=null;

9do10newClusters=getClusters（k,points,clusters）;

1112if（isClustersTheSame（clusters,newClusters）13break;

哈尔滨工业大学Page3of10Designedby谢浩哲1415clusters=newClusters;

16while（true）;

17returnclusters;

181920privateClustergetClusters（intk,Pointpoints,Clustercluster）21for（inti=0;

ipoints.length;

+i）22PointcurrentPoint=pointsi;

23Clusterc=getClosestClusters（currentPoint,cluster）;

24c.points.add（currentPoint）;

252627ClusternewClusters=newClusterk;

28for（inti=0;

ik;

+i）29Clusterc=clusteri;

30intnumberOfPointsInCluster=c.points.size（）;

3132if（numberOfPointsInCluster=0）33/Iftheclusterisempty34intrandomIndex=（int）（Math.random（）*points.length）;

35newClustersi=newCluster（pointsrandomIndex）;

36else37/Iftheclusterisnotempty38doublenewCentroidX=0;

39doublenewCentroidY=0;

40for（intj=0;

jnumberOfPointsInCluster;

+j）41Pointp=c.points.get（j）;

42newCentroidX+=p.x;

43newCentroidY+=p.y;

4445newCentroidX/=numberOfPointsInCluster;

46newCentroidY/=numberOfPointsInCluster;

48ClusternewCluster=newCluster（newPoint（newCentroidX,newCentroidY）;

49newClustersi=newCluster;

5051哈尔滨工业大学Page4of10Designedby谢浩哲52returnnewClusters;

53542.AGNES（层次聚类）算法思想:

算法选用GroupAverage作为合并估量.第一次循环选取n个点中GroupAverage最小值进行合并,将合并后的簇加入列表中,移除之前的2个簇,并重新计算该簇中的点与其他n2个簇的GroupAverage.重复执行之前的步骤,直至所有的簇都被合并.程序流程图:

哈尔滨工业大学Page5of10Designedby谢浩哲核心代码:

1publicclassAgnes2publicClustergetCluster（Listclusters）3while（clusters.size（）1）4doubleminProximity=Double.MAX_VALUE;

5intminProximityIndex1=0,minProximityIndex2=0;

67for（inti=0;

iclusters.size（）;

+i）8for（intj=i+1;

jclusters.size（）;

+j）9doubleproximity=getProximity（clusters.get（i）,clusters.get（j）;

1011if（proximityminProximity）12minProximity=proximity;

13minProximityIndex1=i;

14minProximityIndex2=j;

15161718Clusterc=newCluster（clusters.get（minProximityIndex1）,clusters.get（minProximityIndex2）;

19clusters.add（c）;

20clusters.remove（minProximityIndex2）;

21clusters.remove（minProximityIndex1）;

2223returnclusters.size（）=0?

null:

clusters.get（0）;

24253.DBSCAN算法思想:

首先在所有的点集中识别出CorePoint（对其邻域内点的个数进行计数）,再在剩余的点集中识别出CorePoint（即该点在CorePoint的邻域内）.接着,若两个CorePoint彼此相连,他们是一个Cluster中的点,将所有的CorePoint合并成若干的Cluster.再检查所有的BorderPoint,看该BorderPoint在哪一个CorePoint的邻域内,将其合并至该CorePoint所在的簇.哈尔滨工业大学Page6of10Designedby谢浩哲程序流程图:

以下为该算法核心代码的实现（仅包含识别CorePoint,并将CorePoint分类成簇）1publicclassDbscan2publicListgetClusters（Listpoints,intminPoints,doubleeps）3ListcorePoints=getCorePoints（points,minPoints,eps）;

4Mapclusters=getClustersOfCorePoints（corePoints,eps）;

56ListborderPoints=getBorderPoints（points,corePoints,minPoints,eps）;

7getClustersOfBorderPoints（corePoints,borderPoints,clusters,eps）;

8哈尔滨工业大学Page7of10Designedby谢浩哲9returnnewArrayList（clusters.values（）;

101112privateListgetCorePoints（Listpoints,intminPoints,doubleeps）13ListcorePoints=newArrayList（）;

1415for（inti=0;

ipoints.size（）;

+i）16PointcurrentPoint=points.get（i）;

17intnumberOfPointsInEps=0;

18for（intj=0;

j=minPoints）25currentPoint.pointType=PointType.CorePoint;

26corePoints.add（currentPoint）;

272829returncorePoints;

303132privateMapgetClustersOfCorePoints（ListcorePoints,doubleeps）33Mapclusters=newHashMap（）;

3435for（inti=0;

icorePoints.size（）;

+i）36PointcurrentPoint=corePoints.get（i）;

37PointrepresentPoint=null;

38for（intj=0;

ji;

+j）39PointanotherPoint=corePoints.get（j）;

40if（currentPoint.isPointsInEpsCircle（anotherPoint,eps）41representPoint=anotherPoint.representPoint;

42currentPoint.representPoint=representPoint;

43break;

444546if（representPoint=null）哈尔滨工业大学Page8of10Designedby谢浩哲47currentPoint.representPoint=currentPoint;

48clusters.put（currentPoint,newCluster（currentPoint）;

49else50Clustercluster=clusters.get（representPoint）;

51cluster.points.add（currentPoint）;

525354returnclusters;

5556三、测试数据1.K-Means对于K-Means,程序随机生成均匀分布的二维坐标数据,其中横纵坐标均在0,100范围内.运行参数:

坐标点的个数:

n=200预期的聚类数量:

k=5数据样例:

请参见附件中的KMeans/Runtime/input.txt查看具体数据.（16.38,7.41）,（39.14,10.49）,（66.43,38.65）,（44.11,51.71）,（66.99,6.14）,2.AGNES（层次聚类）对于AGNES,程序随机生成k个圆形簇的二维坐标数据.对于每一个簇,程序随机生成簇的中心以及小于等于24的半径r.对于每一个簇,其中点的数量被控制在0,r2+64.运行参数:

聚类的数量:

k=2数据样例:

请参见附件中的AGNES/Runtime/input.txt查看具体数据.（16.38,7.41）,（39.14,10.49）,（66.43,38.65）,（44.11,51.71）,（66.99,6.14）,3.DBSCAN对于DBSCAN,程序随机生成k个圆形簇的二维坐标数据.对于每一个簇,程序随机生成簇的中心以及小于等于24的半径r.对于每一个簇,其中点的数量被控制在0,r2+64.运行参数:

k=3被认定为CorePoint周围点的数量:

minPt=5被认定为CorePoint搜索周围点的半径:

Eps=4请参见附件中的DBSCAN/Runtime/input.txt查看具体数据.（16.38,7.41）,（39.14,10.49）,（66.43,38.65）,（44.11,51.71）,（66.99,6.14）,哈尔滨工业大学Page9of10Designedby谢浩哲四、实验结果1.K-Means运行参数:

n=200,k=52.AGNES（层次聚类）运行参数:

k=2哈尔滨工业大学Page10of10Designedby谢浩哲3.DBSCAN运行参数:

k=3,minPt=5,Eps=4NOTE:

图中绿色点为CorePoint,蓝色点为BorderPoint,红色点为NoisePoint.五、遇到的困难及解决方法、心得体会K-Means算法主要来说比较简单.AGNES算法本身并不复杂,但做可视化时的确用了很长的时间,现在的解决方案是,每一秒都将图中的一个类分裂成2个类（感觉像BisectingK-Means）,以展示“层次”聚类的过程.DBSCAN就更复杂一些,并且要在时间和空间上找到一个平衡点并非易事.一种占用较大空间的解法是,对于每一个点,程序扫描其周围的点,并将它们加入列表（每个点都各自维护一个列表）.可以想见,这样的空间消耗较大.现有的实现方案是,先扫描识别出CorePoint,根据CorePoint找出周围的BorderPoint,剩下的即为NoisePoint.但是这样需要有2次扫描,时间复杂度不如前者.总的来说,这些实验让我较好的了解聚类算法,对于不同算法的利弊也有了更深刻的认识.