数据挖掘实验报告资料下载.pdf
- 文档编号:5966646
- 上传时间:2023-05-05
- 格式:PDF
- 页数:11
- 大小:909.03KB
数据挖掘实验报告资料下载.pdf
《数据挖掘实验报告资料下载.pdf》由会员分享,可在线阅读,更多相关《数据挖掘实验报告资料下载.pdf(11页珍藏版)》请在冰点文库上搜索。
1publicclassKMeans2publicClustergetClusters(intk,Pointpoints)3if(k=points.length)4returnnull;
567Clusterclusters=getInitialClusters(k,points);
8ClusternewClusters=null;
9do10newClusters=getClusters(k,points,clusters);
1112if(isClustersTheSame(clusters,newClusters)13break;
哈尔滨工业大学Page3of10Designedby谢浩哲1415clusters=newClusters;
16while(true);
17returnclusters;
181920privateClustergetClusters(intk,Pointpoints,Clustercluster)21for(inti=0;
ipoints.length;
+i)22PointcurrentPoint=pointsi;
23Clusterc=getClosestClusters(currentPoint,cluster);
24c.points.add(currentPoint);
252627ClusternewClusters=newClusterk;
28for(inti=0;
ik;
+i)29Clusterc=clusteri;
30intnumberOfPointsInCluster=c.points.size();
3132if(numberOfPointsInCluster=0)33/Iftheclusterisempty34intrandomIndex=(int)(Math.random()*points.length);
35newClustersi=newCluster(pointsrandomIndex);
36else37/Iftheclusterisnotempty38doublenewCentroidX=0;
39doublenewCentroidY=0;
40for(intj=0;
jnumberOfPointsInCluster;
+j)41Pointp=c.points.get(j);
42newCentroidX+=p.x;
43newCentroidY+=p.y;
4445newCentroidX/=numberOfPointsInCluster;
46newCentroidY/=numberOfPointsInCluster;
48ClusternewCluster=newCluster(newPoint(newCentroidX,newCentroidY);
49newClustersi=newCluster;
5051哈尔滨工业大学Page4of10Designedby谢浩哲52returnnewClusters;
53542.AGNES(层次聚类)算法思想:
算法选用GroupAverage作为合并估量.第一次循环选取n个点中GroupAverage最小值进行合并,将合并后的簇加入列表中,移除之前的2个簇,并重新计算该簇中的点与其他n2个簇的GroupAverage.重复执行之前的步骤,直至所有的簇都被合并.程序流程图:
哈尔滨工业大学Page5of10Designedby谢浩哲核心代码:
1publicclassAgnes2publicClustergetCluster(Listclusters)3while(clusters.size()1)4doubleminProximity=Double.MAX_VALUE;
5intminProximityIndex1=0,minProximityIndex2=0;
67for(inti=0;
iclusters.size();
+i)8for(intj=i+1;
jclusters.size();
+j)9doubleproximity=getProximity(clusters.get(i),clusters.get(j);
1011if(proximityminProximity)12minProximity=proximity;
13minProximityIndex1=i;
14minProximityIndex2=j;
15161718Clusterc=newCluster(clusters.get(minProximityIndex1),clusters.get(minProximityIndex2);
19clusters.add(c);
20clusters.remove(minProximityIndex2);
21clusters.remove(minProximityIndex1);
2223returnclusters.size()=0?
null:
clusters.get(0);
24253.DBSCAN算法思想:
首先在所有的点集中识别出CorePoint(对其邻域内点的个数进行计数),再在剩余的点集中识别出CorePoint(即该点在CorePoint的邻域内).接着,若两个CorePoint彼此相连,他们是一个Cluster中的点,将所有的CorePoint合并成若干的Cluster.再检查所有的BorderPoint,看该BorderPoint在哪一个CorePoint的邻域内,将其合并至该CorePoint所在的簇.哈尔滨工业大学Page6of10Designedby谢浩哲程序流程图:
以下为该算法核心代码的实现(仅包含识别CorePoint,并将CorePoint分类成簇)1publicclassDbscan2publicListgetClusters(Listpoints,intminPoints,doubleeps)3ListcorePoints=getCorePoints(points,minPoints,eps);
4Mapclusters=getClustersOfCorePoints(corePoints,eps);
56ListborderPoints=getBorderPoints(points,corePoints,minPoints,eps);
7getClustersOfBorderPoints(corePoints,borderPoints,clusters,eps);
8哈尔滨工业大学Page7of10Designedby谢浩哲9returnnewArrayList(clusters.values();
101112privateListgetCorePoints(Listpoints,intminPoints,doubleeps)13ListcorePoints=newArrayList();
1415for(inti=0;
ipoints.size();
+i)16PointcurrentPoint=points.get(i);
17intnumberOfPointsInEps=0;
18for(intj=0;
j=minPoints)25currentPoint.pointType=PointType.CorePoint;
26corePoints.add(currentPoint);
272829returncorePoints;
303132privateMapgetClustersOfCorePoints(ListcorePoints,doubleeps)33Mapclusters=newHashMap();
3435for(inti=0;
icorePoints.size();
+i)36PointcurrentPoint=corePoints.get(i);
37PointrepresentPoint=null;
38for(intj=0;
ji;
+j)39PointanotherPoint=corePoints.get(j);
40if(currentPoint.isPointsInEpsCircle(anotherPoint,eps)41representPoint=anotherPoint.representPoint;
42currentPoint.representPoint=representPoint;
43break;
444546if(representPoint=null)哈尔滨工业大学Page8of10Designedby谢浩哲47currentPoint.representPoint=currentPoint;
48clusters.put(currentPoint,newCluster(currentPoint);
49else50Clustercluster=clusters.get(representPoint);
51cluster.points.add(currentPoint);
525354returnclusters;
5556三、测试数据1.K-Means对于K-Means,程序随机生成均匀分布的二维坐标数据,其中横纵坐标均在0,100范围内.运行参数:
坐标点的个数:
n=200预期的聚类数量:
k=5数据样例:
请参见附件中的KMeans/Runtime/input.txt查看具体数据.(16.38,7.41),(39.14,10.49),(66.43,38.65),(44.11,51.71),(66.99,6.14),2.AGNES(层次聚类)对于AGNES,程序随机生成k个圆形簇的二维坐标数据.对于每一个簇,程序随机生成簇的中心以及小于等于24的半径r.对于每一个簇,其中点的数量被控制在0,r2+64.运行参数:
聚类的数量:
k=2数据样例:
请参见附件中的AGNES/Runtime/input.txt查看具体数据.(16.38,7.41),(39.14,10.49),(66.43,38.65),(44.11,51.71),(66.99,6.14),3.DBSCAN对于DBSCAN,程序随机生成k个圆形簇的二维坐标数据.对于每一个簇,程序随机生成簇的中心以及小于等于24的半径r.对于每一个簇,其中点的数量被控制在0,r2+64.运行参数:
k=3被认定为CorePoint周围点的数量:
minPt=5被认定为CorePoint搜索周围点的半径:
Eps=4请参见附件中的DBSCAN/Runtime/input.txt查看具体数据.(16.38,7.41),(39.14,10.49),(66.43,38.65),(44.11,51.71),(66.99,6.14),哈尔滨工业大学Page9of10Designedby谢浩哲四、实验结果1.K-Means运行参数:
n=200,k=52.AGNES(层次聚类)运行参数:
k=2哈尔滨工业大学Page10of10Designedby谢浩哲3.DBSCAN运行参数:
k=3,minPt=5,Eps=4NOTE:
图中绿色点为CorePoint,蓝色点为BorderPoint,红色点为NoisePoint.五、遇到的困难及解决方法、心得体会K-Means算法主要来说比较简单.AGNES算法本身并不复杂,但做可视化时的确用了很长的时间,现在的解决方案是,每一秒都将图中的一个类分裂成2个类(感觉像BisectingK-Means),以展示“层次”聚类的过程.DBSCAN就更复杂一些,并且要在时间和空间上找到一个平衡点并非易事.一种占用较大空间的解法是,对于每一个点,程序扫描其周围的点,并将它们加入列表(每个点都各自维护一个列表).可以想见,这样的空间消耗较大.现有的实现方案是,先扫描识别出CorePoint,根据CorePoint找出周围的BorderPoint,剩下的即为NoisePoint.但是这样需要有2次扫描,时间复杂度不如前者.总的来说,这些实验让我较好的了解聚类算法,对于不同算法的利弊也有了更深刻的认识.
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 挖掘 实验 报告
![提示](https://static.bingdoc.com/images/bang_tan.gif)