基于密度的最佳聚类数确定方法lMethod l is determined based on optimum cluster number of density.docx
- 文档编号:18194085
- 上传时间:2023-08-13
- 格式:DOCX
- 页数:6
- 大小:17.67KB
基于密度的最佳聚类数确定方法lMethod l is determined based on optimum cluster number of density.docx
《基于密度的最佳聚类数确定方法lMethod l is determined based on optimum cluster number of density.docx》由会员分享,可在线阅读,更多相关《基于密度的最佳聚类数确定方法lMethod l is determined based on optimum cluster number of density.docx(6页珍藏版)》请在冰点文库上搜索。
基于密度的最佳聚类数确定方法lMethodlisdeterminedbasedonoptimumclusternumberofdensity
基于密度的最佳聚类数确定方法l(Methodlisdeterminedbasedonoptimumclusternumberofdensity)
Amethodfordeterminingoptimalclusternumberbasedondensity
[abstract]determiningthecorrectclusteringnumberofdatasetsisafundamentalprobleminclusteranalysis.Thecommonlyusedclusteringmethodisusuallydependentonaparticularclusteringalgorithmandisnoteffectiveinthecaseofclusterclusterinthedataset.Thispaperputsforwardanewindexoftheoptimalnumberofclustering,whichfocusesontheanalysisofgeometricstructureoftheclusters,fromthepointofviewofthedataobjectdistributiondensitymeasurementtightnessandthedegreeofseparationbetweenclassesinclass.Theindexisnotsensitivetonoiseandcanidentifythedatasetofsubmanifoldgroup,theexperimentalresultsonrealdataandsyntheticdatashowthattheperformanceofthenewindexissuperiortotheuseofotherindicators.
[keywords]clusterevaluation,clusternumber,clusteringeffectivenessindex
0theintroduction
Clusteringisimportantintheresearchofdatamininganalysismethod,itspurposeistogatherdatasetobjectinclass,whichissimilar,thesamekindofobjectsnotinthesameobjectisdifferent.Sofar,theresearchershaveproposednumerousclusteringalgorithmsandhavebeenwidelyusedinthefieldsofbusinessintelligence,graphicanalysisandbioinformatics.Asanunsupervisedlearningmethod,itisnecessarytoevaluatetheclusteringresultsobtainedbylearning.Becausemanyclusteringalgorithmsrequirethenumberofclustersforagivendataset,inpracticethisisusuallynotknown.Theclusteringnumberofdatasetsisstilloneofthefundamentalpuzzlesinthestudyofclusteranalysis.
Clusterevaluationisusedtoevaluatethequalityofclusteringresults,whichisconsideredtobeoneoftheimportantfactorsinfluencingthesuccessofclusteranalysis.Itslocationintheclusteranalysisprocessisshowninfigure1.Clusteringevaluationofsomeimportantissuesincludingclusteringtrendofthedataset,determinethecorrectnumber,theclusteringanalysisresultscomparedwiththeknownobjective,etc.,thispapermainlystudiesthedeterminationoftheoptimalclusteringnumber.
Thedeterminationoftheoptimumclusternumberisusuallydeterminedbythefollowingcalculationprocess.Onagivendataset,byusingdifferentinputparameters(suchastheclusteringnumber)torunaparticularclusteringalgorithm,tothedivisionofdifferentdatasets,eachpartitionclusteringvalidityindexcalculation,andfinallycompareparametervaluesorchangethesizeofthesituation,conformtothepredeterminedconditionstheparametervaluesofthecorrespondingalgorithmparametersisconsideredtobethebestclusteringnumber[4].
Todate,therehavebeenvarioustypesofmeasurestoevaluatetheeffectivenessofdatasetpartitioningfromdifferentperspectives.TheseindicatorsarecalledClusteringValidationIndices.Generally,theevaluationmeasuresforevaluatingthevariousaspectsofclusteringcanbedividedintotwocategories:
[5].
1)Externalindex:
theevaluationfunctionofclusteranalysisisbasedonthebenchmark,andthenumberofclustersandthecorrectclassificationofeachdataobjectareknown.Therepresentativeexternalindicatorsareentropy,purity,f-measure,etc.
2)Internalindex(Internalindex):
inthecaseofunknownsetstructure,theevaluationofclusteringresultsdependsonlyonthecharacteristicsandvaluesofthedatasetitself.Inthiscase,themeasurementofclusteranalysispursuestwoobjectives:
innerclosenessandinter-classseparation.
Thisisalsothemainresearchareaofthisarticle,therepresentativeinternalindicatorsareDB,CH,XB,SD,etc.
Fromotherperspectives,theclusteringeffectivenessindexcanbedividedintosegmentationindexandhierarchyindex,fuzzyindexandnon-fuzzyindex,statisticalindexandgeometricindex.
Toevaluatetheeffectivenessofclusteringwithinternalindicators,theprocessofobtainingoptimaldivisionoroptimalclusternumberofdatasetsisgenerallydividedintofoursteps:
[6]:
Step1:
giveaseriesofclusteringalgorithmsforclusteringdatasets;
Step2:
foreachclusteringalgorithm,differentinputparametersareusedtoobtaindifferentclusteringresults.
Step3:
forthedifferentclusteringresultsobtainedinstep2,theinternalindexesarecalculatedandcorrespondingvaluesareobtained.
Step4:
selectthebestsegmentationoroptimalclusternumberaccordingtotherulesrequiredbytheinternalindicators.
Commonclusteringeffectivenessindicator
1.1Davies-Bouldinindex(DB)[7]
DBindicatorscalculatedatvariouspointsineachclassandfirstclasscenter,theaveragedistance,thentocalculateeachclassandotherkindsofsimilarity,andtakethemaximumastheclassofthesimilarity,andfinally,averagingDBindexbyallclassesofsimilarity.ItiseasytoconcludethatthesmallerDBis,thelowerthesimilaritybetweenclassesandclasses,andthusthebetterclusteringresults.
1.2Calinski-Harabaszindex(CH)[8]
CHindexthroughcalculatingvariouspointsintheclassandclasscenterdistancemeasuretightnessinthesumofsquares,throughcalculationofcenterandthedatacollectionofcenterdistancesumofsquarestomeasuredegreeofdataset,CHindexbyseparatingdegreeandtheratiooftheintensity.Therefore,thelargertheCHis,themorecloselytheclassitselfis,andthemoredispersedbetweenclassesandclasses,thebetterclusteringresults.
1.3-xi-benimetric(XB)[9]
TheXBindexUSESthesmallestclassandtheclasscenterdistancesquaredtomeasuretheseparationdegreeoftheclass,usingthedistancebetweenthepointsintheclassandthecenteroftheclasstomeasuretheclosenessoftheclass.TheXBindexisalsotheratiooftheclosenessoftheclasstotheseparationoftheclass.AswithCHindicators,XBistofindabalancebetweenthetightnessoftheclassandtheseparationbetweentheclasses,sothatitreachestheminimum,thusobtainingtheoptimalclusteringresults.
1.4SDindex[10]
TheSDvalidityindexisdefinedas
TheSDindexmeasuresthetightnessoftheclassbycalculatingthestandarddeviationoftheobjectintheclass,andmeasurestheseparationbetweentheclassbycalculatingthedistancebetweentheclassandtheclass.Theweightedtermcanbeusedtobalancetherelativeimportanceofthetightnessoftheclassandtheseparationbetweenclasses.Inthispaper,thevalueis.Thestandarddeviationofclassesanddatasetsrespectively.
2.DensityBasedclusteringIndex
Becauseofaparticularclusteringalgorithm,differentinputparameterscanresultindifferentclusteringpartitions.Foraparticularsetofdata,onlyonepartitionisoptimal.Theoptimalpartitionherereferstothefactthatitisclosesttotheactualpartitionofthedatasetthantheotherpartitions.
Therefore,theobjectiveofthisresearchistodefineanewindexofclusteringeffectivenesstoevaluatethequalityofdifferentcluster.Thisclusteringeffectivenessindexcanbeobtainedfromdatasetusingaclusteringalgorithm.
Findtheonethatismostconsistentwiththerealsituationandgettheinputparametersneededforthisdivision,suchasthenumberofclustering.
Mostofexistingclustervalidityindexisbasedonthedistancebetweentheobjectandthecentermeasurement,makesitnotwelltocapturetheobjectdistributionwithintheclass,alsoignoretheobjectdistributionclosedegreebetweentheclasses.
Becausethepurposeofclusteranalysisistoaggregatethedatasetobjectsintoclasses,theobjectsinthesameclassaresimilar,whiletheobjectsinthesameclassaredifferent.Fromthepointofviewofdatageometricdistribution,clusteringmakesobjectsinthesameclassspacedistributionisrelativelydense,andspacebetweenclassandclassobjectdistributionisrelativelysparse,canbeunderstoodasthesamekindofobjectdensityislargerthanbetweenthedensityoftheobject.Theuseofdensitybasedmethodstoevaluatetheeffectivenessofclusteringresultsisconsistentwiththenatureofclusteringanalysis.
Givenadimensiondataset,foradataobject,thenumberofdataobjectsinthedataset.Ahardpartitionclusteringalgorithmisdividedintoasubsetofasubsetcalledclasses.Inthispaper,thecenterpointofthedatasetisusedtorepresentthenumberofobjectsinthecenteroftheobject,indicatingthedistancebetweenobjects.
Definition1:
clustercenterdensity
Thecenterdensityoftheclasscenteriscenteredatthecenteroftheclass,andtherangeoftheradiusislessthanthenumberofobjectsinthecenteroftheclass..
Definition2:
clusterradius
Theclassradiusistheaverageofthedistancefromallobjectsintheclasstotheclasscenter.
Definition3:
mediumpointbetweenclusters
Themiddlelocusoftheclassisapointonthelinebetweenthetwocenters,sotheratioofthedistancebetweenthecenterandthecenterisequaltotheratioofthetwotypesofradii.Class,asshowninfigure2,dataset,thethreeclasses,theclassisaclassofradius,halfoftheclassandclassinthesitetotheclassforhalfthedistancetotheclasscenter,centerdistanceandclasswithclasssitetoaclassofcenterdistanceandcenterdistanceisequaltotheclass.
Definition4:
classedgedensity(clusterboundarydensity)
Classonekindofedgedensityisthecenterofclassanddatasetinthecenterofthesite(mediumpoint)asthecenter,agivenradius,withallthesitesinthedistanceinlessthanorequaltonumberofobjectsofthemean.
Inthisarticle,thevalueis
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 基于密度的最佳聚类数确定方法lMethod is determined based on optimum cluster number of density 基于 密
![提示](https://static.bingdoc.com/images/bang_tan.gif)
链接地址:https://www.bingdoc.com/p-18194085.html