首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 187 毫秒
1.
基于网格的聚类算法可以高效处理低维的海量数据.然而,对于维数较高的数据集,生成的单元数过多导致算法的效率较低.CD-Tree是一种只保存非空单元的索引结构,基于CD-Tree设计了新的基于网格的聚类算法,利用CD-Tree的优点提高了传统的基于网格的聚类算法的效率.此外,该算法聚类时只需访问稠密单元,设计了优化策略,在聚类之前剪枝掉非稠密单元,进一步提高了算法的效率.实验表明,与传统的聚类算法相比,基于CD-Tree的聚类算法有更好的可伸缩性.  相似文献   

2.
一种优化的基于网格的聚类算法   总被引:5,自引:0,他引:5  
聚类是数据挖掘领域中一个重要的研究课题.与其它算法相比,基于网格的聚类算法可以高效处理低维的海量数据.然而,由于划分的单元数与数据的维数呈指数增长,因此对于维数较高的数据集,生成的单元数过多,导致算法的效率较低.本文基于CD—Tree设计了新的基于网格的聚类算法,该算法的效率远高于传统的基于网格聚类算法的效率.此外,本文设计了一种剪枝优化策略,以提高算法的效率.实验表明,与传统的聚类算法相比,基于CD-Tree的聚类算法在数据集的大小及维度的可伸缩性方面均有显著提高.  相似文献   

3.
一种基于划分的孤立点检测算法   总被引:7,自引:0,他引:7       下载免费PDF全文
孤立点是不具备数据一般特性的数据对象.划分的方法是通过将数据集中的数据点分布的空间划分为不相交的超矩形单元集合,匹配数据对象到单元中,然后通过各个单元的统计信息来发现孤立点.由于大多真实数据集具有较大偏斜,因此划分后会产生影响算法性能的大量空单元.由此,提出了一种新的索引结构--CD-Tree(cell dimension tree),用于索引非空单元.为了优化CD-Tree结构和指导对数据的划分,提出了基于划分的数据偏斜度(skew of data,简称SOD)概念.基于CD-Tree与SOD,设计了新的孤立点检测算法.实验结果表明,该算法与基于单元的算法相比,在效率及有效处理的维数方面均有显著提高.  相似文献   

4.
对盈千累万且错综复杂的数据集进行分析,是一个非常具有挑战性的任务,检测数据中的异常值的技术在该任务中发挥着举足轻重的作用.通过聚类捕获异常的方式,在日趋流行的异常检测技术中是最为常用的一类方法.文中提出了一种基于二阶近邻的异常检测算法(anomaly detection based second-order proximity, SOPD),主要包括聚类和异常检测两个阶段.在聚类过程中,通过二阶近邻的方式获取相似性矩阵;在异常检测过程中,根据簇中的点与簇中心的关系,计算聚类生成的每一个簇中的所有的点与该簇中心的距离,捕捉异常状态,并把每个数据点的密度考虑进去,排除簇边界情况.二阶近邻的使用,使得数据的局部性以及全局性得以被同时考虑,进而使得聚类得到的簇数减少,增加了异常检测的精确性.通过大量实验,将该算法与一些经典的异常检测算法进行比较,结果表明, SOPD算法整体上性能较好.  相似文献   

5.
针对数据集中数据分布密度不均匀以及存在噪声点,噪声点容易导致样本聚类时产生较大的偏差问题,提出一种基于网络框架下改进的多密度SNN聚类算法。网格化递归划分数据空间成密度不同的网格,对高密度网格单元作为类簇中心,利用网格相对密度差检测出在簇边界网格中包含噪声点;使用改进的SNN聚类算法计算边界网格内样本数据点的局部密度,通过数据密度特征分布对噪声点进行类簇分配,从而提高聚类算法的鲁棒性。在UCI高维的数据集上的实验结果表明,与传统的算法相比,该算法通过网格划分数据空间和局部密度峰值进行样本类簇分配,有效地平衡聚类效果和时间性能。  相似文献   

6.
赵娇 《传感技术学报》2022,35(12):1686-1690
海量高维传感器数据受网络环境扰动较大,导致其异常值检测难度较大,提出基于BIRCH聚类算法的高维传感器数据异常检测方案。推算节点对应的一阶差分信号序列,信号矢量通过多跳路由传输至网关节点,将空间相关性强的传感器节点划分在同一簇内,采集完整的高维传感器数据;利用分割点预设得到传感器数据特征属性候选分割点,挑选信息增益最大的点为最佳分割点;将传感器数据序列的中位数异常看作异常检测判定条件,利用BIRCH聚类算法中的聚类特征和聚类特征树计算节点特征属性,将数据聚类结果拟作球形簇状架构,输出高维传感器数据序列异常数据。仿真结果证明,该方法的异常节点检测的检出率在95%以上,误报率为0.35%,异常检测耗时在1.5min以内。  相似文献   

7.
面向位置大数据的快速密度聚类算法   总被引:1,自引:0,他引:1  
本文面向位置大数据聚类,提出了一种简单但高效的快速密度聚类算法CBSCAN,以快速发现位置大数据中任意形状的聚类簇模式和噪声.首先,定义了Cell网格概念,并提出了基于Cell的距离分析理论,利用该距离分析,无需距离计算,可快速确定高密度区域的核心点和密度相连关系;其次,给出了网格簇定义,将基于位置点的密度簇映射成基于网格的密度簇,利用排他网格与相邻网格的密度关系,可快速确定网格簇的包含网格;第三,利用基于Cell的距离分析理论和网格簇概念,实现了一个快速密度聚类算法,将DBSCAN基于数据点的密度扩展聚类转换成基于Cell的密度扩展聚类,大大减少高密度区域的距离计算,利用位置数据的内在特性提高了聚类效率;最后,在基准测试数据上验证了所提算法的聚类效果,在位置大数据上的实验结果统计显示,相比DBSCAN、PR-Tree索引和Grid索引优化的DBSCAN,CBSCAN分别平均提升了525倍、30倍和11倍效率.  相似文献   

8.
传统小波聚类算法标记满足密度阈值的连通单元为同一个簇,而不满足密度阈值的网格有可能存在属于簇的数据对象,数据的每维属性有时差距较大,不合适再划分均匀网格。为此,提出一种改进的小波聚类算法CWave Cluster,划分非均匀网格,进一步细化边界网格,对不满足密度阈值的网格进行处理,最终形成聚类。在指定的快速存取记录器(QAR)数据集上的实验结果表明,改进的小波聚类算法能根据数据特点划分网格,区分簇与簇的边界,有效解决QAR数据异常点检测问题。  相似文献   

9.
尹娜  张琳 《计算机科学》2017,44(5):116-119, 140
为了提高异常检测系统的检测率,降低误警率,解决现有异常检测所存在的问题,将离群点挖掘技术应用到异常检测中,提出了一种基于混合式聚类算法的异常检测方法(NADHC)。该方法将基于距离的聚类算法与基于密度的聚类算法相结合从而形成新的混合聚类算法,通过k-中心点算法找出簇中心,进而去除隐蔽性较高的少量攻击行为样本,再将重复增加样本的方法结合基于密度的聚类算法计算出异常度,从而判断出异常行为。最后在KDD CUP 99数据集上进行实验仿真,验证了所提算法的可行性和有效性。  相似文献   

10.
黄虹玮  葛笑天  陈烜松 《计算机应用》2017,37(11):3207-3211
提出一种基于复杂学习分类系统(XCS)的密度聚类方法,可以用于对任意形状且带有噪声的二维数据进行聚类分析。此方法称为DXCSc,主要包括以下三个过程:1)基于学习分类系统,对输入数据生成规则种群,并对规则进行适当压缩;2)将已经生成的规则视为二维数据点,进而基于密度聚类思想对二维数据点进行聚类;3)对密度聚类后的规则种群进行适当聚合,生成最终的规则种群。在第一个过程中,采用学习分类系统框架生成规则种群并进行适当约减。第二个过程认为种群的各规则簇中心比它们的邻居规则具有更高的密度,并且与密度更高的规则间距离更大。在第三个过程中,采用图分割方法对相关重叠簇进行适当聚合。在实验中,将所提方法与K-means、近邻传播聚类算法(AP)、Voting-XCSc等算法进行了比较,实验结果表明,所提方法在精度方面优于对比算法。  相似文献   

11.
唐成龙  邢长征 《计算机应用》2012,32(8):2193-2197
针对已有的基于网格的离群点挖掘算法挖掘效率低和对于大数据集适应性差的问题,提出基于数据分区和网格的离群点挖掘算法。算法首先将数据进行分区,以单元为单位筛选非离群点,并把中间结果暂存起来;然后采用改进的维单元树结构维护数据点的空间信息,以微单元为单位进行非离群点筛选,并通过两个优化策略进行高效操作;最后以数据点为单位挖掘离群点,从而得到离群数据集合。理论分析和实验结果表明了该方法是有效可行的,对大数据集和高维数据具有更好的伸缩性。  相似文献   

12.
基于三级存储器的Join算法   总被引:2,自引:0,他引:2  
研究了基于三级存储器的海量关系数据库的Join算法.目前,在所有磁带数据Join算法中,基于Hash思想的算法是最优的.但是,这些算法没有考虑从第三级存储器中读取数据时,磁带定位时间对算法性能的影响.磁带的磁头随机定位耗时大,是影响基于三级存储器的数据操作算法时间复杂性的关键因素.针对这个问题,提出了两种新的基于三级存储器的海量关系数据库连接算法,即Disk-Based-Hash-Join算法和Tertiary-Only-Hash-Join算法.这两种算法采用了磁盘缓冲技术和散列数据集中存储方法,降低了算法的磁带磁头随机定位时间复杂性,提高了基于三级存储器的连接算法的性能.理论分析和实验结果表明,提出的基于三级存储器连接算法的性能高于目前所有同类算法的性能,可以有效地应用于海量数据管理系统.  相似文献   

13.
In recent years, energy efficiency has become an important topic, especially in the field of ultra-dense networks (UDNs). In this area, cell-association bias adjustment and small cell on/off are proposed to enhance the performance of energy efficiency in UDNs. This is done by changing the cell association relationship and turning off the extra small cells that have no users. However, the variety of cell association relationships and the switching on/off of the small cells may deteriorate some users’ data rates, leading to nonconformance to the users’ data rate requirement. Considering the discreteness and non-convexity of the energy efficiency optimization problem and the coupled relationship between cell association and scheduling during the optimization process, it is difficult to achieve an optimal cell-association bias. In this study, we optimize the network energy efficiency by adjusting the cell-association bias of small cells while satisfying the users’ data rate requirement. We propose an energy-efficient centralized Gibbs sampling based cell-association bias adjustment (CGSCA) algorithm. In CGSCA, global information such as channel state information, cell association information, and network load information need to be collected. Then, considering the overhead of the messages that are exchanged and the implementation complexity of CGSCA to obtain the global information in UDNs, we propose an energy-efficient distributed Gibbs sampling based cell-association bias adjustment (DGSCA) algorithm with a lower message-exchange overhead and implementation complexity. Using DGSCA, we derive the updated formulas for calculating the number of users in a cell and the users’ SINR. We analyze the implementation complexities (e.g., computation complexity and communication com- plexity) of the proposed two algorithms and other existing algorithms. We perform simulations, and the results show that CGSCA and DGSCA have faster convergence speed, as well as a higher performance gain of the energy efficiency and throughput compared to other existing algorithms. In addition, we analyze the importance of the users’ data rate constraint in optimizing the energy efficiency, and we compare the energy efficiency performance of different algorithms with different number of small cells. Then, we present the number of sleeping small cells as the number of small cells increases.  相似文献   

14.
Disk scheduling is an operating system process to service disk requests. It has an important role in QOS guarantee of soft real-time environments such as video-on-demand and multimedia servers. Since now, some disk scheduling algorithms have been proposed to schedule disk requests in an optimized manner. Most of these methods try to minimize makespan by decreasing the number of disk head seeks as one of the slowest operations in modern computers and crucial for system performance because it usually takes some milli-seconds. In this paper, we propose a new disk scheduling method based on genetic algorithm that considers makespan and number of missed tasks simultaneously. In the proposed method, a new coding scheme is presented which employs simple GA procedures such as crossover and mutation and a penalty function in fitness. To get the best performance of the proposed method, its parameters such as number of chromosomes in initial population, mutation, and crossover probabilities, etc have been adjusted by applying it on some sample problems. The algorithm has been tested on several problems and its results were compared with well-known related methods. Experimental results showed that the proposed method worked very well and excelled most related works in terms of miss ratio and average seeks.  相似文献   

15.
赵峰  秦锋 《计算机工程》2009,35(19):78-80
研究基于单元的孤立点检测算法,给出数据空间的单元格划分及数据对象分配算法。针对该算法中阈值M设置的不足,对算法进行改进并应用于纳税行为的分析。与其他孤立点检测算法对比的结果表明,该算法不仅能有效挖掘纳税行为中的孤立点,还能确定孤立点的位置,有利于对纳税行为的分析。  相似文献   

16.
Disk input/output (I/O) efficient query execution is an important topic with respect to DBMS performance. In this context, we elaborate on the construction of disk access plans for sort order queries in balanced and nested grid files. The key idea is to use the order information contained in the directory of the multiattribute search structure. The presented algorithms are shown to yield a significant decrease in the number of disk I/O operations by appropriate use of the order information. Two algorithms for the construction of appropriate disk access plans are proposed, namely a greedy approach and a heuristic divide-and-conquer approach. Both approaches yield considerable I/O savings compared to straightforward query processing without consideration of any directory order information. The former performs well for small buffer page allocations, i.e., for a small number of buffer pages relative to the number of data buckets processed in the query. The latter is superior to the greedy algorithm with respect to the total number of I/O operations and with respect to the overall maximum of buffer pages needed to achieve the minimal number of disk I/O operations. Both approaches rely on a binary trie as a temporary data structure. This trie is used as an explicit representation of the order information. The storage consumption of the temporary data structure is shown to be negligible in realistic cases, Even for pathological cases with respect to degenerated balanced and nested grid files, reasonable upper bounds can be given  相似文献   

17.
The cellular manufacturing system (CMS) is considered as an efficient production strategy for batch type production. The CMS relies on the principle of grouping machines into machine cells and grouping machine parts into part families on the basis of pertinent similarity measures. The bacteria foraging optimization (BFO) algorithm is a modern evolutionary computation technique derived from the social foraging behavior of Escherichia coli bacteria. Ever since Kevin M. Passino invented the BFO, one of the main challenges has been the employment of the algorithm to problem areas other than those of which the algorithm was proposed. This paper investigates the first applications of this emerging novel optimization algorithm to the cell formation (CF) problem. In addition, for this purpose matrix-based bacteria foraging optimization algorithm traced constraints handling (MBATCH) is developed. In this paper, an attempt is made to solve the cell formation problem while considering cell load variations and a number of exceptional elements. The BFO algorithm is used to create machine cells and part families. The performance of the proposed algorithm is compared with a number of algorithms that are most commonly used and reported in the corresponding scientific literature such as K-means clustering, the C-link clustering and genetic algorithm using a well-known performance measure that combined cell load variations and a number of exceptional elements. The results lie in favor of better performance of the proposed algorithm.  相似文献   

18.
A cellular manufacturing system (CMS) is considered an efficient production strategy for batch type production. A CMS relies on the principle of grouping machines into machine cells and grouping parts into part families on the basis of pertinent similarity measures. The bacteria foraging algorithm (BFA) is a newly developed computation technique extracted from the social foraging behavior of Escherichia coli (E. coli) bacteria. Ever since Kevin M. Passino invented the BFA, one of the main challenges has been employment of the algorithm to problem areas other than those for which the algorithm was proposed. This research work studies the first applications of this emerging novel optimization algorithm to the cell formation (CF) problem considering the operation sequence. In addition, a newly developed BFA-based optimization algorithm for CF based on operation sequences is discussed. In this paper, an attempt is made to solve the CF problem, while taking into consideration the number of voids in the cells and the number of inter-cell travels based on operational sequences of the parts visited by the machines. The BFA is suggested to create machine cells and part families. The performance of the proposed algorithm is compared with that of a number of algorithms that are most commonly used and reported in the corresponding scientific literature, such as the CASE clustering algorithm for sequence data, the ACCORD bicriterion clustering algorithm and modified ART1, and using a defined performance measure known as group technology efficiency and bond efficiency. The results show better performance of the proposed algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号