首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
引入遗传算法试图解决海量、高维样本的聚类问题。分析了目前基于样本和属性值两类基于遗传算法的聚类算法的不足,归纳出它们的算法模型。针对多维快速聚类问题提出了密度法、网格法两种基于遗传算法的聚类算法。算法测试表明,改进后的基于遗传算法的聚类方法能够解决海量、高维样本的聚类问题。  相似文献   

2.
针对传统k-均值聚类方法不能有效处理海量数据聚类的问题,该文提出一种基于并行计算的加速k-均值聚类(K-means clustering based on parallel computing,Pk-means)方法。该方法首先将海量的聚类样本随机划分为多个独立同分布的聚类工作集,并在每个工作集上并行进行传统k-均值聚类,并得到相应的聚类中心和半径,通过衡量不同子集聚类结果的关系,对每个工作集中聚类得到的子类进行合并,并对特殊数据进行二次归并以校正聚类结果,从而有效处理海量数据的聚类问题。实验结果表明,Pk_means方法在大规模数据集上在保持聚类效果的同时大幅度提高了聚类效率。  相似文献   

3.
由于变种和多态技术的出现,恶意代码的数量呈爆发式增长。然而涌现的恶意代码只有小部分是新型的,大部分仍是已知病毒的变种。针对这种情况,为了从海量样本中筛选出已知病毒的变种,从而聚焦新型未知病毒,提出一种改进的判定恶意代码所属家族的方法。从恶意代码的行为特征入手,使用反汇编工具提取样本静态特征,通过单类支持向量机筛选出恶意代码的代表性函数,引入聚类算法的思想,生成病毒家族特征库。通过计算恶意代码与特征库之间的相似度,完成恶意代码的家族判定。设计并实现了系统,实验结果表明改进后的方法能够有效地对各类家族的变种进行分析及判定。  相似文献   

4.
针对传统的聚类算法K-means对初始中心点的选择非常依赖,容易产生局部最优而非全局最优的聚类结果,同时难以满足人们对海量数据进行处理的需求等缺陷,提出了一种基于MapReduce的改进K-means聚类算法。该算法结合系统抽样方法得到具有代表性的样本集来代替海量数据集;采用密度法和最大最小距离法得到优化的初始聚类中心点;再利用Canopy算法得到粗略的聚类以降低运算的规模;最后用顺序组合MapReduce编程模型的思想实现了算法的并行化扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景;文中对该改进算法和传统聚类算法进行了比较,比较结果证明其性能优于后者;这表明该改进算法降低了对初始聚类中心的依赖,提高了聚类的准确性,减少了聚类的迭代次数,降低了聚类的时间,而且在处理海量数据时表现出较大的性能优势。  相似文献   

5.
刘金岭 《计算机工程》2011,37(1):57-59,62
提出一种基于语义概念的海量中文短信文本聚类方法。该方法从短信文本出发,利用《现代汉语语义分类词典》的级类主题词,在短信文本向量集中提取概念元组,形成表示聚类结果的高层概念,基于这些高层概念进行样本划分,从而完成整个聚类过程。实验结果表明,该聚类算法有较好的聚类结果且执行效率较高。  相似文献   

6.
针对大数据环境下K-means聚类算法聚类精度不足和收敛速度慢的问题,提出一种基于优化抽样聚类的K-means算法(OSCK)。首先,该算法从海量数据中概率抽样多个样本;其次,基于最佳聚类中心的欧氏距离相似性原理,建模评估样本聚类结果并去除抽样聚类结果的次优解;最后,加权整合评估得到的聚类结果得到最终k个聚类中心,并将这k个聚类中心作为大数据集聚类中心。理论分析和实验结果表明,OSCK面向海量数据分析相对于对比算法具有更好的聚类精度,并且具有很强的稳健性和可扩展性。  相似文献   

7.
分级聚类与平面划分结合方法在网页分类中的应用   总被引:2,自引:0,他引:2  
文章研究分级聚类与平面划分结合方法在网页分类中的应用。阐述了网页分类问题中样本特征分布的特点和复杂性,分级聚类能够生成层次化的嵌套类,且具有较高的准确度,但具有较高的计算复杂度,不适合计算大量样本的计算问题。K-均值算法受初始聚类中心的选择影响较大,对于不规则分布的样本往往聚类的效果不佳。文章考虑利用少数样本和分级聚类算法进行样本集合的初始聚类中心的划分,再利用K-均值算法对整个样本集合做聚类,则既可以避免分级聚类算法的计算复杂又可充分利用K-均值算法的快速特点;另一方面则利用了分级聚类算法准确度高为确定初始聚类中心提供了可靠的方法。文中给出了纯K-均值方法、分级聚类与平面划分结合方法在解决文本分类问题上的实验结果。  相似文献   

8.
一种基于核的快速可能性聚类算法   总被引:1,自引:1,他引:0       下载免费PDF全文
传统的快速聚类算法大多基于模糊C均值算法(Fuzzy C-means,FCM),而FCM对初始聚类中心敏感,对噪音数据敏感并且容易收敛到局部极小值,因而聚类准确率不高。可能性C-均值聚类较好地解决了FCM对噪声敏感的问题,但容易产生一致性聚类。将FCM和可能性C-均值聚类结合的聚类算法较好地解决了一致性聚类问题。为进一步提高算法收敛速度和鲁棒性,提出一种基于核的快速可能性聚类算法。该方法引入核聚类的思想,同时使用样本方差对目标函数中参数η进行优化。标准数据集和人造数据集的实验结果表明这种基于核的快速可能性聚类算法提高了算法的聚类准确率,加快了收敛速度。  相似文献   

9.
利用无监督聚类算法可以有效地保留数据特征的特性,提出采用无监督聚类算法来对数据样本进行降维处理的方法,通过将连续多次迭代分类结果进行按类数编码,得到快速判定聚类分析降维开始的可行条件及聚类结束条件,并以降维数据为数据样本,继续进行聚类分析,快速完成数据特征提取。通过实验证明该方法在数据降维效果和聚类算法的执行速度上都有很大提高。  相似文献   

10.
针对处理高维海量数据时聚类算法用时太长的问题,提出基于抽样的多模态分布聚类优化算法,该算法随机地抽取少量样本进行循环校正,减少聚类时间,通过大量实验找出算法的最优配置参数,结果证明,该优化算法以11.8%的聚类运行时间得到了88%的聚类准确性,为高时间成本的应用环境提供了最优的聚类方案。  相似文献   

11.
A large number of today’s botnets leverage the HTTP protocol to communicate with their botmasters or perpetrate malicious activities. In this paper, we present a new scalable system for network-level behavioral clustering of HTTP-based malware that aims to efficiently group newly collected malware samples into malware family clusters. The end goal is to obtain malware clusters that can aid the automatic generation of high quality network signatures, which can in turn be used to detect botnet command-and-control (C&C) and other malware-generated communications at the network perimeter.We achieve scalability in our clustering system by simplifying the multi-step clustering process proposed in [31], and by leveraging incremental clustering algorithms that run efficiently on very large datasets. At the same time, we show that scalability is achieved while retaining a good trade-off between detection rate and false positives for the signatures derived from the obtained malware clusters. We implemented a proof-of-concept version of our new scalable malware clustering system and performed experiments with about 65,000 distinct malware samples. Results from our evaluation confirm the effectiveness of the proposed system and show that, compared to [31], our approach can reduce processing times from several hours to a few minutes, and scales well to large datasets containing tens of thousands of distinct malware samples.  相似文献   

12.
Malware classification based on call graph clustering   总被引:1,自引:0,他引:1  
Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, enabling the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.  相似文献   

13.
Clustering is an important problem in malware research, as the number of malicious samples that appear every day makes manual analysis impractical. Although these samples belong to a limited number of malware families, it is difficult to categorize them automatically as obfuscation is involved. By extracting relevant features we can apply clustering algorithms, then only analyze a couple of representatives from each cluster. However, classic clustering algorithms that compute the similarity between each pair of samples are slow when a large collection is involved. In this paper, the features will be strings of operation codes extracted from the binary code of each sample. With a modified suffix tree data structure we can find long enough substrings that correspond to portions of a program’s code. These substrings must be filtered against a database of known substrings so that common library code will be ignored. The items that have common substrings above a certain threshold will be grouped into the same cluster. Our algorithm was tested with data extracted from real-world malware and constructed quality clusters.  相似文献   

14.
The explosive growth of malware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware programs. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import functions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the unknown malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.  相似文献   

15.
An abstraction resilient to common malware obfuscation techniques is the call-graph. A call-graph is the representation of an executable file as a directed graph with labeled vertices, where the vertices correspond to functions and the edges to function calls. Unfortunately, most of the interesting graph comparison problems, including full-graph comparison and computing the largest common subgraph, belong to the \(NP\) -hard class. This makes the study and use of graphs in large scale systems difficult. Existing work has focused only on offline clustering and has not addressed the issue of clustering streams of graphs. In this paper we present Classy, a scalable distributed system that clusters streams of large call-graphs for purposes including automated malware classification and facilitating malware analysts. Since algorithms aimed at clustering sets are not suitable for clustering streams of objects, we propose the use of a clustering algorithm that relies on the notion of candidate clusters and reference samples therein. We demonstrate via thorough experimentation that this approach yields results very close to the offline optimal. Graph similarity is determined by computing a graph edit distance (GED) of pairs of graphs using an adapted version of simulated annealing. Furthermore, we present a novel lower bound for the GED. We also study the problem of approximating statistics of clusters of graphs when the distances of only a fraction of all possible pairs have been computed. Finally, we present results and statistics from a real production-side system that has clustered and contains more than 0.8 million graphs.  相似文献   

16.
Previous work has shown that cluster analysis can be used to effectively classify malware into meaningful families. In this research, we apply cluster analysis to the challenging problem of classifying previously unknown malware. We perform several experiments involving malware clustering. We compare our clustering results to those obtained when a support vector machine (SVM) is trained on the malware family. Using clustering, we are able to classify malware with an accuracy comparable to that of an SVM. An advantage of the clustering approach is that a new malware family can be classified before a model has been trained specifically for the family.  相似文献   

17.
在当前的恶意代码家族检测中,通过恶意代码灰度图像提取的局部特征或全局特征无法全面描述恶意代码,针对这个问题并为提高检测效率,提出了一种基于感知哈希算法和特征融合的恶意代码检测方法。首先,通过感知哈希算法对恶意代码灰度图样本进行检测,快速划分出具体恶意代码家族和不确定恶意代码家族的样本,实验测试表明约有67%的恶意代码能够通过感知哈希算法检测出来。然后,对于不确定恶意代码家族样本再进一步提取局部特征局部二值模式(LBP)与全局特征Gist,并利用二者融合后的特征通过机器学习算法对恶意代码样本进行分类检测。最后,对于25类恶意代码家族检测的实验结果表明,相较于仅用单一特征,使用LBP与Gist的融合特征时的检测准确率更高,并且所提方法与仅采用机器学习的检测算法相比分类检测效率更高,检测速度提高了93.5%。  相似文献   

18.
A huge number of botnet malware variants can be downloaded by zombie personal computers as secondary injections and upgrades according to their botmasters to perform different distributed/coordinated cyber attacks such as phishing, spam e-mail, malicious Web sites, ransomware, DDoS. In order to generate a faster response to new threats and better understanding of botnet activities, grouping them based on their malicious behaviors has become extremely important. This paper presents a Spatio-Temporal malware clustering algorithm based on its (weekly-hourly-country) features. The dataset contains more than 32 million of malware download logs from 100 honeypots set up by Malware Investigation Task Force (MITF) of Internet Initiative Japan Inc. (IIJ) from 2011 to 2012. The Top-20 malware clustering results coincidentally correspond to Conficker.B and Conficker.C with relatively high precision and recall rates up to 100.0, 88.9 % and 91.7, 100.0 %, respectively. On the other hand, the resulting two clusters of Top-20 countries are comparable to those with high and low growth rates recently reported in 2015 by Asghari et al. Therefore, our approach can be validated and evaluated to yield precision and recall of up to 75.0 and 86.7 %, respectively.  相似文献   

19.
基于语义的恶意代码行为特征提取及检测方法   总被引:5,自引:0,他引:5  
王蕊  冯登国  杨轶  苏璞睿 《软件学报》2012,23(2):378-393
提出一种基于语义的恶意代码行为特征提取及检测方法,通过结合指令层的污点传播分析与行为层的语义分析,提取恶意代码的关键行为及行为间的依赖关系;然后,利用抗混淆引擎识别语义无关及语义等价行为,获取具有一定抗干扰能力的恶意代码行为特征.在此基础上,实现特征提取及检测原型系统.通过对多个恶意代码样本的分析和检测,完成了对该系统的实验验证.实验结果表明,基于上述方法提取的特征具有抗干扰能力强等特点,基于此特征的检测对恶意代码具有较好的识别能力.  相似文献   

20.
The sheer volume of new malware samples presents some big data challenges for antivirus vendors. Not only does the metadata for tens (or even hundreds) of millions of samples need to be stored, but all this data also needs to be clustered - mined to find groups of related samples. Existing techniques cannot easily scale to the magnitudes of samples already arriving today, yet alone those that we expect to receive in the future. This paper proposes the use of a data structure called an aggregation overlay graph to simplify these problems. By exploiting the similarities shared between most malware variants, we can reduce the total volume of metadata by more than an entire magnitude without any loss of information. Furthermore, by including a wide variety of features from each sample, this process of reduction also creates groups of similar samples, a clustering technique that is capable of handling extremely high volumes. The versatility of this approach is demonstrated by applying it not only to large corpuses of Windows PE metadata, but also for Android APK files.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号