首页 | 官方网站   微博 | 高级检索  
     

基于析因设计的大数据相关关系挖掘算法
引用本文:唐小川,罗亮.基于析因设计的大数据相关关系挖掘算法[J].计算机应用,2018,38(9):2507-2510.
作者姓名:唐小川  罗亮
作者单位:电子科技大学 计算机科学与工程学院, 成都 611731
基金项目:国家自然科学基金资助项目(61602094)。
摘    要:针对高维大数据的降维问题,提出了一种基于统计学析因设计的特征选择算法——FFD。首先,使用析因设计的因子效应作为过滤式特征选择算法中特征与目标变量之间相关关系的度量标准;其次,提出一个分治算法用于搜索适合于输入数据集的最优析因设计;再次,为了解决传统实验设计需要人工执行实验的问题,提出一种数据驱动的方法从输入数据集中自动搜索析因设计的响应值;最后,根据设计矩阵和平均响应值计算因子效应,并使用因子效应对特征和交互作用进行排序,得到显著的特征和交互作用。实验结果表明,FFD的平均分类错误率比互信息最大化算法(MIM)降低了2.95个百分点,比联合互信息最大化算法(JMIM)降低了3.33个百分点,比ReliefF算法降低了6.62个百分点。因此,FFD在实际数据集中能有效挖掘与目标变量相关的特征和交互作用。

关 键 词:大数据  相关关系  特征选择  交互作用  析因设计  
收稿时间:2018-03-07
修稿时间:2018-03-27

Big data correlation mining algorithm based on factorial design
TANG Xiaochuan,LUO Liang.Big data correlation mining algorithm based on factorial design[J].journal of Computer Applications,2018,38(9):2507-2510.
Authors:TANG Xiaochuan  LUO Liang
Affiliation:School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 611731, China
Abstract:Focused on the issue of dimensionality reduction in high-dimensional big data, a feature selection algorithm based on statistical factorial design was proposed, which was named Full Factorial Design (FFD). Firstly, the factor effect of the factorial design was used to measure the correlation between features and the target variable; secondly, a divide-and-conquer algorithm for finding the optimal factorial design for a given dataset was proposed; thirdly, in order to solve the problem that the traditional experimental design required manual execution of experiments, a data-driven approach was proposed to automatically search the response values for the factorial design from the input dataset; finally, the factor effects were calculated based on the design matrix and the average response values, and the features and interactions were sorted by the factor effects. Then the significant features and interactions could be obtained. The experimental results show that the average classification error rate of FFD over Mutual Information Maximisation (MIM), Joint Mutual Information Maximisation (JMIM) and ReliefF was 2.95, 3.33 and 6.62 percentage points, respectively. Therefore, FFD can effectively identify significant features and interactions that are highly correlated with the target variable in real-world datasets.
Keywords:big data                                                                                                                        correlation                                                                                                                        feature selection                                                                                                                        interaction                                                                                                                        factorial design
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号