首页 | 官方网站   微博 | 高级检索  
     

结合主动学习的多记录网页属性抽取方法*
引用本文:魏晶晶,廖祥文,陈巧灵,马飞翔,陈国龙.结合主动学习的多记录网页属性抽取方法*[J].模式识别与人工智能,2016,29(8):673-681.
作者姓名:魏晶晶  廖祥文  陈巧灵  马飞翔  陈国龙
作者单位:1.福州大学 物理与信息工程学院 福州 350116
2. 福建江夏学院 电子信息科学学院 福州 350108
3.福州大学 数学与计算机科学学院 福州 350116
4.福州大学 福建省网络计算与智能信息处理重点实验室 福州 350116
基金项目:国家自然科学基金青年基金项目(No.61300105)、教育部博士点基金联合项目(No.2012351410010)、福建省科技重大专项项目(No.2013H6012)、福州市科技计划项目(No.2013-PT-45,2012-G-113)资助
摘    要:属性抽取可分为对齐和语义标注两个过程,现有对齐方法中部分含有相同标签不同语义的属性会错分到同一个组,而且为了提高语义标注的精度,通常需要大量的人工标注训练集.为此,文中提出结合主动学习的多记录网页属性抽取方法.针对属性错分问题,引入属性的浅层语义,减少相同标签语义不一致的影响.在语义标注阶段,基于网页的文本、视觉和全局特征,采用基于主动学习的SVM分类方法获得带有语义的结构化数据.同时在主动学习的策略选择方面,通过引入样本整体信息,构建基于不确定性度量的策略,选择语义分类预测不准的样本进行标注.实验表明,在论坛、微博等多个数据集上,相比现有方法,文中方法抽取效果更好.

关 键 词:属性抽取    语义分类    主动学习  
收稿时间:2015-02-02

A Multi-record Webpage Attribute Extraction Method Combining Active Learning
WEI Jingjing,LIAO Xiangwen,CHEN Qiaoling,MA Feixiang,CHEN Guolong.A Multi-record Webpage Attribute Extraction Method Combining Active Learning[J].Pattern Recognition and Artificial Intelligence,2016,29(8):673-681.
Authors:WEI Jingjing  LIAO Xiangwen  CHEN Qiaoling  MA Feixiang  CHEN Guolong
Affiliation:1.College of Physics and Information Engineering, Fuzhou University, Fuzhou 350116
2.College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108
3.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116
4.Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing,Fuzhou University, Fuzhou 350116
Abstract:The attribute extraction process can be separated into two phases, alignment and annotation. In the existing alignment methods, different semantic attributes are mistakenly aligned into the same group. Furthermore, to improve the accuracy of semantic annotation, time-consuming manual annotation is often introduced to construct training set. To solve this problem, a multi-record webpage attribute extraction method combining active learning is presented. As for the problem of wrong attribute alignment, shallow semantic is integrated into the alignment approach to relieve the influence of same tags with different semantics. In the semantic annotation phase, textual, visual and global features are extracted for semantic classification and an active learning based SVM classifier is applied to extract structural data. Moreover, a new sample selection strategy is proposed by introducing the global sample information, and more informative samples with lower confidences are selected to be labeled. The experimental results on BBS and microblog datasets confirm the superiority the proposed method.
Keywords:Attribute Extraction  Semantic Classification  Active Learning  
点击此处可从《模式识别与人工智能》浏览原始摘要信息
点击此处可从《模式识别与人工智能》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号