首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
该文结合词向量技术和传统统计量,提出了一种新的无监督新词识别方法。该方法利用传统统计量获得候选新词,然后采用多种策略训练得到词向量,利用词向量构建弱成词词串集合,并使用该集合从候选新词的内部构成和外部环境两个方面对其进行过滤。此外,该文人工标注了一万条微博的分词语料作为发展语料,用于分析传统统计量以及调整变量阈值。实验使用NLPCC2015面向微博的中文分词评测任务的训练语料作为最终的测试语料。实验表明,该文方法对二元新词进行识别的F值比基线系统提高了6.75%,比目前新词识别领域最佳方法之一Overlap Variety方法提高了4.9%。最终,在测试语料上对二元新词和三元新词识别的F值达到了56.2%。  相似文献   

2.
Co-training机器学习方法在中文组块识别中的应用   总被引:6,自引:0,他引:6  
采用半指导机器学习方法co2training 实现中文组块识别。首先明确了中文组块的定义,co-training 算法的形式化定义。文中提出了基于一致性的co-training 选取方法将增益的隐马尔可夫模型(Transductive HMM) 和基于转换规则的分类器(fnTBL) 组合成一个分类体系,并与自我训练方法进行了比较,在小规模汉语树库语料和大规模未带标汉语语料上进行中文组块识别,实验结果要比单纯使用小规模的树库语料有所提高,F 值分别达到了85134 %和83141 % ,分别提高了2113 %和7121 %。  相似文献   

3.
情绪分类是自然语言处理问题中的重要研究问题之一。情绪分类旨在对文本包含的情绪进行自动分类,该任务是情感分析的一项基本任务。然而,已有的研究都假设各情绪类别的样本数量平衡,这与实际情况并不相符合。该文的研究主要面向不平衡数据的情绪分类任务。具体而言,该文提出了一种基于多通道LSTM神经网络的方法来解决不平衡情绪分类问题。首先,该方法使用欠采样方法获取多组平衡训练语料;其次,使用每一组训练语料学习一个LSTM模型;最后,通过融合多个LSTM模型,获得最终分类结果。实验结果表明该方法明显优于传统的不平衡分类方法。  相似文献   

4.
该文探讨了无指导条件下的中文分词,这对构建语言无关的健壮分词系统大有裨益。互信息与HDP(Hierarchical Dirichlet Process)是无指导情况下常用的分词模型,该文将两者结合,并改进了采样算法。不考虑标点符号,在两份大小不同的测试语料上获得的F值为0.693与0.741,相比baseline的HDP分别提升了5.8%和3.9%。该文还用该模型进行了半指导分词,实验结果比常用的CRF有指导分词提升了2.6%。  相似文献   

5.
该文提出了一种从维基百科的可比语料中抽取对齐句子的方法。在获取了维基百科中英文数据库备份并进行一定处理后,重构成本地维基语料数据库。在此基础上,统计了词汇数据、构建了命名实体词典,并通过维基百科本身的对齐机制获得了双语可比语料文本。然后,该文在标注的过程中分析了维基百科语料的特点,以此为指导设计了一系列的特征,并确定了“对齐”、“部分对齐”、“不对齐”三分类体系,最终采用SVM分类器对维基百科语料和来自第三方的平行语料进行了句子对齐实验。实验表明:对于语言较规范的可比语料,分类器对对齐句的分类正确率可达到82%,对于平行语料,可以达到92%,这说明该方法是可行且有效的。  相似文献   

6.
该文探索了基于树核函数的中文语义角色分类,重点研究如何获取有效的结构化信息特征。在最小句法树结构的基础上,根据语义角色分类的特点,进一步定义了三种不同的句法结构,并使用复合核将基于树核和基于特征的方法结合。在中文PropBank语料上的结果表明,基于树核函数的方法在中文语义角色分类任务中能够取得较好的结果,精确率达到91.79%。同时,与基于特征方法的结合,基于树核函数的方法能够进一步提高前者性能,精确率达到94.28%,优于同类系统。  相似文献   

7.
术语抽取从非结构化文本中自动抽取专业术语。该工作在中文分词、信息抽取、知识库构建中发挥着重要的作用。当前术语抽取方法很大程度上依赖于词的统计信息,由于基础教育学科中术语具有极强的长尾特性,导致基于统计的术语抽取方法很难抽取出处于尾端的术语。该文结合基础教育的学科特点,提出了DRTE: 一种利用术语定义与术语关系挖掘,综合构词规则与边界检测的术语抽取方法。该文以初高中的数学课本为数据源进行术语抽取,实验结果表明我们的术语抽取方法F1值达到82.7%,相比目前的方法提高了40.8%,能够有效地在中文基础教育领域进行自动化的术语抽取。  相似文献   

8.
通过定义类别聚类密度、类别复杂度以及类别清晰度三个指标,从语料库信息度量的角度研究多种代表性的中文分词方法在隐含概率主题模型LDA下对文本分类性能的影响,定量、定性地分析不同分词方法在网页和学术文献等不同类型文本的语料上进行分类的适用性及影响分类性能的原因。结果表明:三项指标可以有效指明分词方法对语料在分类时产生的影响,Ik Analyzer分词法和ICTCLAS分词法分别受类别复杂度和类别聚类密度的影响较大,二元分词法受三个指标的作用相当,使其对于不同语料具有较好的适应性。对于学术文献类型的语料,使用二元分词法时的分类效果较好,F1值均在80%以上;而网页类型的语料对于各种分词法的适应性更强。本文尝试通过对语料进行信息度量而非单纯的实验来选择提高该语料分类性能的最佳分词方法,以期为网页和学术文献等不同类型的文本在基于LDA模型的分类系统中选择合适的中文分词方法提供参考。  相似文献   

9.
随着互联网整体水平的提高,大量基于维吾尔文的网络信息不断建立,引起了对不同领域的信息进行情感倾向性分析的迫切需要。该文考虑到维吾尔文没有足够的情感训练语料和完整的情感词典,结合机器学习方法和词典方法的优点,构建一个分类器模型 LCUSCM(Lexicon-based and Corpus-based Uyghur Text Sentiment Classification Model),先用自己构建的维吾尔文情感词典对语料进行高质量的情感分类,分类过程中对词典进行递归扩充,再根据每条句子的情感得分,从词典分类的结果中选择一部分语料来训练一个分类器并改进第一步的分类结果。此方法的正确率比单独使用机器学习方法提高了9.13%, 比词典方法提高了1.82%。  相似文献   

10.
“未定义”类话语在面向任务的对话语料中广泛存在,具有成分复杂,与其余“已定义”类话语边界模糊的特点,影响着话语领域的分类总体正确率。“未定义”类话语一旦错分,将会使用户对口语对话系统的功能有效性产生怀疑,导致大大降低用户体验。该文提出一种基于优化“未定义”类话语检测的领域分类方案,采用两阶段法完成口语话语的领域分类任务。首先,采用聚类方法将“已定义”类话语聚为几个大类,简化众多的“已定义”类话语独立存在时与“未定义”类话语之间的边界。进而利用分类模型对聚类后的“已定义”类话语大类以及“未定义”类话语进行领域分类,优化目标是“未定义”类话语的检测效率。最后,将第一阶段分类为“已定义”类的话语,在去除了绝大部分“未定义”类话语干扰的基础上进行再次分类。该文的分类模型采用了深度学习模型LSTM,并利用无标签微博数据训练词向量用于话语特征表达。在SMP 2017 意图领域分类比赛的多任务语料上的评测结果表明,该方案在 “未定义”类话语检测的F1值以及所有话语的领域分类总正确率上均有明显提升。  相似文献   

11.
藏文人名识别是藏文信息处理领域研究的难点之一,其识别效果直接影响到藏文自动分词的精度和相关应用系统的性能,包括藏汉翻译、藏文信息检索、文本分类等。该文在分析藏文人名构成规律和特点的基础上,提出了一种最大熵和条件随机场相融合的藏文人名识别方法。实验表明,该方法可以获取较好的识别效果,在我们的测试集上F-测度值到达了93.08%。  相似文献   

12.
In this paper, we introduce a new class of fuzzy measures, referred to as q-measures, based on a modification of the definition of the well-known Sugeno /spl lambda/-measure. Our proposed definition of the q-measures not only includes the /spl lambda/-measure as a special case, but also preserves all desirable properties and avoids some of the limitations of the conventional /spl lambda/-measure. The q-measure approach provides a more flexible and powerful method for constructing various fuzzy measures. We provide an iterative algorithm for constructing an interesting sequence of q-measures and analytically prove its convergence as a distinguishing characteristic of the proposed formulation.  相似文献   

13.

Since the implementation of the EU General Data Protection Regulation (“GDPR”) and similar legislation on personal data protection in Taiwan, enterprises must now provide adequate protection for their customers’ personal data. Many enterprises use automated personally identifiable information (“PII”) scanning systems to process PII to ensure full compliance with the law. However, personal data saved in non-electronic form cannot be detected by these automated scanning systems, resulting in PII not being able to be accurately identified. We propose a random forest (“RF”) approach to detect unidentified PII to close the loopholes. Relevant peripheral information attributes of PII are identified and used in our study for machine learning and modeling to establish a model for detecting PII that otherwise cannot be detected by automated scanners. Our study shows that the F1-measure of our proposed model achieves at least 90%, a higher accuracy rate than that of automated scanners in detecting PII in an enterprise’s inventory of information assets. Finally, the results of the experiment in our study show that our proposed model can shorten the time required for detecting PII by 100 times and increase the F1-measure by 2% when compared with the PII detection conducted manually.

  相似文献   

14.
文敏  王荣存  姜淑娟 《计算机应用》2022,42(6):1814-1821
软件安全的根源在于软件开发人员开发的源代码,但随着软件规模和复杂性不断提高,仅靠人工检测漏洞代价高昂且难以扩展,而现有的代码分析工具有较高的误报率与漏报率。为此,提出一种基于关系图卷积网络(RGCN)的自动化漏洞检测方法以进一步提高漏洞检测的精度。首先将程序源代码转换为包含语法、语义特征信息的CPG;然后使用RGCN对图结构进行表示学习;最后训练神经网络模型预测程序源代码中的漏洞。为验证所提方法的有效性,在真实的软件漏洞样本上开展了实验验证,结果表明所提方法的漏洞检测结果的召回率和F1值分别达到了80.27%和63.78%。与Flawfinder、VulDeepecker和基于图卷积网络(GCN)的同类方法相比,所提方法的F1值分别提高了182%、12%和55%,可见所提方法能有效提高漏洞检测能力。  相似文献   

15.
Automated nucleus/cell detection is usually considered as the basis and a critical prerequisite step of computer assisted pathology and microscopy image analysis. However, due to the enormous variability (cell types, stains and different microscopes) and data complexity (cell overlapping, inhomogeneous intensities, background clutters and image artifacts), robust and accurate nucleus/cell detection is usually a difficult problem. To address this issue, we propose a novel multi-scale fully convolutional neural networks approach for regression of a density map to robustly detect the nuclei of pathology and microscopy images. The procedure can be divided into three main stages. Initially, instead of working on the simple dot label space, regression on the proposed structured proximity space for patches is performed so that centers of image patches are explicitly forced to produce larger values than their adjacent areas. Then, several multi-scale fully convolutional regression networks are developed for this task; this will enlarge the receptive field and not only can detect the single, small size cells, but also benefit to detecting cells with big size and overlapping states. In this stage, we copy the full feature maps from the contracting path and merge with the feature maps of the expansive path. This operation will make full use of shallow and deep semantic information of the networks. The networks do not have any fully connected layers; this strategy allows the seamless probability map prediction of arbitrarily large images. At the same time, data augmentations (e.g., small range shift, zoom and randomly flip) are carefully used to enhance the robustness of detection. Finally, morphological operations and suitable filters are employed and some prior information is introduced to find the centers of the cells more robustly. Our method achieves about 99.25% detection precision and the F1-measure is 0.9924 on fluorescence microscopy cell images; about 85.90% detection precision and the F1-measure is 0.9020 on Lymphocyte cell images and about 78.41% detection precision and the F1-measure is 0.8440 on breast histopathological images. This result leads to a promising detection performance that equates and sometimes exceeds the recently published leading detection approaches with the same benchmark datasets.  相似文献   

16.
为解决垃圾网页检测过程中的不平衡分类和"维数灾难"问题,提出一种基于随机森林(RF)和欠采样集成的二元分类器算法。首先使用欠采样技术将训练样本集大类抽样成多个子样本集,再将其分别与小类样本集合并构成多个平衡的子训练样本集;然后基于各个子训练样本集训练出多个随机森林分类器;最后用多个随机森林分类器对测试样本集进行分类,采用投票法确定测试样本的最终所属类别。在WEBSPAM UK-2006数据集上的实验表明,该集成分类器算法应用于垃圾网页检测比随机森林算法及其Bagging和Adaboost集成分类器算法效果更好,准确率、F1测度、ROC曲线下面积(AUC)等指标提高至少14%,13%和11%。与Web spam challenge 2007 优胜团队的竞赛结果相比,该集成分类器算法在F1测度上提高至少1%,在AUC上达到最优结果。  相似文献   

17.
维基百科实体分类对自然语言处理和机器学习具有重要的作用。该文采用机器学习的方法对中文维基百科的条目进行实体分类,在利用维基百科页面中半结构化信息和无结构化文本作为基本特征的基础上,结合中文的特点使用扩展特征和语义特征来提高实体分类性能。在人工标注的语料库上的实验表明,这些额外特征有效地提高了ACE分类体系上的实体分类性能,总体F1值达到96%,同时在扩展实体分类上也取得了较好的效果,总体F1值达95%。
  相似文献   

18.
Semantic data models are increasing in popularity and use, but they are also becoming increasingly complex and difficult to manage. In this paper we extend the definition of a semantic data model to give users the power to specify and manipulate views. Our model supports both high-level object-class views and high-level relationship-set views. To define these views, we extend the traditional method of view definition (through query formulation) to also include view definition by a dominant object class, by an independent object class, and by a relational object class. A set of operators to specify and manipulate views is also defined. These operators allow a user to create and destroy views, implode and explode views, hide and expose semantic-model elements. We also provide algorithms to extract an instance graph for a high-level object in a semantic-model view and to transform a semantic-model view hierarchy into an equivalent atomic model. Implications and applications of views in our model are also discussed.  相似文献   

19.
Validation of overlapping clustering: A random clustering perspective   总被引:1,自引:0,他引:1  
As a widely used clustering validation measure, the F-measure has received increased attention in the field of information retrieval. In this paper, we reveal that the F-measure can lead to biased views as to results of overlapped clusters when it is used for validating the data with different cluster numbers (incremental effect) or different prior probabilities of relevant documents (prior-probability effect). We propose a new “IMplication Intensity” (IMI) measure which is based on the F-measure and is developed from a random clustering perspective. In addition, we carefully investigate the properties of IMI. Finally, experimental results on real-world data sets show that IMI significantly alleviates biased incremental and prior-probability effects which are inherent to the F-measure.  相似文献   

20.
Software developers, testers and customers routinely submit issue reports to software issue trackers to record the problems they face in using a software. The issues are then directed to appropriate experts for analysis and fixing. However, submitters often misclassify an improvement request as a bug and vice versa. This costs valuable developer time. Hence automated classification of the submitted reports would be of great practical utility. In this paper, we analyze how machine learning techniques may be used to perform this task. We apply different classification algorithms, namely naive Bayes, linear discriminant analysis, k-nearest neighbors, support vector machine (SVM) with various kernels, decision tree and random forest separately to classify the reports from three open-source projects. We evaluate their performance in terms of F-measure, average accuracy and weighted average F-measure. Our experiments show that random forests perform best, while SVM with certain kernels also achieve high performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号