首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
针对众包标记经过标记集成后仍然存在噪声的问题,提出了一种基于自训练的众包标记噪声纠正算法(Selftraining-based label noise correction, STLNC). STLNC整体分为3个阶段:第1阶段利用过滤器将带集成标记的众包数据集分为噪声集和干净集.第2阶段利用加权密度峰值聚类算法构建数据集中低密度实例指向高密度实例的空间结构关系.第3阶段首先根据发现的空间结构关系设计噪声实例选择策略;然后利用在干净集上训练的集成分类器对选择的噪声实例按照设计的实例纠正策略进行纠正,并将纠正后的实例加入到干净集,再重新训练集成分类器;重复实例选择与纠正过程直到噪声集中所有的实例被纠正;最后用最后一轮训练得到的集成分类器对所有实例进行纠正.在仿真标准数据集和真实众包数据集上的实验结果表明STLNC比其他5种最先进的噪声纠正算法在噪声比和模型质量两个度量指标上表现更优.  相似文献   

2.
在标注现实网络流量数据的过程中难免会造成标签错误标记的情况,导致标签数据不可避免地受到噪声污染,即样本的观测标签与真实标签存在差异。为降低噪声标签对分类器分类准确率的负面影响,考虑引入噪声的2种情况,即正确标签类型错误标记和标签类型错误拼写,并提出一种基于标签噪声纠正的网络流量分类方法,该方法利用聚类和权重划分来对观测样本进行评估和修复。在2个网络流量数据集上的实验结果表明,与3种标签噪声修复算法STC、CC和ADE相比,提出的修复算法在不同噪声比例干扰下对最终的分类结果都有一定的提升。在NSL-KDD数据集上,标签平均修复率分别提高23.00%,7.58%和2.05%左右;在MOORE数据集上,标签平均修复率分别提高35.12%,10.40%和471%左右,在最终分类模型上有较好的分类稳定性。  相似文献   

3.
偏标记学习是一种重要的弱监督学习框架。在偏标记学习中,每个实例与一组候选标记相关联,它的真实标记隐藏在候选标记集合中,且在学习过程中不可获知。为了消除候选标记对学习过程的影响,提出了一种融合实例语义差别最大化和流型学习的偏标记学习方法(partial label learning by semantic difference and manifold learning, PL-SDML)。该方法是一个两阶段的方法:在训练阶段,基于实例的语义差别最大化准则和流型学习方法为训练实例生成标记置信度;在预测阶段,使用基于最近邻投票的方法为未知实例预测标记类别。在四组人工改造的UCI数据集中,在平均70%的情况下优于其他对比算法。在四组真实偏标记数据集中,相比其他对比算法,取得了0.3%~13.8%的性能提升。  相似文献   

4.
We address the problem of predicting category labels for unlabeled videos in a large video dataset by using a ground-truth set of objectively labeled videos that we have created. Large video databases like YouTube require that a user uploading a new video assign to it a category label from a prescribed set of labels. Such category labeling is likely to be corrupted by the subjective biases of the uploader. Despite their noisy nature, these subjective labels are frequently used as gold standard in algorithms for multimedia classification and retrieval. Our goal in this paper is NOT to propose yet another algorithm that predicts labels for unseen videos based on the subjective ground-truth. On the other hand, our goal is to demonstrate that the video classification performance can be improved if instead of using subjective labels, we first create an objectively labeled ground-truth set of videos and then train a classifier based on such a ground-truth so as to predict objective labels for the set of unlabeled videos.  相似文献   

5.
We consider an incremental optimal label placement in a closed-2PM map containing points each attached with a label. Labels are assumed to be axis-parallel square-shaped and have to be pairwise disjoint with maximum possible length each attached to its corresponding point on one of its horizontal edges. Such a labeling is denoted as optimal labeling. Our goal is to efficiently generate a new optimal labeling for all points after each new point being inserted in the map. Inserting each point may require several labels to flip or all labels to shrink. We present an algorithm that generates each new optimal labeling in O(lgn+k) time where k is the number of required label flips, if there is no need to shrink the label lengths, or in O(n) time when we have to shrink the labels and flip some of them. The algorithm uses O(n) space in both cases. This is a new result on this problem.  相似文献   

6.
Large, unlabeled datasets are abundant nowadays, but getting labels for those datasets can be expensive and time-consuming. Crowd labeling is a crowdsourcing approach for gathering such labels from workers whose suggestions are not always accurate. While a variety of algorithms exist for this purpose, we present crowd labeling latent Dirichlet allocation (CL-LDA), a generalization of latent Dirichlet allocation that can solve a more general set of crowd labeling problems. We show that it performs as well as other methods and at times better on a variety of simulated and actual datasets while treating each label as compositional rather than indicating a discrete class. In addition, prior knowledge of workers’ abilities can be incorporated into the model through a structured Bayesian framework. We then apply CL-LDA to the EEG independent component labeling dataset, using its generalizations to further explore the utility of the algorithm. We discuss prospects for creating classifiers from the generated labels.  相似文献   

7.
众包是一个新兴的收集数据集标签的方法。虽然它经济实惠,但面临着数据标签质量无法保证的问题。尤其是当客观原因存在使得众包工作者工作质量较差时,所得的标签会更加不可靠。因此提出一个名为基于特征扩维提高众包质量的方法(FA-method),其基本思想是,首先由专家标注少部分标签,再利用众包者标注的数据集训练模型,对专家集进行预测,所得结果作为专家数据集新的特征,并利用扩维后的专家集训练模型进行预测,计算每个实例为噪声的可能性以及噪声数量上限来过滤出潜在含噪声标签的数据集,类似地,对过滤后的高质量集再次使用扩维的方法进一步校正噪声。在8个UCI数据集上进行验证的结果表明,和现有的结合噪声识别和校正的众包标签方法相比,所提方法能够在重复标签数量较少或标注质量较低时均取得很好的效果。  相似文献   

8.
多标签答案聚合问题是通过融合众包收集的大量非专家标注来估计样本的真实标签,由于数字文化遗产数据具有标注成本高、样本类别多、分布不均衡等特点,给数据集多标签答案聚合问题带来了极大挑战。以往的方法主要集中在单标签任务,忽视了多标签任务的标签关联性;大部分多标签聚合方法虽然在一定程度上考虑了标签相关性,但是很敏感地受噪声和离群值的影响。为解决这些问题,提出一种基于自适应图正则化与联合低秩矩阵分解的多标签答案聚合方法AGR-JMF。首先,将标注矩阵分解成纯净标注和噪声标注两部分;对纯净标注采用自适应图正则化方法构建标签间的关联矩阵;最后,利用标注质量、标签关联性、标注人员行为属性相似性等信息指导低秩矩阵分解,以实现多标签答案的聚合。真实数据集和莫高窟壁画数据集上的实验表明,AGR-JMF相较于现有算法在聚合准确率、识别欺诈者等方面具有明显优势。  相似文献   

9.
张志浩  林耀进  卢舜  郭晨  王晨曦 《计算机应用》2021,41(10):2849-2857
多标记特征选择已在图像分类、疾病诊断等领域得到广泛应用;然而,现实中数据的标记空间往往存在部分标记缺失的问题,这破坏了标记间的结构性和关联性,使得学习算法难以准确地选择重要特征。针对此问题,提出一种缺失标记下基于类属属性的多标记特征选择(MFSLML)算法。首先,通过利用稀疏学习方法获取每个类标记的类属属性;同时基于线性回归模型构建类属属性与标记的映射关系,以用于恢复缺失标记;最后,选取7组数据集以及4个评价指标进行实验。实验结果表明:相比基于最大依赖度和最小冗余度的多标记特征选择算法(MDMR)和基于特征交互的多标记特征选择算法(MFML)等一些先进的多标记特征选择算法,MFSLML在平均查准率指标上能够提升4.61~5.5个百分点,由此可见MFSLML具有更优的分类性能。  相似文献   

10.
谭桥宇  余国先  王峻  郭茂祖 《软件学报》2017,28(11):2851-2864
弱标记学习是多标记学习的一个重要分支,近几年已被广泛研究并被应用于多标记样本的缺失标记补全和预测等问题.然而,针对特征集合较大、更容易拥有多个语义标记和出现标记缺失的高维数据问题,现有弱标记学习方法普遍易受这类数据包含的噪声和冗余特征的干扰.为了对高维多标记数据进行准确的分类,提出了一种基于标记与特征依赖最大化的弱标记集成分类方法EnWL.EnWL首先在高维数据的特征空间多次利用近邻传播聚类方法,每次选择聚类中心构成具有代表性的特征子集,降低噪声和冗余特征的干扰;再在每个特征子集上训练一个基于标记与特征依赖最大化的半监督多标记分类器;最后,通过投票集成这些分类器实现多标记分类.在多种高维数据集上的实验结果表明,EnWL在多种评价度量上的预测性能均优于已有相关方法.  相似文献   

11.
Partial label learning is a weakly supervised learning framework in which each instance is associated with multiple candidate labels,among which only one is the ground-truth label.This paper proposes a unified formulation that employs proper label constraints for training models while simultaneously performing pseudo-labeling.Unlike existing partial label learning approaches that only leverage similarities in the feature space without utilizing label constraints,our pseudo-labeling process leverages similarities and differences in the feature space using the same candidate label constraints and then disambiguates noise labels.Extensive experiments on artificial and real-world partial label datasets show that our approach significantly outperforms state-of-the-art counterparts on classification prediction.  相似文献   

12.
标记分布学习能有效求解多标记学习任务,然而分类器构造以获得大规模具有更强监督信息的标注为前提,在许多应用中难以满足。一种替代的方案是以标记增强的方式从传统逻辑形式的标注中挖掘出隐含的数值型标记的重要程度。现有的标记增强方法大多假设增强后的标记需要在所有示例上保持原有逻辑标记的相关性,不能有效保持局部标记相关性。基于粒计算理论,提出了一种适用于标记分布学习的粒化标记增强学习方法。该方法通过k均值聚类构造具有局部相关性语义的信息粒,并在粒的抽象层面上,分别在图上依据逻辑标记的特性和属性空间的拓扑性质完成粒内示例的标记转化。最后,将得到的标记分布在示例层面进行融合,得到描述整个数据集标记重要程度的数值型标记。大量比较研究表明,所提出的模型可以显著地提升多标记学习的性能。  相似文献   

13.
Crowdsourcing provides an effective and low-cost way to collect labels from crowd workers. Due to the lack of professional knowledge, the quality of crowdsourced labels is relatively low. A common approach to addressing this issue is to collect multiple labels for each instance from different crowd workers and then a label integration method is used to infer its true label. However, to our knowledge, almost all existing label integration methods merely make use of the original attribute information and do not pay attention to the quality of the multiple noisy label set of each instance. To solve these issues, this paper proposes a novel three-stage label integration method called attribute augmentation-based label integration (AALI). In the first stage, we design an attribute augmentation method to enrich the original attribute space. In the second stage, we develop a filter to single out reliable instances with high-quality multiple noisy label sets. In the third stage, we use majority voting to initialize integrated labels of reliable instances and then use cross-validation to build multiple component classifiers on reliable instances to predict all instances. Experimental results on simulated and real-world crowdsourced datasets demonstrate that AALI outperforms all the other state-of-the-art competitors.  相似文献   

14.
An optimal labeling where labels are disjoint axis-parallel equal-size squares is called 2PM labeling if the labels have maximum length each attached to its corresponding point on the middle of one of its horizontal edges. In a closed-2PM labeling, no two edges of labels containing points should intersect. Removing one point and its label, makes free room for its adjacent labels and may cause a global label expansion. In this paper, we construct several data structures in the preprocessing stage, so that any point removal event is handled efficiently. We present an algorithm which decides in O(lgn) amortized time whether a label removal leads to label expansion in which case a new optimal labeling for the remaining points is generated in O(n) amortized time.  相似文献   

15.
We present a novel off-line algorithm for target segmentation and tracking in video. In our approach, video data is represented by a multi-label Markov Random Field model, and segmentation is accomplished by finding the minimum energy label assignment. We propose a novel energy formulation which incorporates both segmentation and motion estimation in a single framework. Our energy functions enforce motion coherence both within and across frames. We utilize state-of-the-art methods to efficiently optimize over a large number of discrete labels. In addition, we introduce a new ground-truth dataset, called Georgia Tech Segmentation and Tracking Dataset (GT-SegTrack), for the evaluation of segmentation accuracy in video tracking. We compare our method with several recent on-line tracking algorithms and provide quantitative and qualitative performance comparisons.  相似文献   

16.
在对大规模多标签数据进行人工标注时极易产生标签的缺失。现有算法大多利用被所有实例共享的全局标签相关性来解决该问题,即对不同实例而言,标签之间的相关性是相同的。然而在实际应用中,不同实例的标签相关性并非完全相同,此时采用局部方式获取的标签相关性将更加准确。因此,本文提出一种基于局部标签相关性的解决方法。该方法利用局部标签相关性来恢复缺失标签,利用低秩矩阵分解技术来构造适用于大规模数据的分类器。此外,为了加快模型的训练,该方法将这两个过程融合到一个统一的框架中,并采用迭代优化的方式进行求解。大量实验结果表明,该方法在预测准确度上至少比现有算法高2个百分点,在训练速度上至少提升5个百分点。  相似文献   

17.
Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning such a mapping, called ranking by pairwise comparison (RPC), first induces a binary preference relation from suitable training data using a natural extension of pairwise classification. A ranking is then derived from the preference relation thus obtained by means of a ranking procedure, whereby different ranking methods can be used for minimizing different loss functions. In particular, we show that a simple (weighted) voting strategy minimizes risk with respect to the well-known Spearman rank correlation. We compare RPC to existing label ranking methods, which are based on scoring individual labels instead of comparing pairs of labels. Both empirically and theoretically, it is shown that RPC is superior in terms of computational efficiency, and at least competitive in terms of accuracy.  相似文献   

18.
Construction workplace hazard detection requires engineers to analyze scenes manually against many safety rules, which is time-consuming, labor-intensive, and error-prone. Computer vision algorithms are yet to achieve reliable discrimination of anomalous and benign object relations underpinning safety violation detections. Recently developed deep learning-based computer vision algorithms need tens of thousands of images, including labels of the safety rules violated, in order to train deep-learning networks for acquiring spatiotemporal reasoning capacity in complex workplaces. Such training processes need human experts to label images and indicate whether the relationship between the worker, resource, and equipment in the scenes violate spatiotemporal arrangement rules for safe and productive operations. False alarms in those manual labels (labeling no-violation images as having violations) can significantly mislead the machine learning process and result in computer vision models that produce inaccurate hazard detections. Compared with false alarms, another type of mislabels, false negatives (labeling images having violations as “no violations”), seem to have fewer impacts on the reliability of the trained computer vision models.This paper examines a new crowdsourcing approach that achieves above 95% accuracy in labeling images of complex construction scenes having safety-rule violations, with a focus on minimizing false alarms while keeping acceptable rates of false negatives. The development and testing of this new crowdsourcing approach examine two fundamental questions: (1) How to characterize the impacts of a short safety-rule training process on the labeling accuracy of non-professional image annotators? And (2) How to properly aggregate the image labels contributed by ordinary people to filter out false alarms while keeping an acceptable false negative rate? In designing short training sessions for online image annotators, the research team split a large number of safety rules into smaller sets of six. An online image annotator learns six safety rules randomly assigned to him or her, and then labels workplace images as “no violation” or ‘violation” of certain rules among the six learned by him or her. About one hundred and twenty anonymous image annotators participated in the data collection. Finally, a Bayesian-network-based crowd consensus model aggregated these labels from annotators to obtain safety-rule violation labeling results. Experiment results show that the proposed model can achieve close to 0% false alarm rates while keeping the false negative rate below 10%. Such image labeling performance outdoes existing crowdsourcing approaches that use majority votes for aggregating crowdsourced labels. Given these findings, the presented crowdsourcing approach sheds lights on effective construction safety surveillance by integrating human risk recognition capabilities into advanced computer vision.  相似文献   

19.
In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive if at least one of its instances is positive; otherwise, the bag is labeled negative. Since a positive bag may contain some negative instances in addition to one or more positive instances, the true labels for the instances in a positive bag may or may not be the same as the corresponding bag label and, consequently, the instance labels are inherently ambiguous. In this paper, we propose a very efficient and robust MIL method, called Multiple-Instance Learning via Disambiguation (MILD), for general MIL problems. First, we propose a novel disambiguation method to identify the true positive instances in the positive bags. Second, we propose two feature representation schemes, one for instance-level classification and the other for bag-level classification, to convert the MIL problem into a standard single-instance learning (SIL) problem that can be solved by well-known SIL algorithms, such as support vector machine. Third, an inductive semi-supervised learning method is proposed for MIL. We evaluate our methods extensively on several challenging MIL applications to demonstrate their promising efficiency, robustness, and accuracy.  相似文献   

20.
In this paper we study multi-label learning with weakly labeled data, i.e., labels of training examples are incomplete, which commonly occurs in real applications, e.g., image classification, document categorization. This setting includes, e.g., (i) semi-supervised multi-label learning where completely labeled examples are partially known; (ii) weak label learning where relevant labels of examples are partially known; (iii) extended weak label learning where relevant and irrelevant labels of examples are partially known. Previous studies often expect that the learning method with the use of weakly labeled data will improve the performance, as more data are employed. This, however, is not always the cases in reality, i.e., weakly labeled data may sometimes degenerate the learning performance. It is desirable to learn safe multi-label prediction that will not hurt performance when weakly labeled data is involved in the learning procedure. In this work we optimize multi-label evaluation metrics (\(\hbox {F}_1\) score and Top-k precision) given that the ground-truth label assignment is realized by a convex combination of base multi-label learners. To cope with the infinite number of possible ground-truth label assignments, cutting-plane strategy is adopted to iteratively generate the most helpful label assignments. The whole optimization is cast as a series of simple linear programs in an efficient manner. Extensive experiments on three weakly labeled learning tasks, namely, (i) semi-supervised multi-label learning; (ii) weak label learning and (iii) extended weak label learning, clearly show that our proposal improves the safeness of using weakly labeled data compared with many state-of-the-art methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号