首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Sampling is a fundamental method for generating data subsets. As many data analysis methods are deve-loped based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60000 show that when the significant difference level,α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8%of the tests, and the average distribution difference is approximately 0.24.  相似文献   

2.
针对原有集成学习多样性不足而导致的集成效果不够显著的问题,提出一种基于概率校准的集成学习方法以及两种降低多重共线性影响的方法。首先,通过使用不同的概率校准方法对原始分类器给出的概率进行校准;然后使用前一步生成的若干校准后的概率进行学习,从而预测最终结果。第一步中使用的不同概率校准方法为第二步的集成学习提供了更强的多样性。接下来,针对校准概率与原始概率之间的多重共线性问题,提出了选择最优(choose-best)和有放回抽样(bootstrap)的方法。选择最优方法对每个基分类器,从原始分类器和若干校准分类器之间选择最优的进行集成;有放回抽样方法则从整个基分类器集合中进行有放回的抽样,然后对抽样出来的分类器进行集成。实验表明,简单的概率校准集成学习对学习效果的提高有限,而使用了选择最优和有放回抽样方法后,学习效果得到了较大的提高。此结果说明,概率校准为集成学习提供了更强的多样性,其伴随的多重共线性问题可以通过抽样等方法有效地解决。  相似文献   

3.
Importance sampling is a technique that is commonly used to speed up Monte Carlo simulation of rare events. However, little is known regarding the design of efficient importance sampling algorithms in the context of queueing networks. The standard approach, which simulates the system using an a priori fixed change of measure suggested by large deviation analysis, has been shown to fail in even the simplest network settings. Estimating probabilities associated with rare events has been a topic of great importance in queueing theory, and in applied probability at large. In this article, we analyse the performance of an importance sampling estimator for a rare event probability in a Jackson network. This article carries out strict deadlines to a two-node Jackson network with feedback whose arrival and service rates are modulated by an exogenous finite state Markov process. We have estimated the probability of network blocking for various sets of parameters, and also the probability of missing the deadline of customers for different loads and deadlines. We have finally shown that the probability of total population overflow may be affected by various deadline values, service rates and arrival rates.  相似文献   

4.
Balanced sampling is a very efficient sampling design when the variable of interest is correlated to the auxiliary variables on which the sample is balanced. A procedure to select balanced samples in a stratified population has previously been proposed. Unfortunately, this procedure becomes very slow as the number of strata increases and it even fails to select samples for some large numbers of strata. A new algorithm to select balanced samples in a stratified population is proposed. This new procedure is much faster than the existing one when the number of strata is large. Furthermore, this new procedure makes it possible to select samples for some large numbers of strata, which was impossible with the existing method. Balanced sampling can then be applied on a highly stratified population when only a few units are selected in each stratum. Finally, this algorithm turns out to be valuable for many applications as, for instance, for the handling of nonresponse.  相似文献   

5.
主动学习通过主动选择要学习的样例进行标注,从而有效地降低学习算法的样本复杂度。针对当前主动学习算法普遍采用的平分版本空间策略,本文提出过半缩减版本空间的策略,这种策略避免了平分版本空间策略所要求的较强假设。基于过半缩减版本空间的策略,本文实现了一种选取具有最大可能性被误分类的样例作为训练样例的启发式主动动学习算法(CBMPMS)。该算法计算版本空间中随机抽取的假设组成的委员会和当前学习器对样例预测的类概率差异的熵,以此作为选择样例的标准。针对UCI数据集的实验表明,该算法能够在大多数数据集上取得比相关研究更好的性能。  相似文献   

6.
In this paper, a digital filter bank structure is proposed for the reconstruction of uniformly sampled bandlimited signals from their N-th order nonuniform samples. The proposed filter bank structure is arrived at after incorporating polyphase-domain filtering operations and discrete Fourier transform (DFT) modulation to an existing filter bank framework.In this paper, an idea is also presented, so that uniform samples can be reconstructed from N-th order nonuniform samples using the structures based on recurrent nonuniform sampling. A comparison of the computational complexity and the signal-to-noise ratio (SNR) performance is also given for various structures existing in the literature.  相似文献   

7.
In this article, the problem of robust sampled-data H output tracking control is investigated for a class of nonlinear networked systems with stochastic sampling and time-varying norm-bounded uncertainties. For the sake of technical simplicity, only two different sampling periods are considered, their occurrence probabilities are given constants and satisfy Bernoulli distribution, and can be extended to the case with multiple stochastic sampling periods. By the way of an input delay, the probabilistic system is transformed into a stochastic continuous time-delay system. A new linear matrix inequality-based procedure is proposed for designing state-feedback controllers, which would guarantee that the closed-loop networked system with stochastic sampling tracks the output of a given reference model well in the sense of H . Conservatism is reduced by taking the probability into account. Both network-induced delays and packet dropouts have been considered. Finally, an illustrative example is given to show the usefulness and effectiveness of the proposed H output tracking design.  相似文献   

8.
This paper deals with the state estimation for the systems under measurement noise whose mean and covariance change with Markov transition probabilities. The minimum variance estimate for the state involves consideration of a prohibitively large number of sequences, so that the usual computation method becomes impractical. In the algorithm proposed here, the estimate is calculated with a relatively small number of sequences sampled at random from the set of a large number of sequences. The average risk of the algorithm is shown to converge to the optimal average risk as the number of sampled sequences increases. An ideal sampling probability yielding a very fast convergence is found. The probability is approximated in a minimum mean squared sense by a probability according to which sequences can be sampled sequentially and with great ease. This policy of determination of sampling probability makes it possible to design practical and efficient algorithms. Digital simulation results show a good performance of the proposed algorithm.  相似文献   

9.
Importance sampling is a technique that is commonly used to speed up Monte Carlo simulation of rare events. The standard approach, which simulates the system using an a priori fixed change of measure, has been shown to fail in even the simplest network settings. Estimating probabilities associated with rare events has been a topic of great importance in queuing theory, and in applied probability at large. In this paper, we estimate the probability of two rare events known as total population overflow and individual buffer overflow in an open Jackson network in which the customers should receive the needed service in a definite deadline. we use parallel computing in implementing the estimator. Moreover, we consider the effect of various network parameters on aforementioned overflow probabilities, and we have also shown that how these parameters affect the probability of missing the deadline.  相似文献   

10.
The sequential probability ratio test is widely used in in-situ monitoring, anomaly detection, and decision making for electronics, structures, and process controls. However, because model parameters for this method, such as the system disturbance magnitudes, and false and missed alarm probabilities, are selected by users primarily based on experience, the actual false and missed alarm probabilities are typically higher than the requirements of the users. This paper presents a systematic method to select model parameters for the sequential probability ratio test by using a cross-validation technique. The presented method can improve the accuracy of the sequential probability ratio test by reducing the false and missed alarm probabilities caused by improper model parameters. A case study of anomaly detection of resettable fuses is used to demonstrate the application of a cross validation method to select model parameters for the sequential probability ratio test.  相似文献   

11.
符永铨  王意洁  周婧 《软件学报》2009,20(3):630-643
针对非结构化P2P 系统中可扩展的快速无偏抽样问题,提出了一种基于多个peer 自适应随机行走的抽样方法SMARW.在该方法中,基于代理随机行走选择一组临时的peer 执行抽样过程,一次产生一组可调数目的抽样节点,提高了抽样速度,选择每次产生的抽样节点作为临时peer 进行新的抽样过程,这种简单的方法可以保证系统具有近似最优的系统负载均衡程度.同时,SMARW 利用自适应的分布式随机行走修正过程提高抽样过程的收敛速度.理论分析和模拟测试表明,SMARW 方法具有较高的无偏抽样能力以及近似最优的系统负载均衡程度.  相似文献   

12.
Due to the application-specific nature of wireless sensor networks, the sensitivity to coverage and data reporting latency varies depending on the type of applications. In light of this, algorithms and protocols should be application-aware to achieve the optimum use of highly limited resources in sensors and hence to increase the overall network performance. This paper proposes a probabilistic constrained random sensor selection (CROSS) scheme for application-aware sensing coverage with a goal to maximize the network lifetime. The CROSS scheme randomly selects in each round (approximately) k data-reporting sensors which are sufficient for a user/application-specified desired sensing coverage (DSC) maintaining a minimum distance between any pair of the selected k sensors. We exploit the Poisson sampling technique to force the minimum distance. Consequently, the CROSS improves the spatial regularity of randomly selected k sensors and hence the fidelity of satisfying the DSC in each round, and the connectivity among the selected sensors increase. To this end, we also introduce an algorithm to compute the desired minimum distance to be forced between any pair of sensors. Finally, we present the probabilistic analytical model to measure the impact of the Poisson sampling technique on selecting k sensors, along with the optimality of the desired minimum distance computed by the proposed algorithm.  相似文献   

13.
针对大数据环境中存在很多的冗余和噪声数据,造成存储耗费和学习精度差等问题,为有效的选取代表性样本,同时提高学习精度和降低训练时间,提出了一种基于选择性抽样的SVM增量学习算法,算法采用马氏抽样作为抽样方式,抽样过程中利用决策模型来计算样本间的转移概率,然后通过转移概率来决定是否接受样本作为训练数据,以达到选取代表性样本的目的。并与其他SVM增量学习算法做出比较,实验选取9个基准数据集,采用十倍交叉验证方式选取正则化参数,数值实验结果表明,该算法能在提高学习精度的同时,大幅度的减少抽样与训练总时间和支持向量总个数。  相似文献   

14.
《国际计算机数学杂志》2012,89(8):1565-1572
Recently, the estimation of a population quantile has received quite attention. Existing quantile estimators generally assume that values of an auxiliary variable are known for the entire population, and most of them are defined under simple random sampling without replacement. Assuming two-phase sampling for stratification with arbitrary sampling designs in each of the two phases, a new quantile estimator and its variance estimator are defined. The proposed estimators can be used when the population auxiliary information is not available, which is a common situation in practice. Desirable properties such as the unbiasedness are derived. Suggested estimators are compared numerically with an alternative stratification estimator and its variance estimator, and desirable results are observed. Confidence intervals based upon the proposed estimators are also defined, and they are compared via simulation studies with the confidence intervals based upon the stratification estimator. The proposed confidence intervals give desirable coverage probabilities with the smallest interval lengths.  相似文献   

15.
The performance of m-out-of-n bagging with and without replacement in terms of the sampling ratio (m/n) is analyzed. Standard bagging uses resampling with replacement to generate bootstrap samples of equal size as the original training set mwor=n. Without-replacement methods typically use half samples mwr=n/2. These choices of sampling sizes are arbitrary and need not be optimal in terms of the classification performance of the ensemble. We propose to use the out-of-bag estimates of the generalization accuracy to select a near-optimal value for the sampling ratio. Ensembles of classifiers trained on independent samples whose size is such that the out-of-bag error of the ensemble is as low as possible generally improve the performance of standard bagging and can be efficiently built.  相似文献   

16.
Acceptance sampling is used to decide either the whole lot will be accepted or rejected, based on inspection of randomly sampled items from the same lot. As an alternative to traditional sampling plans, it is possible to use Bayesian approaches using previous knowledge on process variation. This study presents a Bayesian two-sided group chain sampling plan (BTSGChSP) by using various combinations of design parameters. In BTSGChSP, inspection is based on preceding as well as succeeding lots. Poisson function is used to derive the probability of lot acceptance based on defective and non-defective products. Gamma distribution is considered as a suitable prior for Poisson distribution. Four quality regions are found, namely: (i) quality decision region (QDR), (ii) probabilistic quality region (PQR), (iii) limiting quality region (LQR) and (iv) indifference quality region (IQR). Producer’s risk and consumer’s risk are considered to estimate the quality regions, where acceptable quality level (AQL) is associated with producer’s risk and limiting quality level (LQL) is associated with consumer’s risk. Moreover, AQL and LQL are used in the selection of design parameters for BTSGChSP. The values based on all possible combinations of design parameters for BTSGChSP are presented and inflection points’ values are found. The finding exposes that BTSGChSP is a better substitute for the existing plan for industrial practitioners.  相似文献   

17.
We explore in this paper a novel sampling algorithm, referred to as algorithm PAS (standing for proportion approximation sampling), to generate a high-quality online sample with the desired sample rate. The sampling quality refers to the consistency between the population proportion and the sample proportion of each categorical value in the database. Note that the state-of-the-art sampling algorithm to preserve the sampling quality has to examine the population proportion of each categorical value in a pilot sample a priori and is thus not applicable to incremental mining applications. To remedy this, algorithm PAS adaptively determines the inclusion probability of each incoming tuple in such a way that the sampling quality can be sequential/preserved while also guaranteeing the sample rate close to the user specified one. Importantly, PAS not only guarantees the proportion consistency of each categorical value but also excellently preserves the proportion consistency of multivariate statistics, which will be significantly beneficial to various data mining applications. For better execution efficiency, we further devise an algorithm, called algorithm EQAS (standing for efficient quality-aware sampling), which integrates PAS and random sampling to provide the flexibility of striking a compromise between the sampling quality and the sampling efficiency. As validated in experimental results on real and synthetic data, algorithm PAS can stably provide high-quality samples with corresponding computational overhead, whereas algorithm EQAS can flexibly generate samples with the desired balance between sampling quality and sampling efficiency  相似文献   

18.
In this paper, a genetic algorithm (GA) is proposed as a search strategy for not only positive but also negative quantitative association rule (AR) mining within databases. Contrary to the methods used as usual, ARs are directly mined without generating frequent itemsets. The proposed GA performs a database-independent approach that does not rely upon the minimum support and the minimum confidence thresholds that are hard to determine for each database. Instead of randomly generated initial population, uniform population that forces the initial population to be not far away from the solutions and distributes it in the feasible region uniformly is used. An adaptive mutation probability, a new operator called uniform operator that ensures the genetic diversity, and an efficient adjusted fitness function are used for mining all interesting ARs from the last population in only single run of GA. The efficiency of the proposed GA is validated upon synthetic and real databases.  相似文献   

19.
基于某外场试验同步采集某装甲车内不同时刻多个位置气体样品的现实需求,依据电磁阀通电开阀断电关阀的工作原理和时间继电器的延时控制原理,运用电磁阀控制一定规格的真空负压采样瓶开始与停止采样、时间继电器控制多条采样线路按预设顺序自动开启采样的方法,采用多点分布式布设方式设计了一款气体采样系统。在预试验过程中,分为分系统测试和系统联调联试两个阶段,对系统工作的稳定性、采样的可靠性进行了全面测试。从测试结果来看,该系统工作稳定、采集的样品可信,样品量能够满足仪器分析需求,采样系统引入的试验误差小,试验结果真实可靠,系统完全能够满足在某行进装甲车内多位置多时刻自动同步采集气体样品的试验需求,为试验顺利实施提供了物质条件,有效解决了密闭空间气体多位置多时刻同步采集难题。  相似文献   

20.
A desirable feature of a global sampling design for estimating forest cover change based on satellite imagery is the ability to adapt the design to obtain precise regional estimates, where a region may be a country, state, province, or conservation area. A sampling design stratified by an auxiliary variable correlated with forest cover change has this adaptability. A global stratified random sample can be augmented by additional sample units within a region selected by the same stratified protocol and the resulting sample constitutes a stratified random sample of the region. Stratified sampling allows increasing the sample size in a region by a few to many additional sample units. The additional sample units can be effectively allocated to strata to reduce the standard errors of the regional estimates, even though these strata were not initially constructed for the objective of regional estimation. A complete coverage map of deforestation within the Brazilian Legal Amazon (BLA) is used as a population to evaluate precision of regional estimates obtained by augmenting a global stratified random sample. The standard errors of the regional estimates for the BLA and states within the BLA obtained from the augmented stratified design were generally smaller than those attained by simple random sampling and systematic sampling.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号