首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a practical disk error recovery scheme tolerating multiple simultaneous disk failures in a typical RAID system, resulting in improvement in availability and reliability. The scheme is composed of the encoding and the decoding processes. The encoding process is defined by making one horizontal parity and a number of vertical parities. The decoding process is defined by a data recovering method for multiple disk failures including the parity disks. The proposed error recovery scheme is proven to correctly recover the original data for multiple simultaneous disk failures regardless of the positions of the failed disks. The proposed error recovery scheme only uses exclusive OR operations and simple arithmetic operations, which can be easily implemented on current RAID systems without hardware changes.  相似文献   

2.
随着软件系统变得更加复杂和可配置,由于错误配置而导致的故障正成为关键问题;这种故障的诊断和修复需要跨越软件本身及其运行环境进行分析,使得其处理过程十分困难,且修理费用极高;为解决这种故障带来的较为严重的经济损失、安全隐患和功能故障;基于配置项之间隐含的关联关系及其运行环境,设计了基于信息系统配置关联关系的配置错误检测系统技术,利用给定的大量样本配置训练,形成配置项关联关系与检测规则,通过发掘信息系统各组件配置项之间的关联关系并利用这种关联进行配置项交叉检验,能够有效检测系统的错误配置;通过模拟测试表明,所提错误配置检出率达到了90%以上,在大型企业中具有广泛的应用前景,为未来优化信息误配置检测技术提供建设性方向方法。  相似文献   

3.
The most popular image matching algorithm SIFT, introduced by D. Lowe a decade ago, has proven to be sufficiently scale invariant to be used in numerous applications. In practice, however, scale invariance may be weakened by various sources of error inherent to the SIFT implementation affecting the stability and accuracy of keypoint detection. The density of the sampling of the Gaussian scale-space and the level of blur in the input image are two of these sources. This article presents a numerical analysis of their impact on the extracted keypoints stability. Such an analysis has both methodological and practical implications, on how to compare feature detectors and on how to improve SIFT. We show that even with a significantly oversampled scale-space numerical errors prevent from achieving perfect stability. Usual strategies to filter out unstable detections (e.g., poorly contrasted extrema) are shown to be inefficient. We also prove that the effect of the error in the assumption on the initial blur is asymmetric and that the method is strongly degraded in the presence of aliasing or without a correct assumption on the camera blur. This analysis leads to a series of practical recommendations.  相似文献   

4.
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, which is often prohibitively expensive for practical use due to its high performance overhead and cost. The adaptive reconfigurable mobile objects of reliability (Armor) middleware architecture offers a scalable low-overhead way to provide high-dependability services to applications. It uses coordinated multithreaded processes to manage redundant resources across interconnected nodes, detect errors in user applications and infrastructural components, and provide failure recovery. The authors describe the experiences and lessons learned in deploying Armor in several diverse fields.  相似文献   

5.
Context-aware applications, as a typical type of self-adaptive software systems, are receiving increasing attention. These applications continually adapt to environmental changes in an autonomic way. However, their adaptation may contain defects when the complexity of modeling all environmental changes is beyond a developer's ability. Such defects can cause failure to the adaptation and result in application crash or freezing. Relating these failures back to responsible defects is challenging. In this paper we propose a novel approach, called Adam, to assist identifying defects in the context-aware adaptation. Adam monitors runtime errors for an application, logs relevant error information, and relates them to responsible defects in this application. To make our Adam approach feasible, we investigate the error types that are commonly exhibited by various failures reported in context-aware applications. Adam detects these errors in order to identify responsible defects in context-aware applications. To detect these errors, Adam formally models the adaptation semantics for context-aware applications, and integrates into them a set of assertion checkers with respect to these error types. We experimentally evaluated Adam through three context-aware applications. The experiments reported promising results that Adam can effectively detect errors, identify their responsible defects in applications, and give useful hints on how these defects can be fixed.  相似文献   

6.
Ye  Yingjun  Zhang  Yongdong  Ye  Weicai 《The Journal of supercomputing》2022,78(12):14009-14033

It is essential to use fault tolerance techniques on exascale high-performance computing systems, but this faces many challenges such as higher probability of failure, more complex types of faults, and greater difficulty in failure detection. In this paper, we designed the Fail-Lagging model to describe HPC process-level failure. The failure model does not distinguish whether the failed process is crashed or slow, but is compatible with the possible behavior of the process due to various failures, such as crash, slow, recovery. The failure detection in Fail-Lagging model is implemented by local detection and global decision among processes, which depend on a robust and efficient communication topology. Robust means that failed processes do not easily corrupt the connectivity of the topology, and efficient means that the time complexity of the topology used for collective communication is as low as possible. For this purpose, we designed a torus-tree topology for failure detection, which is scalable even at the scale of an extremely large number of processes. The Fail-Lagging model supports common fault tolerance methods such as rollback, replication, redundancy, algorithm-based fault tolerance, etc. and is especially able to better enable the efficient forward recovery mode. We demonstrate with large-scale experiments that the torus-tree failure detection algorithm is robust and efficient, and we apply fault tolerance based on the Fail-Lagging model to iterative computation, enabling applications to react to faults in a timely manner.

  相似文献   

7.
Reliability is a serious problem in computer controlled robot systems. Although robots serve successfully in relatively simple applications such as painting and spot welding, their potential in areas such as automated assembly is hampered by the complexity of programming. A program for assembling parts may be logically correct, execute correctly on a simulator, and even execute correctly on a robot most of the time, yet still fail unexpectedly in the face of real world uncertainties. Recovery from such errors is far more complicated than recovery from simple controller errors, since even expected errors can manifest themselves in unexpected ways. In this paper we present a novel approach for improving robot reliability. Instead of anticipating errors, we use knowledge-based programming techniques so that the robot can autonomously exploit knowledge about its task and environment to detect and recover from failures. We describe a system that we have designed and constructed in our robotics laboratory.  相似文献   

8.
ReStore: Symptom-Based Soft Error Detection in Microprocessors   总被引:1,自引:0,他引:1  
Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor—in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow misspeculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures.  相似文献   

9.
A field study was performed in a hospital pharmacy aimed at identifying positive and negative influences on the process of detection of and further recovery from initial errors or other failures, thus avoiding negative consequences. Confidential reports and follow-up interviews provided data on 31 near-miss incidents involving such recovery processes. Analysis revealed that organizational culture with regard to following procedures needed reinforcement, that some procedures could be improved, that building in extra checks was worthwhile and that supporting unplanned recovery was essential for problems not covered by procedures. Guidance is given on how performance in recovery could be measured. A case is made for supporting recovery as an addition to prevention-based safety methods.  相似文献   

10.
A Flexible Framework for Fault Tolerance in the Grid   总被引:2,自引:0,他引:2  
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-defined exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures.  相似文献   

11.
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.  相似文献   

12.
Interest point detection plays a significant role in computer vision applications. The most commonly used interest point detector algorithm is scale invariant feature transform (SIFT). The use of Gaussian filter in the SIFT algorithm fails to match interest points on the edge and it also causes blur annoyance in the rescaling process. To overcome this failure Bilateral-Harris Corner Detector (BHCD) has been proposed in this paper. In the proposed BHCD, a Bilateral filter preserves edges by smoothening and removing noise in an image. Accuracy in localization of interest points are improved by using the proposed dynamic blur metric calculation. The Harris corner has been added to get stable and reliable interest point detection. The proposed BHCD has been simulated for the evaluation criteria such as repeatability and matching score. Extensive experimental results show that the proposed method is more robust to illumination, scaling, rotation, compression and viewpoint changes. The experimental evaluation for BHCD has been carried for the object recognition benchmark datasets COIL-100, ZuBud, Caltech-101. The proposed BHCD achieves highest recognition rate compared to the other state-of-the-art methods.  相似文献   

13.
File recovery enhances the reliability and robustness of a network file management system. This capability of error detection and recovery is examined in an FTAM implementation. The issues of docket design, checkpoint insertion, recoverability, as well as interface transparency, are discussed in the paper. The impact of the recovery on the end-to-end performance and the effectiveness of the error recovery protocol in the light of failures are also examined by means of performance measurements. The result shows that the advantage gained by error recovery outweighs the protocol overheads incurred in the process.  相似文献   

14.
Network-based cloud computing has rapidly expanded as an effective way of video processing and transmission. Since packet losses or errors may frequently occur in cloud computing environment during the transmission of compressed video, error concealment is applied in the decoder to prevent significant degradation of image quality. Motion vector (MV) recovery is a widely-used temporal error concealment which shows satisfactory performance in practical application of video transmission. In this paper, a fast and effective temporal error concealment algorithm for H.264/AVC is presented, which efficiently utilizes the MVs of neighboring macroblocks (MB) which are adjacent to the lost MB under different circumstances. To ensure the precision of the MV recovery, a smallest division of \(4\times 4\) sub-block is applied, which will not bring too much complexity in the proposed algorithm. Each MV of sub-block is restored in individual method, and the recovery information is gathered from the nearby 20 sub-blocks. Simulation results under the virtual cloud environment show that our scheme can highly improve the quality of reconstructed video and obtain a gain of about 4 dB in PSNR, compared with other temporal error concealment methods in the condition of different packet loss rates and quantization parameters. The practical simplicity ensures that the proposed method can be readily applied to real-time video applications running under cloud computing environment.  相似文献   

15.
Byte level Forward Error Correction (B-FEC) is efficient for recovery from uniform bit errors, but not suitable to handle recovery from burst bit errors. Conversely, Sub-Packet level Forward Error Correction (SP-FEC) can alleviate the problem of large encoding/decoding delay jitter in Packet level Forward Error Correction (P-FEC) to efficiently handle recovery from burst bit errors, but has large error recovery overhead as P-FEC for recovery from uniform bit errors. This paper proposes a dynamic combination of byte level and Sub-Packet level Forward Error Correction (BSP-FEC) in the Hybrid Automatic Repeat reQuest (HARQ) mechanism to reduce the error recovery overhead. BSP-FEC not only can solve the problems appearing in B-FEC and SP-FEC, but also can get the advantages of B-FEC and SP-FEC in the HARQ mechanism. BSP-FEC replaces the SP-FEC checksum with B-FEC and uses Automatic Repeat reQuest (ARQ) when the network condition deteriorates. BSP-FEC not only utilizes an overhead cost model to dynamically decide the SP-FEC parameter and the B-FEC bit rate according to network conditions, but also utilizes a time constraint model to decide the ARQ retry limit. BSP-FEC dynamically adjusts the FEC redundancy to save bandwidth and improves the Decodable Frame Rate (DFR) and the Peak Signal to Noise Ratio (PSNR) of the delivered video streaming. Accordingly, BSP-FEC can improve multimedia communication performance to both avoid network congestion and shorten end-to-end delay by decreasing effective packet loss rate and packet recovery overhead. Because of the low packet recovery overhead, furthermore, BSP-FEC allows applications to transmit more application data in networks with limited bandwidth. Considering the compatibility, BSP-FEC is implemented in the application layer as the end-to-end protection method to protect packets from errors in wired/wireless networks. Numerical and simulation experimental results show that BSP-FEC obtains better recovery efficiency with the minimum error recovery overhead.  相似文献   

16.
OpenMP has been focused in performance applied to numerical applications, but when we try to move this focus to other kind of applications, like Web servers, we detect one important lack. In these applications, performance is important, but reliability is even more important, and OpenMP does not have any recovery mechanism. In this paper we present a novel proposal to address this lack. In order to add error handling to OpenMP we propose some extensions to the current OpenMP specification. A directive and a clause are proposed, defining a scope for the error handling (where the error can occur) and specifying a behaviour for handling the specific errors. Some examples of use are presented, and we present also an evaluation showing the impact of this proposal in OpenMP applications. We show that this impact is low enough to consider the proposal worthwhile for OpenMP.  相似文献   

17.
庄曈  曾庆化  刘建业  董良 《计算机工程》2012,38(15):197-200
针对无人机在连续飞行过程中的姿态求取问题,提出一种基于单目视觉的微型无人机姿态算法。基于无人机摄像机获得序列图像,利用图像尺度不变特性变换获取特征点信息,结合对极几何约束关系,运用随机采样一致性原理求解载体位姿变换信息,从而获得载体的导航信息。实验结果表明,通过单目序列图像获得的姿态角度变化精度优于0.1°,在180°旋转情况下的误差累加值小于1°。  相似文献   

18.
针对传统特征提取拼接算法在复杂图像中配准过程中出现的过多误匹配,导致拼接后图像出现鬼影、模糊等问题,从而影响拼接图像的质量,提出一种改进的SIFT配准算法。在对目标图像提取SIFT特征后,利用SIFT描述子的尺度以及梯度方向信息建立最小邻域匹配剔除误匹配点,之后利用局部均方根误差(RMSE)评价映射矩阵与RANSAC算法相结合,迭代出精确变换模型。在对图像进行几何矫正后,提出一种自适应的混合线性算法对重合区域图像变换至HIS颜色空间进行图像拼接,最后得到平滑无缝的完整彩色全景拼接图像。实验结果证明,该算法在拼接复杂场景并且重合区域不多时仍有较好的准确性及稳定性。  相似文献   

19.

Simulation is a common technique for the evaluation of new approaches and protocols in networked systems and provides many benefits. However, it is also well known that the relevance of the simulation results for real-world applications depends on the various models which are used within the simulation, e.g., for the characteristics of the radio communication. In this paper, we introduce the Extended Multipath Raytracing Model, an extension to the ray-tracing radio medium available in Cooja, to improve the modelling of wireless links in simulated Wireless Sensor Networks. Our extension allows the simulation of environmental influences onto links on a per node basis, allowing the analysis of various effects observed in experiments in a virtual environment. Furthermore, the packet-based modelling of transmission errors is extended to provide the simulation of bit errors, allowing new usage scenarios, like the simulation of error detection and Forward Error Correction codes in Cooja.

  相似文献   

20.
This paper describes an analysis of hardware-related software (HW/SW) errors on an MVS/SP operating system at Stanford University. The analysis procedure demonstrates a methodology for evaluating the interaction between hardware and software as it relates to system reliability. The paper examines the operating system's handling of HW/SW errors and also the effectiveness of recovery management. Nearly 35 percent of all observed software failures were found to be hareware-related. The analysis shows that the operating system is seldom able to diagnose that a software error may be hardware-related. The impact of HW/SW errors on the system is evaluated by measuring the effectiveness of system recovery in containing the propagation of HW/SW errors. The system failure probability for HW/SW errors is close to three times that for software errors in general. The observed HW/SW errors are seen to have a specific pattern, suggesting the possibility of the use of such error patterns for intelligent error prediction and recovery.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号