首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper describes an object recognition algorithm both on a sequential machine and on a single instruction multiple data (SIMD) parallel processor such as the MIT connection machine. The problem, in the way it is presently formulated on a sequential machine, is essentially a propagation of constraints through a tree of possibilities in an attempt to prune the tree to a small number of leaves. The tree can become excessively large, however, and so implementations on massively parallel machines are sought in order to speed up the problem. Two fast parallel algorithms are described here, a static algorithm and a dynamic algorithm. The static algorithm reformulates the problem by assigning every leaf in the completely expanded unpruned tree to a separate processor in the connection machine. Then pruning is done in nearly constant time by broadcasting constraints to the entire SIMD array. This parallel version is shown to run three to four orders of magnitude faster than the sequential version. For large recognition problems which would exceed the capacity of the machine, a dynamic algorithm is described which performs a series of loading and pruning steps, dynamically allocating and deallocating processors through the use of the connection machine's global router communications mechanism.  相似文献   

2.
This paper presents count-sort, a parallel algorithm for mesh-connected computers to sort integers where the range of inputs is known. A straightforward counting technique that has not been implemented previously in parallel sorting algorithms is presented. On a mesh-connected computer with N×N processors we are able to sort N integers in the range 1 sN in timewhere cN is very small. For practical values of N, the algorithm is extremely fast. Further, it is possible to expand the range by a factor k to 1 N so that the slowdown is less than k.

We produce two implementations of count-sort on the SIMD MasPar MP-I with 8192 processors. The first sorts 8-bit integers, one per processor, significantly faster than the manufacturer's current library routine for sorting 8-bit integers. The second implementation is a fast version that sorts several elements per processor.  相似文献   

3.
Feature tracking and matching in video using programmable graphics hardware   总被引:2,自引:0,他引:2  
This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation.  相似文献   

4.
5.

Efficient collision detection is critical in 3D geometric modeling. In this paper, we first implement three parallel triangle-triangle intersection algorithms on a GPU and then compare the computational efficiency of these three GPU-accelerated parallel triangle-triangle intersection algorithms in an application that detects collisions between triangulated models. The presented GPU-based parallel collision detection method for triangulated models has two stages: first, we propose a straightforward and efficient parallel approach to reduce the number of potentially intersecting triangle pairs based on AABBs, and second, we conduct intersection tests with the remaining triangle pairs in parallel based on three triangle-triangle intersection algorithms, i.e., the Möller’s algorithm, Devillers’ and Guigue’s algorithm, and Shen’s algorithm. To evaluate the performance of the presented GPU-based parallel collision detection method for triangulated models, we conduct four groups of benchmarks. The experimental results show the following: (1) the time required to detect collisions for the triangulated model consisting of approximately 1.5 billion triangle pairs is less than 0.5 s; (2) the GPU-based parallel collision detection method speedup over the corresponding serial version is 50x - 60x, and (3) Devillers’ and Guigue’s algorithm is comparatively and comprehensively the best of the three GPU-based parallel triangle-triangle intersection algorithms. The presented GPU-accelerated method is capable of efficiently detecting the potential collisions of triangulated models. Overall, the GPU-accelerated parallel Devillers’ and Guigue’s triangle-triangle intersection algorithm is recommended when performing practical collision detections between large triangulated models.

  相似文献   

6.
With the rapid development of quantum computers capable of realizing Shor’s algorithm, existing public key-based algorithms face a significant security risk. Crystals-Kyber has been selected as the only key encapsulation mechanism (KEM) algorithm in the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography (PQC) competition. In this study, we present a portable and efficient implementation of a Crystals-Kyber post-quantum KEM based on WebAssembly (Wasm), a recently released portable execution framework for high-performance web applications. Until now, most Kyber implementations have been developed with native programming languages such as C and Assembly. Although there are a few previous Kyber implementations based on JavaScript for portability, their performance is significantly lower than that of implementations based on native programming languages. Therefore, it is necessary to develop a portable and efficient Kyber implementation to secure web applications in the quantum computing era. Our Kyber software is based on JavaScript and Wasm to provide portability and efficiency while ensuring quantum security. Namely, the overall software is written in JavaScript, and the performance core parts (secure hash algorithm-3-based operations and polynomial multiplication) are written in Wasm. Furthermore, we parallelize the number theoretic transform (NTT)-based polynomial multiplication using single instruction multiple data (SIMD) functionality, which is available in Wasm. The three steps in the NTT-based polynomial multiplication have been parallelized with Wasm SIMD intrinsic functions. Our software outperforms the latest reference implementation of Kyber developed in JavaScript by ×4.02 (resp. ×4.32 and ×4.1), ×3.42 (resp. ×3.52 and ×3.44), and ×3.41 (resp. ×3.44 and ×3.38) in terms of key generation, encapsulation, and decapsulation on Google Chrome (resp. Firefox, and Microsoft Edge). As far as we know, this is the first software implementation of Kyber with Wasm technology in the web environment.  相似文献   

7.
This paper describes changes made to a previous implementation of an N-body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ∼ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.  相似文献   

8.
In this paper we study various implementations of Cholesky factorization on SIMD architectures. A submatrix algorithm is implemented on the MasPar MP-2 using both block and torus-wrap data mappings. Both LLT and LDLT (square root free) implementations of the algorithm are investigated. The execution times and performance results for the MP-2 are presented. The performance of these algorithms is characterized in terms of the problem size, machine size and other machine dependent communication and computation parameters. Analysis for the communication and computation complexities for these algorithms is also presented, and models to predict the performance are derived. The torus-wrap mapped implementations outperformed the block approach for all problem sizes. The LDLT implementation outperformed LLT for small to medium problem sizes. © 1997 John Wiley & Sons, Ltd.  相似文献   

9.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

10.
We present a parallel implementation of a histogram-based particle filter for object tracking on smart cameras based on SIMD processors. We specifically focus on parallel computation of the particle weights and parallel construction of the feature histograms since these are the major bottlenecks in standard implementations of histogram-based particle filters. The proposed algorithm can be applied with any histogram-based feature sets—we show in detail how the parallel particle filter can employ simple color histograms as well as more complex histograms of oriented gradients (HOG). The algorithm was successfully implemented on an SIMD processor and performs robust object tracking at up to 30 frames per second—a performance difficult to achieve even on a modern desktop computer.  相似文献   

11.
This paper presents a Graphics Processing Unit (GPU)-based implementation of a Bellman-Ford (BF) routing algorithm using NVIDIA’s Compute Unified Device Architecture (CUDA). In the proposed GPU-based approach, multiple threads run concurrently over numerous streaming processors in the GPU to dynamically update routing information. Instead of computing the individual vertex distances one-by-one, a number of threads concurrently update a larger number of vertex distances, and an individual vertex distance is represented in a single thread. This paper compares the performance of the GPU-based approach to an equivalent CPU implementation while varying the number of vertices. Experimental results show that the proposed GPU-based approach outperforms the equivalent sequential CPU implementation in terms of execution time by exploiting the massive parallelism inherent in the BF routing algorithm. In addition, the reduction in energy consumption (about 99 %) achieved by using the GPU is reflective of the overall merits of deploying GPUs across the entire landscape of IP routing for emerging multimedia communications.  相似文献   

12.
Min—Min is a popular heuristic for scheduling tasks to heterogeneous computational resources, which has been applied either directly or as part of more sophisticated heuristics. However, for large scenarios such as grid computing platforms, the time complexity of a straightforward implementation of Min—Min, which is quadratic in the number of tasks, may be prohibitive. This has motivated the development of high performance computing (HPC) implementations, and the use of simpler heuristics for the sake of acceptable execution times. We propose a simple algorithm that implements Min—Min requiring only O(mn) operations for scheduling n tasks on m machines. Our experiments show, in practice, that a straightforward sequential implementation of this algorithm significantly outperforms other state of the art implementations of Min—Min, even compared to HPC implementations. In addition, the proposed algorithm is at least as suitable for parallelization as a direct implementation of Min—Min.  相似文献   

13.
We introduce a GPU-based parallel vertex substitution (pVS) algorithm for the p-median problem using the CUDA architecture by NVIDIA. pVS is developed based on the best profit search algorithm, an implementation of vertex substitution (VS), that is shown to produce reliable solutions for p-median problems. In our approach, each candidate solution in the entire search space is allocated to a separate thread, rather than dividing the search space into parallel subsets. This strategy maximizes the usage of GPU parallel architecture and results in a significant speedup and robust solution quality. Computationally, pVS reduces the worst case complexity from sequential VS’s O(p · n2) to O(p · (n ? p)) on each thread by parallelizing computational tasks on GPU implementation. We tested the performance of pVS on two sets of numerous test cases (including 40 network instances from OR-lib) and compared the results against a CPU-based sequential VS implementation. Our results show that pVS achieved a speed gain ranging from 10 to 57 times over the traditional VS in all test network instances.  相似文献   

14.
There are two ways, other than the standard fast Fourier transform (FFT) algorithm, of computing Fourier transforms of real data, namely, (1)the real fast Fourier transform (RFFT) algorithm, and (2) the fast Hartley transform (FHT) algorithm. On a sequential computer, it has been shown that both the RFFT and the FHT algorithms are faster than the FFT algorithm. However, it is not obvious that the same is true on a parallel machine. The communication requirements of the RFFT and the FHT algorithms, which are critical to the cost of any parallel implementation, are different from those of the FFT algorithm. In this paper we present efficient implementations of the RFFT and the FHT algorithms on a hypercube machine. Experimental results are given for the implementation of the RFFT and the FHT algorithms on the NCUBE machine.  相似文献   

15.
Sequence segmentation has gained popularity in bioinformatics and particularly in studying DNA sequences. Information theoretic models have been used in providing accurate solutions in the segmentation of DNA sequences. Existing dynamic programming approaches provide optimal solution to the segmentation problem. However, their quadratic time complexity prohibits their applicability to long sequences. In this paper, we propose a parallel approach to improve the performance of a quasilinear sequence segmentation algorithm. The target segmentation technique is a divide-and-conquer recursive algorithm that is based on information theory principles and models. We present three parallel implementations that aim at reducing the segmentation time. The first implementation uses the multithreading capabilities of CPUs. The second one is a hybrid implementation that utilizes the synergy between the CPU and the multithreading power of GPUs. The third implementation is a variation of the hybrid approach where it utilizes the concept of unified memory between the CPU and the GPU instead of the standard memory copy approach. We demonstrate the applicability of the parallel implementations by testing them on real DNA sequences and randomly generated sequences with different lengths and different number of unique elements. The results show that the hybrid CPU-GPU approach outperforms the sequential implementation with a speedup of up to 5.9X while the CPU parallel implementation provides a poor speedup of only 1.7X.  相似文献   

16.
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

17.
A GPU-based implementation of the MRF algorithm in ITK package   总被引:1,自引:1,他引:0  
The analysis of medical image, in particular Magnetic Resonance Imaging (MRI), is a very useful tool to help the neurologists on the diagnosis. One of the stages on the analysis of MRI is given by a classification based on the Markov Random Fields (MRF) method. It is possible to find in the literature several packages to carry out this analysis, and of course, the classification tasks. One of them is the Insight Segmentation and Registration Toolkit (ITK). The analysis of MRI is an expensive computational task. In order to reduce the execution time spent on the analysis of MRI, parallelism techniques can be used. Currently, Graphics Processing Units (GPUs) are becoming a good choice to reduce the execution time of several applications at a low cost. In this paper, the authors present a GPU-based classification using MRF from the sequential implementation that appears in the ITK package. The experimental results show a spectacular execution time reduction being the GPU-based implementation up to 118 times faster than the sequential implementation included in the ITK package. Moreover, this result is also observed by reducing the total power consumption in a significant amount.  相似文献   

18.
In this paper, we present faster than real-time implementation of a class of dense stereo vision algorithms on a low-power massively parallel SIMD architecture, the CSX700. With two cores, each with 96 Processing Elements, this SIMD architecture provides a peak computation power of 96 GFLOPS while consuming only 9 Watts, making it an excellent candidate for embedded computing applications. Exploiting full features of this architecture, we have developed schemes for an efficient parallel implementation with minimum of overhead. For the sum of squared differences (SSD) algorithm and for VGA (640 × 480) images with disparity ranges of 16 and 32, we achieve a performance of 179 and 94 frames per second (fps), respectively. For the HDTV (1,280 × 720) images with disparity ranges of 16 and 32, we achieve a performance of 67 and 35 fps, respectively. We have also implemented more accurate, and hence more computationally expensive variants of the SSD, and for most cases, particularly for VGA images, we have achieved faster than real-time performance. Our results clearly demonstrate that, by developing careful parallelization schemes, the CSX architecture can provide excellent performance and flexibility for various embedded vision applications.  相似文献   

19.

Image registration is a commonly task in medical image analysis. Therefore, a significant number of algorithms have been developed to perform rigid and non-rigid image registration. Particularly, the free-form deformation algorithm is frequently used to carry out non-rigid registration task; however, it is a computationally very intensive algorithm. In this work, we describe an approach based on profiling data to identify potential parts of this algorithm for which parallel implementations can be developed. The proposed approach assesses the efficient of the algorithm by applying performance analysis techniques commonly available in traditional computer operating systems. Hence, this article provides guidelines to support researchers working on medical image processing and analysis to achieve real-time non-rigid image registration applications using common computing systems. According to our experimental findings, significant speedups can be accomplished by parallelizing sequential snippets, i.e., code regions that are executed more than once. For the selected costly functions previously identified in the studied free-form deformation algorithm, the developed parallelization decreased the runtime by up to seven times relatively to the related single thread based implementation. The implementations were developed based on the Open Multi-Processing application programming interface. In conclusion, this study confirms that based on the call graph visualization and detected performance bottlenecks, one can easily find and evaluate snippets which are potential optimization targets in addition to throughput in memory accesses.

  相似文献   

20.
Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号