首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.
Resource-aware Speculative Prefetching in Wireless Networks   总被引:3,自引:0,他引:3  
Tuah  N.J.  Kumar  M.  Venkatesh  S. 《Wireless Networks》2003,9(1):61-72
Mobile users connected to wireless networks expect performance comparable to those on wired networks for interactive multimedia applications. Satisfying Quality of Service (QoS) requirements for such applications in wireless networks is a challenging problem due to limitations of low bandwidth, high error rate and frequent disconnections of wireless channels. In addition, wireless networks suffer from varying bandwidth. In this paper we investigate object prefetching during times of connectedness and bandwidth availability to enhance user perceived connectedness. This paper presents an access model that is suitable for multimedia access in wireless networks. Access modelling for the purpose of predicting future accesses in the context of speculative prefetching has received much attention in the literature. The model recognizes that a web page, instead of just a single file, is typically a compound of several files. When it comes to making prefetch decisions, most previous studies in speculative prefetching resort to simple heuristics, such as prefetching an item with access probabilities larger than a manually tuned threshold. This paper takes a different approach. Specifically, it models the performance of the prefetcher, taking into account access predictions and resource parameters, and develops a prefetch policy based on a theoretical analysis of the model. Since the analysis considers cache as one of the resource parameters, the resulting policy integrates prefetch and cache replacement decisions. The paper investigates the effect of prefetching on network load. In order to make effective use of available resources and maximize access improvement, it is beneficial to prefetch all items with access probabilities exceeding certain threshold.  相似文献   

2.
党向磊  王箫音  佟冬  陆俊林  程旭  王克义 《电子学报》2012,40(11):2145-2151
为提高按序执行处理器的访存性能,本文提出一种预执行指导的数据预取方法(PEDP).PEDP利用跨距预取器对规则的访存模式进行预取,并在发生L2 Cache失效后通过预执行后续指令对不规则的访存模式进行精确的预取,从而结合两者的优势提高预取覆盖率.同时,PEDP利用预执行过程中提前捕获的真实访存信息指导跨距预取器的预取过程.在预执行的指导下,跨距预取器可以对预执行能够产生的符合跨距访存模式的地址更早地发起预取请求,从而改善预取及时性.此外,为进一步优化上述指导过程,PEDP使用更新过滤器有效去除指导过程中对跨距预取器的有害更新,从而提高预取准确率.实验结果表明,在平均情况下,PEDP将基准处理器的性能提升33.0%.与跨距预取和预执行各自单独使用相比,PEDP将性能分别提高16.2%和7.3%.  相似文献   

3.
A multimedia storage system plays a vital role for the performance and scalability of multimedia servers. To handle the server load imposed by increased user access to on-demand multimedia streaming applications, new storage system solutions are needed. Multimedia storage systems store and retrieve data from storage devices and manage related issues including data placement, scheduling, file management, continuous data delivery, memory buffering, and prefetching. For high-data-rate multimedia systems, storage systems have long been viewed as a primary bottleneck for two reasons. First, multimedia applications have a much higher storage system load than previous applications. Second, storage devices have become only marginally faster compared to increased processor and network performance. This increasing speed mismatch has fueled a search for new storage structures and file system storage and retrieval mechanisms  相似文献   

4.
We have developed a 0.25-μm, 200-MHz embedded RISC processor for multimedia applications. This processor has a dual-issue superscalar datapath that consists of a 32-bit integer unit and a 64-bit single-instruction multiple-data (SIMD) function unit that together have a total of five multiply-adders. An on-chip concurrent Rambus DRAM (C-RDRAM) controller uses interleaved transactions to increase the memory bandwidth of the Rambus channel to 533 Mb/s. The controller also reduces latency by using the transaction interleaving and instruction prefetching. A 64-bit, 200-MHz internal bus transfers data among the CPU core, the C-RDRAM, and the peripherals. These high-data-rate channels improve CPU performance because they eliminate a bottleneck in the data supply. The datapath part of this chip was designed using a functional macrocell library that included placement information for leaf cells and resulted in the SIMD function unit of this chip's having 68000 transistors per square millimeter  相似文献   

5.
Web预取技术和缓存技术对缓解访问延迟有一定的作用,但各有利弊。这.里将预取技术与语义缓存技术相结合,对用户查询的访问频率进行实时监测,并通过多项式回归算法对用户的下一周期访问概率进行预测。采用基于多项式回归预取技术构建的预测模型,可以实现动态在线预测,既可避免兴趣漂移引起的预取不确定性,又可以减少历史信息的存储量,科学合理地解决Web访问延迟的问题。  相似文献   

6.
This paper presents the design of an embedded automated digital video surveillance system with real-time performance. Hardware accelerators for video segmentation, morphological operations, labeling and feature extraction are required to achieve the real-time performance while tracking will be handled in software in an embedded processor. By implementing a complete embedded system, bottlenecks in computational complexity and memory requirements can be identified and addressed. Accordingly, a memory reduction scheme for the video segmentation unit, reducing bandwidth with more than 70%, and a low complexity morphology architecture that only requires memory proportional to the input image width, have been developed. On a system level, it is shown that a labeling unit based on a contour tracing technique does not require unique labels, resulting in more than 50% memory reduction. The hardware accelerators provide the tracking software with image objects properties, i.e. features, thereby decoupling the tracking algorithm from the image stream. A prototype of the embedded system is running in real-time, 25 fps, on a field programmable gate array development board. Furthermore, the system scalability for higher image resolution is evaluated.  相似文献   

7.
基于控制流的混合指令预取   总被引:2,自引:0,他引:2  
沈立  王志英  鲁建壮  戴葵 《电子学报》2003,31(8):1141-1144
取指令能力的高低对微处理器的性能有很大影响.指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能.本文提出了一种基于程序控制流的混合指令预取机制,它采用顺序预取和非顺序预取相结合的方式将指令提前读入指令Cache.模拟结果显示,该方法能够有效地提高指令Cache访问的命中率,并具有实现简单,无效预取率低等特点.  相似文献   

8.
Systolic arrays (SAs) are very efficient architectures for multimedia processing, database management, and scientific computing applications that are characterized by a high number of data access. However, in these data transfer and storage intensive applications, memory access is often the limiting factor to the computation speed. Since the memory subsystem dominates the cost (area), performance and power consumption of the SA, we have to pay a special attention to how memory subsystem can benefit from customization. In this paper we consider memory organization of linear systolic array with bi-directional links (called BLSA) suitable for implementation of broad class of algorithms. We assume that memory is organized into distributed smaller physical memory modules. In order to provide high bandwidth in data access we have designed special hardware, called address generator unit (AGU). The function of AGU is threefold. First, during the initialization, it transforms host address space into BLSA address space. Second, provides efficient memory data access during BLSA operation. Third, performs fast data transfer between BLSA and host at the end of the computation. In this article, we examine the impact on area and performance of memory access related circuity in eliminating computational intensive offset address calculations performed in software by implementing the needed address transformations with the AGUs. By involving hardware AGUs we achieved a speedup of approximately two, compared to the software implementation of address calculation, with a hardware overhead of only 7.6% in the worst case.  相似文献   

9.
The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead.  相似文献   

10.
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.  相似文献   

11.
The CSI multimedia architecture   总被引:1,自引:0,他引:1  
An instruction set extension designed to accelerate multimedia applications is presented and evaluated. In the proposed complex streamed instruction (CSI) set, a single instruction can process vector data streams of arbitrary length and stride and combines complex memory accesses (with implicit prefetching), program control for vector sectioning, and complex computations on multiple data in a single operation. In this way, CSI eliminates overhead instructions (such as instructions for data sectioning, alignment, reorganization, and packing/unpacking) often needed in applications utilizing MMX-like extensions and accelerates key multimedia kernels. Simulation results demonstrate that a superscalar processor extended with CSI outperforms the same processor enhanced with Sun's VIS extension by a factor of up to 7.77 on key multimedia kernels and by up to 35% on full applications.  相似文献   

12.
In this paper, we present an energy-aware informed prefetching technique called Eco-Storage that makes use of the application-disclosed access patterns to group the informed prefetching process in a hybrid storage system (e.g., hard disk drive and solid state disks). Since the SSDs are more energy efficient than HDDs, aggressive prefetching for the data in the HDD level enables it to have as much standby time as possible in order to save power. In the Eco-Storage system, the application can still read its on-demand I/O reading requests from the hybrid storage system while the data blocks are prefetched in groups from HDD to SSD. We show that these two steps can be handled in parallel to decreases the system’s power consumption. Our Eco-Storage technique differs from existing energy-aware prefetching schemes in two ways. First, Eco-Storage is implemented in a hybrid storage system where the SDD level is more energy efficient. Second, it can group the informed prefetching process and quickly prefetch the data from the HDD to the SSD to increase the frequent HDD standby times. This will makes the application finds most of its on-demand I/O reading requests in the SSD level. Finally, we develop a simulator to evaluate our Eco-Storage system performance. Our results show that our Eco-Storage reduces the power consumption by at least 75 % when compared with the worst case of non-Eco-Storage case using a real-world I/O trace.  相似文献   

13.
14.
In this paper, we present a novel memory access reduction scheme (MARS) for two-dimension fast cosine transform (2-D FCT). It targets programmable DSPs with high memory-access latency. It reduces the number of memory accesses by: 1) reducing the number of weighting factors and 2) combining butterflies in vector-radix 2-D FCT pruning diagram from two stages to one stage with an efficient structure. Hardware platform based on general purpose processor is used to verify the effectiveness of the proposed method for vector-radix 2-D FCT pruning implementation. Experimental results validate the benefits of the proposed method with reduced memory access, less clock cycle and fewer memory space compared with the conventional implementation.  相似文献   

15.
Video prefetching is a technique that has been proposed for the transmission of variable-bit-rate (VBR) videos over packet-switched networks. The objective of these protocols is to prefetch future frames at the customers' set-top box (STB) during light load periods. Experimental results have shown that video prefetching is very effective and it achieves much higher network utilization (and potentially larger number of simultaneous connections) than the traditional video smoothing schemes. The previously proposed prefetching algorithms, however, can only be efficiently implemented when there is one centralized server. In a distributed environment there is a large degradation in their performance. In this paper we introduce a new scheme that utilizes smoothing along with prefetching, to overcome the problem of distributed prefetching. We show that our scheme performs almost as well as the centralized prefetching protocol even though it is implemented in a distributed environment. In addition, we introduce a call admission control algorithm for a fully interactive video-on-demand (VoD) system that utilizes this concept of distributed video prefetching. Using the theory of effective bandwidths, we develop an admission control algorithm for new requests, based on the user's viewing behavior and the required quality of service (QoS).  相似文献   

16.
This paper describes debug facilities in the Philips TriMedia CPU64, which is an embedded processor core for multimedia applications. Its architecture provides a VLIW pipeline, support for 64-bit vector data, and virtual memory management. The debug hardware in the TriMedia CPU64 supports two complementary debug strategies. One strategy provides a snapshot of the processor state at certain moments in time, which is achieved by single-step execution and various breakpoint types. The other debug strategy provides continuous monitoring of the processor state by using a PC trace buffer. Precise exceptions are used to provide accurate context switching from application software to debugger software.  相似文献   

17.
As technology scales toward deep submicron, the integration of a large number of IP blocks on the same silicon die is becoming technically feasible, thus enabling large-scale parallel computations, such as those required for multimedia workloads. The communication architecture is becoming the bottleneck for these multiprocessor Systems-on-Chip (SoC), and efficient contention resolution schemes for managing simultaneous access requests to the shared communication resources are required to prevent system performance degradation. The contribution of this work is to analyze the impact on multiprocessor SoC performance of different bus arbitration policies under different communication patterns, showing the distinctive features of each policy and the strong correlation of their effectiveness with the communication requirements of the applications. Beyond traditional arbitration schemes such as round robin and TDMA, another policy is considered that periodically allocates a temporal slot for contention-free bus utilization to a processor which needs fixed predictable bandwidth for the correct execution of its time-critical task. The results are derived on a complete and scalable multiprocessor SoC simulation platform based on SystemC, whose software support includes a complete embedded multiprocessor OS (RTEMS). The communication architecture is AMBA compliant, and we exploit the flexibility of this multi-master commercial standard, which does not specify the arbitration algorithm, to implement the explored contention resolution schemes.  相似文献   

18.
大数据分析应用往往采用基于大型稀疏图的遍历算法,其主要特点是非规则数据密集访存。以频繁使用的具有大型稀疏图遍历特征的介度中心算法为例,提出一种基于帮助线程的多参数预取控制模型和参数优化方法,从而达到提高非规则数据密集程序性能的目的。在商用多核平台Q6600和I7上运用该方法后,介度中心算法在不同规模输入下平均性能加速比分别为1.20和1.11。实验结果表明,帮助线程预取能够有效提升该类非规则应用程序的性能。  相似文献   

19.
Embedded and portable systems running multimedia applications create a new challenge for hardware architects. A microprocessor for such applications needs to be easy to program like a general-purpose processor and have the performance and power efficiency of a digital signal processor. This paper presents the codevelopment of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner. The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype system-on-a-chip, able to execute 3.2 billion operations per second per watt  相似文献   

20.
An adaptive network prefetch scheme   总被引:9,自引:0,他引:9  
In this paper, we present an adaptive prefetch scheme for network use, in which we download files that will very likely be requested in the near future, based on the user access history and the network conditions. Our prefetch scheme consists of two parts: a prediction module and a threshold module. In the prediction module, we estimate the probability with which each file will be requested in the near future. In the threshold module, we compute the prefetch threshold for each related server, the idea being that the access probability is compared to the prefetch threshold. An important contribution of this paper is that we derive a formula for the prefetch threshold to determine its value dynamically based on system load, capacity, and the cost of time and system resources to the user. We also show that by prefetching those files whose access probability is greater than or equal to its server's prefetch threshold, a lower average cost can always be achieved. As an example, we present a prediction algorithm for web browsing. Simulations of this prediction algorithm show that, by using access information from the client, we can achieve high successful prediction rates, while using that from the server generally results in more hits  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号