期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Simulation of high-performance memory allocators

José L. Risco-Martín J.Manuel Colmenar David Atienza J.Ignacio HidalgoAuthor vitae 《Microprocessors and Microsystems》2011,35(8):755-765

For the last 30 years, a large variety of memory allocators have been proposed. Since performance, memory usage and energy consumption of each memory allocator differs, software engineers often face difficult choices in selecting the most suitable approach for their applications. To this end, custom allocators are developed from scratch, which is a difficult and error-prone process. This issue has special impact in the field of portable consumer embedded systems, that must execute a limited amount of multimedia applications, demanding high performance and extensive memory usage at a low energy consumption. This paper presents a flexible and efficient simulator to study Dynamic Memory Managers (DMMs), a composition of one or more memory allocators. This novel approach allows programmers to simulate custom and general DMMs, which can be composed without incurring any additional runtime overhead or additional programming cost. We show that this infrastructure simplifies DMM construction, mainly because the target application does not need to be compiled every time a new DMM must be evaluated and because we propose a structured method to search and build DMMs in an object-oriented fashion. Within a search procedure, the system designer can choose the “best” allocator by simulation for a particular target application and embedded system. In our evaluation, we show that our scheme delivers better performance, less memory usage and less energy consumption than single memory allocators. 相似文献

2.

Memory power optimization of Java-based embedded systems exploiting garbage collection information

Jose Manuel Velasco David Atienza Katzalin Olcoz 《Journal of Systems Architecture》2012,58(2):61-72

Nowadays, Java is used in all types of embedded devices. For these memory-constrained systems, the automatic dynamic memory manager (Garbage Collector or GC) has been always a key factor in terms of the Java Virtual Machine (JVM) performance. Moreover, in current embedded platforms, power consumption is becoming as important as performance. Thus, in this paper we present an exploration, from an energy viewpoint, of the different possibilities of memory hierarchies for high-performance embedded systems when used by state-of-the-art GCs. This is a starting point for a better understanding of the interactions between the Java applications, the memory hierarchy and the GC.Hence, we subsequently present two techniques to reduce energy consumption on Java-based embedded systems, based on exploiting GC information. The first technique uses GC execution behavior to reduce leakage energy consumption taking advantage of the low-power mode of actual multi-banked SDRAM memories and it is intended for generational collectors. This technique can achieve a reduction up to 50% of SDRAM memory leakage.The second technique involves the inclusion of a software-controlled (scratch-pad) memory that stores GC instructions under the JVM control to reduce the active energy consumption and also improve the performance of the target embedded system and it is aimed at all kind of garbage collectors. For this last technique we have experimented with two different approaches for selecting the GC code to be stored in the scratchpad memory: one static and one dynamic. Our experimental results show that the proposed dynamic scratchpad management approach for GCs enables up to 63% energy consumption reduction and 25% performance improvement during the collector phase, which means, in terms of JVM execution, a global reduction of 29% and 17% for energy and cycles, respectively.Overall, this work outlines that the key for an efficient low-power implementation of Java Virtual Machines for high-performance embedded systems is the synergy between the GC choice, the memory architecture tuning, and the inclusion of power management schemes controlled by the JVM, exploiting knowledge of the GC behavior. 相似文献

3.

多视角深度运动图的人体行为识别

下载免费PDF全文

刘婷婷李玉鹏张良《中国图象图形学报》2019,24(3):400-409

目的使用运动历史点云（MHPC）进行人体行为识别的方法,由于点云数据量大,在提取特征时运算复杂度很高。而使用深度运动图（DMM）进行人体行为识别的方法,提取特征简单,但是包含的动作信息不全面,限制了人体行为识别精度的上限。针对上述问题,提出了一种多视角深度运动图的人体行为识别算法。方法首先采用深度图序列生成MHPC对动作进行表示,接着将MHPC旋转特定角度补充更多视角下的动作信息;然后将原始和旋转后MHPC投影到笛卡儿坐标平面,生成多视角深度运动图,并对其提取方向梯度直方图,采用串联融合生成特征向量;最后使用支持向量机对特征向量进行分类识别,在MSR Action3D和自建数据库上对算法进行验证。结果 MSR Action3D数据库有2种实验设置,采用实验设置1时,算法识别率为96.8%,比APS_PHOG（axonometric projections and PHOG feature）算法高2.5%,比DMM算法高1.9%,比DMM_CRC（depth motion maps and collaborative representation classifier）算法高1.1%。采用实验设置2时,算法识别率为93.82%,比DMM算法高5.09%,比HON4D（histogram of oriented 4D surface normal）算法高4.93%。在自建数据库上该算法识别率达到97.98%,比MHPC算法高3.98%。结论实验结果表明,多视角深度运动图不但解决了MHPC提取特征复杂的问题,而且使DMM包含了更多视角下的动作信息,有效提高了人体行为识别的精度。相似文献

4.

软件能耗优化技术研究进展 总被引：4，自引：0，他引：4

赵霞郭耀陈向群《计算机研究与发展》2011,48(12)

为了设计高性能低能耗的系统,需要从硬件设计和软件设计两个方面进行考虑,以取得性能和能耗的最佳权衡.研究利用软件技术降低系统能耗的问题,主要探讨系统开发阶段的低能耗软件优化与评估技术.优化技术包括指令级优化、算法级优化与软件体系结构优化3类,阐述在各类优化技术研究中面临的问题和当前的研究工作进展;深入讨论了低能耗软件优化的关键支撑技术——软件能耗估算,指出并分析面向处理器和面向全系统的软件能耗估算面临的主要问题和研究进展;最后展望进一步研究的主要问题和发展趋势. 相似文献

5.

A parallel evolutionary algorithm to optimize dynamic data types in embedded systems 总被引：1，自引：1，他引：0

José L. Risco-Martín David Atienza J. Ignacio Hidalgo Juan Lanchares 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2008,12(12):1157-1167

New multimedia embedded applications are increasingly dynamic, and rely on dynamically-allocated data types (DDTs) to store their data. The optimization of DDTs for each target embedded system is a time-consuming process due to the large searching space of possible DDTs implementations. That implies the minimization of embedded design variables (memory accesses, power consumption and memory usage). Up to know, some very effective heuristic algorithms have been developed in order to solve this problem, but it is unknown how good the selected DDTs are since the problem is NP-complete and cannot be fully explored. In these cases the use of parallel processing can be very useful because it allows not only to explore more solutions spending the same time, but also to implement new algorithms. This paper describes several parallel evolutionary algorithms for DDTs optimization in Embedded Systems, where parallelism improves the solutions found by the corresponding sequential algorithm, which indeed is quite effective compared with other previously proposed procedures. Experimental results show how a novel parallel multi-objective genetic algorithm, which combines NSGA-II and SPEA2, allows designers to reach a larger number of solutions than previous approximations. 相似文献

6.

Two-level caches tuning technique for energy consumption in reconfigurable embedded MPSoC

《Journal of Systems Architecture》2013,59(8):656-666

In order to meet the ever-increasing computing requirement in the embedded market, multiprocessor chips were proposed as the best way out. In this work we investigate the energy consumption in these embedded MPSoC systems. One of the efficient solutions to reduce the energy consumption is to reconfigure the cache memories. This approach was applied for one cache level/one processor architecture, but has not yet been investigated for multiprocessor architecture with two level caches. The main contribution of this paper is to explore two level caches (L1/L2) multiprocessor architecture by estimating the energy consumption. Using a simulation platform, we first built a multiprocessor architecture, and then we propose a new algorithm that tunes the two-level cache memory hierarchy (L1 and L2). The tuning caches approach is based on three parameters: cache size, line size, and associativity. To find the best cache configuration, the application is divided into several execution intervals. And then, for each interval, we generate the best cache configuration. Finally, the approach is validated using a set of open source benchmarks; Spec 2006, Splash-2, MediaBench and we discuss the performance in terms of speedup and energy reduction. 相似文献

7.

Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations

《Parallel Computing》2014,40(5-6):86-99

Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. 相似文献

8.

Virtual duplication and mapping prefetching for emerging storage primitives in NAND flash memory storage systems

《Microprocessors and Microsystems》2017

NAND flash memory has become the mainstream storage medium for both enterprise high performance computers and embedded systems. However, over the past several decades, the storage primitives that access secondary storage have remained unchanged, forcing NAND flash memory to serve merely as a block device like hard disk drive. Recently, several emerging storage primitives have been presented to explore the potential value of non-volatile memory devices. Although these primitives can significantly boost the access performance by providing virtual to logical address mappings, they still suffer from large RAM footprint to maintain the address mapping table and require further support for update operations.This paper presents ESP to optimize E merging S torage P rimitives with virtualization for flash memory storage systems. We propose two optimization strategies, virtual duplication and mapping prefetching to solve the critical issues in existing emerging storage primitives. The objective is to reduce unnecessary flash memory accesses and keep RAM footprint of address mapping table well under control. We have evaluated ESP on an embedded development platform. Experimental results show that ESP can significantly improve the write/read performance and reduce over 30% of garbage collection operations. 相似文献

9.

An analytical framework for high-speed hardware particle swarm optimization

《Microprocessors and Microsystems》2020

Engineering optimization techniques are computationally intensive and can challenge implementations on tightly-constrained embedded systems. Particle Swarm Optimization (PSO) is a well-known bio-inspired algorithm that is adopted in various applications, such as, transportation, robotics, energy, etc. In this paper, a high-speed PSO hardware processor is developed with focus on outperforming similar state-of-the-art implementations. In addition, the investigation comprises the development of an analytical framework that captures wide characteristics of optimization algorithm implementations, in hardware and software, using key simple and combined heterogeneous indicators. The framework proposes a combined Optimization Fitness Indicator that can classify the performance of PSO implementations when targeting different evaluation functions. The two targeted processing systems are Field Programmable Gate Arrays for hardware implementations and a high-end multi-core computer for software implementations. The investigation confirms the successful development of a PSO processor with appealing performance characteristics that outperforms recently presented implementations. The proposed hardware implementation attains 23,300 improvement ratio of execution times with an elliptic evaluation function. In addition, a speedup of 1777 times is achieved with a Shifted Schwefels function. Indeed, the developed framework successfully classifies PSO implementations according to multiple and heterogeneous properties for a variety of benchmark functions. 相似文献

10.

A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation

Oscar Almer Igor Böhm Tobias Edler von Koch Björn Franke Stephen Kyle Volker Seeker Christopher Thompson Nigel Topham 《International journal of parallel programming》2013,41(2):212-235

In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations. 相似文献

11.

An online electricity cost budgeting algorithm for maximizing green energy usage across data centers

Hui Dou Yong Qi 《Frontiers of Computer Science》2017,11(4):661-674

With the sky-rocketing development of Internet services, the power usage in data centers has been significantly increasing. This ever increasing energy consumption leads to negative environmental impact such as global warming. To reduce their carbon footprints, large Internet service operators begin to utilize green energy. Since green energy is currently more expensive than the traditional brown one, it is important for the operators to maximize the green energy usage subject to their desired long-term (e.g., a month) cost budget constraint. In this paper, we propose an online algorithm GreenBudget based on the Lyapunov optimization framework. We prove that our algorithm is able to achieve a delicate tradeoff between the green energy usage and the enforcement of the cost budget constraint, and a control parameter V is the knob to arbitrarily tune such a tradeoff. We evaluate GreenBudget utilizing real-life traces of user requests, cooling efficiency, electricity price and green energy availability. Experimental results demonstrate that under the same cost budget constraint, GreenBudget can increase the green energy usage by 11.55% compared with the state-of-the-art work, without incurring any performance violation of user requests. 相似文献

12.

嵌入式语音识别系统中的DTW在线并行算法* 总被引：2，自引：0，他引：2

姜干新陈伟b 《计算机应用研究》2010,27(3):977-980

为提高语音识别系统的实时性,利用动态规划和并行计算思想,提出一种适用于嵌入式语音识别系统的DTW(动态时间规整)在线并行算法。通过分析标准DTW及其主要衍生算法,对DTW算法的数据结构进行改进以满足在线算法要求,在寻找最佳路径过程中动态连续地分配和释放内存或预先分配固定大小的内存,并将多个关键词的DTW计算分布到多个运算单元;最后汇总各运算单元的结果得到识别结果。实验表明,该算法比经典DTW降低了内存使用和识别时间,并使语音识别的实时系数达到1.17,具有较高的实时性。相似文献

13.

Comparative study of task duplication static scheduling versus clustering and non-clustering techniques

Behrooz Shirazi Hsing-Bung Chen Jeff Marquis 《Concurrency and Computation》1995,7(5):371-389

One of the major issues that needs to be addressed in distributed memory multiprocessor (DMM) systems is the program task partitioning and scheduling problems, i.e. mapping of an application program's precedence related task threads among the processing elements of a DMM system. The optimal task partitioning and scheduling problem, with the goal of minimizing the program execution time and interprocessor communication overhead, is known to be an NP-complete problem. The paper addresses the design, development and performance evaluation of a novel static task partitioning and scheduling method called linear clustering with task duplication (LCTD). LCTD employs the linear (sequential) execution of tasks and task duplication heuristics in achieving minimized computation and interprocessor communication delays in DMMs. The superiority of the proposed LCTD algorithm is demonstrated through simulation studies and comparison against several of the existing static scheduling schemes, such as heavy node first (HNF) and linear clustering. We show that the proposed method can obtain an average of 33% improvement in program execution time and 21% improvement in processor utilization compared to linear clustering and HNF methods. 相似文献

14.

Loop scheduling and bank type assignment for heterogeneous multi-bank memory

Meikang Qiu Minyi Guo Meiqin Liu Chun Jason Xue Laurence T. Yang Edwin H.-M. Sha 《Journal of Parallel and Distributed Computing》2009

Many high-performance DSP processors employ multi-bank on-chip memory to improve performance and energy consumption. This architectural feature supports higher memory bandwidth by allowing multiple data memory accesses to be executed in parallel. However, making effective use of multi-bank memory remains difficult, considering the combined effect of performance and energy requirement. This paper studies the scheduling and assignment problem about how to minimize the total energy consumption while satisfying the timing constraint with heterogeneous multi-bank memory for applications with loop. An algorithm, TASL (Type Assignment and Scheduling for Loops), is proposed. The algorithm uses bank type assignment with the consideration of variable partition to find the best configuration for both memory and ALU. The experimental results show that the average improvement on energy-saving is significant by using TASL. 相似文献

15.

A spill data aware memory assignment technique for improving power consumption of multimedia memory systems

Youn Jonghee Cho Doosan 《Multimedia Tools and Applications》2019,78(5):5463-5478

As embedded memory technology evolves, the traditional Static Random Access Memory (SRAM) technology has reached the end of development. For deepening the manufacturing process technology, the next generation memory technology is highly required because of the exponentially increasing leakage current of SRAM. Non-volatile memories such as STT-MRAM (Spin Torque Transfer Magnetic Random Access Memory), PCM (Phase Change Memory) are good candidates for replacing SRAM technology in embedded memory systems. They have many advanced characteristics in the perspective of power consumption, leakage power, size (density) and latency. Nonetheless, nonvolatile memories have two major problems that hinder their use it the next-generation memory. First, the lifetime of the nonvolatile memory cell is limited by the number of write operations. Next, the write operation consumes more latency and power than the same size of the read operation. This study describes a compiler optimization technique to overcome such disadvantages of a nonvolatile memory component in hybrid cache memories. A hybrid cache is proposed to overcome the disadvantages using a compiler. Specifically, to minimize the number of write operations for nonvolatile memory, we present a data replacement technique that considers the locations of the register spill data. Many portions of the memory accesses are yielded by the spill data of a register allocator in an optimizing compiler. Such spill data can be partially removed using a recalculation method. Thus, we implemented an optimization technique that rearranges the data placement with recalculation to minimize the write instructions on the nonvolatile memory. Our experimental results show that the proposed technique can reduce the average number of spill codes by 20%, and improves the energy consumption by 20.2% on average.

相似文献

16.

Effectiveness Analysis of DVFS and DPM in Mobile Devices

下载免费PDF全文

Youngbin Seo Jeongki Kim Euiseong Seo 《计算机科学技术学报》2012,27(4):781-790

The demand for high-performance embedded processors in multimedia mobile electronics is growing and their power consumption thus increasingly threatens battery lifetime.It is usually believed that the dynamic voltage and frequency scaling (DVFS) feature saves significant energy by changing the performance levels of processors to match the performance demands of applications on the fly.However,because the energy efficiency of embedded processors is rapidly improving,the effectiveness of DVFS is expected to change.In this paper,we analyze the benefit of DVFS in state-of-the-art mobile embedded platforms in comparison to those in servers or PCs.To obtain a clearer view of the relationship between power and performance,we develop a measurement methodology that can synchronize time series for power consumption with those for processor utilization.The results show that DVFS hardly improves the energy efficiency of mobile multimedia electronics,and can even significantly worsen energy efficiency and performance in some cases.According to this observation,we suggest that power management for mobile electronics should concentrate on adaptive and intelligent power management for peripheral devices.As a preliminary design,we implement an adaptive network interface card (NIC) speed control that reduces power consumption by 10% when NIC is not heavily used.Our results provide valuable insights into the design of power management schemes for future mobile embedded systems. 相似文献

17.

Reliability-conscious energy management for fixed-priority real-time embedded systems with weakly hard QoS-constraint

《Microprocessors and Microsystems》2016

Aggressive scaling in technology size has dramatically increased the power density and degraded the reliability of real-time embedded systems. In this paper, we study the problem of reliability-conscious energy minimization for scheduling fixed-priority real-time embedded systems with weakly hard QoS-constraint. The weakly hard QoS-constraint is modeled with (m, k)-constraint, which requires that at least m out of any k consecutive jobs of a task meet their deadlines. We first propose a technique that can balance the static and dynamic energy consumption for real-time jobs with better speed determination than the classical strategies during their feasible intervals. Then based on it, we propose an adaptive fixed-priority scheduling scheme to reduce the energy consumption for the system while preserving its reliability. Through extensive simulations, our experiment results demonstrate that the proposed techniques can significantly outperform the previous research in energy performance while satisfying the weakly hard QoS-constraint under the reliability requirement. 相似文献

18.

SoMMA: A software-managed memory architecture for multi-issue processors

《Microprocessors and Microsystems》2020

Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. 相似文献

19.

Data memory power optimization and performance exploration of embedded systems for implementing motion estimation algorithms

《Real》2003,9(6):371-386

A memory power optimization and performance exploration methodology based on high-level (C language) code transformations that allows the system designer to explore various data memory power, data memory area and performance trade-offs early in the design process of embedded multimedia systems is introduced. This exploration strategy is introduced for both single and multiprocessor environments. The latter requires partitioning of the application. After employing software transformations, the experimental results, obtained using four well-known motion estimation kernels provide an insight on the performance and energy consumption trade-offs, comparing memory hierarchies for the ARM programmable core and prove the validity of the proposed approach. 相似文献

20.

Performance-directed energy management for storage systems

Xiaodong Li Zhenmin Li Pin Zhou Yuanyuan Zhou Adve S.V. Kumar S. 《Micro, IEEE》2004,24(6):38-49

Energy consumption has become an important issue in the design of battery-operated mobile devices and sophisticated data centers. The storage hierarchy, which includes memory and disks, is a major energy consumer in such systems; especially for high-end servers at data centers. Much work has focused on energy control algorithms for storage systems that transition a device into a low power mode when a certain usage function exceeds a specified threshold. These algorithms are difficult to use in real systems, however, because designers must painstakingly and manually tune threshold values, and even then a performance guarantee is difficult. To address these limitations, we develop three algorithms: 1) a performance guarantee technique that designers can use with any underlying energy-control algorithm 2) a performance-directed control algorithm that periodically assigns a static configuration to different devices by solving an optimization problem 3) another performance-directed control algorithm that dynamically self-tunes according to an optimal set of thresholds 相似文献