共查询到18条相似文献,搜索用时 125 毫秒
1.
本文提出了一种基于硬件抽象机的动态翻译技术,它可用于实现Java处理器.该技术采用了硬件抽象机的"模糊执行"(HAM)方法,通过分析Java程序之间的相关性,动态地将Java字节码转换成基于标签的类RISC指令.然后,将堆栈折叠与动态翻译相结合进一步优化指令.应用该技术设计了一个Java指令级并行处理器,并且扩展它,支持Java多线程功能. 相似文献
2.
Java虚拟机的设计是基于堆栈的,它的性能由数据相关性而被限制.为了提高JVM的性能,于是sun公司提出了堆栈操作折叠机制并且用于picoJava Ⅰ、Ⅱ处理器,它折叠了42.3%的堆栈操作.通过把连续的字节码与预先定义的类型在指令译码器中对比,那么push、pop操作的数量就能被减少.文中为Java处理器设计了一种简单的指令折叠器,最终在FPGA上加以实现,从而大大地提高了JVM的性能. 相似文献
3.
文中结合PicoJava和JOP等一些经典的Java处理器的优势,设计了一种基于RISC结构的Java处理器.它充分利用了Java指令折叠技术和精简指令集处理器的优势,不仅降低了设计复杂度,而且在很大程度上提高了Java处理器的性能. 相似文献
4.
面向VLIW结构的高性能代码生成技术 总被引:1,自引:1,他引:0
DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能. 相似文献
5.
6.
设计了一款面向嵌入式控制领域的16位堆栈处理器,该处理器包含两个堆栈:执行数学表达式的数据堆栈和支持子程序调用的返回堆栈,其指令集含35条堆栈指令.详细给出了该堆栈处理器的体系结构及设计方法;不仅采用简单有效的指令编码方式缩小了代码体积,同时给出了单周期操作多个堆栈元素的解决方法.该处理器采用FPGA实现,在XC5VLX110T芯片上的运行时钟频率最高达到146.7MHz.最后给出了设计的软件仿真与硬件综合结果. 相似文献
7.
为了简化不同体系结构间代码迁移工作,提出一种面向具有超长指令字架构的数字信号处理器的汇编级翻译的方法.前端分析将汇编代码中的指令信息同语义映射为机器无关的中间表示.采用路径探测法移除分支指令延迟槽构建指令流图,并重构源程序控制流图.基于各条指令的时间戳分配和指令间的数据依赖关系分析,移动代码和修改时间戳来线性化并行代码.实验证明,该方法能够正确翻译汇编程序. 相似文献
8.
9.
使DSP处理器达到高性能有多种方法。然而,传统的DSP性能几乎总是以MIPS来衡量的。传统DSP通常在每个时钟周期仅完成一次操作,因此MIPS可直接对应于以MHz表示的处理器频率。随着指令级并行(ILP)技术的出现,这些指标变得意义不大。ILP 相似文献
10.
11.
N. Vassiliadis N. Kavvadias G. Theodoridis S. Nikolaidis 《International Journal of Electronics》2013,100(6):421-438
In this paper, the architecture of an embedded processor extended with a tightly-coupled coarse-grain reconfigurable functional unit (RFU) is proposed. The efficient integration of the RFU with the control unit and the datapath of the processor eliminate the communication overhead between them. To speed up execution, the RFU exploits instruction level parallelism (ILP) and spatial computation. Also, the proposed integration of the RFU efficiently exploits the pipeline structure of the processor, leading to further performance improvements. Furthermore, a development framework for the introduced architecture is presented. The framework is fully automated, hiding all reconfigurable hardware related issues from the user. The hardware model of the architecture was synthesized in a 0.13?µm process and all information regarding area and delay were estimated and presented. A set of benchmarks is used to evaluate the architecture and the development framework. Experimental results prove performance improvements in addition to potential energy reduction. 相似文献
12.
This paper proposes a pure software technique "error detection by duplicated instructions" (EDDI), for detecting errors during usual system operation. Compared to other error-detection techniques that use hardware redundancy, EDDI does not require any hardware modifications to add error detection capability to the original system. EDDI duplicates instructions during compilation and uses different registers and variables for the new instructions. Especially for the fault in the code segment of memory, formulas are derived to estimate the error-detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. EDDI was applied to eight benchmark programs and the error-detection coverage was estimated. Then, the estimates were verified by simulation, in which a fault injector forced a bit-flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault-injection experiments, EDDI can provide over 98% fault-coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware, but they need dependability in the computer system. To reduce the performance overhead, EDDI schedules the instructions that are added for detecting errors such that "instruction-level parallelism" (ILP) is maximized. Performance overhead can be reduced by increasing ILP within a single super-scalar processor. The execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue two instructions in one cycle 相似文献
13.
A scalable processor architecture for multi-threaded JavaTM applications is presented. The proposed architecture consists of multiple application-specific processing elements, each able to execute a single thread at one time. The architecture is evaluated by implementing a portable and scalable Java machine on an FPGA board for demonstration 相似文献
14.
BIOS的设计与实现 总被引:5,自引:1,他引:4
文章详细阐述了BIOS的基本组成框架,提出了一个适合于检测工控机硬件的BIOS上电自检流程,并就设计中的几个关键性问题:正确性,兼容性和可移植性,以及压缩算法等进行了分析,最后整个BIOS在西北工业大学航空微电子中心自主研发的龙腾S1系统(PC104兼容)平台上进行了严格的验证. 相似文献
15.
JCP(Java Card Processor)是一种直接支持Java卡虚拟机运行的16位RISC微处理器.但Java卡虚拟机的支持面向对象的字节码指令功能较复杂,用硬件直接实现需要消耗大量的资源,不适合智能卡等硬件资源有限的系统.JCP提供一种硬件陷阱机制,在执行此类指令时,切换到相应的陷阱处理程序中,用软件仿真它们的功能.文章讨论了Java卡虚拟机二进制文件特点,软件仿真指令的面向对象的功能及其具体实现.通过仿真基于JCP的Java卡操作系统和应用程序,验证了软件仿真指令实现的正确性. 相似文献
16.
The paper proposes a novel heuristic technique for integrated hardware-software partitioning, hardware design space exploration and scheduling. The technique maps an application specified as a task graph on a heterogeneous architecture with an objective to minimize the latency of the task graph subject to the area constraint on the hardware coprocessor. The technique uses an iterative approach where the partitioner decides the processor mapping and HW design points of some tasks. The scheduler then simultaneously decides the processor mapping, HW design point and schedule time of the remaining tasks. There exists a tight coupling between the two design stages allowing them to produce superior quality designs in fewer iterations. The technique accounts for the time overheads due to inter-processor /intra-processor communication and shared memory access conflicts. It can therefore be used for both communication intensive and computation intensive applications. The technique also considers dynamic reconfiguration capability of the hardware coprocessor. The technique performs tradeoff analysis and maps hardware tasks to mutually exclusive temporal segments if this results in lower latency. The effectiveness of the technique is demonstrated by a case study of the JPEG image compression algorithm, comparison with an optimal ILP based approach and experimentation with synthetic graphs. 相似文献
17.
This paper describes a new architecture for JAVA-based, interactive multimedia applications. A hardware implementation of a Java Virtual Machine (JVM) is proposed, which allows the direct execution of Java bytecode. In a single clock cycle, up to 3 bytecode instructions can be decoded and executed in parallel using a RISC pipeline. A splitable 64-bit ALU implementation addresses demanding processing requirements of typical multimedia signal processing schemes. The on-chip caches are adapted to the specific data structures of the JVM. The proposed architecture supports execution of multiple Java threads in parallel. An implementation of basic building blocks of the processor with a standard-cell library provides an estimate of 150 MHz clock-speed for a 0.35 m 3 metal layer CMOS process. With a size of less than 10 mm2 needed for the core logic, it is possible to integrate multiple JVMs together with larger cache memories on a single chip. Based on these results, we discuss various performance aspects of JAVA for use in future multimedia terminals. 相似文献
18.
Java possesses many advantages for embedded system development, including fast product deployment, portability, security, and a small memory footprint. As Java makes inroads into the market for embedded systems, much effort is being invested in designing real-time garbage collectors. The proposed garbage-collected memory module, a bitmap-based processor with standard DRAM cells is introduced to improve the performance and predictability of dynamic memory management functions that include allocation, reference counting, and garbage collection. As a result, memory allocation can be done in constant time and sweeping can be performed in parallel by multiple modules. Thus, constant time sweeping is also achieved regardless of heap size. This is a major departure from the software counterparts where sweeping time depends largely on the size of the heap. In addition, the proposed design also supports limited-field reference counting, which has the advantage of distributing the processing cost throughout the execution. However, this cost can be quite large and results in higher power consumption due to frequent memory accesses and the complexity of the main processor. By doing reference counting operation in a coprocessor, the processing is done outside of the main processor. Moreover, the hardware cost of the proposed design is very modest (about 8000 gates). Our study has shown that 3-bit reference counting can eliminate the need to invoke the garbage collector in all tested applications. Moreover, it also reduces the amount of memory usage by 77 percent. 相似文献