首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
大数据流式计算平台Apache Storm默认采用轮询的方式进行任务调度,未考虑到拓扑中各任务计算开销的差异以及任务之间不同类型的通信模式,在负载均衡和通信开销方面存在较大的优化空间。针对这一问题,提出一种Storm环境下基于权重的任务调度算法(TSAW-Storm)。该算法首先根据各任务的CPU资源占用情况以及任务间的数据流大小,分别确定拓扑的点权和边权;并利用最大化边权增益的思想,逐步构建起各工作节点中承载的任务集合,在保证集群负载均衡的同时,尽可能将边权较大的节点间数据流转化为节点内数据流,从而降低网络传输开销。实验结果表明,在包含有8个工作节点的WordCount基准测试中,TSAW-Storm的系统延迟和节点间数据流大小相比Storm默认调度算法分别降低了30.0%和32.9%,且各工作节点的CPU负载标准差仅为Storm默认调度算法的25.8%;此外,在与在线调度算法的对比实验中,TSAW-Storm在系统延迟、节点间数据流大小和CPU负载标准差方面分别降低了7.76%、11.8%和5.93%,且算法的执行开销明显降低,有效提高了Storm系统的运行效率。  相似文献   

2.
分布式集群环境使得数据实时计算更为复杂,流式大数据处理系统的正确性难以保障.现有的大数据基准测试框架可以测试流式大数据处理系统的性能表现,但是普遍存在应用场景设计简单、评价指标不充分等不足.针对这一挑战,本文构造了一个面向股票交易场景的流式大数据基准测试框架,通过生成股票高频交易数据,测试系统在高流速场景下的延迟、吞吐量、GC时间、CPU资源等的性能表现.同时,通过横向测试验证流式大数据系统的扩展性.本文以Apache Spark Streaming为待测系统进行测试,实验结果表明,高流速场景下出现延迟增加、GC时间提高等性能下降问题,原因是系统输入速率的提高及并行度的增加.  相似文献   

3.
李梓杨  于炯  卞琛  鲁亮  蒲勇霖 《计算机应用》2018,38(9):2560-2567
针对大数据流式计算平台中输入数据流速急剧上升所导致的计算延迟升高问题,提出了基于流网络模型的动态调度策略,并将其应用于Flink数据流计算平台。首先,通过定义有向无环图(DAG)中每条边的容量和流量将其转化为流网络模型,并通过容量检测算法确定每条边的容量值;然后,通过最大流算法计算对应的增进网络和优化路径,从而在输入速率上升阶段提升集群的吞吐量,并通过评估时空代价论证了算法的可行性;最后,讨论了重要参数对算法执行效果的影响,并通过实验得出了在不同类型的作业中推荐的参数取值。经实验验证得出:所提算法与Flink平台现有的任务调度策略相比,在输入速率上升阶段对不同作业类型中集群吞吐量的优化比均高于16.12%。实验结果表明动态调度策略在满足任务延迟约束的前提下有效提高了集群的吞吐量。  相似文献   

4.
Energy efficiency of data analysis systems has become a very important issue in recent times because of the increasing costs of data center operations. Although distributed streaming workloads have increasingly been present in modern data centers, energy‐efficient scheduling of such applications remains as a significant challenge. In this paper, we conduct an energy consumption analysis of data stream processing systems in order to identify their energy consumption patterns. We follow stream system benchmarking approach to solve this issue. Specifically, we implement Linear Road benchmark on six stream processing environments (S4, Storm, ActiveMQ, Esper, Kafka, and Spark Streaming) and characterize these systems' performance on a real‐world data center. We study the energy consumption characteristics of each system with varying number of roads as well as with different types of component layouts. We also use a microbenchmark to capture raw energy consumption characteristics. We observed that S4, Esper, and Spark Streaming environments had highest average energy consumption efficiencies compared with the other systems. Using a neural networkbased technique with the power/performance information gathered from our experiments, we developed a model for the power consumption behavior of a streaming environment. We observed that energy‐efficient execution of streaming application cannot be specifically attributed to the system CPU usage. We observed that communication between compute nodes with moderate tuple sizes and scheduling plans with balanced system overhead produces better power consumption behaviors in the context of data stream processing systems. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

5.
由于电力调度网出现任何网络故障都可能发生极度严重的事故,因此具有的极高可靠性及安全性的要求.而当前传统的网络监测系统在面对大数据量时,其实时处理能力和扩展能力都无法满足需求.因此对实时产生的大规模各类型数据的分析处理则需要一种专门的实时数据分析平台完成.本文结合电力调度信息网络的特点以及监测准确性及实时性的需求,构建出一个基于流计算的数据处理分析平台,以Apache Spark中的Spark Streaming为代表的开源流计算框架,加入如Kafka分布式消息队列、Redis内存数据库等组件,为数据分析平台提供稳定高效的数据来源和数据服务接口,从而实现适用于电力调度网的各类海量数据的实时分析处理完成流量异常监测场景.  相似文献   

6.
目前成熟的RDF流处理(RDF Stream Processing, RSP)系统由于集中式的设计而缺乏并行处理特性,因此在查询处理大量传入的RDF流数据时,均无法实现高吞吐和低延迟。为提高查询性能,本文对RSP查询过程和Flink流计算结构进行研究,设计数据源、滤器、多路分区连接和投影4个逻辑操作符,并设计一种多流连接(Multi-Stream Join, MSJ)算法用于生成具有并行性的有向无环图的逻辑查询计划,最后以大数据流处理平台Apache Flink为底层实现逻辑操作符和逻辑查询计划。使用真实数据集SRBench和模拟数据集LUBMs进行实验验证。结果表明,与最成熟的系统C-SPARQL、CQELS相比,单机吞吐量增长高达10倍,5台机器集群的吞吐量增长高达28倍,同时在延时方面达到了毫秒级;在查询性能方面实现了处理大量RDF流数据时吞吐量的提高和延时的降低。  相似文献   

7.
易佳  薛晨  王树鹏 《计算机科学》2017,44(5):172-177
分布式流查询是一种基于数据流的实时查询计算方法,近年来得到了广泛的关注和快速发展。综述了分布式流处理框架在实时关系型查询上取得的研究成果;对涉及分布式数据加载、分布式流计算框架、分布式流查询的产品进行了分析和比较;提出了基于Spark Streaming和Apache Kafka构建的分布式流查询模型,以并发加载多个文件源的形式,设计内存文件系统实现数据的快速加载,相较于基于Apache Flume的加载技术提速1倍以上。在Spark Streaming的基础上,实现了基于Spark SQL的分布式流查询接口,并提出了自行编码解析SQL语句的方法,实现了分布式查询。测试结果表明,在查询语句复杂的情况下,自行编码解析SQL的查询效率具有明显的优势。  相似文献   

8.
基于流式计算的空间科学卫星数据实时处理   总被引:1,自引:0,他引:1  
针对空间科学卫星探测数据的实时处理要求越来越高的问题,提出一种基于流计算框架的空间科学卫星数据实时处理方法。首先,根据空间科学卫星数据处理特点对数据流进行抽象分析;然后,对各处理单元的输入输出数据结构进行重新定义;最后,基于流计算框架Storm设计数据流处理并行结构,以适应大规模数据并行处理和分布式计算的要求。对应用该方法开发的空间科学卫星数据处理系统进行测试分析,测试结果显示,在相同条件下数据处理时间比原有系统缩短了一半;数据局部性策略比轮询策略具有更高的吞吐率,数据元组吞吐率平均提高29%。可见采用流式计算框架能够大幅缩短数据处理延迟,提高空间科学卫星数据处理系统的实时性。  相似文献   

9.
针对现有大数据分类方法难以满足大数据应用中时间和储存空间的限制,提出了一种基于Apache Spark框架的大数据并行多标签K最近邻分类器设计方法。为了通过使用其他内存操作来减轻现有MapReduce方案的成本消耗,首先,结合Apache Spark框架的并行机制将训练集划分成若干分区;然后在Map阶段找到待预测样本每个分区的K近邻,进一步在reduce阶段根据map阶段的结果确定最终的K近邻;最后并行地对近邻的标签集合进行聚合,通过最大化后验概率输出待预测样本的目标标签集合。在PokerHand等四个大数据分类数据集上进行实验,提出方法取得了较低的汉明损失,证明了其有效性。  相似文献   

10.
作为目前主流的大数据流式计算平台之一,Storm在设计之初以性能为目的进行研究而忽视了高能耗的问题,但是其高能耗问题已经开始制约着平台的发展.针对这一问题,分别建立了任务分配模型、拓扑信息监控模型、数据恢复模型以及能耗模型,并进一步提出了基于Storm平台的数据恢复节能策略(energy-efficient strategy based on data recovery in Storm,DR-Storm),包括吞吐量检测算法与数据恢复算法.其中吞吐量检测算法根据拓扑信息监控模型反馈的拓扑信息计算集群吞吐量,并通过信息反馈判断是否终止整个集群内拓扑的任务.数据恢复算法根据数据恢复模型选择备份节点用于数据存储,并通过拓扑信息监控模型反馈的信息判断集群拓扑是否进行数据恢复.此外,DR-Storm通过备份节点内存恢复集群拓扑内的数据,并根据大数据流式计算的系统延迟与能效评估DR-Storm.实验结果表明:与现有研究成果相比,DR-Storm在减少系统计算延迟、降低集群功率的同时,有效节约了能耗.  相似文献   

11.
王春凯    庄福振  史忠植 《智能系统学报》2019,14(6):1278-1285
大规模数据流管理系统往往由上层的关系查询系统和下层的流处理系统组成。当用户提交查询请求时,往往需要根据数据流的流速和分布情况动态配置系统参数。然而,由于数据流的易变性,频繁改变参数配置会降低系统性能。针对该问题,提出了OrientStream+框架。设定以用户自定义查询延迟阈值为间隔片段的微批量数据流传输机制;并利用多级别管道缓存,对相同配置的数据流进行批量处理;然后按照数据流的时间戳计算出精准查询结果;引入基于异常检测的增量学习模型,用于提高OrientStream+的预测精度。最后,在Storm上实现了该资源配置框架,并进行了大量的实验。实验结果表明,OrientStream+框架可进一步降低系统的处理延迟并提高系统的吞吐率。  相似文献   

12.
基于Storm的海量数据实时聚类   总被引:1,自引:0,他引:1  
针对现有平台处理海量数据实时响应能力普遍较差的问题,引入Storm分布式实时计算平台进行大规模数据的聚类分析,设计了基于Storm框架的DBSCAN算法。该算法将整个过程分为数据接入、聚类分析、结果输出等阶段,在框架预定义的组件中分别编程实现,各组件通过数据流连通形成任务实体,提交到集群运行完成。通过对比分析和性能监测,验证了所提方案具有低延迟和高吞吐量的优势,集群运行状况良好,负载均衡。实验结果表明Storm平台处理海量数据实时性较高,能够胜任大数据背景下的数据挖掘任务。  相似文献   

13.
Data stream is a continuous, rapid, time-varying sequence of data elements which should be processed in an online manner. These matters are under research in Data Stream Management Systems (DSMSs). Single processor DSMSs cannot satisfy data stream applications?? requirements properly. Main shortcomings are tuple latency, tuple loss, and throughput. In our previous publications, we introduced parallel execution of continuous queries to overcome these problems via performance improvement, especially in terms of tuple latency. We scheduled operators in an event-driven manner which caused system performance reduction in periods between consecutive scheduling instances. In this paper, a continuous scheduling method (dispatching) is presented to be more compatible with the continuous nature of data streams as well as queries to improve system adaptivity and performance. In a multiprocessing environment, the dispatching method forces processing nodes (logical machines) to send partially-processed tuples to next machines with minimum workload to execute the next operator on them. So, operator scheduling is done continuously and dynamically for each tuple processed by each operator. The dispatching method is described, formally presented, and its correctness is proved. Also, it is modeled in PetriNets and is evaluated via simulation. Results show that the dispatching method significantly improves system performance in terms of tuple latency, throughput, and tuple loss. Furthermore, the fluctuation of system performance parameters (against variation of system and stream characteristics) diminishes considerably and leads to high adaptivity with the underlying system.  相似文献   

14.
While virtualization enables multiple virtual machines (VMs)—with multiple operating systems and applications—to run within a physical server, it also complicates resource allocations trying to guarantee Quality of Service (QoS) requirements of the diverse applications running within these VMs. As QoS is crucial in the cloud, considerable research efforts have been directed towards CPU, memory and network allocations to provide effective QoS to VMs, but little attention has been devoted to disk resource allocation.This paper presents the design and implementation of Flubber, a two-level scheduling framework that decouples throughput and latency allocation to provide QoS guarantees to VMs while maintaining high disk utilization. The high-level throughput control regulates the pending requests from the VMs with an adaptive credit-rate controller, in order to meet the throughput requirements of different VMs and ensure performance isolation. Meanwhile, the low-level latency control, by the virtue of the batch and delay earliest deadline first mechanism (BD-EDF), re-orders all pending requests from VMs based on their deadlines, and batches them to disk devices taking into account the locality of accesses across VMs. We have implemented Flubber and made extensive evaluations on a Xen-based host. The results show that Flubber can simultaneously meet the different service requirements of VMs while improving the efficiency of the physical disk. The results also show improvement of up to 25% in the VM performance over state-of-art approaches: for example, in contract to the default Xen disk I/O scheduler—Completely Fair Queueing (CFQ)—besides achieving the desired QoS of each VM, Flubber speeds up the sequential and random reads by 17% and 25%, respectively, due to the efficient physical disk utilization.  相似文献   

15.
Artificial Intelligence and Machine learning has been used by many research groups for processing large scale data known as big data. Machine learning techniques to handle large scale complex datasets are expensive to process computation. Apache Spark framework called spark MLlib is becoming a popular platform for handling big data analysis and it is used for many machine learning problems such as classification, regression and clustering. In this work, Apache Spark and the advanced machine learning architecture of a Deep Multilayer Perceptron (MLP), is proposed for Audio Scene Classification. Log Mel band features are used to represent the characteristics of the input audio scenes. The parameters of the DNN are set according to the DNN baseline of DCASE 2017 challenge. The system is evaluated with TUT dataset (2017) and the result is compared with the baseline provided.  相似文献   

16.
针对计算机网络访问请求具有实时到达以及动态变化的特点,为了实时检测网络入侵,并且适应网络访问数据的动态变化,提出一个基于数据流的网络入侵实时检测框架。首先,将误用检测模式与异常检测模式相结合,通过初始聚类建立由正常模式和异常模式构成的知识库;其次,采用数据点与数据簇之间的不相似性来度量网络访问数据与正常模式和异常模式的相似性,从而判定网络访问数据的合法性;最后,当网络访问数据流发生演化时,通过重新聚类来更新知识库以反映网络访问的最近状态。在入侵检测数据集KDDCup99上进行实验,当初始聚类的样本数为10000,缓冲区聚类的样本数为10000,调节系数为0.9时,召回率达到91.92%,误报率达到0.58%,接近传统非实时检测模式的结果,但整个学习和检测过程只需扫描网络访问数据一次,并引入了知识库的更新机制,在入侵检测的实时性和适应性方面更具有优势。  相似文献   

17.

One of the most challenging issues in the big data research area is the inability to process a large volume of information in a reasonable time. Hadoop and Spark are two frameworks for distributed data processing. Hadoop is a very popular and general platform for big data processing. Because of the in-memory programming model, Spark as an open-source framework is suitable for processing iterative algorithms. In this paper, Hadoop and Spark frameworks, the big data processing platforms, are evaluated and compared in terms of runtime, memory and network usage, and central processor efficiency. Hence, the K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks. The results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoop. Evaluations show that Hadoop uses more sources, including central processor and network. It is concluded that the CPU in Spark is more effective than Hadoop. On the other hand, the memory usage in Hadoop is less than Spark.

  相似文献   

18.
云计算作为下一代互联网技术的发展方向,已经成为工业界和学术界的一个热门研究领域。在高校信息化过程中,搭建自己的云平台占有极其重要的地位。利用Apache VCL和VMware vCloud Director搭建在线资源申请云平台是众多云计算技术中两个成熟的技术。本文详细介绍Apache VCL开源云平台和VMware vCloud Director 商业云平台的架构和应用,对这两套云平台的工作原理、网络架构、安全措施等方面进行深入的探讨,并对高校云平台的搭建给出一点建议。  相似文献   

19.
何珍祥  贾筱景 《现代计算机》2010,(1):139-141,175
计算机网络教学中DNS和Apache服务器的配置实验只有通过虚拟机技术才能有效解决。介绍用VMware Workstation创建Linux虚拟系统,给出Linux虚拟机中DNS及Apache配置文件及其测试实现,对学生网络服务器配置具有重要的指导意义,对虚拟机其他服务器实验开发具有一定的参考价值。  相似文献   

20.
In the smart cities, the travel-time is a typical business calculation to monitor and control the traffic congestions. But it still faces challenges on real-time stream due to the limitation of latency and accuracy. In this paper, we propose a collaborative approach for travel-time calculation on stream of recognized data of vehicles. Compared with other types of sensory data in urban roads, the recognized data of vehicles has wider coverage, finer interval and more exact locality. Our approach continuously achieves both factual and predictive values, and consists of two-step spatio-temporal parallelism on real-time data and Bayes prior rules mining on historical data. It can be analyzed theoretically for its low latency with high accuracy, and has been implemented on Apache Storm correlated with Hadoop MapReduce. Through exhaustive experiments on simulated and real data, our approach holds millisecond-level latencies steadily on high speed stream with nearly linear scalability, and keeps the accuracy above 80% for prediction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号