首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
电子商务网站逐渐成为商务智能中数据量最大的地方之一。把数据仓库技术引入电子商务应用中,把用户在电子商务网站上的点击流(Click Stream)和Web日志文件作为数据源,利用高效的改进的关联规则算法,可以有效地分析出其中蕴涵的知识,如用户行为模式等。利用这些知识,商务人员能够拓展他们的市场,改善客户关系,降低成本,使操作流水化,有效地辅助他们改进商业策略。  相似文献   

2.
基于Web使用挖掘的用户行为分析   总被引:9,自引:0,他引:9  
张波  巫莉莉  周敏 《计算机科学》2006,33(8):213-214
Web服务产生了大量的日志数据,这些数据记录了用户的行为信息。如何从海量的日志数据中自动、智能地抽取隐藏于其中的知识,这是本文要研究的问题。基于Web使用挖掘,对点击流数据源进行收集、预处理,并基于FP-tree的关联规则挖掘算法对用户行为进行分析,发现新模式,为优化网站建设提供有价值数据。  相似文献   

3.
针对经典多维关联规则挖掘算法执行效率不高、存在冗余规则的不足,提出基于约束的多维Apriori改进算法,在多维Apriori算法的基础上,将用户约束引入挖掘过程,根据关于谓词的约束产生用户感兴趣的频繁谓词集,并以此为依据删减事务集。该算法一方面通过用户约束大大缩减了候选谓词集的产生,另一方面经过删减的事务集也降低了扫描数据库的开销,最终实现了挖掘效率的提高以及冗余规则的减少。应用该算法在FPGA代码缺陷事务集上进行对比实验,实验结果证明了该算法相比多维Apriori算法,在搜索效率以及挖掘结果的准确性方面均得到了改善,有效提高了FPGA代码缺陷分析的准确性。  相似文献   

4.
点击流数据是分析互联网用户心理倾向的关键,互联网用户的聚类可以通过分析点击流数据实现. 本文提出了一种基于向量的相似度计算方法,将点击流数据转化为向量数据. 通过对向量的计算来得出聚类的结果. 算法克服了传统的聚类算法的一些缺点,更能符合研究人员研究Web点击流数据时关于个性化聚类的要求.  相似文献   

5.
本文旨在研究基于Web环境下利用关联规则对Web日志挖掘的数据分析系统。把关联规则的概念引入到web日志挖掘中,将用户的访问路径以关联规则的形式表现出来,其目的在于从用户访问超文本系统的行为中发现用户的访问模式。然后在砷riori挖掘算法思想的基础上,对其改造,给出了适合挖掘用户访问频繁路径的类Apriori算法。最后设计开发了一个Web日志数据分析系统。此系统主要包含三个功能模块:数据预处理模块、智能分析模块和基本分析模块  相似文献   

6.
本文旨在研究基于Web环境下利用关联规则对Web日志挖掘的数据分析系统。把关联规则的概念引入到Web日志挖掘中,将用户的访问路径以关联规则的形式表现出来,其目的在于从用户访问超文本系统的行为中发现用户的访问模式。然后在Apriori挖掘算法思想的基础上,对其改造,给出了适合挖掘用户访问频繁路径的类Apriori算法。最后设计开发了一个Web日志数据分析系统。此系统主要包含三个功能模块:数据预处理模块、智能分析模块和基本分析模块  相似文献   

7.
DNS访问记录体现了网络用户的访问意图,通过Apriori关联规则挖掘算法处理这些记录,生成关联规则用以发现网络用户的访问行为模式,从而满足用户识别、用户分析等应用需求。该文详细分析了Apriori关联规则挖掘算法,针对其不足进行改进,应用于DNS访问记录挖掘中,对产生的关联规则进行了分析,得到相关用户的一些访问行为模式。  相似文献   

8.
基于闭合有间隔频繁子序列的点击流聚类   总被引:2,自引:0,他引:2       下载免费PDF全文
马超  沈微 《计算机工程》2010,36(23):72-75
对网站日志文件中记录的点击流序列聚类可以发现用户使用模式,从而对用户归类。而传统聚类方法面临着难以提取点击流中有代表性的特征向量以及点击流及其特征向量存在数据稀疏性的问题。针对上述情况,提出一种基于闭合有间隔频繁子序列模式挖掘的点击流聚类方法。该方法从点击流中提取子序列模式的频繁支持度,构建特征向量,利用基于双向映射欧氏距离的模糊距离度量判断向量间相似度,增强BIRCH聚类算法对点击流数据的聚类效果。  相似文献   

9.
Web挖掘就是利用数据挖掘技术从Web文档和服务中自动发现和抽取信息,用于Web挖掘的最有效的数据就是点击流,现在对点击流数据进行建模和分析所采用的方法要么过分强调挖掘算法,要么过分强调实用性。基于此,笔者提供了一种将挖掘算法与商业智能(BI)相结合的统计建模方法。  相似文献   

10.
点击流中事务数据模型的设计与实现   总被引:1,自引:0,他引:1  
点击流数据简单说就是Web服务器上一系列有序的日志记录。随着WWW应用及电子商务的高速发展,电子商务网站的Web服务器上自动收集了大量的用户访问信息记录,即所谓的Web日志。Web日志蕴涵了大量的有用信息,如客户来源、客户访问趋势、客户兴趣、网站流量等,因而记录和分析Web日志数据已逐渐成为e企业的一项重大活动。点击流数据仓库对原始的Web日志数据进行过滤、清洗并集成,以便于利用联机分析处理和数据挖掘技术对点击流数据做进一步分析,从而为企业创造巨大的信息财富。  相似文献   

11.
Search engine query log mining has evolved over time to more like data stream mining due to the endless and continuous sequence of queries known as query stream. In this paper, we propose an online frequent sequence discovery (OFSD) algorithm to extract frequent phrases from within query streams, based on a new frequency rate metric, which is suitable for query stream mining. OFSD is an online, single pass, and real-time frequent sequence miner appropriate for data streams. The frequent phrases extracted by the OFSD algorithm are used to guide novice Web search engine users to complete their search queries more efficiently. YourEye, our online phrase recommender is then introduced. The advantages of YourEye compared with Google Suggest, a service powered by Google for phrase suggestion, is also described. Various characteristics of two specific Web search engine query logs are analyzed and then the query logs are used to evaluate YourEye. The experimental results confirm the significant benefit of monitoring frequent phrases within the queries instead of the whole queries because none-separable items. The number of the monitored elements substantially decreases, which results in smaller memory consumption as well as better performance. Re-ranking the retrieved pages based on past users clicks for each frequent phrase extracted by OFSD is also introduced. The preliminary results show the advantages of the proposed method compared to the similar work reported in Smyth et al.  相似文献   

12.
In the smart cities, the travel-time is a typical business calculation to monitor and control the traffic congestions. But it still faces challenges on real-time stream due to the limitation of latency and accuracy. In this paper, we propose a collaborative approach for travel-time calculation on stream of recognized data of vehicles. Compared with other types of sensory data in urban roads, the recognized data of vehicles has wider coverage, finer interval and more exact locality. Our approach continuously achieves both factual and predictive values, and consists of two-step spatio-temporal parallelism on real-time data and Bayes prior rules mining on historical data. It can be analyzed theoretically for its low latency with high accuracy, and has been implemented on Apache Storm correlated with Hadoop MapReduce. Through exhaustive experiments on simulated and real data, our approach holds millisecond-level latencies steadily on high speed stream with nearly linear scalability, and keeps the accuracy above 80% for prediction.  相似文献   

13.
在当今的网络监控、电信数据管理、传感器数据监控等应用中,数据采取的是多维的、连续的、快速的、随时间变化的流式数据的形式,对数据的访问也是多次和连续的,并要求即时的响应。数据流独特的特点,对传统数据的处理方法带来很大的挑战。数据流应用的出现,带动了相关技术的研究,其中包括数据流挖掘的研究。文中介绍了数据流的基本概念,讨论了数据流挖掘的研究现状及相关技术,包括数据流的介绍、流行的数据流处理技术和数据挖掘中的相关算法。  相似文献   

14.
在线广告中的欺诈点击(click fraud)是指所有利用欺诈性手段或带有欺诈意图并被搜索引擎承认的点击行为。传统点击欺诈检测主要集中在检测个体用户点击的合法性。然而,目前存在很多的发布商雇佣大批网络用户,以群体形式进行欺诈点击。针对这一问题,提出了一种检测点击欺诈群组的方法。首先使用频繁项集挖掘算法来发现共同点击过大量广告的个体用户,作为疑似欺诈组。然后,在对组内用户点击行为属性分析的基础上,运用孤立点检测方法找到与组内其它用户有显著差异的疑似欺诈用户。最后,运用贝叶斯分类方法对检测到的所有疑似欺诈成员分类,得到真正的欺诈群组和欺诈用户。在真实数据集上的实验结果证明了方法的可行性与有效性。  相似文献   

15.
Today's world of increasingly dynamic environments naturally results in more and more data being available as fast streams. Applications such as stock market analysis, environmental sensing, Web clicks, and intrusion detection are just a few of the examples where valuable data is streamed. Often, streaming information is offered on the basis of a nonexclusive, single-use customer license. One major concern, especially given the digital nature of the valuable stream, is the ability to easily record and potentially "replay" parts of it in the future. If there is value associated with such future replays, it could constitute enough incentive for a malicious customer (Mallory) to record and duplicate data segments, subsequently reselling them for profit. Being able to protect against such infringements becomes a necessity. In this work, we introduce the issue of rights protection for discrete streaming data through watermarking. This is a novel problem with many associated challenges including: operating in a finite window, single-pass, (possibly) high-speed streaming model, and surviving natural domain specific transforms and attacks (e.g., extreme sparse sampling and summarizations), while at the same time keeping data alterations within allowable bounds. We propose a solution and analyze its resilience to various types of attacks as well as some of the important expected domain-specific transforms, such as sampling and summarization. We implement a proof of concept software (wms.*) and perform experiments on real sensor data from the NASA Infrared Telescope Facility at the University of Hawaii, to assess encoding resilience levels in practice. Our solution proves to be well suited for this new domain. For example, we can recover an over 97 percent confidence watermark from a highly down-sampled (e.g., less than 8 percent) stream or survive stream summarization (e.g., 20 percent) and random alteration attacks with very high confidence levels, often above 99 percent.  相似文献   

16.
数据流管理技术   总被引:2,自引:1,他引:1  
最近,人们已经广泛认识到:在某些新的应用领域中,把数据看作瞬时的数据流比看作持久的关系更为适合。本文首先分析了传统数据库管理系统处理数据流的局限性,然后分析了三个典型的数据流管理系统的基本实现技术,讨论了当前数据流管理技术的研究现状和今后的研究方向,最后,给出了一个数据流管理原型系统的体系结构。  相似文献   

17.
流数据管理系统的研究已成为当前数据库领域研究的共识。本文详细论述了流数据管理系统的基本概念、流数据模型和查询语义、流数据查询算法,并提出了流数据管理系统研究中许多重要问题的未来研究方向。  相似文献   

18.
基于大规模真实网络用户的行为日志,对用户与网络搜索引擎系统的交互过程和用户决策过程展开研究.通过比较具有相关信息的用户点击和普通点击的分布,对用户点击的3类上下文背景特征进行分析,从而实现对用户点击的可靠性评估.实验结果表明,通过对用户点击的上下文背景的特征分析,能够发现用户检索行为中的思维决策过程,并进而对用户点击的可靠性进行有效的评估.  相似文献   

19.
大数据时代到来,备份数据量增大给存储空间带来新的挑战。重复数据删除技术在备份存储系统中正逐渐流行,但大量数据访问,造成了磁盘的很大负担。针对重复数据删除技术存在的块索引查询磁盘瓶颈问题,文中提出了文件相似性与数据流局部性结合方法改善磁盘I/O性能。该方法充分发挥了各自的优势,相似性优化了索引查找,可以检测到相同数据检测技术不能识别的重复数据;而数据局部性保留了数据流的序列,使得cache的命中率提高,减少磁盘访问次数。布鲁过滤器存储数据块索引可节省大量查询时间和空间开销。对于提出的解决方法所涉及的重要参数如块大小、段大小以及对误判率的影响做了深入分析。通过相关实验评估与性能分析,实验数据与结果为进一步系统性能优化问题提供了重要的数据依据。  相似文献   

20.
It is widely recognized that the integration of information retrieval (IR) and database (DB) techniques provides users with a broad range of high quality services. Along this direction, IR-styled m-keyword query processing over a relational database in an rdbms framework has been well studied. It finds all hidden interconnected tuple structures, for example connected trees that contain keywords and are interconnected by sequences of primary/foreign key relationships among tuples. A new challenging issue is how to monitor events that are implicitly interrelated over an open-ended relational data stream for a user-given m-keyword query. Such a relational data stream is a sequence of tuple insertion/deletion operations. The difficulty of the problem is related to the number of costly joins to be processed over time when tuples are inserted and/or deleted. Such cost is mainly affected by three parameters, namely, the number of keywords, the maximum size of interconnected tuple structures, and the complexity of the database schema when it is viewed as a schema graph. In this paper, we propose new approaches. First, we propose a novel algorithm to efficiently determine all the joins that need to be processed for answering an m-keyword query. Second, we propose a new demand-driven approach to process such a query over a high speed relational data stream. We show that we can achieve high efficiency by significantly reducing the number of intermediate results when processing joins over a relational data stream. The proposed new techniques allow us to achieve high scalability in terms of both query plan generation and query plan execution. We conducted extensive experimental studies using synthetic data and real data to simulate a relational data stream. Our approach significantly outperforms existing algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号