首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A text mining approach for automatic construction of hypertexts   总被引:1,自引:0,他引:1  
The research on automatic hypertext construction emerges rapidly in the last decade because there exists a urgent need to translate the gigantic amount of legacy documents into web pages. Unlike traditional ‘flat’ texts, a hypertext contains a number of navigational hyperlinks that point to some related hypertexts or locations of the same hypertext. Traditionally, these hyperlinks were constructed by the creators of the web pages with or without the help of some authoring tools. However, the gigantic amount of documents produced each day prevent from such manual construction. Thus an automatic hypertext construction method is necessary for content providers to efficiently produce adequate information that can be used by web surfers. Although most of the web pages contain a number of non-textual data such as images, sounds, and video clips, text data still contribute the major part of information about the pages. Therefore, it is not surprising that most of automatic hypertext construction methods inherit from traditional information retrieval research. In this work, we will propose a new automatic hypertext construction method based on a text mining approach. Our method applies the self-organizing map algorithm to cluster some at text documents in a training corpus and generate two maps. We then use these maps to identify the sources and destinations of some important hyperlinks within these training documents. The constructed hyperlinks are then inserted into the training documents to translate them into hypertext form. Such translated documents will form the new corpus. Incoming documents can also be translated into hypertext form and added to the corpus through the same approach. Our method had been tested on a set of at text documents collected from a newswire site. Although we only use Chinese text documents, our approach can be applied to any documents that can be transformed to a set of index terms.  相似文献   

2.
Textual databases are useful sources of information and knowledge and if these are well utilised then issues related to future project management and product or service quality improvement may be resolved. A large part of corporate information, approximately 80%, is available in textual data formats. Text Classification techniques are well known for managing on-line sources of digital documents. The identification of key issues discussed within textual data and their classification into two different classes could help decision makers or knowledge workers to manage their future activities better. This research is relevant for most text based documents and is demonstrated on Post Project Reviews (PPRs) which are valuable source of information and knowledge. The application of textual data mining techniques for discovering useful knowledge and classifying textual data into different classes is a relatively new area of research. The research work presented in this paper is focused on the use of hybrid applications of text mining or textual data mining techniques to classify textual data into two different classes. The research applies clustering techniques at the first stage and Apriori Association Rule Mining at the second stage. The Apriori Association Rule of Mining is applied to generate Multiple Key Term Phrasal Knowledge Sequences (MKTPKS) which are later used for classification. Additionally, studies were made to improve the classification accuracies of the classifiers i.e. C4.5, K-NN, Naïve Bayes and Support Vector Machines (SVMs). The classification accuracies were measured and the results compared with those of a single term based classification model. The methodology proposed could be used to analyse any free formatted textual data and in the current research it has been demonstrated on an industrial dataset consisting of Post Project Reviews (PPRs) collected from the construction industry. The data or information available in these reviews is codified in multiple different formats but in the current research scenario only free formatted text documents are examined. Experiments showed that the performance of classifiers improved through adopting the proposed methodology.  相似文献   

3.
We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing.We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.  相似文献   

4.
The World Wide Web (WWW) has been recognized as the ultimate and unique source of information for information retrieval and knowledge discovery communities. Tremendous amount of knowledge are recorded using various types of media, producing enormous amount of web pages in the WWW. Retrieval of required information from the WWW is thus an arduous task. Different schemes for retrieving web pages have been used by the WWW community. One of the most widely used scheme is to traverse predefined web directories to reach a user's goal. These web directories are compiled or classified folders of web pages and are usually organized into hierarchical structures. The classification of web pages into proper directories and the organization of directory hierarchies are generally performed by human experts. In this work, we provide a corpus-based method that applies a kind of text mining techniques on a corpus of web pages to automatically create web directories and organize them into hierarchies. The method is based on the self-organizing map learning algorithm and requires no human intervention during the construction of web directories and hierarchies. The experiments show that our method can produce comprehensible and reasonable web directories and hierarchies.  相似文献   

5.
Using text classification and multiple concepts to answer e-mails   总被引:1,自引:0,他引:1  
In text mining, the applications domain of text classification techniques is very broad to include text filtering, word identification, and web page classification, etc. Through text classification techniques, documents can be placed into previously defined classifications in order to save on time costs especially when manual document search methods are employed. This research uses text classification techniques applied to e-mail reply template suggestions in order to lower the burden of customer service personnel in responding to e-mails. Suggested templates allows customer service personnel, using a pre-determined number of templates, to find the needed reply template, and not waste time in searching for relevant answers from too much information available. Current text classification techniques are still single-concept based. This research hopes to use a multiple concept method to integrate the relationship between concepts and classifications which will thus allow easy text classification. Through integration of different concepts and classifications, a dynamically unified e-mail concept can recommend different appropriate reply templates. In so doing, the differences between e-mails can be definitely determined, effectively improving the accuracy of the suggested template. In addition, for e-mails with two or more questions, this research tries to come up with an appropriate reply template. Based on experimental verification, the method proposed in this research effectively proposes a template for e-mails of multiple questions. Therefore, using multiple concepts to display the document topic is definitely a clearer way of extracting information that a document wants to convey when the vector of similar documents is used.  相似文献   

6.
In the field of multimedia retrieval in video, text frame classification is essential for text detection, event detection, event boundary detection, etc. We propose a new text frame classification method that introduces a combination of wavelet and median moment with k-means clustering to select probable text blocks among 16 equally sized blocks of a video frame. The same feature combination is used with a new Max-Min clustering at the pixel level to choose probable dominant text pixels in the selected probable text blocks. For the probable text pixels, a so-called mutual nearest neighbor based symmetry is explored with a four-quadrant formation centered at the centroid of the probable dominant text pixels to know whether a block is a true text block or not. If a frame produces at least one true text block then it is considered as a text frame otherwise it is a non-text frame. Experimental results on different text and non-text datasets including two public datasets and our own created data show that the proposed method gives promising results in terms of recall and precision at the block and frame levels. Further, we also show how existing text detection methods tend to misclassify non-text frames as text frames in term of recall and precision at both the block and frame levels.  相似文献   

7.
An important issue in text mining is how to make use of multiple pieces knowledge discovered to improve future decisions. In this paper, we propose a new approach to combining multiple sets of rules for text categorization using Dempster’s rule of combination. We develop a boosting-like technique for generating multiple sets of rules based on rough set theory and model classification decisions from multiple sets of rules as pieces of evidence which can be combined by Dempster’s rule of combination. We apply these methods to 10 of the 20-newsgroups—a benchmark data collection (Baker and McCallum 1998), individually and in combination. Our experimental results show that the performance of the best combination of the multiple sets of rules on the 10 groups of the benchmark data is statistically significant and better than that of the best single set of rules. The comparative analysis between the Dempster–Shafer and the majority voting (MV) methods along with an overfitting study confirm the advantage and the robustness of our approach.  相似文献   

8.
Apple is a leading company of technological evolution and innovation. This company founded and produced the Apple I computer in 1976. Since then, based on its innovative technologies, Apple has launched creative and innovative products and services such as the iPod, iTunes, the iPhone, the Apple app store, and the iPad. In many fields of academia and business, diverse studies of Apple’s technological innovation strategy have been performed. In this paper, we analyze Apple’s patents to better understand its technological innovation. We collected all applied patents by Apple until now, and applied statistics and text mining for patent analysis. By using graphical causal inference method, we created the causal relations among Apple keywords preprocessed by text mining, and then we carried out the semiparametric Gaussian copula regression model to see how the target response keyword and the predictor keywords are relating to each other. Furthermore, Gaussian copula partial correlation was applied to Apple keywords to find out the detailed dependence structure. By performing these methods, this paper shows the technological trends and relations between Apple’s technologies. This research could make contributions in finding vacant technology areas and central technologies for Apple’s R&D planning.  相似文献   

9.
The quality of information provision influences considerably knowledge construction driven by individual users’ needs. In the design of information systems for e-learning, personal information requirements should be incorporated to determine a selection of suitable learning content, instructive sequencing for learning content, and effective presentation of learning content. This is considered as an important part of instructional design for a personalised information package. The current research reveals that there is a lack of means by which individual users’ information requirements can be effectively incorporated to support personal knowledge construction. This paper presents a method which enables an articulation of users’ requirements based on the rooted learning theories and requirements engineering paradigms. The user’s information requirements can be systematically encapsulated in a user profile (i.e. user requirements space), and further transformed onto instructional design specifications (i.e. information space). These two spaces allow the discovering of information requirements patterns for self-maintaining and self-adapting personalisation that enhance experience in the knowledge construction process.  相似文献   

10.
Learning from past accidents is pivotal for improving safety in construction. However, hazard records are typically documented and stored as unstructured or semi-structured free-text rendering the ability to analyse such data a difficult task. The research presented in this study presents a novel and robust framework that combines deep learning and text mining technologies that provide the ability to analyse hazard records automatically. The framework comprises four-step modelling approach: (1) identification of hazard topics using a Latent Dirichlet Allocation algorithm (LDA) model; (2) automatic classification of hazards using a Convolution Neural Network (CNN) algorithm; (3) the production of a Word Co-occurrence Network (WCN) to determine the interrelations between hazards; and (4) quantitative analysis by Word Cloud (WC) technology of keywords to provide a visual overview of hazard records. The proposed framework is validated by analysing hazard records collected from a large-scale transport infrastructure project. It is envisaged that the use of the framework can provide managers with new insights and knowledge to better ensure positive safety outcomes in projects. The contributions of this research are threefold: (1) it is demonstrated that the process of analysing hazard records can be automated by combining deep learning and text learning; (2) hazards are able to be visualized using a systematic and data-driven process; and (3) the automatic generation of hazard topics and their classification over specific time periods enabling managers to understand their patterns of manifestation and therefore put in place strategies to prevent them from reoccurring.  相似文献   

11.
针对施工企业对已完建筑工程数据管理与利用的现状和存在问题,提出了建立数据仓库管理作为施工企业管理已完建筑工程项目数据的新模式;构建了已完建筑工程数据仓库的三层体系结构框架,阐明了系统的开发方法及过程;从利用OLAP辅助工程项目管理决策及基于数据仓库的数据挖掘等方面对已完建筑工程数据仓库的应用领域进行了分析,并以项目风险评价为例说明了数据仓库的应用过程。以期为施工企业加强信息化建设和提高项目管理与风险分析水平提供新的思路和方法。  相似文献   

12.
卢玲  杨武  杨有俊  陈梦晗 《计算机应用》2017,37(12):3498-3503
中文新闻标题通常包含一个或几十个词,由于字符数少、特征稀疏,在分类问题中难以提升正确率。为解决此问题,提出了基于Word Embedding的文本语义扩展方法。首先,将新闻标题扩展为(标题、副标题、主题词)构成的三元组,用标题的同义词结合词性过滤方法构造副标题,对多尺度滑动窗口内的词进行语义组合,提取主题词;然后,针对扩展文本构造卷积神经网络(CNN)分类模型,该模型通过max pooling及随机dropout进行特征过滤及防止过拟合;最后,将标题、副标题拼接为双词表示,与多主题词集分别作为模型的输入。在2017自然语言处理与中文计算评测(NLP&CC2017)的新闻标题分类数据集上进行实验。实验结果表明,用三元组扩展结合相应的CNN模型在18个类别新闻标题上分类的正确率为79.42%,比未经扩展的CNN模型提高了9.5%,且主题词扩展加快了模型的收敛速度,验证了三元组扩展方法及所构建CNN分类模型的有效性。  相似文献   

13.
Stemming is the basic operation in Natural language processing (NLP) to remove derivational and inflectional affixes without performing a morphological analysis. This practice is essential to extract the root or stem. In NLP domains, the stemmer is used to improve the process of information retrieval (IR), text classifications (TC), text mining (TM) and related applications. In particular, Urdu stemmers utilize only uni-gram words from the input text by ignoring bigrams, trigrams, and n-gram words. To improve the process and efficiency of stemming, bigrams and trigram words must be included. Despite this fact, there are a few developed methods for Urdu stemmers in the past studies. Therefore, in this paper, we proposed an improved Urdu stemmer, using hybrid approach divided into multi-step operation, to deal with unigram, bigram, and trigram features as well. To evaluate the proposed Urdu stemming method, we have used two corpora; word corpus and text corpus. Moreover, two different evaluation metrics have been applied to measure the performance of the proposed algorithm. The proposed algorithm achieved an accuracy of 92.97% and compression rate of 55%. These experimental results indicate that the proposed system can be used to increase the effectiveness and efficiency of the Urdu stemmer for better information retrieval and text mining applications.  相似文献   

14.
An Overview of Data Mining and Knowledge Discovery   总被引:9,自引:0,他引:9       下载免费PDF全文
With massive amounts of data stored in databases,mining information and knowledge in databases has become an important issue in recent research.Researchers in many different fields have shown great interest in date mining and knowledge discovery in databases.Several emerging applications in information providing services,such as data warehousing and on-line services over the Internet,also call for various data mining and knowledge discovery tchniques to understand used behavior better,to improve the service provided,and to increase the business opportunities.In response to such a demand,this article is to provide a comprehensive survey on the data mining and knowledge discorvery techniques developed recently,and introduce some real application systems as well.In conclusion,this article also lists some problems and challenges for further research.  相似文献   

15.
In this paper, we propose a novel integrated framework combining association rule mining, case-based-reasoning and text mining that can be used to continuously improve service and repair in an automotive domain. The developed framework enables identification of anomalies in the field that cause customer dissatisfaction and performs root cause investigation of the anomalies. It also facilitates identification of the best practices in the field and learning from these best practices to achieve lean and effective service. Association rule mining is used for the anomaly detection and the root cause investigation, while case-based-reasoning in conjunction with text mining is used to learn from the best practices. The integrated system is implemented in a web based distributed architecture and has been tested on real life data.  相似文献   

16.
Automatic text classification is one of the most important tools in Information Retrieval. This paper presents a novel text classifier using positive and unlabeled examples. The primary challenge of this problem as compared with the classical text classification problem is that no labeled negative documents are available in the training example set. Firstly, we identify many more reliable negative documents by an improved 1-DNF algorithm with a very low error rate. Secondly, we build a set of classifiers by iteratively applying the SVM algorithm on a training data set, which is augmented during iteration. Thirdly, different from previous PU-oriented text classification works, we adopt the weighted vote of all classifiers generated in the iteration steps to construct the final classifier instead of choosing one of the classifiers as the final classifier. Finally, we discuss an approach to evaluate the weighted vote of all classifiers generated in the iteration steps to construct the final classifier based on PSO (Particle Swarm Optimization), which can discover the best combination of the weights. In addition, we built a focused crawler based on link-contexts guided by different classifiers to evaluate our method. Several comprehensive experiments have been conducted using the Reuters data set and thousands of web pages. Experimental results show that our method increases the performance (F1-measure) compared with PEBL, and a focused web crawler guided by our PSO-based classifier outperforms other several classifiers both in harvest rate and target recall.  相似文献   

17.
杨世刚  刘勇国 《计算机应用》2022,42(5):1324-1329
短文本分类是自然语言处理(NLP)中的重要研究问题,广泛应用于新闻分类、情感分析、评论分析等领域。针对短文本分类中存在的数据稀疏性问题,通过引入语料库的节点和边权值特征,基于图注意力网络(GAT),提出了一个融合节点和边权值特征的图注意力网络NE-GAT。首先,针对每个语料库构建异构图,利用引力模型(GM)评估单词节点的重要性,并通过节点间的点互信息(PMI)获得边权重;其次,为每个句子构建文本级别图,并将节点重要性和边权重融入节点更新过程。实验结果表明,所提模型在测试集上的平均准确率达到了75.48%,优于用于文本分类的图卷积网络(Text-GCN)、TL-GNN、Text-ING等模型;相较原始GAT,所提模型的平均准确率提升了2.32个百分点,验证了其有效性。  相似文献   

18.
A technology roadmap (TRM), an approach that is applied to the development of an emerging technology to meet business goals, is one of the most frequently adopted tools to support the process of technology innovation. Although many studies have dealt with TRMs that are designed primarily for a market-driven technology planning process, a technology-driven TRM is far less researched than a market-driven one. Furthermore, approaches to a technology-driven roadmap using quantitative technological information have rarely been studied. Thus, the aim of this research is to propose a new methodological framework to identify both profitable markets and promising product concepts based on technology information. This study suggests two quality function deployment (QFD) matrices to draw up the TRM in order to find new business opportunities. A case study is presented to illustrate the proposed approach using patents on the solar-lighting devices, which is catching on as a high-tech way to prevent environmental pollution and reduce fuel costs.  相似文献   

19.
 The combination of objective measurements and human perceptions using hidden Markov models with particular reference to sequential data mining and knowledge discovery is presented in this paper. Both human preferences and statistical analysis are utilized for verification and identification of hypotheses as well as detection of hidden patterns. As another theoretical view, this work attempts to formalize the complementarity of the computational theories of hidden Markov models and perceptions for providing solutions associated with the manipulation of the internet.  相似文献   

20.
This paper deals with an approach to the automatic construction and optimization of the knowledge mesh (KM) based on the user’s function requirements. Once a KM multiple set operation expression is obtained, a new KM can be inferred from the expression by the developed KM-based inference engine and transformed into its corresponding KMS (knowledgeable manufacturing system) software automatically by the developed automatic program construction software so as to realize the self-reconfiguration of the KMS. Thus, the automatic construction and optimization of a KM multiple set operation expression is equivalent to the automatic construction and optimization of its corresponding KM and KMS software. To explore the automatic construction and optimization of the new KM by the user’s function requirements, an automatic construction procedure of a KM aiming at the user’s maximum function-satisfaction is proposed. Firstly, the fuzzy function-satisfaction degree relationships of the users’ requirements for the KM functions are defined, and so are the multiple fuzzy function-satisfaction degrees of the relationships. Secondly, operations (union, intersection and minus) on both fuzzy and multiple fuzzy function-satisfaction degrees are proposed and clarified, along with the proof that there exists a one-to-one mapping between the KM multiple set operation expression and the KM-function-satisfaction degree expression. Then, the optimization model of the KM multiple set operation expression is constructed and proved to be very NP-hard. And finally, the KM multiple set operation expression is optimized by the hybrid genetic-tabu algorithm, with the steps of the KM’s automatic construction presented in detail as well. Based upon the above, the KM’s automatic construction and optimization are illustrated by an actual KM example which corresponds to the management information system (MIS) software used in a vehicle body plant. The proposed approach proves to be very effective.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号