首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

  相似文献   

2.
A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.  相似文献   

3.
The number of citations of journal papers is an important measure of the impact of research. Thus, the modeling of citation behavior needs attention. Burrell, Egghe, Rousseau and others pioneered this type of modeling. Several models have been proposed for the citation distribution. In this note, we derive the most comprehensive collection of formulas for the citation distribution, covering some 17 flexible families. The corresponding estimation procedures are also derived by the method of moments. We feel that this work could serve as a useful reference for the modeling of citation behavior.  相似文献   

4.
In this paper, we investigate whether a semantic representation of patent documents provides added value for a multi-dimensional visual exploration of a patent landscape compared to traditional approaches that use tf–idf (term frequency–inverse document frequency). Word embeddings from a pre-trained word2vec model created from patent text are used to calculate pairwise similarities in order to represent each document in the semantic space. Then, a hierarchical clustering method is applied to create several semantic aggregation levels for a collection of patent documents. For visual exploration, we have seamlessly integrated multiple interaction metaphors that combine semantics and additional metadata for improving hierarchical exploration of large document collections.  相似文献   

5.
An extended latent Dirichlet allocation (LDA) model is presented in this paper for patent competitive intelligence analysis. After part-of-speech tagging and defining the noun phrase extraction rules, technological words have been extracted from patent titles and abstracts. This allows us to go one step further and perform patent analysis at content level. Then LDA model is used for identifying underlying topic structures based on latent relationships of technological words extracted. This helped us to review research hot spots and directions in subclasses of patented technology in a certain field. For the extension of the traditional LDA model, another institution-topic probability level is added to the original LDA model. Direct competing enterprises’ distribution probability and their technological positions are identified in each topic. Then a case study is carried on within one of the core patented technology in next generation telecommunication technology-LTE. This empirical study reveals emerging hot spots of LTE technology, and finds that major companies in this field have been focused on different technological fields with different competitive positions.  相似文献   

6.
In this short communication, we provide an overview of a relatively newly provided source of altmetrics data which could possibly be used for societal impact measurements in scientometrics. Recently, Altmetric—a start-up providing publication level metrics—started to make data for publications available which have been mentioned in policy-related documents. Using data from Altmetric, we study how many papers indexed in the Web of Science (WoS) are mentioned in policy-related documents. We find that less than 0.5% of the papers published in different subject categories are mentioned at least once in policy-related documents. Based on our results, we recommend that the analysis of (WoS) publications with at least one policy-related mention is repeated regularly (annually) in order to check the usefulness of the data. Mentions in policy-related documents should not be used for impact measurement until new policy-related sites are tracked.  相似文献   

7.
Ping  Qing  Chen  Chaomei 《Scientometrics》2018,116(3):1887-1944

The continuing growth of scientific publications has posed a double-challenge to researchers, to not only grasp the overall research trends in a scientific domain, but also get down to research details embedded in a collection of core papers. Existing work on science mapping provides multiple tools to visualize research trends in domain on macro-level, and work from the digital humanities have proposed text visualization of documents, topics, sentences, and words on micro-level. However, existing micro-level text visualizations are not tailored for scientific paper corpus, and cannot support meso-level scientific reading, which aligns a set of core papers based on their research progress, before drilling down to individual papers. To bridge this gap, the present paper proposes LitStoryTeller+, an interactive system under a unified framework that can support both meso-level and micro-level scientific paper visual storytelling. More specifically, we use entities (concepts and terminologies) as basic visual elements, and visualize entity storylines across papers and within a paper borrowing metaphors from screen play. To identify entities and entity communities, named entity recognition and community detection are performed. We also employ a variety of text mining methods such as extractive text summarization and comparative sentence classification to provide rich textual information supplementary to our visualizations. We also propose a top-down story-reading strategy that best takes advantage of our system. Two comprehensive hypothetical walkthroughs to explore documents from the computer science domain and history domain with our system demonstrate the effectiveness of our story-reading strategy and the usefulness of LitStoryTeller+.

  相似文献   

8.
Collaborative coefficient (CC) is a measure of collaboration in research, that reflects both the mean number of authors per paper as well as the proportion of multi-authored papers. Although it lies between the values 0 and 1, and is 0 for a collection of purely single-authored papers, it is not 1 for the case where all papers are maximally authored, i.e., every publication in the collection has all authors in the collection as co-authors. We propose a simple modification of CC, which we call modified collaboration coefficient (or MCC, for short), which improves its performance in this respect.  相似文献   

9.
This paper offers an overview of the bibliometric study of the domain of library and information science (LIS), with the aim of giving a multidisciplinary perspective of the topical boundaries and the main areas and research tendencies. Based on a retrospective and selective search, we have obtained the bibliographical references (title and abstract) of academic production on LIS in the database LISA in the period 1978–2014, which runs to 92,705 documents. In the context of the statistical technique of topic modeling, we apply latent Dirichlet allocation, in order to identify the main topics and categories in the corpus of documents analyzed. The quantitative results reveal the existence of 19 important topics, which can be grouped together into four main areas: processes, information technology, library and specific areas of information application.  相似文献   

10.
The growing collection of scientific data in various web repositories is referred to as Scientific Big Data, as it fulfills the four “V’s” of Big Data–-volume, variety, velocity, and veracity. This phenomenon has created new opportunities for startups; for instance, the extraction of pertinent research papers from enormous knowledge repositories using certain innovative methods has become an important task for researchers and entrepreneurs. Traditionally, the content of the papers are compared to list the relevant papers from a repository. The conventional method results in a long list of papers that is often impossible to interpret productively. Therefore, the need for a novel approach that intelligently utilizes the available data is imminent. Moreover, the primary element of the scientific knowledge base is a research article, which consists of various logical sections such as the Abstract, Introduction, Related Work, Methodology, Results, and Conclusion. Thus, this study utilizes these logical sections of research articles, because they hold significant potential in finding relevant papers. In this study, comprehensive experiments were performed to determine the role of the logical sections-based terms indexing method in improving the quality of results (i.e., retrieving relevant papers). Therefore, we proposed, implemented, and evaluated the logical sections-based content comparisons method to address the research objective with a standard method of indexing terms. The section-based approach outperformed the standard content-based approach in identifying relevant documents from all classified topics of computer science. Overall, the proposed approach extracted 14% more relevant results from the entire dataset. As the experimental results suggested that employing a finer content similarity technique improved the quality of results, the proposed approach has led the foundation of knowledge-based startups.  相似文献   

11.
This study evaluates trends in quality of nanotechnology and nanoscience papers produced by South Korean authors. The metric used to gauge quality is ratio of highly cited nanotechnology papers to total nanotechnology papers produced in sequential time frames. In the first part of this paper, citations (and publications) for nanotechnology documents published by major producing nations and major producing global institutions in four uneven time frames are examined. All nanotechnology documents in the Science Citation Index [SCI, 2006] for 1998, 1999–2000, 2001–2002, 2003 were retrieved and analyzed in March 2007. In the second part of this paper, all the nanotechnology documents produced by South Korean institutions were retrieved and examined. All nanotechnology documents produced in South Korea (each document had at least one author with a South Korea address) in each of the above time frames were retrieved and analyzed. The South Korean institutions were extracted, and their fraction of total highly cited documents was compared to their fraction of total published documents. Non-Korean institutions that co-authored papers were included as well, to offer some perspective on the value of collaboration.  相似文献   

12.
13.
An array of 20 compositionally different carbon black--polymer composite chemiresistor vapor detectors was challenged under laboratory conditions to discriminate between a pair of extremely similar pure analytes (H2O and D2O), compositionally similar mixtures of pairs of compounds, and low concentrations of vapors of similar chemicals. Several discriminant algorithms were utilized, including k nearest neighbors (kNN, with k = 1), linear discriminant analysis (LDA, or Fisher's linear discriminant), quadratic discriminant analysis (QDA), regularized discriminant analysis (RDA, a hybrid of LDA and QDA), partial least squares, and soft independent modeling of class analogy (SIMCA). H2O and D2O were perfectly classified by most of the discriminants when a separate training and test set was used. As expected, discrimination performance decreased as the analyte concentration decreased, and performance decreased as the composition of the analyte mixtures became more similar. RDA was the overall best-performing discriminant, and LDA was the best-performing discriminant that did not require several cross-validations for optimization.  相似文献   

14.

Document relational network has been effective in retrieving and evaluating papers. Despite their effectiveness, relational measures, including co-citation, are far from ideal and need improvements. The assumption underlying the co-citation relation is the content relevance and opinion relatedness of cited and citing papers. This may imply existence of some kind of co-opinionatedness between co-cited papers which may be effective in improving the measure. Therefore, the present study tries to test the existence of this phenomenon and its role in improving information retrieval. To do so, based on CITREC, a medical test collection was developed consisting of 30 queries (seed documents) and 4823 of their co-cited papers. Using NLP techniques, the co-citances of the queries and their co-cited papers were analyzed and their similarities were computed by 4 g similarity measure. Opinion scores were extracted from co-citances using SentiWordnet. Also, nDCG values were calculated and then compared in terms of the citation proximity index (CPI) and co-citedness measures before and after being normalized by the co-opinionatedness measure. The reliability of the test collection was measured by generalizability theory. The findings suggested that a majority of the co-citations exhibited a high level of co-opinionatedness in that they were mostly similar either in their opinion strengths or in their polarities. Although anti-polar co-citations were not trivial in their number, a significantly higher number of the co-citations were co-polar, with a majority being positive. The evaluation of the normalization of the CPI and co-citedness by the co-opinionatedness indicated a generally significant improvement in retrieval effectiveness. While anti-polar similarity reduced the effectiveness of the measure, the co-polar similarity proved to be effective in improving the co-citedness. Consequently, the co-opinionatedness can be presented as a new document relation and used as a normalization factor to improve retrieval performance and research evaluation.

  相似文献   

15.
In the literature there are only few papers concerned with classification methods for multi-way arrays. The most common procedure, by far, is to unfold the multi-way data array into an ordinary matrix and then to apply the traditional multivariate tools for classification. As opposed to unfolding the data several possibilities exist for building classification models more directly based on the multi-way structure of the data. As an example, multi-way partial least squares discriminant analysis has been used as a supervised classification method, another alternative that has been investigated is to perform classification using Fisher's LDA or SIMCA on the score matrix from e.g. a PARAFAC or a Tucker model. Despite a few attempts of applying such multi-way classification approaches, no-one has looked into how such models are best built and implemented.In this work, the SIMCA method is extended to three-way arrays. Included in this work is also actual code that will work on general multi-way arrays rather than just three-way arrays. In analogy with two-way SIMCA, a decomposition model is separately built for the multi-way data for each class, using multi-way decomposition method such as PARAFAC or Tucker3. In the choice of the best class dimensionality, i.e. number of latent factors, both the results of cross-validation but mainly the sensitivity/specificity values are evaluated. In order to estimate the class limits for each class model, orthogonal and score distances are considered, and different statistics are implemented and tested to set confidence limits for these two parameters. Classification performance using different definitions of class boundaries and classification rules, including the use of cross-validated residuals and scores is compared.The proposed N-SIMCA methodology and code, besides simulated data sets of varying dimensionality, has been tested on two case studies, concerning food authentication tasks for typical food products.  相似文献   

16.
Traditional topic models have been widely used for analyzing semantic topics from electronic documents. However, the obvious defects of topic words acquired by them are poor in readability and consistency. Only the domain experts are possible to guess their meaning. In fact, phrases are the main unit for people to express semantics. This paper presents a Distributed Representation-Phrase Latent Dirichlet Allocation (DRPhrase LDA) which is a phrase topic model. Specifically, we reasonably enhance the semantic information of phrases via distributed representation in this model. The experimental results show the topics quality acquired by our model is more readable and consistent than other similar topic models.  相似文献   

17.
Paul Erdos was a world famous Hungarian mathematician, who passed away in September 1996. Documents on the World Wide Web, mentioning Paul Erdos's name were systematically collected. These documents were categorized using the method of content analysis. This work enables us to draw some conclusions about the ways authors of Internet documents picture Paul Erdos. This is the first work we know of that thoroughly examines the content of a huge collection of documents on a specific topic on the Internet.  相似文献   

18.
Summary We present a new approach to study the structure of the impact factor of academic journals. This new method is based on calculation of the fraction of citations that contribute to the impact factor of a given journal that come from citing documents in which at least one of the authors is a member of the cited journal's editorial board. We studied the structure of three annual impact factors of 54 journals included in the groups “Education and Educational Research” and “Psychology, Educational” of the Social Sciences Citation Index. The percentage of citations from papers authored by editorial board members ranged from 0% to 61%. In 12 journals, for at least one of the years analysed, 50% or more of the citations that contributed to the impact factor were from documents published in the journal itself. Given that editorial board members are considered to be among the most prestigious scientists, we suggest that citations from papers authored by editorial board members should be given particular consideration.  相似文献   

19.
Scientific collaboration in China as reflected in co-authorship   总被引:5,自引:0,他引:5  
Summary A chronically weak area in research papers, reports, and reviews is the complete identification of background documents that formed the building blocks for these papers. A method for systematically determining these seminal references is presented. Citation-Assisted Background (CAB) is based on the assumption that seminal documents tend to be highly cited. CAB is being applied presently to three applications studies, and the results so far are much superior to those used by the first author for background development in any other study. An example of the application of CAB to the field of Nonlinear Dynamics is outlined. While CAB is a highly systematic approach for identifying seminal references, it is not a substitute for the judgement of the researchers, and serves as a supplement.  相似文献   

20.
We present a case study of how scientometric tools can reveal the structure of scientific theory in a discipline. Specifically, we analyze the patterns of word use in the discipline of cognitive science using latent semantic analysis, a well-known semantic model, in the abstracts of over a thousand academic papers relevant to these theories. Our results show that it is possible to link these theories with specific statistical distributions of words in the abstracts of papers that espouse these theories. We show that theories have different patterns of word use, and that the similarity relationships with each other are intuitive and informative. Moreover, we show that it is possible to predict fairly accurately the theory of a paper by constructing a model of the theories based on their distribution of word use. These results may open new avenues for the application of scientometric tools on theoretical divides.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号