首页 | 官方网站   微博 | 高级检索  
     

基于PDFBox抽取学术论文信息的实现
引用本文:牛永洁,薛苏琴.基于PDFBox抽取学术论文信息的实现[J].微机发展,2014(12):61-63.
作者姓名:牛永洁  薛苏琴
作者单位:延安大学数学与计算机学院,陕西延安716000
基金项目:陕西省自然科学基础研究计划项目(2013JM8042)
摘    要:为了对学术动态、热点及学术发展趋势进行研究,需要对学术研究论文进行数据挖掘研究。首先需要从海量的学术论文中提取有兴趣的信息。针对目前学术论文大多采用PDF格式的现状,重点研究了PDF文件的格式以及对PDF格式操作的各种技术,采用开源函数库PDFBox对PDF格式的学术论文按照规则进行信息的提取,提取的信息主要包括学术论文的标题、作者、单位、关键词、发表时间、摘要等信息。最后对提取信息的正确率进行了统计,有助于针对学术研究的大数据研究。

关 键 词:数据挖掘  信息抽取  PDF格式  学术论文

Realization of Extraction of Academic Papers Information Based on PDFBox
NIU Yong-jie,XUE Su-qin.Realization of Extraction of Academic Papers Information Based on PDFBox[J].Microcomputer Development,2014(12):61-63.
Authors:NIU Yong-jie  XUE Su-qin
Affiliation:( College of Mathematics & Computer, Yan' an University, Yan' an 716000, China)
Abstract:In order to research the academic dynamics,hot topic and academic development trends,need to carry out the data mining research for academic research papers. First of all,extract interest information from the massive papers. For the situation that the current academic papers are mostly used PDF format,mainly study the format of PDF files and a variety of technical operations for PDF operations,open- source library PDFBox is used to extract information for the academic papers with PDF format in accordance with the rules,the extracted information is mainly including academic titles,authors,unit,keyword,publication time,abstract and other information. Finally,the correct rate of extraction of information has been statistical,which is helpful for big data for academic research.
Keywords:data mining  information extraction  PDF format  academic papers
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号