查看论文信息

中文题名：	科技云中非结构化数据向结构化数据的转换方法
姓名：	马晓荣
学号：	1403121763
保密级别：	公开
论文语种：	chi
学科代码：	081203
学科名称：	计算机应用技术
学生类型：	硕士
学位：	工程硕士
学校：	西安电子科技大学
院系：	计算机学院
专业：	计算机技术
第一导师姓名：	鱼滨
第一导师单位：	西安电子科技大学
第二导师姓名：	张晓红
完成日期：	2017-06-14
答辩日期：	2017-05-25
外文题名：	A method for converting unstructed data into structed data in scientific and technological cloud
中文关键词：	非结构化数据 ; 命名实体识别 ; 关系抽取 ; 科技云平台
外文关键词：	Unstructured Data ; Named Entity Recognition ; Relationship Eextraction ; Scientific and Technological Cloud Platform
中文摘要：	︿在大数据时代，数据逐渐成为驱动经济增长和社会进步的重要生产力和战略资源，加快推进这些数据资源的开放共享则是政府转型的内在需求及强大动力。为了更好地实现科技数据的共享联动和服务管理的高效便捷，陕西省政府充分利用工作中积累的丰富科技资源，提出了“科技服务管理一体化云平台”的建设目标。但在科技云的建设过程中所采集的原始数据绝大部分是非结构化的文本数据，许多资源无法直接使用，而仅仅依靠人工方式从海量数据中提取有效信息，将其转换为结构化数据需要消耗大量时间和人工成本，无法满足业务需求。基于对非结构化数据转换的迫切需要，本文提出非结构化数据向结构化数据转换的相关方法。本文首先分析了科技云中对非结构化数据处理的相关需求和特点，并对非结构化数据转换的主要方法进行对比，根据实际情况采用基于机器学习的实体关系抽取方法实现了非结构化数据的结构化，并将非结构化数据向结构化数据转换的过程分解为三个关键问题，即分词和词性标注，命名实体识别和实体关系抽取。本文重点对其中的两个核心任务命名实体识别和实体关系抽取进行了算法研究和实现。对分类并解析后的非结构化文本，本文首先利用中科院的NLPIR(Natural Language Processing and Information Retrieval)自动分词工具包进行分词和词性标注等预处理工作。然后采用规则和CRFs(Conditional Random Fields)相结合的方法完成命名实体识别任务，具体地，对数字和时间表达式这些简单实体使用外部资源表和规则相结合的方法进行实体识别，对人名、地名和组织机构名等复杂的其他类型命名实体则通过CRFs和规则相结合的方法进行识别处理。本文还针对不同类型实体制定了特定的特征模板，通过实验调优得到最佳窗口大小，并用L-BFGS算法对模型参数进行估计训练，在此基础上结合CRF++工具包对算法进行了具体实现。最后通过基于无监督的K-means聚类算法完成实体关系抽取任务，并根据实际需求对K-means算法中的K取值、初始化聚类中心选择和孤立点问题进行了优化，基本实现了从非结构化数据中获取结构化数据的目标。本文对非结构化数据转换中的命名实体识别和关系抽取算法进行了重点研究和实现，并完成了相关功能的测试以及相应结果的分析工作。测试结果表明，本文提出的命名实体识别和关系抽取等算法能够对科技云文本中需要的数据进行有效提取，整合成结构化数据。该研究为科技云建设中的数据处理工作提供了技术支持，减轻了人工数据处理的工作量，提高了处理速度和效率，具有一定的实用价值。﹀
外文摘要：	︿ In the era of big data, the data has become an important productive force and strategic resource to drive economic growth and social progress. Speeding up the opening and sharing of the data is the inherent demand and strong driving force of government transformation. For better sharing the technology data and easier to manage or service, the government in Shanxi province makes full use of the rich scientific and technological resources accumulated in daily work and puts forward a goal to construct "the Integrated Cloud Platform of the Technology Management and Service", in which most of the collected data is unstructured text and can’t be used directly, and it will take lots of time and looks impossible to meet the needs to rely on manual way only to extract effective information from the massive data. Based on the urgent need for unstructured data conversion, this thesis proposes a method to convert unstructured data to structured data with information extraction technology. This thesis analyzes the relevant requirements of unstructured data in the cloud platform, compares the main methods of unstructured data conversion, and then a method of entity relationship extraction based on machine learning is used to achieve the goal according to the actual situation, which is decomposed into three key issues, word segmentation and POS(Part of Speech) tagging, named entity recognition and entity relationship extraction. This thesis focuses on the research and implementation of algorithm, including entity recognition and entity relationship extraction. In this thesis, the NLPIR(Natural Language Processing and Information Retrieval) automatic word segmentation kit of Chinese Academy of Sciences is used to finish a series of preconditioning work such as word segmentation and POS tagging. Apart from this, this thesis combines rules and the CRFs(Conditional Random Fields) together to recognize the named entities. Specifically, external resource tables and rules are used to recognize the simple named entities such as numeric expressions or time expressions, and the CRFs model is introduced to recognize the complex named entities such as organization names or people names. In addition, some specific feature templates are made to recognize various types of entities, and the optimal window size is obtained by experiment tuning. And then, we implement the algorithm based on the CRF ++ toolkit. Finally, an improved K-means clustering algorithm is presented to extract the relationship of entities, which focus on the value of K, the initial clustering center and the isolated points and achieves the goal basically. The named entity recognition and relationship extraction algorithms in unstructured data conversion are studied, implemented and tested in this thesis. Experimental results show that the proposed algorithm of recognizing the named entity and extracting relationship can be used to extract useful information from the unstructured data, which provides technical support for the data processing in the construction of the cloud platform, which reduces the workload of manual data processing, accelerates the processing speed and efficiency, and has certain practical application value. ﹀
参考文献：	︿ [1]张冲. 非结构化数据向结构化数据转换的新技术[J]. 数字图书馆论坛, 2006, (05): 75. [2]徐健, 张智雄, 吴振新. 实体关系抽取的技术方法综述[J]. 现代图书情报技术, 2008, (08): 18-23. [3]郭喜跃, 何婷婷. 信息抽取研究综述[J]. 计算机科学, 2015, (02): 14-17+38. [4]胡熠, 陆汝占, 刘慧. 面向信息检索的概念关系自动构建[J]. 中文信息学报, 2007, (05): 46-50. [5]Eugene A, Luis G. Extracting relations from large plain-text collections[J]. Proc. ACM, 2000, 2000. [6]Liu Y, Bi J W, Fan Z P. A method for multi-class sentiment classification based on an improved one-vs-one (OVO) strategy and the support vector machine (SVM) algorithm[J]. Information Sciences, 2017, 394: 38-52. [7]陈基. 命名实体识别综述[J]. 现代计算机(专业版), 2016, (03): 24-26. [8]亓晓青. Web挖掘中的主题模型扩展[D]. 北京: 北京邮电大学, 2013. [9]王明印. 开放式中文实体关系抽取研究[D]. 北京: 北京邮电大学, 2015. [10]李明耀, 杨静. 基于依存分析的开放式中文实体关系抽取方法[J]. 计算机工程, 2016, (06): 201-207. [11]温锐. 中文命名实体识别及其关系抽取研究[D]. 苏州: 苏州大学, 2005 [12]闫丹辉, 毕玉德. 基于规则的越南语命名实体识别研究[J]. 中文信息学报, 2014, (05): 198-205+214. [13]皇甫晶, 王凌云. 基于规则的纪传体古代汉语文献姓名识别[J]. 图书情报工作, 2013, (03): 120-124. [14]Akkasi A, Varoğlu E, Dimililer N. ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition[J]. BioMed research international, 2016, 2016. [15]Nguyen Q M, Cao T D. A novel approach for automatic extraction of semantic data about football transfer in sport news[J]. International Journal of Pervasive Computing and Communications, 2015, 11(2): 233-252. [16]Iwakura T, Takamura H, Okumura M. A named entity recognition method based on decomposition and concatenation of word chunks[J]. ACM Transactions on Asian Language Information Processing (TALIP), 2013, 12(3): 10. [17]Lu Y H, Liang M H. Answer Extraction Model Based On Named Entity Recognition[C]//Applied Mechanics and Materials. Trans Tech Publications, 2014, 571: 339-344. [18]尹迪, 周俊生, 曲维光. 基于联合模型的中文嵌套命名实体识别[J]. 南京师大学报(自然科学版), 2014, (03): 29-35. [19]熊丹, 陆勤, 罗凤珠, 石定栩, 赵天成. 基于语料库的明清小说人名与称谓研究[J]. 中文信息学报, 2015, (01): 19-27+43. [20]孟洪宇, 孟庆刚. 基于条件随机场的中医术语抽取方法及其应用探析[J]. 中华中医药学刊, 2014, (10): 2334-2337. [21]潘正高. 基于规则和统计相结合的中文命名实体识别研究[J]. 情报科学, 2012, (05): 708-712+786. [22]何炎祥, 罗楚威, 胡彬尧. 基于CRF和规则相结合的地理命名实体识别方法[J]. 计算机应用与软件, 2015, (01): 179-185+202. [23]高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52. [24]栗伟, 赵大哲, 李博, 彭新茗, 刘积仁. CRF与规则相结合的医学病历实体识别[J]. 计算机应用研究, 2015, (04): 1082-1086. [25]杨瑞仙, 毛一雷. 面向知识评价的我国科研机构命名识别方法研究[J]. 情报杂志, 2015, (07): 179-183. [26]崔梦娇, 李红莲, 吕学强, 周建设. 基于高校网站内容的实体抽取研究[J]. 北京: 北京信息科技大学学报(自然科学版), 2016, (05): 92-96. [27]朱江涛. 中文信息处理中命名实体识别问题的研究[D].沈阳航空工业学院, 2006. [28]Luo W H. The Processing and Analyzing of Non-Structured Data in Digital Investigation[C]//Advanced Materials Research. Trans Tech Publications, 2013, 774: 1807-1811. [29]Rocktäschel T, Weidlich M, Leser U. ChemSpot: a hybrid system for chemical named entity recognition[J]. Bioinformatics, 2012, 28(12): 1633-1640. [30]王琼瑶, 何友全, 彭小玲. 基于改进遗传算法的支持向量机参数优化方法[J]. 计算机与现代化, 2015, (03): 33-36 [31]刘方驰, 钟志农, 雷霖, 吴烨. 基于机器学习的实体关系抽取方法[J]. 兵工自动化, 2013, (09): 57-62. [32]黄勋, 游宏梁, 于洋. 关系抽取技术研究综述[J]. 现代图书情报技术, 2013, (11): 30-39. [33]瞿剑峰. 远监督关系抽取中的样本去噪与模型优化[D]. 吉林: 吉林大学, 2016. [34]陈鹏, 郭剑毅, 余正涛, 线岩团, 严馨, 魏斯超. 基于凸组合核函数的中文领域实体关系抽取[J]. 中文信息学报, 2013, (05): 144-148+155. [35]马超. 基于Web信息使用改进的无监督关系抽取方法构建交通本体[J]. 计算机系统应用, 2015, (12): 273-276. [36]施琦. 无监督中文实体关系抽取研究[D]. 北京: 中国地质大学(北京), 2015. [37]吴胜, 刘茂福, 胡慧君, 张志清, 顾进广. 中文文本中实体数值型关系无监督抽取方法[J]. 武汉: 武大学学报(理学版), 2016, (06): 552-560. [38]刘绍毓, 周杰, 李弼程, 席耀一, 唐浩浩. 基于多分类SVM-KNN的实体关系抽取方法[J]. 数据采集与处理, 2015, (01): 202-210. [39]陈立玮, 冯岩松, 赵东岩. 基于弱监督学习的海量网络数据关系抽取[J]. 计算机研究与发展, 2013, (09): 1825-1835 [40]贾真, 冶忠林, 尹红风, 何大可. 基于Tri-training与噪声过滤的弱监督关系抽取[J]. 中文信息学报, 2016, (04): 142-149+158. [41]王峰. 基于CRF的中文命名实体识别方法研究[D]. 太原 : 中北大学, 2011. ﹀
中图分类号：	11
馆藏号：	11-35228
开放日期：	2017-12-15

附件下载