首页 | 本学科首页   官方微博 | 高级检索  
     

顾及中文汉字多特征的矿产资源实体识别
引用本文:刘志豪, 金相国, 邱芹军, 陶留锋, 黄振, 谢忠. 2023. 顾及中文汉字多特征的矿产资源实体识别. 地质科学, 58(4): 1535-1553. doi: 10.12017/dzkx.2023.084
作者姓名:刘志豪  金相国  邱芹军  陶留锋  黄振  谢忠
作者单位:1. 国家地理信息系统工程技术研究中心 武汉 430074; 2. 中国地质大学(武汉)计算机学院 武汉 430074; 3. 自然资源部城市国土资源监测与仿真重点实验室 广东深圳 518034; 4. 地理信息系统国家地方联合工程实验室 武汉 430074
基金项目:国家重点研发计划项目(编号:2022YFF0711601)、湖北省自然科学基金项目(编号:2022CFB640)、中国博士后科学基金项目(编号:2021M702991)、地质探测与评估教育部重点实验室主任基金项目(编号:GLAB2023ZR01)和自然资源部城市国土资源监测与仿真重点实验室开放基金课题项目(编号:KF-2022-07-014)资助
摘    要:

矿产资源地质报告中蕴含大量专家经验及基础地质知识。快速准确地从海量矿产资源文本中抽取形成结构化知识已成为目前研究热点,命名实体识别是信息抽取与知识挖掘的重要步骤。针对矿产资源地质文本中存在实体长度长、专业术语多、实体嵌套等问题,已有基于深度学习的命名实体识别直接应用在矿产资源领域性能低下,本文提出了一种矿产资源命名实体识别深度学习模型:ALBERT(A Lite Bidirectional Encoder Representations fromTransformers) -BiLSTM(Bi-directional Long Short-Term Memory)-CRF(Conditional Random Field),通过ALBERT预训练语言模型获取地质文本丰富语义特征,同时结合汉字拼音、字形和词边界特征来共同作为嵌入层,从而提高对复杂实体的识别能力。本文方法在人民日报、电子简历数据集及构建的矿产资源数据集上进行实验,结果表明提出方法在准确率、召回率、F1值上分别达到70.97%、64.33%、67.49%。



关 键 词:矿产资源报告   命名实体识别   预训练模型   多特征融合
收稿时间:2023-02-21
修稿时间:2023-04-03

Mineral resource entity recognition considering multiple features of Chinese characters
Liu Zhihao, Jin Xiangguo, Qiu Qinjun, Tao Liufeng, Huang Zhen, Xie Zhong. 2023. Mineral resource entity recognition considering multiple features of Chinese characters. Chinese Journal of Geology, 58(4): 1535-1553. doi: 10.12017/dzkx.2023.084
Authors:Liu Zhihao  Jin Xiangguo  Qiu Qinjun  Tao Liufeng  Huang Zhen  Xie Zhong
Affiliation:1. National Engineering Research Center of Geographic Information System, Wuhan 430074; 2. School of Computer Science, China University of Geosciences(Wuhan), Wuhan 430074; 3. Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources, Shenzhen, Guangdong 518034; 4. National and Local Joint Engineering Laboratory of Geographic Information System, Wuhan 430074
Abstract:Mineral resource geological reports contain a large amount of expert empirical knowledge and basic geological knowledge. Rapid and accurate extraction of structured knowledge from massive mineral resource texts has become a hot research topic, and named entity recognition is an important step in information extraction and knowledge mining. To address the problems of long entity length, many technical terms and nested entities in geological texts, the existing deep learning-based named entity recognition is directly applied to the mineral resources field, which leads to low performance, a deep learning model for named entity recognition of mineral resources is proposed: ALBERT-BiLSTM-CRF, through which ALBERT pre-trained language model to obtain rich semantic features of geological text, while combining Chinese pinyin, character form and word boundary features to jointly serve as an embedding layer, thus improving the recognition ability of complex entities. The method in this paper was experimented on the People's Daily, Resume dataset and the constructed mineral resources dataset, and the results showed that the proposed method achieved 70.97%, 64.33% and 67.49% in accuracy, recall and F1 value respectively.
Keywords:Mineral resources report  Named entity recognition  Pre-training model  Multi-feature fusion
点击此处可从《地质科学》浏览原始摘要信息
点击此处可从《地质科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号