一种使用RoBERTa-BiLSTM-CRF的中文地址解析方法 |
| |
引用本文: | 张红伟,杜清运,陈张建,张琛. 一种使用RoBERTa-BiLSTM-CRF的中文地址解析方法[J]. 武汉大学学报(信息科学版), 2022, 47(5): 665-672. DOI: 10.13203/j.whugis20210112 |
| |
作者姓名: | 张红伟 杜清运 陈张建 张琛 |
| |
作者单位: | 1.武汉大学电子信息学院,湖北 武汉,430072 |
| |
基金项目: | 国家重点研发计划2016YFC0803106 |
| |
摘 要: | 针对当前地址匹配方法严重依赖分词词典、无法有效识别地址中的地址元素及其所属类型的问题,提出了使用深度学习的中文地址解析方法,该方法能够对解析后的地址进行标准化和构成分析以改善地址匹配结果。通过对地址的不同词向量表示及不同序列标注模型的对比评估,结果表明,使用双向门递归单元和双向长短时记忆网络对中文地址解析差别较小,稀疏注意力机制有助于提高地址解析的值。所提出的方法在泛化能力测试集上的值达到了0.940,在普通测试集上的值达到了0.968。
|
关 键 词: | 地址解析 中文地址分词 注意力机制 长短时记忆网络 RoBERTa BiLSTM CRF |
收稿时间: | 2021-06-08 |
A Chinese Address Parsing Method Using RoBERTa-BiLSTM-CRF |
| |
Affiliation: | 1.School of Electronic Information, Wuhan University, Wuhan 430072, China2.School of Resources and Environmental Sciences, Wuhan University, Wuhan 430079, China3.Zhejiang Academy of Surveying and Mapping, Hangzhou 311100, China |
| |
Abstract: | Objectives Aiming at the problems that current address matching relies heavily on word segmentation dictionary and cannot effectively recognize address elements in addresses and their types, a Chinese address parsing method based on deep learning is proposed. Methods The model combining robustly optimized bidirectional encoder representations from transformers(BERT) transformers pretraining approach(RoBERTa), bidirectional long short-term memory(BiLSTM), and conditional random field(CRF) is used to parse Chinese addresses. Firstly, the RoBERTa model is used to obtain the word vector representation of the address. Secondly, BiLSTM is used to learn the address model features and contextual information. Finally, CRF is used to construct the constraint relations between the labels. Results Through the comparison and evaluation of different word vector representations of addresses and different sequence labeling models, the proposed method in this study achieves the maximum value of 0.940 on the generalization ability test dataset, and the precision, recall rate, and F1-score of the correspond?ing test dataset reach 0.962, 0.974, and 0.968. Conclusions The method proposed in this paper does not need to extract the address model features, nor does it rely on word segmentation dictionary for address segmentation. Address elements are recognized by learning address context information and address model features. The model generalization ability test dataset used in this study can effectively test whether the model is overfitted. There is little difference between bidirectional gated recurrent unit (BiGRU) and BiLSTM for Chinese address resolution, and the sparse attention mechanism helps to improve the accuracy of address resolution. |
| |
Keywords: | |
|
| 点击此处可从《武汉大学学报(信息科学版)》浏览原始摘要信息 |
|
点击此处可从《武汉大学学报(信息科学版)》下载免费的PDF全文 |
|