首页 | 本学科首页   官方微博 | 高级检索  
     检索      

面实体匹配的集成学习CatBoost方法
引用本文:刘贺,郭黎,李豪,张婉晨,白翔天.面实体匹配的集成学习CatBoost方法[J].地球信息科学,2022,24(11):2198-2211.
作者姓名:刘贺  郭黎  李豪  张婉晨  白翔天
作者单位:1.61363部队,西安 7100542.信息工程大学, 郑州 4500013.32021部队, 北京 100094
基金项目:科技基础资源调查专项(2019FY202501);河南省高等教育教学改革研究与实践重点项目(2021SJGLX299)
摘    要:现有的面实体多指标几何匹配方法在计算综合相似度和确定最终匹配实体时面临着指标权重和阈值难以科学量化的难题,集成学习算法通过构建并结合多个机器学习器来完成学习任务,在解决分类问题时体现出了较为明显的性能优势。为此,本文提出了一种基于集成学习算法CatBoost的面实体匹配方法,将匹配问题转化为分类问题。选取形状、面积、方向和位置4个几何特征作为模型分类特征;利用过采样与欠采样相结合的混合重采样技术减轻原始训练样本的类别不平衡度;借助贝叶斯优化算法确定CatBoost模型的最优超参数;引入可解释人工智能领域的SHAP解释框架从全局和局部两个角度解释各输入特征对匹配结果的影响。在青藏高原的面状湖泊数据上对本文提出的方法进行了验证,实验结果表明:对模型预测影响最大的特征是位置,然后依次是面积、形状,影响最小的特征是方向。CatBoost匹配方法在实验数据集上的查准率、查全率和F1-score分别达到0.9937、0.9753和0.9844,相比于直接使用样本不均衡的原始样本进行模型训练,分别提高了约5.8%、0.6%和3.3%。与传统的面实体多指标双向匹配方法和逻辑回归、K近邻、决策树、神经网络等常规机器学习分类算法相比,集成学习算法CatBoost性能表现更加优异,在避免指标权重和阈值设置难题的同时取得了较好的匹配结果。

关 键 词:面实体  相似性  匹配  集成学习  CatBoost  类别不平衡  贝叶斯优化  SHAP  
收稿时间:2022-01-26

Matching Areal Entities with CatBoost Ensemble Method
LIU He,GUO Li,LI Hao,ZHANG Wanchen,BAI Xiangtian.Matching Areal Entities with CatBoost Ensemble Method[J].Geo-information Science,2022,24(11):2198-2211.
Authors:LIU He  GUO Li  LI Hao  ZHANG Wanchen  BAI Xiangtian
Institution:1. 61363 Troops, Xi'an 710054, China2. Information Engineering University, Zhengzhou 450001, China3. 32021 Troops, Beijing 100094, China
Abstract:The existing multi-index geometric matching methods for areal entities face the difficulty in scientific quantification of index weights and thresholds when calculating the comprehensive similarity and determining the final matching entity. The ensemble methods in machine learning train multiple base models as ensemble members and combine their predictions into the final output, which have shown excellent performance in solving classification problems. For this purpose, an areal entities matching method based on the CatBoost is proposed in this paper, and this method transforms the matching problem into a classification problem. Firstly, we select four geometric features including shape, area, direction, and position as model classification features. Secondly, to reduce the impact of sample imbalance on model training, we use hybrid resampling combining oversampling and undersampling to alleviate the class imbalance of the original training samples. The Bayesian optimization is used to determine the optimal hyperparameters of the CatBoost model. To improve transparency of ensemble learning models, the SHAP framework in the field of explainable artificial intelligence is introduced to explain the influence of each input feature on the prediction results from both global and local perspectives. Finally, we take the areal lake data of the Qinghai-Tibet Plateau as experimental data to assess the performance of the proposed method. The results demonstrate that the feature with the greatest influence on model prediction is position, followed by area and shape, and the feature with the least influence is direction. The Precision, Recall, and F1-score of this method on the experimental data are 0.9937, 0.9753, and 0.9844, respectively. Hybrid resampling can effectively reduce the impact of unbalanced samples on model training. Compared with the original unbalanced samples for model training, hybrid resampling increases the Precision, Recall, and F1-score by 5.8%, 0.6%, and 3.3%, respectively. Compared with traditional areal entities multi-index bidirectional matching method and conventional machine learning classification algorithms such as logistic regression, K-nearest neighbors, decision trees, and neural networks, the CatBoost performs better and achieves better matching results while avoiding the difficulty of index weights and thresholds setting.
Keywords:areal entities  similarity  matching  ensemble methods  CatBoost  class imbalance  Bayesian Optimization  SHAP  
点击此处可从《地球信息科学》浏览原始摘要信息
点击此处可从《地球信息科学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号