首页 | 本学科首页   官方微博 | 高级检索  
     

基于主题相关度的地理信息Web服务爬虫研究
引用本文:武昊,廖安平,何超英,侯东阳. 基于主题相关度的地理信息Web服务爬虫研究[J]. 地理与地理信息科学, 2012, 28(2): 27-30
作者姓名:武昊  廖安平  何超英  侯东阳
作者单位:1. 武汉大学遥感信息工程学院,湖北武汉430079;国家基础地理信息中心,北京100830
2. 国家基础地理信息中心,北京,100830
3. 国家基础地理信息中心,北京100830;中国矿业大学环境与测绘学院,江苏徐州221008
基金项目:国家自然科学基金项目(41001216)
摘    要:针对通用搜索引擎对于地理信息Web服务检索存在的不足,提出了一种基于主题相关度的服务爬虫方法,利用向量空间模型表示主题特征,通过引入特征值权重的计算方法分析页面内容与主题的相关度,过滤与主题无关的页面;并利用改进的PageRank算法从URL和锚文本两方面分析链接的重要性,优化爬取队列。实验表明,该方法在服务检索效率和抓取能力上都取得了良好的效果。

关 键 词:地理信息Web服务  服务检索  爬虫  主题相关度

Topic-Relevance Based Crawler for Geographic Information Web Services
WU Hao , LIAO An-ping , HE Chao-ying , HOU Dong-yang. Topic-Relevance Based Crawler for Geographic Information Web Services[J]. Geography and Geo-Information Science, 2012, 28(2): 27-30
Authors:WU Hao    LIAO An-ping    HE Chao-ying    HOU Dong-yang
Affiliation:2,3(1.School of Remote Sensing and Information Engineering,Wuhan University,Wuhan 430079;2.Nation Geomantics Center of China,Beijing 100830;3.School of Environment Science and Spatial Informatics of CUMT,Xuzhou 221008,China)
Abstract:According to the defects of Common Search Engine on retrieving Geographic Information Web Services(GIServices),a web service crawler based on topic-relevance was designed and proposed in this paper.Firstly,this paper analyzed and defined the topic features of GIServices by utilizing Vector Space Model(VSM),which could facilitate the representation and calculation of topic features.Secondly,based on the introduction of calculation of topic weight,the paper presented an algorithm to analyze the similarity of web pages and eigenvector,which could be used to filter the web pages that were unrelated to the topic.Afterwards,an improved PageRank algorithm was reviewed based on analyzing the significance of hyperlink,which included the URL and anchor text,in order to optimize the crawling stack.The experimental results and analysis has proved that this method has distinct advantages on the searching efficiency and capturing ability compared to Common Search Engine.
Keywords:geographic information Web services  service retrieval  crawler  topic-relevance
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号