首页 | 本学科首页   官方微博 | 高级检索  
     

一种主动发现网络地理信息服务的主题爬虫
引用本文:沈平,桂志鹏,游兰,胡凯,吴华意. 一种主动发现网络地理信息服务的主题爬虫[J]. 地球信息科学学报, 2015, 17(2): 185-190. DOI: 10.3724/SP.J.1047.2015.00185
作者姓名:沈平  桂志鹏  游兰  胡凯  吴华意
作者单位:1. 武汉大学测绘遥感信息工程国家重点实验室,武汉 4300792. 武汉大学遥感信息工程学院,武汉 4300793. 湖北大学计算机与信息工程学院,武汉 430062
基金项目:国家自然科学基金面上项目(41371372);武汉大学遥感信息工程学院探索性研发基金“基于时空计算特征挖掘的空间信息云计算优化方法研究”。
摘    要:地理信息服务已成为分布式环境下获取地理数据的重要来源,从海量的网络资源中找到地理信息服务,是共享与互操作地理数据的基础。目前,地理信息服务主动搜索主要采用通用搜索引擎的接口或者通用爬虫的抓取方式,但这2种方式存在搜索效率低、搜索结果可用性差等不足。针对这一问题,本文设计了一种搜索地理信息服务的主题爬虫。该算法在最佳优先搜索的基础上进行了改进,综合考虑网页内容的主题相关度和链接文本的主题相关度确定链接优先级,优先爬取与地理信息服务相关的链接,并通过舍弃无关网页中的无关链接,减少无效爬取,进而提高搜索效率。此外,本文采用关键词匹配结合能力文档探测的方式识别地理信息服务,有效筛选出可用的地理信息服务,提高了服务搜索结果的可利用率。最后,本文以OGC WMS为实例,实现爬虫算法的原型系统并进行实验,实验证明该算法有效可行。

关 键 词:主题爬虫  网络地理信息服务  最佳优先搜索  能力文档探测  
收稿时间:2014-11-14

A Topic Crawler for Discovering Geospatial Web Services
SHEN Ping,GUI Zhipeng,YOU Lan,HU Kai,WU Huayi. A Topic Crawler for Discovering Geospatial Web Services[J]. Geo-information Science, 2015, 17(2): 185-190. DOI: 10.3724/SP.J.1047.2015.00185
Authors:SHEN Ping  GUI Zhipeng  YOU Lan  HU Kai  WU Huayi
Affiliation:1. State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China2. School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China3. School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China
Abstract:In Internet era, geospatial web services (GWSs) are the primary approaches to share and interoperate geographical data. After more than ten years of development and the widely adoption on specifications, an increased number of geospatial web services have been published and are available for online public access. To obtain those geographical data, it is necessary to find an effective approach to locate and discover GWSs among massive web resources. Currently, the most widely used methods in practical for GWSs discovering are either based on Google Search API or based on generic web crawler. But the aforementioned approaches have some shortages, such as relatively inefficient search performance, irrelevant results, and low precision on GWS identification. To partially address the above issues, this paper developed a topic crawler to harvest GWSs based on the modified Best First Search strategy. The core of the proposed algorithm is that through combining the topic relevance of the link text and the topic relevance of the webpage text synthetically to predict the crawling priority of the unvisited URL. Then, we can utilize the priority thresholds to filter out the irrelevant URLs and narrow the search range at the same time. Moreover, a capabilities document detecting operation is added to GWSs recognition process to improve the search precision. Finally, we use the most widely adopted GWS specification: Web Map Service (WMS), which is proposed by Open Geospatial Consortium (OGC), as a case study. Two groups of experiments were conducted to compare the proposed method and a generic web crawler. The experimental results verified the feasibility of the proposed algorithm.
Keywords:topic crawler  Geospatial Web Services  Best First Search  capability detection
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《地球信息科学学报》浏览原始摘要信息
点击此处可从《地球信息科学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号