基于Shark/Spark的分布式空间数据分析框架 A Framework of Distributed Spatial Data Analysis Based on Shark/Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Shark/Spark的分布式空间数据分析框架

引用本文：	温馨,罗侃,陈荣国. 基于Shark/Spark的分布式空间数据分析框架[J]. 地球信息科学学报, 2015, 17(4): 401-407. DOI: 10.3724/SP.J.1047.2015.00401

作者姓名：	温馨罗侃陈荣国

作者单位：	1. 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室,北京 1001012. 中国科学院大学,北京 100049

基金项目：	国家高技术发展研究计划“863”项目(2013AA12A204、2013AA122302)

摘要：	随着空间数据的与日俱增,传统依托于单节点的空间数据管理方法,已难以满足海量数据高并发的需求。云计算的兴起带来机遇与挑战,分布式技术与数据库技术的优势互补,为云计算下高效的数据管理提供了可能。本文提出一种在分布式计算引擎（Shark/Spark）中集合之关键技术（包括空间数据映射、空间数据加载、数据备份及空间查询等）,将空间数据库对空间数据的高效存储、索引及查询优势与分布式计算引擎对复杂计算的优势相结合,实现一种基于Shark/Spark的分布式空间数据分析框架。在具体实现中,通过空间自定义函数和空间函数下推2种方式实现空间查询,结果表明,影响返回结果数据量的空间查询更适合下推给空间数据库完成,而不影响返回结果数据量的空间查询,利用分布式计算引擎直接运算更有优势。同时,通过与现有的一种分布式GIS方案（ArcGIS on Hadoop）对比发现,空间数据库的空间索引可有效提高查询效率,空间数据管理也更加独立。
关键词：	Shark Spark Hadoop 空间数据库空间查询
收稿时间：	2014-10-11
A Framework of Distributed Spatial Data Analysis Based on Shark/Spark

WEN Xin;LUO Kan;CHEN Rongguo. A Framework of Distributed Spatial Data Analysis Based on Shark/Spark[J]. Geo-information Science, 2015, 17(4): 401-407. DOI: 10.3724/SP.J.1047.2015.00401

Authors:	WEN Xin LUO Kan CHEN Rongguo

Affiliation:	1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China2. University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:	With the development of technology, spatial datasets continue increasing in an incredible speed. Traditional data management based on single-node DBMS hardly meets the demands of high-concurrence in massive data. The rise of cloud computing brings brand new opportunities and challenges. Some researchers adopt a hybrid solution that combines the fault tolerance, heterogeneous cluster, and distributed computing framework together for efficient performances. Derived from the computing framework of Spark, Shark is a computing engine for fast data analysis. When a query is submitted, Shark compiles the query into an operator tree represented by RDDs, which will then be translated by Spark into a graph of tasks for execution. Shark does not support spatial query at the moment; therefore, we introduce an approach to enable Shark/Spark to support spatial query. With the APIs and UDFs that provided by Shark, Shark/Spark has the capability to process spatial data fetching from spatial databases and perform spatial queries according to the demands. Integrating Shark/Spark and relevant components which include mapping, loading, backup and querying of spatial data, and taking the advantages of the efficient spatial data management of spatial databases and high performance computing that involves the large-scale data processing of Spark, a framework of distributed spatial data analysis based on Shark/Spark has been implemented. During the implementation and testing process, we found that in order to achieve a better performance, some queries which had impacts on the returned dataset, should be pushed entirely into the database layer; while the other queries should be performed in Spark. In addition, we found that this system outperformed ArcGIS on Hadoop in some queries because the spatial index of spatial databases could improve its efficiency. Moreover, data management using a spatial database would be much more independent and convenient.

Keywords:	Shark Spark Hadoop Spatial database Spatial query
本文献已被 CNKI 等数据库收录！
	点击此处可从《地球信息科学学报》浏览原始摘要信息
	点击此处可从《地球信息科学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏