Abstract:Aiming at a large amount of data query requirements of longtime series, multisites and multimeteorological elements, the supporting capacity of the existing CMISS(China Integrated Meteorological Information Sharing System) is seriously insufficient. In this study, the monthly report data of historical surface meteorological records since the establishment of the meteorological stations in Guangxi and existing Hadoop cluster physical resources are used to redesign the ETL process, construct the Parquet format dataset, and complete HDFS conversion storage. Besides, the Broadcast variable of Spark is embedded to optimize the execution parameters of the Spark cluster, which improves the processing parallelism of the cluster and the association query efficiency of SparkSql. The results show that the maximum compression ratio of the Parquet format data set was more than 95%; the query efficiency of the onetime large amount of data was 1 to 5 times higher than the original and supported high concurrent access, providing effective technical support for the development of various related forecasting services.