查看论文信息

中文题名：	Spark计算框架性能建模与优化技术的研究与实现
姓名：	温艳琪
学号：	1403121597
保密级别：	公开
论文语种：	chi
学科代码：	081202
学科名称：	计算机软件与理论
学生类型：	硕士
学位：	工学硕士
学校：	西安电子科技大学
院系：	计算机学院
专业：	计算机软件与理论
第一导师姓名：	陈平
第一导师单位：	西安电子科技大学
完成日期：	2017-06-14
外文题名：	Research and Implementation of Performance Modeling and Optimization Technology of Spark Computing Framework
中文关键词：	性能模型 ; 配置参数 ; 性能调优 ; 任务调度 ; Spark
外文关键词：	Performance Model ; Configuration Parameters ; Performance Optimization ; Task Scheduling ; Spark
中文摘要：	︿近年来，随着互联网的高速发展，对于企业来说，需要处理的数据量呈爆发式增长，使得原有处理数据的方式已经不能满足现有的需求。MapReduce分布式计算框架的出现解决了很多企业的数据处理需求，但是数据规模的逐渐扩大使得MapReduce提供的计算能力逐渐减弱，MapReduce计算速度的缓慢也逐渐不能满足企业的需求。在这样一种情况下，基于内存计算的Spark被很多企业看中，越来越多的企业选择用Spark进行大数据量的处理任务，同时开始关注Spark的性能提升问题。为了提升Spark框架的计算性能，国内外许多组织和个人从不同方面做出了努力，也提出了很多优化方法。但是通过自动化寻找较优配置参数，来实现性能优化的研究很少。为了弥补这一方向的研究不足，本文提出了一个优化方法，该方法可以针对不同的Spark应用程序给出不同的优化后的参数配置，使得应用程序的运行时间得到优化，从而达到Spark性能优化的效果。本文提出的优化方法的主要思路是：通过建立Spark框架的性能模型，运行小数据集的应用程序，通过修改后的Spark源码收集得到运行信息，再根据提前建立好的性能模型，从而预测出大数据集下应用程序的运行时间，再通过基于代价的优化算法多次迭代调用预测模型，最终给出优化后的参数配置集，使得在该参数配置下应用程序的运行时间近似最优。本论文的主要研究工作包括：（1）植入监控代码，收集应用程序运行数据。在Spark的1.4.0版代码中加入监控代码，收集Task执行过程的数据流和执行时间等信息，收集Job的执行信息以及Stage之间的DAG信息，收集到的信息以XML格式存储在本地，便于下一步的Task预测模型的建立和模拟调度模型的实现。（2）构建预测模型。通过阅读Spark 1.4.0源码，分析应用程序的执行调度过程以及Task的执行过程，根据上一步收集到的数据流、执行时间信息，以及筛选出来的配置参数信息，建立Task的执行时间数学模型。根据收集到的Stage之间的DAG信息实现对Spark调度过程的模拟，从而得到整个应用程序的预测模型。（3）设计并实现优化算法。通过实现随机网格算法、递归随机搜索、遗传算法以及粒子群优化算法，多次迭代给出参数配置集，调用预测模型计算在该参数配置集下的预测运行时间。最终给出优化后的参数配置集，使得应用程序的执行时间得到优化，从而达到Spark应用程序性能调优的目的。本文的实验部分采用Intel提供的HiBench基准测试平台，使用WordCount、Sort、TeraSort、PageRank、Kmeans以及Bayes六个工作负载对本文所研究的Spark性能优化方法进行验证，实验从预测模型的准确性和优化算法的优化效果两方面进行验证，对比了CBO中四种优化算法和RBO的优化效果，最终实验结果表明本文所设计并实现的Spark性能优化方法的优化效果明显高于基于RBO的优化效果。﹀
外文摘要：	︿ With the rapid development of the Internet in recent years, the amount of data is explosive growth, which leads to the original way cannot meet the existing needs, especially for enterprises. MapReduce, a distributed computing framework, helps many enterprises solve the needs of data processing. However, the growing scale of data makes the computing ability that MapReduce provides becomes weak. And the latency of MapReduce in time is more and more unable to meet the needs of enterprises. In this case, memory-based Spark becomes popular in the enterprises. More and more enterprises choose Spark to process the large data and want the higher performance of Spark. In order to improve the performance of Spark, many organizations and individuals have made lots of effort from different aspects and developed lots of optimization methods. But there are few studies to achieve performance optimization by searching for optimal configuration parameters automatically. In order to solve the problem, this paper presents an optimization method, which can give different optimized parameters for different Spark applications, and reduce the runtime of the program, so as to achieve the effect of Spark performance optimization. The main idea of the optimization method proposed in this paper is to establish the performance model of the Spark framework, run the application of the small data set, and collect the relevant parameters in the performance model by modifying Spark source code. The optimization method in this paper can predict the runtime of the application with large data set. Then, the optimization algorithm model iteratively calls prediction model and finds the optimal parameter configuration set. So that this paper can optimize the runtime of the application. The research of this paper is described as follows. (1) Collecting the application's running data. This paper adds monitoring code in the Spark 1.4.0 version of the code, to collect the data flow and ution time during Task processing, and to collect the ution information of the Job and the DAG information between the stages. The collected information is stored in XML file locally, and the information is used to build up the Task's prediction model, and realize the simulation scheduling model. (2) Building up prediction model. Through reading the Spark 1.4.0 version source code, this paper analyzes the ution scheduling of the application and Task's ution process, and establish the ution time mathematical model of the Task according to the collected data flow, the ution time and the ed configuration parameter information. According to the collected DAG information between the Stages, this paper achieves Spark's simulation scheduling process. Finally, this paper achieves the forecast model of the application. (3) Implementing cost-based optimization algorithms. This paper implements random grid algorithm, recursive random search algorithm, genetic algorithm and particle swarm optimization algorithm, and multiple iterative give different configuration parameters. Then, the prediction model is called to calculate the predicted running time under those configuration parameters. Finally, those optimization algorithms give the optimal configuration parameters, which making the application ution time is the shortest, so as to achieve the purpose of Spark application performance tuning. The experimental part of this paper uses the HiBench benchmark platform, which provided by Intel, to verify the Spark performance optimization method. And this paper uses WordCount, Sort, TeraSort, PageRank, Kmeans, and Bayes as the test application. The experiment validates the accuracy of the prediction model and the optimization effects of optimization algorithms. Also, compares the optimization effects of four optimization algorithms in CBO and RBO. Finally, the experiment result shows that the optimization method in this paper is better than the role-bases optimization. ﹀
参考文献：	︿ [1] Apache.Spark™-Lightning-FastCluster Computing.http://spark.apache.org, 2016. [2] Scala.The Scala Programming Language.www.scala-lang.org, 2016. [3] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: 2-2. [4] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113. [5] M. Armbrust, T. Das, A. Davidson et al. Scaling spark in the realworld:performance and usability. Proceedings of the VLDB. Endowment,4(11):1111-1122, 2011. [6] 李雪蕤.Spark性能优化指南.http://tech.meituan.com/spark-tuning-basic.html.2016. [7] 康海蒙. 基于细粒度监控的Spark优化研究[D]. 哈尔滨: 哈尔滨工业大学计算机科学与技术学院, 2016. [8] 陈侨安, 李峰, 曹越等. 基于运行数据分析的Spark任务参数优化[J]. 计算机工程与科学, 2016, 38(1): 11-19. [9] P. Wendell. Understanding the performance of spark applications. https://spark-summit.org/2013 /talk/wendell-understanding-the-performance-of-spark-applications.2013. [10] A. Bida and R. Warren. Spark tuning for enterprise system administors.https://spark-summit.org/ east-2016/events/spark-tuning-for-enterprise-system-administrators.2016. [11] A. Davidson. A deeper understanding of sparkinternals. https://spark-summit.org/2014/talk/ a-deeper-understanding-of-spark-internals.2014. [12] M. Armbrust. Catalyst: A query optimization framework for spark and shark.https://spark- summit.org/2013/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark.2013. [13] P. Li, X. Huang, T. Zhao et al.Sparkling: Identification of task skew and speculative partition of data for spark applications.https://spark-summit.org/2014/talk/sparkling-identification-of-task- skew-and-speculative-partition-of-data-for-spark-applications.2014. [14] S. Ryza. How-to: Tune your apache spark jobs (part 1). http://blog.cloudera.com/blog/2015/03 /how-to-tune-your-apache-spark-jobs-part-1.2015. [15] M. Qiu. Enhancement on spark sql optimizer.https://spark-summit.org/east-2016/events/ enhancements-on-spark-sql-optimizer.2016. [16] Herodotos Herodotou. Starfish: Self-tuning Analytics System.http://www.cs.duke.edu/starfish/ index.html.2012. [17] H. Herodotou and S. Babu. Profiling, what-if analysis,and cost-based optimization of mapreduce programs.Proceedings of the VLDB Endowment,4(11):1111-1122, 2011. [18] Spark 百度百科. http://baike.baidu.com/item/SPARK/2229312?sefr=enterbtn. 2017. [19] Zaharia M, Das T, Li H, et al. Discretized streams: Fault-tolerant streaming computation at scale[C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013: 423-438. [20] 南国故人. Spark Streaming实时计算框架. http://blog.csdn.net/u013516966/article/details/ 50982842. 2014. [21] Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. ACM, 2013: 13-24. [22] Meng X, Bradley J, Yavuz B, et al. Mllib: Machine learning in apache spark[J]. arXiv preprint arXiv:1505.06807, 2015. [23] Xin R S, Gonzalez J E, Franklin M J, et al. Graphx: A resilient distributed graph system on spark[C]//First International Workshop on Graph Data Management Experiences and Systems. ACM, 2013: 2. [24] 高彦杰.Spark大数据处理：技术、应用与性能优化.1.北京:机械工业出版社,2014.7-10. [25] Dong. Apache Spark探秘：三种分布式部署方式比较. http://dongxicheng.org/framework-on- yarn/apache-spark-comparing-three-deploying-ways/. 2014. [26] 南国故人. RDD原理与详解. http://www.cnblogs.com/shenh062326/p/4130973.html. 2014. [27] Spark Programming Guide – Spark 1.4.0 Documentation. http://spark.apache.org/docs/1.4.0 /programming-guide.html, 2014. [28] 徐宁, 李春光, 张健,等. 几种现代优化算法的比较研究[J]. 系统工程与电子技术, 2002, 24(12):100-103. [29] T. Ye and S. Kalyanaraman. A recursive random search algorithm for large-scale network parameter configuration. ACM SIGMETRICS Performance Evaluation Review, 31(1):196-205, 2003. [30] 陈国良, 王熙法, 庄镇泉,等. 遗传算法及其应用[M]. 人民邮电出版社, 1999. [31] Shi X H, Liang Y C, Lee H P, et al. An improved GA and a novel PSO-GA-based hybrid algorithm[J]. Information Processing Letters, 2005, 93(5): 255-261. [32] Configuration – Spark1.4.0 Documentation. http://spark.apache.org/docs/1.4.0/configuration. html, 2015. [33] GitHub Intel hadoop/HiBench. https://github.com/intel-hadoop/HiBench, 2015. ﹀
中图分类号：	11
馆藏号：	11-34820
开放日期：	2017-12-16

附件下载