Spark打破原来MapReduce排序的世界记录

　　Databricks官网昨天发布了一篇关于Spark用206个节点打破了原来MapReduce 100TB和1PB排序的世界记录。先前的世界记录是Yahoo在2100个Hadoop节点上运行MapReduce 对102.5 TB数据进行排序，他的运行时间是72分钟；而此次的Spark采用了206 个EC2节点，并部署了Spark，对100 TB的数据进行排序，一共用了23分钟！并且所有的排序都是基于磁盘的。也就是说，Spark对同一份数据的排序速度是Hadoop3X，并且只用了10X少的机器。
　　测试人员同样对比了对1PB的数据进行排序的结果，Spark用了190机器，一共运行了4小时！而此前的Hadoop采用了3800台机器，一共运行了16小时！详细对比如下：

	Hadoop World Record	Spark 100 TB	Spark 1 PB
Data Size	102.5 TB	100 TB	1000 TB
Elapsed Time	72 mins	23 mins	234 mins
# Nodes	2100	206	190
# Cores	50400	6592	6080
# Reducers	10,000	29,000	250,000
Rate	1.42 TB/min	4.27 TB/min	4.27 TB/min
Rate/node	0.67 GB/min	20.7 GB/min	22.5 GB/min
Sort Benchmark Daytona Rules	Yes	Yes	No
Environment	dedicated data center	EC2 (i2.8xlarge)	EC2 (i2.8xlarge)

　　对此次排序有很大的作用的技术主要包括如下三点：
　　1、sort-based shuffle：这是在Spark 1.1开始引入的，之前的shuffle是基于hash，它们之间的性能有很大的不同，主要参考这篇文章：《Spark shuffle：hash和sort性能对比》。这个模块在1.1版本中还存在不少的Bug，不过可能在Spark 1.2 中作为默认的shuffle。
　　2、Netty网络模块：Spark目前在将之前旧的网络传输模块用Netty来实现（Netty's Epoll native socket transport via JNI ），能显著减少操作系统内核与用户空间之间的数据传输，并且减少了JVM的垃圾回收。不过目前该模块还在实现中，默认是不启用的。详情可以看这里：

Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC
Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/
One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default.

　　3、全新的外部shuffle service：该模块是从Spark的executor 解耦出来的，这个服务是基于Netty网络模块构建，能够使得Spark在executors处于GC的时候仍然提供 shuffle files。目前此模块也处于开发中，开发一共分为两步实现，详情如下：

This task will be broken up into two parts – the first, being to refactor our internal shuffle service to use a BlockTransferService which we can easily extract out into its own service, and then the second is to actually do the extraction.
Here is the design document for the low-level service, nicknamed "Sluice", on top of which will be Spark's BlockTransferService API:
https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0

本博客文章除特别声明，全部都是原创！
原创文章版权归过往记忆大数据（过往记忆）所有，未经许可不得转载。
本文链接: 【Spark打破原来MapReduce排序的世界记录】（https://www.iteblog.com/archives/1141.html）