欢迎关注大数据技术架构与案例微信公众号:过往记忆大数据
过往记忆博客公众号iteblog_hadoop
欢迎关注微信公众号:
过往记忆大数据

Spark打破原来MapReduce排序的世界记录

  Databricks官网昨天发布了一篇关于Spark用206个节点打破了原来MapReduce 100TB和1PB排序的世界记录。先前的世界记录是Yahoo在2100个Hadoop节点上运行MapReduce 对102.5 TB数据进行排序,他的运行时间是72分钟;而此次的Spark采用了206 个EC2节点,并部署了Spark,对100 TB的数据进行排序,一共用了23分钟!并且所有的排序都是基于磁盘的。也就是说,Spark对同一份数据的排序速度是Hadoop3X,并且只用了10X少的机器。
  测试人员同样对比了对1PB的数据进行排序的结果,Spark用了190机器,一共运行了4小时!而此前的Hadoop采用了3800台机器,一共运行了16小时!详细对比如下:

Hadoop World RecordSpark 100 TBSpark 1 PB
Data Size102.5 TB100 TB1000 TB
Elapsed Time72 mins23 mins234 mins
# Nodes2100206190
# Cores5040065926080
# Reducers10,00029,000250,000
Rate1.42 TB/min4.27 TB/min4.27 TB/min
Rate/node0.67 GB/min20.7 GB/min22.5 GB/min
Sort Benchmark Daytona RulesYesYesNo
Environmentdedicated data centerEC2 (i2.8xlarge)EC2 (i2.8xlarge)

  对此次排序有很大的作用的技术主要包括如下三点:
  1、sort-based shuffle:这是在Spark 1.1开始引入的,之前的shuffle是基于hash,它们之间的性能有很大的不同,主要参考这篇文章:《Spark shuffle:hash和sort性能对比》。这个模块在1.1版本中还存在不少的Bug,不过可能在Spark 1.2 中作为默认的shuffle。
  2、Netty网络模块:Spark目前在将之前旧的网络传输模块用Netty来实现(Netty's Epoll native socket transport via JNI ),能显著减少操作系统内核与用户空间之间的数据传输,并且减少了JVM的垃圾回收。不过目前该模块还在实现中,默认是不启用的。详情可以看这里:

Right now shuffle send goes through the block manager. This is inefficient because it requires loading a block from disk into a kernel buffer, then into a user space buffer, and then back to a kernel send buffer before it reaches the NIC. It does multiple copies of the data and context switching between kernel/user. It also creates unnecessary buffer in the JVM that increases GC
Instead, we should use FileChannel.transferTo, which handles this in the kernel space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/
One potential solution is to use Netty. Spark already has a Netty based network module implemented (org.apache.spark.network.netty). However, it lacks some functionality and is turned off by default.

  3、全新的外部shuffle service:该模块是从Spark的executor 解耦出来的,这个服务是基于Netty网络模块构建,能够使得Spark在executors处于GC的时候仍然提供 shuffle files。目前此模块也处于开发中,开发一共分为两步实现,详情如下:

This task will be broken up into two parts – the first, being to refactor our internal shuffle service to use a BlockTransferService which we can easily extract out into its own service, and then the second is to actually do the extraction.
Here is the design document for the low-level service, nicknamed "Sluice", on top of which will be Spark's BlockTransferService API:
https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0

本博客文章除特别声明,全部都是原创!
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Spark打破原来MapReduce排序的世界记录】(https://www.iteblog.com/archives/1141.html)
喜欢 (15)
分享 (0)
发表我的评论
取消评论

表情
本博客评论系统带有自动识别垃圾评论功能,请写一些有意义的评论,谢谢!
(2)个小伙伴在吐槽
  1. MapReduce似乎在Spark面前毫无优势?
    Spark是否会在未来3-4年中彻底取代MapReduce?

    农药泡饭2014-10-14 12:56 回复
  2. 😮 这么好的性能??

    412554112014-10-11 12:12 回复