
Apache Spark 1.4.0正式发布

  早上时间匆忙,我将于晚点时间详细地介绍Spark 1.4的更新,请关注本博客。

  Apache Spark 1.4.0的新特性可以看这里《Apache Spark 1.4.0新特性详解》

  Apache Spark 1.4.0于美国时间的2015年6月11日正式发布。Python 3支持,R API,window functions,ORC,DataFrame的统计分析功能,更好的执行解析界面,再加上机器学习管道从alpha毕业成正式API。

  Apache Spark 1.4.0是1.x版本线的第五个版本,这个版本将R API正式加入到Spark中,同时Spark核心引擎的可用性也有所提升,扩展了MLib和Spark Streaming。Spark 1.4有来自70个机构的超过210贡献者参与,并且有超过1000个patch。


Hi All,

I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!

A huge thanks go to all of the individuals and organizations involved
in development and testing of this release.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-4-0.html
[2] http://spark.apache.org/downloads.html


  Spark 1.4是第一个引入SparkR的版本,通过基于Spark新的DataFrame API使得R可以和Spark绑定。SparkR使得R语言用户可以使用Spark集群来分析大规模的数据。而且可以直接使用Spark SQL。可以参见SparkR(R on Spark)编程指南了解更多详情。

Spark Core

  Spark Core上面主要是带来操作性,表现性以及兼容性方面的提升,主要更新如下:

SPARK-6942: Visualization for Spark DAGs and operational monitoring
SPARK-4897: Python 3 support
SPARK-3644: A REST API for application information
SPARK-4550: Serialized shuffle outputs for improved performance
SPARK-7081: Initial performance improvements in project Tungsten
SPARK-3074: External spilling for Python groupByKey operations
SPARK-3674: YARN support for Spark EC2 and SPARK-5342: Security for long running YARN applications
SPARK-2691: Docker support in Mesos and SPARK-6338: Cluster mode in Mesos

DataFrame API and Spark SQL

  The DataFrame API sees major extensions in Spark 1.4 (see this link for a full list) with a focus on analytic and mathmatical functions. Spark SQL introduces new operational utilities along with support for ORCFile.

SPARK-2883: Support for ORCFile format
SPARK-2213: Sort-merge joins to optimize very large joins
SPARK-5100: Dedicated UI for the SQL JDBC server
SPARK-6829: Mathematical functions in DataFrames
SPARK-8299: Improved error message reporting for DataFrame and SQL
SPARK-1442: Window functions in Spark SQL and DataFrames
SPARK-6231 / SPARK-7059: Improved API support for self joins
SPARK-5947: Partitioning support in Spark’s data source API
SPARK-7320: Rollup and cube functions
SPARK-6117: Summary and descriptive statistics

Spark ML/MLlib

  Spark’s ML pipelines API graduates from alpha in this release, with new transformers and improved Python coverage. MLlib also adds several new algorithms.

SPARK-5884: A variety of feature transformers for ML pipelines
SPARK-7381: Python API for ML pipelines
SPARK-5854: Personalized PageRank for GraphX
SPARK-6113: Stabilize DecisionTree and ensembles APIs
SPARK-7262: Binary LogisticRegression with L1/L2 (elastic net)
SPARK-7015: OneVsRest multiclass to binary reduction
SPARK-4588: Add API for feature attributes
SPARK-1406: PMML model evaluation support via MLib
SPARK-5995: Make ML Prediction Developer APIs public
SPARK-3066: Support recommendAll in matrix factorization model
SPARK-4894: Bernoulli naive Bayes

Spark Streaming

  Spark streaming adds visual instrumentation graphs and significantly improved debugging information in the UI. It also enhances support for both Kafka and Kinesis.

SPARK-7602: Visualization and monitoring in the streaming UI including batch drill down (SPARK-6796, SPARK-6862)
SPARK-7621: Better error reporting for Kafka
SPARK-2808: Support for Kafka and Kafka with Scala 2.11
SPARK-5946: Python API for Kafka direct mode
SPARK-7111: Input rate tracking for Kafka
SPARK-5960: Support for transferring AWS credentials to Kinesis
SPARK-7056 A pluggable interface for write ahead logs

Known Issues

  This release has few known issues which will be addressed in Spark 1.4.1

Python sortBy()/sortByKey() can hang if a single partition is larger than worker memory SPARK-8202
Unintended behavior change of JSON schema inference SPARK-8093
Some ML pipleline components do not correctly implement copy SPARK-8151
Spark-ec2 branch pointer is wrong SPARK-8310
本文链接: 【Apache Spark 1.4.0正式发布】(https://www.iteblog.com/archives/1390.html)
喜欢 (11)
分享 (0)
