Apache Spark 1.4.0正式发布

文章目录

1 SparkR
2 Spark Core
3 DataFrame API and Spark SQL
4 Spark ML/MLlib
5 Spark Streaming
6 Known Issues

　　~~早上时间匆忙，我将于晚点时间详细地介绍Spark 1.4的更新，请关注本博客。~~

　　Apache Spark 1.4.0的新特性可以看这里《Apache Spark 1.4.0新特性详解》。

　　Apache Spark 1.4.0于美国时间的2015年6月11日正式发布。Python 3支持，R API，window functions，ORC，DataFrame的统计分析功能，更好的执行解析界面，再加上机器学习管道从alpha毕业成正式API。

　　Apache Spark 1.4.0是1.x版本线的第五个版本，这个版本将R API正式加入到Spark中，同时Spark核心引擎的可用性也有所提升，扩展了MLib和Spark Streaming。Spark 1.4有来自70个机构的超过210贡献者参与，并且有超过1000个patch。

邮件内容如下：

Hi All,

I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!

A huge thanks go to all of the individuals and organizations involved
in development and testing of this release.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone who helped work on this release!

[1] http://spark.apache.org/releases/spark-release-1-4-0.html
[2] http://spark.apache.org/downloads.html

SparkR

　　Spark 1.4是第一个引入SparkR的版本，通过基于Spark新的DataFrame API使得R可以和Spark绑定。SparkR使得R语言用户可以使用Spark集群来分析大规模的数据。而且可以直接使用Spark SQL。可以参见SparkR(R on Spark)编程指南了解更多详情。

Spark Core

　　Spark Core上面主要是带来操作性，表现性以及兼容性方面的提升，主要更新如下：

SPARK-6942: Visualization for Spark DAGs and operational monitoring
SPARK-4897: Python 3 support
SPARK-3644: A REST API for application information
SPARK-4550: Serialized shuffle outputs for improved performance
SPARK-7081: Initial performance improvements in project Tungsten
SPARK-3074: External spilling for Python groupByKey operations
SPARK-3674: YARN support for Spark EC2 and SPARK-5342: Security for long running YARN applications
SPARK-2691: Docker support in Mesos and SPARK-6338: Cluster mode in Mesos

DataFrame API and Spark SQL

　　The DataFrame API sees major extensions in Spark 1.4 (see this link for a full list) with a focus on analytic and mathmatical functions. Spark SQL introduces new operational utilities along with support for ORCFile.

SPARK-2883: Support for ORCFile format
SPARK-2213: Sort-merge joins to optimize very large joins
SPARK-5100: Dedicated UI for the SQL JDBC server
SPARK-6829: Mathematical functions in DataFrames
SPARK-8299: Improved error message reporting for DataFrame and SQL
SPARK-1442: Window functions in Spark SQL and DataFrames
SPARK-6231 / SPARK-7059: Improved API support for self joins
SPARK-5947: Partitioning support in Spark’s data source API
SPARK-7320: Rollup and cube functions
SPARK-6117: Summary and descriptive statistics

Spark ML/MLlib

　　Spark’s ML pipelines API graduates from alpha in this release, with new transformers and improved Python coverage. MLlib also adds several new algorithms.

SPARK-5884: A variety of feature transformers for ML pipelines
SPARK-7381: Python API for ML pipelines
SPARK-5854: Personalized PageRank for GraphX
SPARK-6113: Stabilize DecisionTree and ensembles APIs
SPARK-7262: Binary LogisticRegression with L1/L2 (elastic net)
SPARK-7015: OneVsRest multiclass to binary reduction
SPARK-4588: Add API for feature attributes
SPARK-1406: PMML model evaluation support via MLib
SPARK-5995: Make ML Prediction Developer APIs public
SPARK-3066: Support recommendAll in matrix factorization model
SPARK-4894: Bernoulli naive Bayes

Spark Streaming

　　Spark streaming adds visual instrumentation graphs and significantly improved debugging information in the UI. It also enhances support for both Kafka and Kinesis.

SPARK-7602: Visualization and monitoring in the streaming UI including batch drill down (SPARK-6796, SPARK-6862)
SPARK-7621: Better error reporting for Kafka
SPARK-2808: Support for Kafka 0.8.2.1 and Kafka with Scala 2.11
SPARK-5946: Python API for Kafka direct mode
SPARK-7111: Input rate tracking for Kafka
SPARK-5960: Support for transferring AWS credentials to Kinesis
SPARK-7056 A pluggable interface for write ahead logs

Known Issues

　　This release has few known issues which will be addressed in Spark 1.4.1

Python sortBy()/sortByKey() can hang if a single partition is larger than worker memory SPARK-8202
Unintended behavior change of JSON schema inference SPARK-8093
Some ML pipleline components do not correctly implement copy SPARK-8151
Spark-ec2 branch pointer is wrong SPARK-8310

本博客文章除特别声明，全部都是原创！
原创文章版权归过往记忆大数据（过往记忆）所有，未经许可不得转载。
本文链接: 【Apache Spark 1.4.0正式发布】（https://www.iteblog.com/archives/1390.html）