Apache Spark 2.4.0 正式发布

文章目录

1 Core and Spark SQL
2 Structured Streaming
3 MLlib
4 SparkR
5 GraphX
6 Deprecations
7 Changes of behavior
8 Known Issues

Apache Spark 2.4 与昨天正式发布，Apache Spark 2.4 版本是 2.x 系列的第五个版本。

如果想及时了解Spark、Hadoop或者Hbase相关的文章，欢迎关注微信公共帐号：iteblog_hadoop

Apache Spark 2.4 为我们带来了众多的主要功能和增强功能，主要如下：

新的调度模型（Barrier Scheduling），使用户能够将分布式深度学习训练恰当地嵌入到 Spark 的 stage 中，以简化分布式训练工作流程。
添加了35个高阶函数，用于在 Spark SQL 中操作数组/map。
新增一个新的基于 Databricks 的 spark-avro 模块的原生 AVRO 数据源。
PySpark 还为教学和可调试性的所有操作引入了热切的评估模式（eager evaluation mode）。
Spark on K8S 支持 PySpark 和 R ，支持客户端模式（client-mode）。
Structured Streaming 的各种增强功能。例如，连续处理（continuous processing）中的有状态操作符。
内置数据源的各种性能改进。例如，Parquet 嵌套模式修剪（schema pruning）。
支持 Scala 2.12。

具体的介绍可以参见《即将发布的 Apache Spark 2.4 都有哪些新功能》文章的介绍。以下是各个模块解决的 ISSUE 列表。

Core and Spark SQL

主要功能
- Barrier Execution Mode: [SPARK-24374] 调度器中支持 Barrier Execution Mode，使得 Spark 可以更好地和机器学习框架进行整合；
- Scala 2.12 Support: [SPARK-14220] 添加对 Scala 2.12 的支持。现在，您可以使用 Scala 2.12 构建 Spark ，并使用 Scala 2.12 中编写 Spark 应用程序。
- Higher-order functions: [SPARK-23899] 添加许多新的内置函数，包括高阶函数，可以更轻松地处理复杂数据类型。
- Built-in Avro data source: [SPARK-24768] Inline Spark-Avro package with logical type support, better performance and usability.
API
- [SPARK-24035] SQL syntax for Pivot
- [SPARK-24940] Coalesce and Repartition Hint for SQL Queries
- [SPARK-19602] Support column resolution of fully qualified column name
- [SPARK-21274] Implement EXCEPT ALL and INTERSECT ALL
Performance and stability
- [SPARK-16406] Reference resolution for large number of columns should be faster
- [SPARK-23486] Cache the function name from the external catalog for lookupFunctions
- [SPARK-23803] Support Bucket Pruning
- [SPARK-24802] Optimization Rule Exclusion
- [SPARK-4502] Nested schema pruning for Parquet tables
- [SPARK-24296] Support replicating blocks larger than 2 GB
- [SPARK-24307] Support sending messages over 2GB from memory
- [SPARK-23243] Shuffle+Repartition on an RDD could lead to incorrect answers
- [SPARK-25181] Limited the size of BlockManager master and slave thread pools, lowering memory overhead when networking is slow
Connectors
- [SPARK-23972] Update Parquet from 1.8.2 to 1.10.0
- [SPARK-25419] Parquet predicate pushdown improvement
- [SPARK-23456] Native ORC reader is on by default
- [SPARK-22279] Use native ORC reader to read Hive serde tables by default
- [SPARK-21783] Turn on ORC filter push-down by default
- [SPARK-24959] Speed up count() for JSON and CSV
- [SPARK-24244] Parsing only required columns to the CSV parser
- [SPARK-23786] CSV schema validation - column names are not checked
- [SPARK-24423] Option query for specifying the query to read from JDBC
- [SPARK-22814] Support Date/Timestamp in JDBC partition column
- [SPARK-24771] Update Avro from 1.7.7 to 1.8
Kubernetes Scheduler Backend
- [SPARK-23984] PySpark bindings for K8S
- [SPARK-24433] R bindings for K8S
- [SPARK-23146] Support client mode for Kubernetes cluster backend
- [SPARK-23529] Support for mounting K8S volumes
PySpark
- [SPARK-24215] Implement eager evaluation for DataFrame APIs
- [SPARK-22274] User-defined aggregation functions with Pandas UDF
- [SPARK-22239] User-defined window functions with Pandas UDF
- [SPARK-24396] Add Structured Streaming ForeachWriter for Python
- [SPARK-23874] Upgrade Apache Arrow to 0.10.0
- [SPARK-25004] Add spark.executor.pyspark.memory limit
- [SPARK-23030] Use Arrow stream format for creating from and collecting Pandas DataFrames
- [SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF
Other notable changes
- [SPARK-24596] Non-cascading Cache Invalidation
- [SPARK-23880] Do not trigger any job for caching data
- [SPARK-23510] Support Hive 2.2 and Hive 2.3 metastore
- [SPARK-23711] Add fallback generator for UnsafeProjection
- [SPARK-24626] Parallelize location size calculation in Analyze Table command

Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming

Major features
- [SPARK-24565] Exposed the output rows of each microbatch as a DataFrame using foreachBatch (Python, Scala, and Java)
- [SPARK-24396] Added Python API for foreach and ForeachWriter
- [SPARK-25005] Support “kafka.isolation.level” to read only committed records from Kafka topics that are written using a transactional producer.
Other notable changes
- [SPARK-24662] Support the LIMIT operator for streams in Append or Complete mode
- [SPARK-24763] Remove redundant key data from value in streaming aggregation
- [SPARK-24156] Faster generation of output results and/or state cleanup with stateful operations (mapGroupsWithState, stream-stream join, streaming aggregation, streaming dropDuplicates) when there is no data in the input stream.
- [SPARK-24730] Support for choosing either the min or max watermark when there are multiple input streams in a query
- [SPARK-25399] Fixed a bug where reusing execution threads from continuous processing for microbatch streaming can result in a correctness issue
- [SPARK-18057] Upgraded Kafka client version from 0.10.0.1 to 2.0.0

Programming guide: Structured Streaming Programming Guide.

MLlib

Major features
- [SPARK-22666] Spark datasource for image format
Other notable changes
- [SPARK-22119] Add cosine distance measure to KMeans/BisectingKMeans/Clustering evaluator
- [SPARK-10697] Lift Calculation in Association Rule mining
- [SPARK-14682] Provide evaluateEachIteration method or equivalent for spark.ml GBTs
- [SPARK-7132] Add fit with validation set to spark.ml GBT
- [SPARK-15784] Add Power Iteration Clustering to spark.ml
- [SPARK-15064] Locale support in StopWordsRemover
- [SPARK-21741] Python API for DataFrame-based multivariate summarizer
- [SPARK-21898] Feature parity for KolmogorovSmirnovTest in MLlib
- [SPARK-10884] Support prediction on single instance for regression and classification related models
- [SPARK-23783] Add new generic export trait for ML pipelines
- [SPARK-11239] PMML export for ML linear regression

Programming guide: Machine Learning Library (MLlib) Guide.

SparkR

Major features
- [SPARK-25393] Adding new function from_csv()
- [SPARK-21291] add R partitionBy API in DataFrame
- [SPARK-25007] Add array_intersect/array_except/array_union/shuffle to SparkR
- [SPARK-25234] avoid integer overflow in parallelize
- [SPARK-25117] Add EXCEPT ALL and INTERSECT ALL support in R
- [SPARK-24537] Add array_remove / array_zip / map_from_arrays / array_distinct
- [SPARK-24187] Add array_join function to SparkR
- [SPARK-24331] Adding arrays_overlap, array_repeat, map_entries to SparkR
- [SPARK-24198] Adding slice function to SparkR
- [SPARK-24197] Adding array_sort function to SparkR
- [SPARK-24185] add flatten function to SparkR
- [SPARK-24069] Add array_min / array_max functions
- [SPARK-24054] Add array_position function / element_at functions
- [SPARK-23770] Add repartitionByRange API in SparkR

Programming guide: SparkR (R on Spark).

GraphX

Major features
- [SPARK-25268] run Parallel Personalized PageRank throws serialization Exception

Programming guide: GraphX Programming Guide.

Deprecations

MLlib
- [SPARK-23451] Deprecate KMeans computeCost
- [SPARK-25345] Deprecate readImages APIs from ImageSchema

Changes of behavior

Spark Core
- [SPARK-25088] Rest Server default & doc updates
Spark SQL
- [SPARK-23549] Cast to timestamp when comparing timestamp with date
- [SPARK-24324] Pandas Grouped Map UDF should assign result columns by name
- [SPARK-23425] load data for hdfs file path with wildcard usage is not working properly
- [SPARK-23173] from_json can produce nulls for fields which are marked as non-nullable
- [SPARK-24966] Implement precedence rules for set operations
- [SPARK-25708] HAVING without GROUP BY should be global aggregate
- [SPARK-24341] Correctly handle multi-value IN subquery
- [SPARK-19724] Create a managed table with an existed default location should throw an exception

Please read the Migration Guide for all the behavior changes

Known Issues

[SPARK-25271] CTAS with Hive parquet tables should leverage native parquet source
[SPARK-24935] Problem with Executing Hive UDAF’s from Spark 2.2 Onwards
[SPARK-25879] Schema pruning fails when a nested field and top level field are selected
[SPARK-25906] spark-shell cannot handle -i option correctly
[SPARK-25921] Python worker reuse causes Barrier tasks to run without BarrierTaskContext
[SPARK-25918] LOAD DATA LOCAL INPATH should handle a relative path