It’s been a while since we last checked in with Apache Spark, but that just means there are all kinds of exciting new goodies and features to enjoy! The 2.4 release is full of improvements and upgrades to create a faster, easier, and smarter experience for big data developers.
Apache Spark is a unified analytics engine for large-scale data processing. Designed for speed and general-purpose use, Apache Spark is one of the most popular projects under the Apache software umbrella and one of the most active open source big data projects.
Apache Spark 2.4
The latest offering from Apache sees the Spark project follow along with its longstanding goals of speed, efficiency, and ease of use. Other focuses include stability and refinement: over 1000 tickets were resolved for this release!
To start, Project Hydrogen is beginning to bear fruit. Meant to bring big data and AI together, the barrier mode supports better integration with distributed deep learning frameworks. The current architecture of Apache Spark means that this is slightly more difficult, as complex communication patterns like All-Reduce cause blockages.
With the new barrier execution mode, Spark can launch training tasks like MPI tasks and restart everything together in case of task failures. This version of Spark also offers a new method of fault tolerance for barrier tasks; when any barrier task fails, Spark aborts all tasks and starts the stage.
Apache Spark 2.4 also offers built-in higher-order functions like array and map. The new built-in higher-order functions allow developers to manipulate complex types directly. These higher-order functions can also manipulate complex values with an anonymous lambda function.
SEE ALSO: Deep learning anomalies with TensorFlow and Apache Spark
Apache Spark 2.4 provides experimental support for Scala 2.12. Now, developers can write complete Spark applications with Scala 2.12, just by picking the 2.12 dependency. This also comes with better interoperability with Java 8, which means improved serialization of lambda functions.
This release also has built-in support for Apache Avro, the popular data serialization format. Now, developers can read and write their Avro data, right in Apache Spark! This module started life as a Databricks project and provides a few new functions and logical support.
Additionally, Apache Spark 2.4 comes with improved Kubernetes integration in three specific ways:
- Supports running containerized PySpark and SparkR on Kubernetes
- Client mode is now provided, meaning users can run tools in a pod on a Kubernetes cluster
- Increased mounting options for more Kubernetes volumes, including emptyDir, hostPath, and persistentVolumeClaim.
Other improvements include the eager evaluation of DataFrames in notebooks for an easier debugging and troubleshooting experience, Pandas UDF improvements, and the 2GB block size limitation has been eliminated. The 2.4 release now supports Databricks Runtime 5.0, the latest version of the popular cluster management platform for Apache Spark.
SEE ALSO: How well do you know your Apache Spark trivia?
Getting Apache Spark 2.4
For more information about the 2.4 release, check out the Spark update post here and the full release notes here. Apache Spark 2.4 is available for free here. However, if you’d like to try it with Databricks Runtime 5.0, there’s a free trial available here.
The post Apache Spark 2.4 provides experimental support for Scala 2.12 appeared first on JAXenter.
Source : JAXenter