How to choose the scala version for my spark program? - scala

I am building my first Spark application, developing with IDEA.
In my cluster, the version of Spark is 2.1.0, and the version of Scala is 2.11.8.
http://spark.apache.org/downloads.html tells me:"Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package and build with Scala 2.10 support".
So here is my question:What's the meaning of "Scala 2.10 users should download the Spark source package and build with Scala 2.10 support"? Why not use the version of Scala 2.1.1?
Another question:Which version of Scala can I choose?

First a word about the "why".
The reason this subject evens exists is that scala versions are not (generally speacking) binary compatible, although most of the times, source code is compatible.
So you can take Scala 2.10 source and compile it into 2.11.x or 2.10.x versions. But 2.10.x compiled binaries (JARs) can not be run in a 2.11.x environment.
You can read more on the subject.
Spark Distributions
So, the Spark package, as you mention, is built for Scala 2.11.x runtimes.
That means you can not run a Scala 2.10.x JAR of yours, on a cluster / Spark instance that runs with the spark.apache.org-built distribution of spark.
What would work is :
You compile your JAR for scala 2.11.x and keep the same spark
You recompile Spark for Scala 2.10 and keep your JAR as is
What are your options
Compiling your own JAR for Scala 2.11 instead of 2.10 is usually far easier than compiling Spark in and of itself (lots of dependencies to get right).
Usually, your scala code is built with sbt, and sbt can target a specific scala version, see for example, this thread on SO. It is a matter of specifying :
scalaVersion in ThisBuild := "2.10.0"
You can also use sbt to "cross build", that is, build different JARs for different scala versions.
crossScalaVersions := Seq("2.11.11", "2.12.2")
How to chose a scala version
Well, this is "sort of" opinion based. My recommandation would be : chose the scala version that matches your production Spark cluster.
If your production Spark is 2.3 downloaded from https://spark.apache.org/downloads.html, then as they say, it uses Scala 2.11 and that is what you should use too. Using anything else, in my view, just leaves the door open for various incompatibilities down the road.
Stick with what your production needs.

Related

Apache Spark 3 and backward compatibility?

We have several Spark applications running on production developed using Spark 2.4.1 (Scala 2.11.12).
For couple of our new Spark jobs,we are considering utilizing features of DeltaLake.For this we need to use Spark 2.4.2 (or higher).
My questions are:
If we upgrade our Spark cluster to 3.0.0, can our 2.4.1 applications still run on the new cluster (without recompile)?
If we need to recompile our previous Spark jobs with Spark 3, are they source compatible or do they need any migration?
There are some breaking changes in Spark 3.0.0, including source incompatible change and binary incompatible changes. See https://spark.apache.org/releases/spark-release-3-0-0.html. And there are also some source and binary incompatible changes between Scala 2.11 and 2.12, so you may also need to update codes because of Scala version change.
However, only do Delta Lake 0.7.0 and above require Spark 3.0.0. If upgrading to Spark 3.0.0 requires a lot of work, you can use Delta Lake 0.6.x or below. You just need to upgrade Spark to 2.4.2 or above in 2.4.x line. They should be source and binary compatible.
You can cross compile projects Spark 2.4 projects with Scala 2.11 and Scala 2.12. The Scala 2.12 JARs should generally work for Spark 3 applications. There are edge cases when using a Spark 2.4/Scala 2.12 JAR won't work properly on a Spark 3 cluster.
It's best to make a clean migration to Spark 3/Scala 2.12 and cut the cord with Spark 2/Scala 2.11.
Upgrading can be a big pain, especially if your project has a lot of dependencies. For example, suppose your project depends on spark-google-spreadsheets, a project that's not built with Scala 2.12. With this dependency, you won't be able to easily upgrade your project to Scala 2.12. You'll need to either compile spark-google-spreadsheets with Scala 2.12 yourself or drop the dependency. See here for more details on how to migrate to Spark 3.

How to enable Partial Unification in Spark REPL with Scala 2.11.8?

I have Scala code written in Scala 2.11.12 using the partial-unification compiler option, which I would like to run in a Spark 2.2.2 REPL.
With a Spark version compiled against Scala 2.11.12 (i.e. 2.3+), this is possible in the Spark REPL via :settings -Ypartial-unification, and the code executes.
I want to run this on Spark 2.2.2, which is compiled against Scala 2.11.8.
To do this, I have downloaded the jar with the partial unification compiler plugin (source from: https://github.com/milessabin/si2712fix-plugin), which backports this setting.
I've played around with a Scala 2.11.8 REPL (adding jar to the classpath - seems too rudimentary) and haven't managed to get it working there (before trying to add it to Spark), and am asking if anyone knows how to do this or if adding a compiler setting to a REPL via a JAR is not possible.
Any other advice appreciated!

Scala dependency in Spark/Pyspark

I want to install Spark in a machine and I don't know if I need Scala as a dependency.
I have installed Java (1.8) as the documentation saids. But I don't know If I need any other dependencies
Scala is included in Spark's distribution, so for using Scala in Spark REPL you don't need to install Scala. Here are scala jars included.
If you are setting up IDE for Spark project then you will need Scala as dependency.

com.databricks.spark.csv version requirement

Which version of com.databricks.spark.csv is compatible for Spark 1.6.1, and scala 2.10.5?
I can see
com.databricks_spark-csv_2.10-1.5.0.jar
com.databricks_spark-csv_2.11-1.3.0.jar
already available on my machine, and as per my understanding goes, if I have scala version 2.10, then the first option is the one that I have to use. just wanted to re-confirm.
You should use the first jar as you have scala 2.10 on your machine
com.databricks_spark-csv_2.10-1.5.0.jar
As 2.10 means it is meant for scala 2.10

Using Scala 2.12 with Spark 2.x

At the Spark 2.1 docs it's mentioned that
Spark runs on Java 7+, Python 2.6+/3.4+ and R 3.1+. For the Scala API, Spark 2.1.0 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).
at the Scala 2.12 release news it's also mentioned that:
Although Scala 2.11 and 2.12 are mostly source compatible to facilitate cross-building, they are not binary compatible. This allows us to keep improving the Scala compiler and standard library.
But when I build an uber jar (using Scala 2.12) and run it on Spark 2.1. every thing work just fine.
and I know its not any official source but at the 47 degree blog they mentioned that Spark 2.1 does support Scala 2.12.
How can one explain those (conflicts?) pieces of information ?
Spark does not support Scala 2.12. You can follow SPARK-14220 (Build and test Spark against Scala 2.12) to get up to date status.
update:
Spark 2.4 added an experimental Scala 2.12 support.
Scala 2.12 is officially supported (and required) as of Spark 3. Summary:
Spark 2.0 - 2.3: Required Scala 2.11
Spark 2.4: Supported Scala 2.11 and Scala 2.12, but not really cause almost all runtimes only supported Scala 2.11.
Spark 3: Only Scala 2.12 is supported
Using a Spark runtime that's compiled with one Scala version and a JAR file that's compiled with another Scala version is dangerous and causes strange bugs. For example, as noted here, using a Scala 2.11 compiled JAR on a Spark 3 cluster will cause this error: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps.
Look at all the poor Spark users running into this very error.
Make sure to look into Scala cross compilation and understand the %% operator in SBT to limit your suffering. Maintaining Scala projects is hard and minimizing your dependencies is recommended.
To add to the answer, I believe it is a typo https://spark.apache.org/releases/spark-release-2-0-0.html has no mention of scala 2.12.
Also, if we look at timings Scala 2.12 was not released untill November 2016 and Spark 2.0.0 was released on July 2016.
References:
https://spark.apache.org/news/index.html
www.scala-lang.org/news/2.12.0/