Scala dependency in Spark/Pyspark - scala

I want to install Spark in a machine and I don't know if I need Scala as a dependency.
I have installed Java (1.8) as the documentation saids. But I don't know If I need any other dependencies

Scala is included in Spark's distribution, so for using Scala in Spark REPL you don't need to install Scala. Here are scala jars included.
If you are setting up IDE for Spark project then you will need Scala as dependency.

Related

Why when Maven Build Works good but adding Spark Jar as external Jars gives a compile error “object Apache is not a member of package org”

On Eclipse, while setting up spark , even after adding external jars to build path to spark-2.4.3-bin-hadoop2.7/jars/<_all.jar>,
Complier complains about '“object apache is not a member of package org''
Yes, Building dependencies via Maven or SBT would fix it. A question is asked
scalac compile yields "object apache is not a member of package org"
But Question over here is , WHY the traditional way is failing like this ?
If we reffer here , Scala/Spark version compatibility We could see a similar issue. The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, the eclipse Scala environment may not be compatible that particular scala version of which we have the Spark libraries set up.

How to choose the scala version for my spark program?

I am building my first Spark application, developing with IDEA.
In my cluster, the version of Spark is 2.1.0, and the version of Scala is 2.11.8.
http://spark.apache.org/downloads.html tells me:"Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package and build with Scala 2.10 support".
So here is my question:What's the meaning of "Scala 2.10 users should download the Spark source package and build with Scala 2.10 support"? Why not use the version of Scala 2.1.1?
Another question:Which version of Scala can I choose?
First a word about the "why".
The reason this subject evens exists is that scala versions are not (generally speacking) binary compatible, although most of the times, source code is compatible.
So you can take Scala 2.10 source and compile it into 2.11.x or 2.10.x versions. But 2.10.x compiled binaries (JARs) can not be run in a 2.11.x environment.
You can read more on the subject.
Spark Distributions
So, the Spark package, as you mention, is built for Scala 2.11.x runtimes.
That means you can not run a Scala 2.10.x JAR of yours, on a cluster / Spark instance that runs with the spark.apache.org-built distribution of spark.
What would work is :
You compile your JAR for scala 2.11.x and keep the same spark
You recompile Spark for Scala 2.10 and keep your JAR as is
What are your options
Compiling your own JAR for Scala 2.11 instead of 2.10 is usually far easier than compiling Spark in and of itself (lots of dependencies to get right).
Usually, your scala code is built with sbt, and sbt can target a specific scala version, see for example, this thread on SO. It is a matter of specifying :
scalaVersion in ThisBuild := "2.10.0"
You can also use sbt to "cross build", that is, build different JARs for different scala versions.
crossScalaVersions := Seq("2.11.11", "2.12.2")
How to chose a scala version
Well, this is "sort of" opinion based. My recommandation would be : chose the scala version that matches your production Spark cluster.
If your production Spark is 2.3 downloaded from https://spark.apache.org/downloads.html, then as they say, it uses Scala 2.11 and that is what you should use too. Using anything else, in my view, just leaves the door open for various incompatibilities down the road.
Stick with what your production needs.

spark-submit on standalone cluster complain about scala-2.10 jars not exist

I'm new to Spark and downloaded a pre-compiled Spark binaries from Apache (Spark-2.1.0-bin-hadoop2.7)
When submitting my scala (2.11.8) uber jar the cluster throw and error:
java.lang.IllegalStateException: Library directory '/root/spark/assembly/target/scala-2.10/jars' does not exist; make sure Spark is built
I'm not running Scala 2.10 and Spark isn't compiled (as much as I know) with Scala 2.10
Could it be that one of my dependencies is based on Scala 2.10 ?
Any suggestions what can be wrong ?
Note sure what is wrong with the pre-built spark-2.1.0 but I've just downloaded spark 2.2.0 and it is working great.
Try setting SPARK_HOME="location to your spark installation" on your system or IDE

Changing the Scala version in Zeppelin

I'm using an external Jar that doesn't work properly with any version of Scala above 2.10.4. However, by looking at the Scala Jar files, I have seen that the version of Scala I'm using is 2.11. Is it possible to downgrade the Scala version to 2.10.4 without replacing the Scala Jar files?
yes it is, run from zeppelin./dev/change_scala_version.sh 2.10, you might need to rebuilt though using mvn clean package -DskipTests -Pspark-2.1 -Phadoop-2.7 -Pyarn -Ppyspark -Psparkr -Pr -Pscala-2.10

Cross build scala using gradle

I've got a Scala project that is built with Gradle. The Scala code is source compatible with scala 2.9 and 2.10 and I'd like to cross build it to both major Scala versions. Does Gradle support this?
For example, my gradle project will have a single module:
build.gradle
src/main/scala/foo.scala
and I'd like the resulting published jars to be:
org-foo_2.9-0.1.jar (with dependency on scala-library 2.9)
org-foo_2.10-0.1.jar (with dependency on scala-library 2.10)
Gradle's Scala plugin doesn't currently support cross-building. It's possible to implement it yourself, though. In my Polyglot Gradle talk, I presented a proof-of-concept.
I am searching for a good example of this. The Gradle manual doesn't mention how to specify Scala version but looking at the source code for the Scala plugin it seems to infer it from the Scala library jar that you specify.
The best example I could find is the Apache Kafka build system. It specifies the Scala version and then uses some additional logic to resolve the correct version of the Scala libraries. It also uses some logic to attach the correct label to the jars its builds to correspond to the correct Scala version.
This feels like a lot of work and something that the build system should do for you like in SBT.