Version Compatibility issues with Scala, Spark, Spark NLP - scala

I am new to 'Spark NLP' and I got stuck in version compatibility issues only. That may seems to be silly but still I request you to help me in this:
‘Spark NLP’ is built on top of Apache Spark 2.4.0 and such is the only supported release (mentioned on https://nlp.johnsnowlabs.com/docs/en/quickstart)
Spark 2.4.3 uses Scala 2.12 (mentioned on https://spark.apache.org/docs/latest/)
But maven repo repository just have spark_nlp libararies for scala 2.11 only not after that (see https://repo1.maven.org/maven2/com/johnsnowlabs/nlp/, only spark-nlp_2.11 is present and not spark-nlp_2.12)
Please help

Related

What version of Scala does Apache Flink use?

I'm getting started with Apache Flink. I noticed on the Apache Flink download page, it says "Apache Flink 1.14.3 for Scala 2.11 (asc, sha512)" as the name of the installation file.
Can you confirm there is no Apache Flink for Scala 3.x? I want to make sure I download the right version of Scala
Flink up to 1.14 is available in both a Scala 2.11 and Scala 2.12 binary, which can be downloaded from https://flink.apache.org/downloads.html. This means that you can use Flink's Scala API with either Scala 2.11 or Scala 2.12.
In Flink 1.15 (the next version), support for Scala 2.11 is dropped. Next to that, all Java APIs are independent from Scala. You can find more information at https://issues.apache.org/jira/browse/FLINK-23986
This means that if you would like to use Scala 3.0 in combination with Flink 1.15, that's possible. You can find some examples at https://github.com/sjwiesman/flink-scala-3

Apache Spark 3 and backward compatibility?

We have several Spark applications running on production developed using Spark 2.4.1 (Scala 2.11.12).
For couple of our new Spark jobs,we are considering utilizing features of DeltaLake.For this we need to use Spark 2.4.2 (or higher).
My questions are:
If we upgrade our Spark cluster to 3.0.0, can our 2.4.1 applications still run on the new cluster (without recompile)?
If we need to recompile our previous Spark jobs with Spark 3, are they source compatible or do they need any migration?
There are some breaking changes in Spark 3.0.0, including source incompatible change and binary incompatible changes. See https://spark.apache.org/releases/spark-release-3-0-0.html. And there are also some source and binary incompatible changes between Scala 2.11 and 2.12, so you may also need to update codes because of Scala version change.
However, only do Delta Lake 0.7.0 and above require Spark 3.0.0. If upgrading to Spark 3.0.0 requires a lot of work, you can use Delta Lake 0.6.x or below. You just need to upgrade Spark to 2.4.2 or above in 2.4.x line. They should be source and binary compatible.
You can cross compile projects Spark 2.4 projects with Scala 2.11 and Scala 2.12. The Scala 2.12 JARs should generally work for Spark 3 applications. There are edge cases when using a Spark 2.4/Scala 2.12 JAR won't work properly on a Spark 3 cluster.
It's best to make a clean migration to Spark 3/Scala 2.12 and cut the cord with Spark 2/Scala 2.11.
Upgrading can be a big pain, especially if your project has a lot of dependencies. For example, suppose your project depends on spark-google-spreadsheets, a project that's not built with Scala 2.12. With this dependency, you won't be able to easily upgrade your project to Scala 2.12. You'll need to either compile spark-google-spreadsheets with Scala 2.12 yourself or drop the dependency. See here for more details on how to migrate to Spark 3.

Should I use Scala 2.11.0 or 2.11.8 with Spark 2.3.0?

I'm trying to upgrade the Spark version from 1.6.2 to 2.3.0. I'm currently using Scala version 2.10, but I need to upgrade to Scala version 2.11.x since Scala 2.10 is no longer supported. My question is which sub-version should I upgrade to? I'm struggling to determine which sub-version to use.
I can't find anything that compares the different sub versions of Scala, but I have encountered some entries that recommend using Scala version 2.11.8 and that there was a bug with Scala 2.11.0 with using Spark (not sure how true that is).
What are your experiences and which sub versions do you recommend? Thanks!
As the documentation says, you can use a compatible version (2.11.x).
So just use the latest version (2.11.12 as of today).

Using Scala 2.12 with Spark 2.x

At the Spark 2.1 docs it's mentioned that
Spark runs on Java 7+, Python 2.6+/3.4+ and R 3.1+. For the Scala API, Spark 2.1.0 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).
at the Scala 2.12 release news it's also mentioned that:
Although Scala 2.11 and 2.12 are mostly source compatible to facilitate cross-building, they are not binary compatible. This allows us to keep improving the Scala compiler and standard library.
But when I build an uber jar (using Scala 2.12) and run it on Spark 2.1. every thing work just fine.
and I know its not any official source but at the 47 degree blog they mentioned that Spark 2.1 does support Scala 2.12.
How can one explain those (conflicts?) pieces of information ?
Spark does not support Scala 2.12. You can follow SPARK-14220 (Build and test Spark against Scala 2.12) to get up to date status.
update:
Spark 2.4 added an experimental Scala 2.12 support.
Scala 2.12 is officially supported (and required) as of Spark 3. Summary:
Spark 2.0 - 2.3: Required Scala 2.11
Spark 2.4: Supported Scala 2.11 and Scala 2.12, but not really cause almost all runtimes only supported Scala 2.11.
Spark 3: Only Scala 2.12 is supported
Using a Spark runtime that's compiled with one Scala version and a JAR file that's compiled with another Scala version is dangerous and causes strange bugs. For example, as noted here, using a Scala 2.11 compiled JAR on a Spark 3 cluster will cause this error: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps.
Look at all the poor Spark users running into this very error.
Make sure to look into Scala cross compilation and understand the %% operator in SBT to limit your suffering. Maintaining Scala projects is hard and minimizing your dependencies is recommended.
To add to the answer, I believe it is a typo https://spark.apache.org/releases/spark-release-2-0-0.html has no mention of scala 2.12.
Also, if we look at timings Scala 2.12 was not released untill November 2016 and Spark 2.0.0 was released on July 2016.
References:
https://spark.apache.org/news/index.html
www.scala-lang.org/news/2.12.0/

Spark Kafka - Issue while running from Eclipse IDE

I am experimenting with Spark Kafka integration. And I want to test the code from my eclipse IDE. However, I got below error:
java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at kafka.utils.Pool.<init>(Pool.scala:28)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.<init>(FetchRequestAndResponseStats.scala:60)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.<clinit>(FetchRequestAndResponseStats.scala)
at kafka.consumer.SimpleConsumer.<init>(SimpleConsumer.scala:39)
at org.apache.spark.streaming.kafka.KafkaCluster.connect(KafkaCluster.scala:52)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:345)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:403)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:532)
at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
at com.capiot.platform.spark.SparkTelemetryReceiverFromKafkaStream.executeStreamingCalculations(SparkTelemetryReceiverFromKafkaStream.java:248)
at com.capiot.platform.spark.SparkTelemetryReceiverFromKafkaStream.main(SparkTelemetryReceiverFromKafkaStream.java:84)
UPDATE:
The versions that I am using are:
scala - 2.11
spark-streaming-kafka- 1.4.1
spark - 1.4.1
Can any one resolve the issue? Thanks in advance.
You have the wrong version of Scala. You need 2.10.x per
https://spark.apache.org/docs/1.4.1/
"For the Scala API, Spark 1.4.1 uses Scala 2.10."
Might be late to help OP, but when using kafka streaming with spark, you need to make sure that you use the right jar file.
For example, in my case, I have scala 2.11 (the minimum required for spark 2.0 which im using), and given that kafka spark requires the version 2.0.0 I have to use the artifact spark-streaming-kafka-0-8-assembly_2.11-2.0.0-preview.jar
Notice my scala version and the artifact version can be seen at 2.11-2.0.0
Hope this helps (someone)
Hope that helps.