Spark 2.1 on Dataproc Image - google-cloud-dataproc

I am trying to spin up a Dataproc cluster using Spark 2.1. Is there an image version that has Spark 2.1? I see Spark 2.0 and 2.2, but not 2.1.
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions

No, there is no dataproc image with Spark 2.1. Dataproc image 1.1 has Spark 2.0.2 and image 1.2 has Spark 2.2.3. Notice both are deprecated.

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

Query pushdown is not supported because you are using Spark 3.2.1 with a connector designed to support Spark 3.1

I am getting this warning when I execute my pyspark code. I am writing from S3 to snowflake.
My Snowflake- pyspark packages are
net.snowflake:snowflake-jdbc:3.13.10,
net.snowflake:spark-snowflake_2.12:2.9.2-spark_3.1
My local pyspark version is
Spark version 3.2.1
Hadoop version 3.3.1
warning:
WARN SnowflakeConnectorUtils$: Query pushdown is not supported because you are using Spark 3.2.1 with a connector designed to support Spark 3.1. Either use the version of Spark supported by the connector or install a version of the connector that supports your version of Spark.
Is this the right package or do we have anything other?
My program is working as expected, reading from s3 storing results to snowflake. How to remove this warning?
For Spark 3.2 you need to use Snowflake Spark connector 2.10:
For Scala 2.12:
https://search.maven.org/search?q=a:spark-snowflake_2.12
For Scala 2.13:
https://search.maven.org/search?q=a:spark-snowflake_2.13

Apache Spark 3 and backward compatibility?

We have several Spark applications running on production developed using Spark 2.4.1 (Scala 2.11.12).
For couple of our new Spark jobs,we are considering utilizing features of DeltaLake.For this we need to use Spark 2.4.2 (or higher).
My questions are:
If we upgrade our Spark cluster to 3.0.0, can our 2.4.1 applications still run on the new cluster (without recompile)?
If we need to recompile our previous Spark jobs with Spark 3, are they source compatible or do they need any migration?
There are some breaking changes in Spark 3.0.0, including source incompatible change and binary incompatible changes. See https://spark.apache.org/releases/spark-release-3-0-0.html. And there are also some source and binary incompatible changes between Scala 2.11 and 2.12, so you may also need to update codes because of Scala version change.
However, only do Delta Lake 0.7.0 and above require Spark 3.0.0. If upgrading to Spark 3.0.0 requires a lot of work, you can use Delta Lake 0.6.x or below. You just need to upgrade Spark to 2.4.2 or above in 2.4.x line. They should be source and binary compatible.
You can cross compile projects Spark 2.4 projects with Scala 2.11 and Scala 2.12. The Scala 2.12 JARs should generally work for Spark 3 applications. There are edge cases when using a Spark 2.4/Scala 2.12 JAR won't work properly on a Spark 3 cluster.
It's best to make a clean migration to Spark 3/Scala 2.12 and cut the cord with Spark 2/Scala 2.11.
Upgrading can be a big pain, especially if your project has a lot of dependencies. For example, suppose your project depends on spark-google-spreadsheets, a project that's not built with Scala 2.12. With this dependency, you won't be able to easily upgrade your project to Scala 2.12. You'll need to either compile spark-google-spreadsheets with Scala 2.12 yourself or drop the dependency. See here for more details on how to migrate to Spark 3.

Spark and Scala upgrade on CDH

I currently have a CDH with Spark 1.6.0 and Scala 2.10.5. I would like to upgrade the Spark version to 2.0.0 and Scala version to 2.11.x and make these as defaults.
I am currently trying this on a CDH Quickstart VM but would like to extend this to a Spark cluster with CDH distribution.
Could someone help on how to go about these two upgrades?
Thank you.

Connector Version for Spark - Cassandra 3.x

Can someone please suggest a version of Cassandra 3.x which works with Spark 1.6, Scala 2.10.5 WHICH WORKS!!!!
Below are the version of jars I am looking for the versions of the below jar files
Cassandra Core
Cassandra Spark Connector
Thanks,
Sai
Visit the below link for check version compatibility.
Correct version of connector is 1.6 for cassandra 3.x , spark -1.6 and scala -2.10.5
Check version as per below image.
https://github.com/datastax/spark-cassandra-connector