Spark Application Using Scala IDE - scala

I'm trying to run Spark classifier using scala on my machine but I am getting the following error:
Only one SparkContext may be running in this JVM (see SPARK-2243). To
ignore this error, set spark.driver.allowMultipleContexts = true.

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

ClassNotFoundException while creating Spark Session

I am trying to create a Spark Session in Unit Test case using the below code
val spark = SparkSession.builder.appName("local").master("local").getOrCreate()
but while running the tests, I am getting the below error:
java.lang.ClassNotFoundException: org.apache.hadoop.fs.GlobalStorageStatistics$StorageStatisticsProvider
I have tried to add the dependency but to no avail. Can someone point out the cause and the solution to this issue?
It can be because of two reasons.
1. You may have incompatible versions of spark and Hadoop stacks. For example, HBase 0.9 is incompatible with spark 2.0. It will result in the class/method not found exceptions.
2. You may have multiple version of the same library because of dependency hell. You may need to run the dependency tree to make sure this is not the case.

Neo4j Spark connector error: import.org.neo4j.spark._ object neo4j is not found in package org

I have my scala code running in spark connecting to Neo4j on my mac. I wanted to test it on my windows machine but cannot seem to get it to run, I keep getting the error:
Spark context Web UI available at http://192.168.43.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1508360735468).
Spark session available as 'spark'.
Loading neo4jspark.scala...
<console>:23: error: object neo4j is not a member of package org
import org.neo4j.spark._
^
Which gives subsequent errors of:
changeScoreList: java.util.List[Double] = []
<console>:87: error: not found: value neo
val initialDf2 = neo.cypher(noBbox).partitions(5).batch(10000).loadDataFrame
^
<console>:120: error: not found: value neo
Not sure what I am doing wrong, I am executing it like this:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jspark.scala
Says it finds all the dependencies yet the code throws the error when using neo. Not sure what else to try? Not sure why this doesn't work on my windows box and does my mac. Spark version 2.2 the same, neo4j up and running same versions, scala too, even java (save for a few minor revisions version differences)
This is a known issue (with a related one here), the fix for which is part of the Spark 2.2.1 release.

Breeze pinv (Moore-Penrose) pseudo-inverse function gives OutOfMemory error using Spark

I am developing an extreme learning machine type of neural network in Spark which requires use of the Moore-Penrose pseudo-inverse function. This is not available in any of the Spark libraries so I'm using Breeze, which has required conversion of the Spark data structures to Breeze matrices. When I get as far as beta = pinv(H) * T everything fails with an OOM exception (which I didn't think was possible in Spark). Any ideas why?
pinv is the Moore-Penrose pseudo-inverse in Breeze. H is a matrix of 35,000 rows and 10 columns. The SVD should be able to cope with this. It's not a particularly large dataset, only takes up 30Mb. I'm running everything locally on my laptop, nothing in the cloud. I have 8G memory on my laptop (MacBook Air).
I read that you can increase the driver memory using a spark-shell command but I don't know how to do this, or how it would link with the code in my IDE which sets up the SparkSession
val spark: SparkSession = SparkSession
.builder()
.master("local[*]")
.appName("ELMPipelineTest")
.getOrCreate()`
I fixed this as follows. Firstly the application must be run on the command line, not through the IDE, as changing the Spark driver memory must be done before the JVM is built. I'm using sbt as my build tool, so from the top level director of my project, via the Linux shell I ran:
sbt compile
sbt package // this creates a jar file
$SPARK_HOME/bin/spark-submit --class dev.elm.ELMPipeline --master local[4] --driver-memory 8G path/to/jar.jar
I set $SPARK_HOME as the spark home variable first.
That avoids the Java OOM error. Thanks to #RafalKwasny for pointing this out.

JSON Mapping with custom case class doesn't work in spark shell

I want to do a json mapping with a scala case class like it is done here: https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/BasicParseJsonWithJackson.scala
That works perfectly in "normal" Spark jobs that run on my cluster. But if I want to do that in a Zeppelin notebook or on Spark shell I get the following error:
com.fasterxml.jackson.databind.JsonMappingException: No suitable constructor found for type [simple type, class MyCaseClassName]: can not instantiate from JSON object (missing default constructor or creator, or perhaps need to add/enable type information?)
Do you have any idea what the problem is and how I can fix it?
EDIT: I use the following versions:
Spark 2.0.2
Zeppelin 0.6.2
Scala 2.11
The Spark cluster and zeppelin runs on Google Container Engine (Kubernetes). But as I mentioned before, this problem also appears on a local Spark shell. So I think it's independent from Zeppelin version and Spark runtime environment.