Install a .jar in Spark - scala

I am pretty new to Spark and Scala at the same time, so some things need clarification. I went to the web looking for a definitive answer to my question, but I did not really end up with one.
At the moment, I am running the spark-shell, in order to write some basic Scala and complete my tutorials. Now, the tutorial wants me to add a library in spark, in order to import it and use it for the examples. I have downloaded the .jar file of the library. Should I put in the /spark/jars/ folder? Is this enough in order to import it or should I also declare it somewhere else as well? Do I need to add a command before running the ./spark-shell ?
Also, when I create a standalone program (using sbt and declaring the library in the build.sbt), will the spark find the .jar in the /spark/jars/ folder or do I need to put it elsewhere?

Any jar can be added to spark-shell by using the --jars command:
evan#vbox:~> cat MyClass.java
public class MyClass
{
public static int add(int x, int y)
{
return x + y;
}
}
evan#vbox:~> javac MyClass.java
evan#vbox:~> jar cvf MyJar.jar MyClass.class
added manifest
adding: MyClass.class(in = 244) (out= 192)(deflated 21%)
evan#vbox:~> spark --jars ./MyJar.jar
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala> MyClass.add(2,3)
res0: Int = 5
If you are going to be making a project using sbt which has dependencies, I would recommend making an "uber jar" with sbt assembly. This will create a single JAR file which includes all of your dependencies, allowing you to just add a single jar using the above command.

Related

Windows Spark Error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils

Downloaded apache 3.2.0 the latest one as well as the hadoop file
java Java SE Development Kit 17.0.1 is installed too
i am not even able to initialize
input :
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql('''select 'spark' as hello ''')
df.show()
Output#
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
As you can read at https://spark.apache.org/docs/3.2.0/:
Spark 3.2.0 only supports Java version 8-11. I had the same issue on Linux and switching to Java 11 instead of 17 helped in my case.
BTW Spark 3.3.0 supports Java 17.
I faced the same issue today, but fixed it by changing JDK from 17 to 8 (only for spark start) as below.
spark-3.2.1
hadoop3.2
python 3.10
File "D:\sw.1\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.: > java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
Env variable was having %JAVA_HOME% to jdk17
Quick fix (incase you want to keep env. variable same but use jdk8 for spark only):
(1) create a batch file (start-pyspark.bat) in d:
(2) add below lines:
set JAVA_HOME=D:\sw.1\jdk1.8.0_332
set PATH=%PATH%;%JAVA_HOME%\bin;%SPARK_HOME%\bin;%HADOOP_HOME%\bin;
pyspark
(3) on cmd, type <start-pyspark.bat> and enter.
d:\>start-pyspark.bat
d:\>set JAVA_HOME=D:\sw.1\jdk1.8.0_332
d:\>set PATH=D:\sw.1\py.3.10\Scripts\;D:\sw.1\py.3.10\;C:\Program Files\Zulu\zulu-17\bin;C:\Program Files\Zulu\zulu-17-jre\bin;C:\windows\system32;....;D:\sw.1\jdk1.8.0_332\bin;D:\sw.1\spark-3.2.1-bin-hadoop3.2\bin;D:\sw.1\hadoop\bin;
d:\>pyspark
Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/27 18:29:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
(4) If you close this spark prompt and cmd and restart, it will be in clean state as having JDK-17 set as JAVA_HOME from env.

spark submit giving "main" java.lang.NoSuchMethodError: scala.Some.value()Ljava/lang/Object

I am trying to do a spark submit to check the compatibility with some simple scala code
println("Hi there")
val p = Some("pop")
p match {
case Some(a) => println("Matched " + a)
case _ => println("00000009")
}
scala version: 2.12.5
spark version: 2.4.6
currently after building and running the jar through spark-submit 2.4.7
it gives:
Hi there
Exception in thread "main" java.lang.NoSuchMethodError: scala.Some.value()Ljava/lang/Object;
at MangoPop$.main(MangoPop.scala:9)
at MangoPop.main(MangoPop.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
From maven, it seems spark 2.4.6 supports scala 2.12
https://mvnrepository.com/artifact/org.apache.spark/spark-core
But when running with spark submit 3.0.2, it runs fine.
What am i missing with spark 2.4.6
(also tried with spark 2.4.7, even though there is no actual spark dependencies/code, only scala)
Running spark submit as
~/Downloads/spark-2.4.7-bin-hadoop2.7/bin$ ./spark-submit --class=Test myprojectLocation..../target/scala-2.12/compatibility-check_2.12-0.1.jar
/spark-2.4.7-bin-hadoop2.7/bin$ ./spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.7
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_282
Branch HEAD
Compiled by user prashant on 2020-09-08T05:22:44Z
Revision 14211a19f53bd0f413396582c8970e3e0a74281d
Url https://prashant:Sharma1988%235031#gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
also tried 2.4.6 downloading from
https://archive.apache.org/dist/spark/spark-2.4.6/
but could not find for scala 2.12
Can we also explicitly mention which scala version to use when doing spark-submit or spark-shell as in the configuration it seems it supports both but it used the lower one, ie 2.11
This is load-spark-env.cmd file
rem Setting SPARK_SCALA_VERSION if not already set.
set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
The issue is that the runtime version of Spark is "Using Scala version 2.11.12" while your code (MangoPop$.main(MangoPop.scala:9)) uses "scala version: 2.12.5".
Make sure the build and runtime versions of Spark are at the same Scala version.
I had the same issue and was able to fix it by creating a new conda enviroment from scratch.
In my Spark project I had issues before with Scala 11 and switched to Scala 12. Presumably, the dependencies and libraries got messed up in this process somewhere.

How to Install specific version of spark using specific version of scala

I'm running spark 2.4.5 in my mac. When I execute spark-submit --version
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_242
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
Note it's using scala version 2.11.12. However, my app is using 2.12.8 and this is throwing me the well known java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V error.
My question is how to make my spark 2.4.5 use scala 2.12 as indicated in their official webiste under Download section: Spark 2.4.5 uses Scala 2.12
I tried brew search apache-spark and got
==> Formulae
apache-spark ✔
and brew info apache-spark returned me
apache-spark: stable 2.4.5, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/2.4.4 (1,188 files, 250.7MB) *
Built from source on 2020-02-03 at 14:57:17
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb
==> Dependencies
Required: openjdk ✔
==> Options
--HEAD
Install HEAD version
==> Analytics
install: 7,150 (30 days), 15,180 (90 days), 64,459 (365 days)
install-on-request: 6,900 (30 days), 14,807 (90 days), 62,407 (365 days)
build-error: 0 (30 days)
Appreciate if any advice is given!
Spark community provides older versions of spark in this website, you can choose any version according your OS, For windows you can use tgz extension file.
https://archive.apache.org/dist/spark/
You can build any custom version of Spark locally.
Clone https://github.com/apache/spark locally
Update pom file, focusing on scala.version, hadoop.version, scala.binary.version, and artifactId in https://github.com/apache/spark/blob/master/pom.xml
mvn -DskipTests clean package (from their README)
After successful build, find all jars in assembly/target/scala-2.11/jars, external/../target, and other external jars you desire, which may be in provided scope of your jars submitted.
Create a new directory and export SPARK_HOME="/path/to/directory_name" so that https://github.com/apache/spark/blob/master/bin/spark-submit will detect it (see the source to see why)
Copy the jars into $SPARK_HOME/jars and make sure there are no conflicting jars
The bin/ scripts should be the same, but if needed, specifically reference those and possibly even unlink the brew ones if you no longer need them

Multiple versions of Spark but can't set to Spark 2

I am working with our company's computing cluster. I know it has Spark 2 since I am able to call it from a Jupyter notebook using PySpark. However, I would like to begin exploring the use of Spark with Scala through a command line interface (CLI). My question is, how do I change to Spark 2?
When running:
spark-submit --version
I got a message saying
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
So I ran:
export SPARK_MAJOR_VERSION = 2
Then ran:
spark-submit --version
SPARK_MAJOR_VERSION is set to 2, using Spark2
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Type --help for more information.
1) The first message implies there are multiple Spark versions, so I tried to switch from 1 to 2. It seems that after setting the PATH to version 2, I am still using 1, which suggests that the first message is saying that I either have multiple versions of Spark1 or need additional configurations to set Spark to version 2. Is this the correct interpretation or is there something else I can do/try?
It looks like that it depends on your environment.
Cloudera says that there is 2 different scripts:
spark-shell and spark2-shell
spark-submit and spark2-submit
While Hortonworks says about the environment variable (as you tried to set).
So you might want to check documentation for your environment, if you are not using any of these.

Issue after Spark Installation on Windows 10

This is a cmd log that I see after running spark-shell command (C:\Spark>spark-shell). As I understand, it's mainly an issue with Hadoop. I use Windows 10. Can somehow please with the below issue?
C:\Users\mac>cd c:\
c:\>winutils\bin\winutils.exe chmod 777 \tmp\hive
c:\>cd c:\spark
c:\Spark>spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/14 13:21:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/14 13:21:34 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/c:/Spark/bin/../jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/jars/datanucleus-rdbms-3.2.9.jar."
17/05/14 13:21:34 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/c:/Spark/bin/../jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/jars/datanucleus-core-3.2.10.jar."
17/05/14 13:21:34 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/c:/Spark/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/jars/datanucleus-api-jdo-3.2.6.jar."
17/05/14 13:21:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.9:4040
Spark context available as 'sc' (master = local[*], app id = local-1494764489031).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
There's no issue in your output. These WARN messages can simply be ignored.
In other words, it looks like you've installed Spark 2.1.1 on Windows 10 properly.
To make sure you installed it properly (so I could remove looks from above sentence) is to do the following:
spark.range(1).show
That by default will trigger loading Hive classes that may or may not end up with exceptions on Windows due to Hadoop's requirements (and hence the need for winutils.exe to work them around).