How to add Delta Lake support to Zeppelin's spark interpreter? - scala

I'm trying to add the Delta Lake support to Zeppelin.
So far I've tried adding the io.delta:delta-core_2.12:0.7.0 dependency to the spark interpreter, as well as a couple other related actions within the interpreters view... but nothing has worked.
When I add the io.delta:delta-core_2.12:0.7.0 dependency, I get errors within my notebooks such as:
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:668)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:577)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:130)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:39)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at org.apache.spark.util.Utils$.stringToSeq(Utils.scala:2664)
at org.apache.spark.internal.config.ConfigHelpers$.stringToSeq(ConfigBuilder.scala:49)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:125)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:125)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:143)
at org.apache.spark.internal.config.package$.<init>(package.scala:172)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:716)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:95)
at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:77)
at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:76)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:76)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:71)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:58)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:80)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
My goal is to read/write from/to Delta Lake tables using Scala + Spark.
Thanks!

The most probable reason for this is that you're using Delta Lake with Spark 2.x - the package that you're using is supposed to work with Spark 3.0+ (compiled with Scala 2.12). The latest version of Delta that supports 2.4 (minimum 2.4.2) is 0.6.1 (see this answer).
So you need to upgrade Spark version if you want to use this specific package, or use another version of Delta if you want to keep you Spark installations.

Related

spark & zeppelin problems with integration

I want to connect my locally installed zeppelin 0.10.0 to an also locally installed spark 3.2.0 (I tried the same procedure with spark2.3.0 and it worked.). But it looks like zeppelin itself has an internal spark which uses the internal one every time I try. I have gone through the setting for spark interpreters with no use.
I just want to know if there is anyway I can change the default internal spark that zeppelin uses and change it to a spark 3.2.0 I want to use.
I put the parameters of SPARK_HOME what it is said to be and spark.master local[*] receiving the following error:
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
I've run into the same issue myself - you won't run Spark 3.2.0 on Zeppelin 0.10.0. Spark 3.1.2 works without any issues and Zeppelin has Spark 2.4.5 included - this is a problem with a tool itself.
According to the ticket ZEPPELIN-5565 version 0.10.0 DOES NOT support Spark 3.2.0. This should be fixed in 0.10.1 and 0.11.0 (info from mentioned ticket and I've also checked the Github repo).
Pull request that fixes this issue is much longer, but in Zeppelin 0.10.0 there is this strategic line:
public static final SparkVersion UNSUPPORTED_FUTURE_VERSION = SPARK_3_2_0;

Native snappy library not available

I'm trying to do lots of joins on some data frames using spark in scala. When I'm trying to get the count of the final data frame I'm generating here, I'm getting the following exception. I'm running the code using spark-shell.
I've tried some configuration params like following while starting the spark-shell. But none of them worked. Is there anything I'm missing here?
:
--conf "spark.driver.extraLibraryPath=/usr/hdp/2.6.3.0-235/hadoop/lib/native/"
--jars /usr/hdp/current/hadoop-client/lib/snappy-java-1.0.4.1.jar
Caused by: java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
Try to update Hadoop jar file from 2.6.3. to 2.8.0 or 3.0.0. There was the bug in the earlier version of Hadoop: the native snappy library was not available.
After modifying of Hadoop core jar, you should be able to perform snappy compression/decompression.

NoSuchMethodError for Scala Seq line in Spark

I am having an error when trying to run plain Scala code in Spark similar to these posts: this and this
Their problem was that they were using the wrong Scala version to compile their Spark project. However, mine is the correct version.
I have Spark 1.6.0 installed on an AWS EMR cluster to run the program. The project is compiled on my local machine with Scala 2.11 installed and 2.11 listed in all dependencies and build files without any references to 2.10.
This is the exact line that throws the error:
var fieldsSeq: Seq[StructField] = Seq()
And this is the exact error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at com.myproject.MyJob$.main(MyJob.scala:39)
at com.myproject.MyJob.main(MyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark 1.6 on EMR is still built with Scala 2.10, so yes, you are having the same issue as in the posts you linked. In order to use Spark on EMR, you currently must compile your application with Scala 2.10.
Spark has upgraded their default Scala version to 2.11 as of Spark 2.0 (to be released within the next several months), so once EMR supports Spark 2.0, we will likely follow this new default and compile Spark with Scala 2.11.

ClassNotFoundException: org.apache.spark.repl.SparkCommandLine

I am a newbie in Apache Zeppelin and I try to run it locally. I try to run just a simple sanity check to see that sc exists and get the error below.
I compiled it for pyspark and spark 1.5 (I use spark 1.5). I increased the memory to 5 GB and changed the port to 8091.
I am not sure what I did wrong so I get the following error and how should I solve it.
Thanks in advance
java.lang.ClassNotFoundException:
org.apache.spark.repl.SparkCommandLine at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:401)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:485)
at
org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:174)
at
org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:152)
at
org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:302)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Update
The solution for me was to degrade my scala version from 2.11.* to 2.10.*, build Apache Spark again and run Zeppelin.
I am making certain assumptions based on what you have answered in comments. It sounds like the Zeppelin setup is good, when I looked at the class SparkCommandLine it's part of Spark's core.
Now Zeppelin has its own minimal embedded Spark classes, which are activated if you don't set SPARK_HOME. So first, per this github page, try not setting SPARK_HOME (which you are setting) and HADOOP_HOME (which I don't think you are setting), to see if eliminating your underlying Spark install "fixes" it:
Without SPARK_HOME and HADOOP_HOME, Zeppelin uses embedded Spark and
Hadoop binaries that you have specified with mvn build option. If you
want to use system provided Spark and Hadoop, export SPARK_HOME and
HADOOP_HOME in zeppelin-env.sh You can use any supported version of
spark without rebuilding Zeppelin.
If that works, then you know we are looking at a Java classpath issue. To try to fix this, there's one more setting that goes in the zeppelin-env.sh file,
ZEPPELIN_JAVA_OPTS
mentioned here on the Zeppelin mailing list, make sure you set that to point to the actual Spark jars so the JVM picks it up with a -classpath
Here's what my zeppelin process looks like for comparison, I think the important part is the -cp argument, do the ps on your system and look through your JVM options to see if it's similarly pointing to
/usr/lib/jvm/java-8-oracle/bin/java -cp /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar:/usr/local/spark/conf/:/usr/local/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar
-Xms1g -Xmx1g -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log org.apache.spark.deploy.SparkSubmit --conf spark.driver.extraClassPath=:/usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar
--conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8 -Xmx1024m -XX:MaxPermSize=512m -Dzeppelin.log.file=/usr/local/zeppelin/logs/zeppelin-interpreter-spark-jim-jim.log
--class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/local/zeppelin/interpreter/spark/zeppelin-spark-0.5.5-incubating.jar 50309
Hope that helps if that doesn't work please edit your question to show your existing classpath.
Zeppelin recently released version 0.6.1 which supports Scala 2.11 and Spark 2.0. I too was puzzled by this error message, since I could clearly see my Spark home directory in the classpath. The new version of Zeppelin works great; I'm currently running it with Spark 2.0/Scala 2.11.

Spark Kafka - Issue while running from Eclipse IDE

I am experimenting with Spark Kafka integration. And I want to test the code from my eclipse IDE. However, I got below error:
java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at kafka.utils.Pool.<init>(Pool.scala:28)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.<init>(FetchRequestAndResponseStats.scala:60)
at kafka.consumer.FetchRequestAndResponseStatsRegistry$.<clinit>(FetchRequestAndResponseStats.scala)
at kafka.consumer.SimpleConsumer.<init>(SimpleConsumer.scala:39)
at org.apache.spark.streaming.kafka.KafkaCluster.connect(KafkaCluster.scala:52)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:345)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:403)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:532)
at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
at com.capiot.platform.spark.SparkTelemetryReceiverFromKafkaStream.executeStreamingCalculations(SparkTelemetryReceiverFromKafkaStream.java:248)
at com.capiot.platform.spark.SparkTelemetryReceiverFromKafkaStream.main(SparkTelemetryReceiverFromKafkaStream.java:84)
UPDATE:
The versions that I am using are:
scala - 2.11
spark-streaming-kafka- 1.4.1
spark - 1.4.1
Can any one resolve the issue? Thanks in advance.
You have the wrong version of Scala. You need 2.10.x per
https://spark.apache.org/docs/1.4.1/
"For the Scala API, Spark 1.4.1 uses Scala 2.10."
Might be late to help OP, but when using kafka streaming with spark, you need to make sure that you use the right jar file.
For example, in my case, I have scala 2.11 (the minimum required for spark 2.0 which im using), and given that kafka spark requires the version 2.0.0 I have to use the artifact spark-streaming-kafka-0-8-assembly_2.11-2.0.0-preview.jar
Notice my scala version and the artifact version can be seen at 2.11-2.0.0
Hope this helps (someone)
Hope that helps.