Zeppelin 6.5 + Apache Kafka connector for Structured Streaming 2.0.2 - streaming

I'm trying to run a zeppelin notebook that contains spark's Structured Streaming example with Kafka connector.
>kafka is up and running on localhost port 9092
>from zeppelin notebook, sc.version returns String = 2.0.2
Here is my environment:
kafka: kafka_2.10-0.10.1.0
zeppelin: zeppelin-0.6.2-bin-all
spark: spark-2.0.2-bin-hadoop2.7
Here is the code in my zeppelin notebook:
import org.apache.enter code herespark.sql.functions.{explode, split}
// Setup connection to Kafka val kafka = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
// comma separated list of broker:host
.option("subscribe", "twitter")
// comma separated list of topics
.option("startingOffsets", "latest")
// read data from the end of the stream .load()
Here is the error I'm getting when I run the notebook:
import org.apache.spark.sql.functions.{explode, split}
java.lang.ClassNotFoundException: Failed to find data source: kafka.
Please find packages at
https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:218)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:80)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:80)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 86 elided Caused by: java.lang.ClassNotFoundException:
kafka.DefaultSource at
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132)
at scala.util.Try$.apply(Try.scala:192)
Any help advice would be greatly appreciated.
Thnx

You probably have figured this out already but putting in the answer for others, you have to add the following to zeppelin-env.sh.j2
SPARK_SUBMIT_OPTIONS=--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0
along with potentially other dependencies if you are using the kafka client:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0,org.apache.spark:spark-sql_2.11:2.1.0,org.apache.kafka:kafka_2.11:0.10.0.1,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.1.0,org.apache.kafka:kafka-clients:0.10.0.1

This solution has been tested in zeppelin version 0.10.1.
You need to add dependencies of your code. It can be done with zeppelin UI. Go to Interpreter panel (http://localhost:8080/#/interpreter) and in spark section, under Dependencies you can add artifact of each dependency. If by adding spark-sql-kafka you ran into other dependency issues, add all packages the spark-sql-kafka needs. You can find them in Compile Dependencies section of it's maven repository.
I'm working with spark version 3.0.0 and scala version 2.12 and I was trying to integrate spark with kafka. I managed to get passed this issue by adding all the bellow artifacts:
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0
com.google.code.findbugs:jsr305:3.0.0
org.apache.spark:spark-tags_2.12:3.3.0
org.apache.spark:spark-token-provider-kafka-0-10_2.12:3.3.0
org.apache.kafka:kafka-clients:3.0.0

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.
We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

spark & zeppelin problems with integration

I want to connect my locally installed zeppelin 0.10.0 to an also locally installed spark 3.2.0 (I tried the same procedure with spark2.3.0 and it worked.). But it looks like zeppelin itself has an internal spark which uses the internal one every time I try. I have gone through the setting for spark interpreters with no use.
I just want to know if there is anyway I can change the default internal spark that zeppelin uses and change it to a spark 3.2.0 I want to use.
I put the parameters of SPARK_HOME what it is said to be and spark.master local[*] receiving the following error:
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:76)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:833)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:741)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.FIFOScheduler.lambda$runJobInScheduler$0(FIFOScheduler.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
at org.apache.zeppelin.spark.SparkScala212Interpreter.open(SparkScala212Interpreter.scala:66)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:121)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
... 8 more
I've run into the same issue myself - you won't run Spark 3.2.0 on Zeppelin 0.10.0. Spark 3.1.2 works without any issues and Zeppelin has Spark 2.4.5 included - this is a problem with a tool itself.
According to the ticket ZEPPELIN-5565 version 0.10.0 DOES NOT support Spark 3.2.0. This should be fixed in 0.10.1 and 0.11.0 (info from mentioned ticket and I've also checked the Github repo).
Pull request that fixes this issue is much longer, but in Zeppelin 0.10.0 there is this strategic line:
public static final SparkVersion UNSUPPORTED_FUTURE_VERSION = SPARK_3_2_0;

Exception while running StreamingContext.start()

Exception while running python code in Windows 10. I am using Apache Kafka and PySpark.
Python code snippet to read data from Kafka
ssc=StreamingContext(sc,60)
zkQuorum, topic = sys.argv[1:]
kvs=KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: [x[0],x[1]])
lines.pprint()
lines.foreachRDD(SaveRecord)
ssc.start()
ssc.awaitTermination()
Exception while running the code
Exception in thread "streaming-start" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
at org.apache.spark.streaming.kafka.KafkaReceiver.<init>(KafkaInputDStream.scala:69)
at org.apache.spark.streaming.kafka.KafkaInputDStream.getReceiver(KafkaInputDStream.scala:60)
at org.apache.spark.streaming.scheduler.ReceiverTracker.$anonfun$launchReceivers$1(ReceiverTracker.scala:441)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:237)
at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at org.apache.spark.streaming.scheduler.ReceiverTracker.launchReceivers(ReceiverTracker.scala:440)
at org.apache.spark.streaming.scheduler.ReceiverTracker.start(ReceiverTracker.scala:160)
at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:102)
at org.apache.spark.streaming.StreamingContext.$anonfun$start$1(StreamingContext.scala:583)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.ThreadUtils$$anon$1.run(ThreadUtils.scala:145)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
This may be due to incompatible version of Scala with Spark. Make sure your Scala Version in Project configuration matches with the Version your Spark Version supports.
Spark requires Scala 2.12; support for Scala 2.11 was removed in Spark 3.0.0
It is also possible that the third party jar (like dstream-twitter for twitter streaming application or your Kafka streaming jar) is built for unsupported version of Scala in your application.
For me dstream-twitter_2.11-2.3.0-SNAPSHOT For Instance didn't work with Spark 3.0, It gave Exception in thread "streaming-start" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class). But when I updated the dtream-twitter jar with scala 2.12 version it solved the issue.
Make sure you get all the Scala Versions correct.

Netezza connection with Spark / Scala JDBC

I've set up Spark 2.2.0 on my Windows machine using Scala 2.11.8 on IntelliJ IDE. I'm trying to make Spark connect to Netezza using JDBC drivers.
I've read through this link and added the com.ibm.spark.netezzajars to my project through Maven. I attempt to run the Scala script below just to test the connection:
package jdbc
object SimpleScalaSpark {
def main(args: Array[String]) {
import org.apache.spark.sql.{SparkSession, SQLContext}
import com.ibm.spark.netezza
val spark = SparkSession.builder
.master("local")
.appName("SimpleScalaSpark")
.getOrCreate()
val sqlContext = SparkSession.builder()
.appName("SimpleScalaSpark")
.master("local")
.getOrCreate()
val nzoptions = Map("url" -> "jdbc:netezza://SERVER:5480/DATABASE",
"user" -> "USER",
"password" -> "PASSWORD",
"dbtable" -> "ADMIN.TABLENAME")
val df = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
}
}
However I get the following error:
17/07/27 16:28:17 ERROR NetezzaJdbcUtils$: Couldn't find class org.netezza.Driver
java.lang.ClassNotFoundException: org.netezza.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:49)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:netezza://SERVER:5480/DATABASE
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:54)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
I have two ideas:
1) I don't don't believe I actually installed any Netezza JDBC driver, though I thought the jars I brought into my project from the link above was sufficient. Am I just missing a driver or am I missing something in my Scala script?
2) In the same link, the author makes mention of starting the Netezza Spark package:
For example, to use the Spark Netezza package with Spark’s interactive
shell, start it as shown below:
$SPARK_HOME/bin/spark-shell –packages
com.ibm.SparkTC:spark-netezza_2.10:0.1.1
–driver-class-path~/nzjdbc.jar
I don't believe I'm invoking any package apart from jdbc in my script. Do I have to add that to my script?
Thanks!
Your 1st idea is right, I think. You almost certainly need to install the Netezza JDBC driver if you have not done this already.
From the link you posted:
This package can be deployed as part of an application program or from
Spark tools such as spark-shell, spark-sql. To use the package in the
application, you have to specify it in your application’s build
dependency. When using from Spark tools, add the package using
–packages command line option. Netezza JDBC driver also should be
added to the application dependencies.
The Netezza driver is something you have to download yourself, and you need support entitlement to get access to it (via IBM's Fix Central or Passport Advantage). It is included in either the Windows driver/client support package, or the linux driver package.

Unable to connect to redshift from spark

I am trying to read data from redshift to spark 1.5 using scala 2.10
I have built the spark-redshift package and added the amazon JDBC connector to the project, but I keep getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials
I have authenticated in the following way:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "ACCESSKEY")
hadoopConf.set("fs.s3n.awsSecretAccessKey","SECRETACCESSKEY")
val df: DataFrame = sqlContext.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://AWS_SERVER:5439/warehouseuser=USER&password=PWD")
.option("dbtable", "fact_time")
.option("tempdir", "s3n://bucket/path")
.load()
df.show()
Concerning your first error java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials I repeat what I said in the comment :
You have forgot to ship your AWS dependency jar in your spark app jar
And about the second error, I'm not sure of the package but it's more likely to be the org.apache.httpcomponents library you need. (I don't know for what you are using it thought!)
You can add the following to your maven dependencies :
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.3</version>
</dependency>
and you'll need to assembly the whole.
PS: You'll always need to provide the libraries when they are not installed. You must also becareful with the size of the jar you are submitting because it can harm performance.