How to use PySpark Structure Streaming +Kafka - pyspark

I trying use spark structure streaming with kafka and i have problem when use spark submit, Consumer still receive data from produce but Spark Structure is error. Please help me find issue on my code
Here my code in test.py:
from kafka import KafkaProducer
from kafka import KafkaConsumer
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('stream_test').getOrCreate()
import random
producer = KafkaProducer(bootstrap_servers=["localhost:9092"])
for i in range(0,100):
lg_value = str(random.uniform(5000, 10000))
producer.send(topic = 'test', value = bytes(lg_value, encoding='utf-8'))
producer.flush()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092") \
.option("subscribe","test").load()
df_to_string = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
print("done")
when i run :
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 test.py
terminal output:
> 20/07/12 19:39:09 INFO Executor: Starting executor ID driver on host
> 192.168.31.129 20/07/12 19:39:09 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 38885. 20/07/12 19:39:09 INFO NettyBlockTransferService: Server
> created on 192.168.31.129:38885 20/07/12 19:39:09 INFO BlockManager:
> Using org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 20/07/12 19:39:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver, 192.168.31.129, 38885,
> None) 20/07/12 19:39:09 INFO BlockManagerMasterEndpoint: Registering
> block manager 192.168.31.129:38885 with 413.9 MiB RAM,
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 192.168.31.129, 38885, None) 20/07/12 19:39:09
> INFO BlockManager: Initialized BlockManager: BlockManagerId(driver,
> 192.168.31.129, 38885, None) 20/07/12 19:39:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir ('file:/home/thoaint2/spark-warehouse').
> 20/07/12 19:39:11 INFO SharedState: Warehouse path is
> 'file:/home/thoaint2/spark-warehouse'. Traceback (most recent call
> last): File "/home/thoaint2/test.py", line 15, in <module>
> df = spark.readStream.format("kafka").option('kafka.bootstrap.servers','localhost:9092')
> \ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 420, in load File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1304, in __call__ File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 131, in deco File
> "/home/thoaint2/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
> occurred while calling o31.load. : java.lang.NoClassDefFoundError:
> org/apache/kafka/common/serialization/ByteArraySerializer at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:557)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:325)
> at
> org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:70)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:220)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:112)
> at
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
> at
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:205)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> py4j.Gateway.invoke(Gateway.java:282) at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79) at
> py4j.GatewayConnection.run(GatewayConnection.java:238) at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.kafka.common.serialization.ByteArraySerializer at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)

NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer
This package is part of kafka-clients JAR, which you'll want to add to your --packages. e.g. spark-submit ... --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0,org.apache.kafka:kafka-clients:<<version>>
Also note that Spark works as a producer as well, so you don't need a different Python Kafka library.
If you simply want to process Kafka Streams without using a JVM then look into Faust

Related

pyspark issue while connecting

I am new to the pyspark.
i was trying to initialize a pyspark session .
But getting the below error. I am doing the pyspark2 command in local machine .
When i tried first time using scala the spark session invokation is correct . Then i tried to invoke Pyspark that time i am getting error. Please let me know how i can come out of this error
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/08 22:55:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/08 22:55:41 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:833)
C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin\..\python\pyspark\shell.py:42: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin\..\python\pyspark\shell.py", line 38, in <module>
spark = SparkSession._create_shell_session() # type: ignore
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\session.py", line 553, in _create_shell_session
return SparkSession.builder.getOrCreate()
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\sql\session.py", line 228, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 392, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 146, in __init__
self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 209, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\pyspark\context.py", line 329, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\java_gateway.py", line 1585, in __call__
return_value = get_return_value(
File "C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\python\lib\py4j-0.10.9.3-src.zip\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
C:\Spark\spark-3.2.1-bin-hadoop3.2\spark-3.2.1-bin-hadoop3.2\bin>SUCCESS: The process with PID 21928 (child process of PID 14900) has been terminated.
SUCCESS: The process with PID 14900 (child process of PID 31720) has been terminated.
SUCCESS: The process with PID 31720 (child process of PID 10468) has been terminated.

Livy spark interactive session

I'm trying to create spark interactive session with livy .and I need to add a lib like a jar that I mi in the hdfs (see my code ) . but the session is dead and the log is below.
code :
client = LivyClient('http://sandbox.c4e.kyomei.fr:10500')
session = client.create_session(SessionKind.SPARK , jars = ['hdfs://sandbox-hdp.hortonworks.com:8020/tmp/tsa2-assembly-0.1.jar'] )
log :
> 20/05/09 01:43:48 INFO LineBufferedStream: Exception in thread "main"
> scala.reflect.internal.FatalError: object Predef does not have a
> member classOf 20/05/09 01:43:48 WARN RSCClient:
> Error stopping RPC.
> io.netty.util.concurrent.BlockingOperationException:
> DefaultChannelPromise#786d2cd8(uncancellable)
>
> at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:394)
>
> at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
>
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:230)
> ....
> Exception in thread "Thread-32" java.io.IOException: Stream closed
>
> at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:283)
>
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> 20/05/09 01:43:48 WARN ContextLauncher: Child process exited with code
20/05/09 01:43:48 ERROR SparkProcApp: job was killed by user

What causes java.lang.NoClassDefFoundError exception during EMR execution?

I have a scala jar file called SGA.jar. Within, there is a class called org/SGA/MainTest, which uses the underlying SGA.jar logic to perform some graph operations and looks like:
package org.SGA
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import java.io._
import scala.util._
object MainTest {
def initialize() : Unit = {
println("Initializing")
}
def perform(collection : Iterable[String]) : Unit = {
val conf = new SparkConf().setAppName("maintest")
val sparkContext = new SparkContext(conf)
sparkContext.setLogLevel("ERROR")
val edges = sparkContext.parallelize(collection.toList).map(_.split(" ")).map { edgeCoordinates => new Edge(edgeCoordinates(0).toLong, edgeCoordinates(1).toLong, edgeCoordinates(2).toDouble) }
println("Creating graph")
val graph : Graph[Any, Double] = Graph.fromEdges(edges, 0)
println("Graph created")
// ...
}
}
SGA.jar is embedded into scalaWrapper.jar, which is a java wrapper around the scala SGA.jar and the necessary datasets. It's folder structure looks like this:
scalaWrapper.jar
| META-INF
| | MANIFEST.MF
| scalawrapper
| | datasets
| | | data1.txt
| | jars
| | | SGA.jar
| | FileParser.java
| | FileParser.class
| | WrapperClass.java
| | WrapperClass.class
| .classpath
| .project
The FileParser class, basically converts the data available in the text files into usable structures and is of no further interest here. The main class is WrapperClass, however:
package scalawrapper;
import scala.collection.*;
import scala.collection.Iterable;
import java.util.List;
import org.SGA.*;
public class WrapperClass {
public static void main(String[] args) {
FileParser fileparser = new FileParser();
String filepath = "/scalawrapper/datasets/data1.txt";
MainTest.initialize();
List<String> list = fileparser.Parse(filepath);
Iterable<String> scalaIterable = JavaConversions.collectionAsScalaIterable(list);
MainTest.perform(scalaIterable);
}
}
SGA.jar is built via SBT and the java jar is developed and exported from Eclipse. When running locally (in which case the SparkConf has appended .setMaster("local[*]").set("spark.executor.memory","7g") to facilitate a local execution), there are no issues and the code behaves as expected.
The problem arises when the scalaWrapper.jar is expected to run on an EMR cluster. The cluster is defined as a 1 master + 4 worker nodes, with an additional spark application step:
Main class : None
Arguments : spark-submit --deploy-mode cluster --class scalawrapper.WrapperClass --executor-memory 17g --executor-cores 16 --driver-memory 17g s3://scalaWrapperCluster/scalaWrapper.jar
The execution fails with :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__1619195545177535823.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for TERM
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for HUP
19/04/22 16:56:43 INFO SignalUtils: Registered signal handler for INT
19/04/22 16:56:43 INFO SecurityManager: Changing view acls to: yarn,hadoop
19/04/22 16:56:43 INFO SecurityManager: Changing modify acls to: yarn,hadoop
19/04/22 16:56:43 INFO SecurityManager: Changing view acls groups to:
19/04/22 16:56:43 INFO SecurityManager: Changing modify acls groups to:
19/04/22 16:56:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
19/04/22 16:56:44 INFO ApplicationMaster: Preparing Local resources
19/04/22 16:56:44 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1555952041027_0001_000001
19/04/22 16:56:44 INFO ApplicationMaster: Starting the user application in a separate Thread
19/04/22 16:56:44 INFO ApplicationMaster: Waiting for spark context initialization...
19/04/22 16:56:44 ERROR ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/SGA/MainTest
java.lang.NoClassDefFoundError: org/SGA/MainTest
at scalawrapper.WrapperClass.main(WrapperClass.java:20)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: java.lang.ClassNotFoundException: org.SGA.MainTest
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more
Note that WrapperClass.java:20 corresponds to MainTest.initialize();.
This exception seems to be quite popular, as I came upon quite a few attempts to solve (example), yet none solved my problem. I tried including in the scalaWrapper.jar file also the scala-library that was used for building SGA.jar, eliminating static fields, searching for mistakes in the project definitions, but had no luck.
I resolved the issue by uploading SGA.jar to S3 separately and by adding it as the --jars parameter to spark-submit.
spark-submit --deploy-mode cluster --jars s3://scalaWrapperCluster/SGA.jar --class scalawrapper.WrapperClass --executor-memory 17g --executor-cores 16 --driver-memory 17g s3://scalaWrapperCluster/scalaWrapper.jar
Note that the original functionality within scalaWrapper.jar (including the already built-in SGA.jar) didn't change. And the separately uploaded SGA.jar is the one being executed.

Unable to Send Spark Data Frame to Kafka (java.lang.ClassNotFoundException: Failed to find data source: kafka.)

I am facing issue while pushing data to Kafka with Spark data frame.
Let me explain my scenario in detail with sample example. I want to load the data to spark and send the spark output to kafka. I am using Gradle 3.5 and Spark 2.3.1 & Kafka 1.0.1
Here is build.gradle
buildscript {
ext {
springBootVersion = '1.5.15.RELEASE'
}
repositories {
mavenCentral()
}
dependencies {
classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
}
}
apply plugin: 'scala'
apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'
group = 'com.sample'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = 1.8
repositories {
mavenCentral()
}
dependencies {
compile('org.springframework.boot:spring-boot-starter')
compile ('org.apache.spark:spark-core_2.11:2.3.1')
compile ('org.apache.spark:spark-sql_2.11:2.3.1')
compile ('org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.1')
compile ('org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1')
testCompile('org.springframework.boot:spring-boot-starter-test')
}
And here is my code:
package com.sample
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
object SparkConnection {
case class emp(empid:Integer, empname:String, empsal:Float)
def main(args:Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark
Connection").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val dataRdd = sc.textFile("/home/sample/data/sample.txt")
val mapRdd = dataRdd.map(row => row.split(","))
val empRdd = mapRdd.map( row => emp(row(0).toInt, row(1), row(2).toFloat))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val empDF = empRdd.toDF()
empDF.
select(to_json(struct(empDF.columns.map(column):_*)).alias("value"))
.write.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "my-kafka-topic").save()
}
}
Please ignore spring boot framework API in build.gradle.
After build my package using Gradle, I can able to see all dependent classes mentioned in .gradle file.
But when I run the code with spark-submit like
spark-submit --class com.sample.SparkConnection spark_kafka_integration.jar
I am getting the following error
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
at com.iniste.SparkConnection$.main(SparkConnection.scala:29)
at com.iniste.SparkConnection.main(SparkConnection.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
2018-09-05 17:41:17 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-09-05 17:41:17 INFO AbstractConnector:318 - Stopped Spark#51684e4a{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-09-05 17:41:17 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-09-05 17:41:17 INFO MemoryStore:54 - MemoryStore cleared
2018-09-05 17:41:17 INFO BlockManager:54 - BlockManager stopped
2018-09-05 17:41:17 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-09-05 17:41:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-05 17:41:17 INFO SparkContext:54 - Successfully stopped SparkContext
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Shutdown hook called
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd4cb4ef-3883-4c26-a93f-f355b13ef306
2018-09-05 17:41:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-156dfdbd-cff4-4c70-943f-35ef403a01ed
Please help me to get out of this error. And some of the blogs they suggested me to use --packages option with spark-submit. But there is some proxy limitation with me which is required to download the mentioned packages. But I am unable to understand that why spark-submit is unable to fetch the jars which are already available. Please correct me where I am doing wrong.
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.11 and its dependencies can be directly added to spark-submit using --packages, such as below in your case
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 com.sample.SparkConnection spark_kafka_integration.jar
This can be found here
However, as per cricket_007 suggestions i have added the shadowjar to your build.gradle
So the new one may look similar to this.
buildscript {
ext {
springBootVersion = '1.5.15.RELEASE'
}
repositories {
mavenCentral()
}
dependencies {
classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
}
}
plugins {
id "com.github.johnrengelman.shadow" version "2.0.4"
}
apply plugin: 'scala'
apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'
apply plugin: "com.github.johnrengelman.shadow"
group = 'com.sample'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = 1.8
repositories {
mavenCentral()
}
dependencies {
compile('org.springframework.boot:spring-boot-starter')
compile ('org.apache.spark:spark-core_2.11:2.3.1')
compile ('org.apache.spark:spark-sql_2.11:2.3.1')
compile ('org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.1')
compile ('org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1')
compile 'org.scala-lang:scala-library:2.11.12'
// https://mvnrepository.com/artifact/org.apache.kafka/kafka
//compile group: 'org.apache.kafka', name: 'kafka_2.10', version: '0.8.0'
testCompile('org.springframework.boot:spring-boot-starter-test')
}
shadowJar{
baseName = "spark_kafka_integration"
zip64 true
classifier = null
version = null
}
So to create your jar the command would be just :shadowJar from your gradle.

mapWithState assertion failed: Block rdd_45_0 is not locked for reading

I'm using the function mapWithState() to count UV in my spark streaming application. After mapWithState I get a dstream and foreachRDD with it. In the function foreachRDD, there is a rdd.foreachPartition to foreach the Iterator, and next apply foreach on Iterator with Future, but I got an error in the Future.
Error log here:
> 17/07/27 10:19:54.0447 INFO Executor: Finished task 1.0 in stage 52.0 (TID 422). 1878 bytes result sent to driver
> 17/07/27 10:19:54.0454 DEBUG BlockManagerSlaveEndpoint: removing RDD 47
> 17/07/27 10:19:54.0454 INFO BlockManager: Removing RDD 47
> 17/07/27 10:19:54.0455 DEBUG BlockManagerSlaveEndpoint: Done removing RDD 47, response is 0
> 17/07/27 10:19:54.0455 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.1.30:43968
> 17/07/27 10:19:54.0456 DEBUG BlockManagerSlaveEndpoint: removing RDD 46
> 17/07/27 10:19:54.0456 INFO BlockManager: Removing RDD 46
> 17/07/27 10:19:54.0456 DEBUG BlockManagerSlaveEndpoint: Done removing RDD 46, response is 0
> 17/07/27 10:19:54.0456 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.1.30:43968
> 17/07/27 10:19:54.0461 WARN BoneCP: Thread close connection monitoring has been enabled. This will negatively impact on your
> performance. Only enable this option for debugging purposes!
> 17/07/27 10:19:54.0873 WARN ClickAnalysis$: before parpair data with threadName=ForkJoinPool-1-worker-5 and threadId=46
> 17/07/27 10:19:54.0873 WARN ClickAnalysis$: before parpair data with threadName=ForkJoinPool-1-worker-3 and threadId=50
> 17/07/27 10:19:54.0875 WARN ClickAnalysis$: come into foreach data with threadName=ForkJoinPool-1-worker-5 and threadId=46
> 17/07/27 10:19:54.0875 WARN ClickAnalysis$: come into foreach data with threadName=ForkJoinPool-1-worker-3 and threadId=50
> Exception: java.util.concurrent.ExecutionException: Boxed Error
> at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
> at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
> at scala.concurrent.Promise$class.complete(Promise.scala:55)
> at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
> at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
> at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.AssertionError: assertion failed: Block rdd_45_0 is not locked for reading
> at scala.Predef$.assert(Predef.scala:170)
> at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720)
> at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516)
> at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
> at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at ClickAnalysis$.doPrepairCamAndGmtUvPs(ClickAnalysis.scala:383)
> at ClickAnalysis$$anonfun$8.apply(ClickAnalysis.scala:353)
> at ClickAnalysis$$anonfun$8.apply(ClickAnalysis.scala:345)
> at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> ... 5 more
and my code here:
val mapState3=pairs.mapWithState(StateSpec.function(mappingFunction).timeout(Duration(uvExpireTime.toLong))).map( x => (x._1, x._2.estimatedSize.toLong))
mapState3.foreachRDD( { rdd =>{
rdd.foreachPartition( uvRecord =>{
if (!uvRecord.isEmpty) {
doUpdateUV(uvRecord)
}
})
def doUpdateUV(data:Iterator[(String, Long)]):Unit ={
if(data != null){
val f = Future{
var connection:Connection = null
try{
connection = ConnectionPool.getConnection.getOrElse(null)
connection.setAutoCommit(false)
val camPs: PreparedStatement = connection.prepareStatement(updateUvCamCnt_sql)
val gmtPs: PreparedStatement = connection.prepareStatement(updateUvGmtCnt_sql)
logger.warn("before parpair data with threadName="+Thread.currentThread().getName+" and threadId="+Thread.currentThread().getId)
for(uvRecord <- data) {
logger.warn("come into foreach data with threadName=" + Thread.currentThread().getName + " and threadId=" + Thread.currentThread().getId)
}
logger.warn("come into batch update with threadName="+Thread.currentThread().getName+" and threadId="+Thread.currentThread().getId)
camPs.executeBatch()
gmtPs.executeBatch()
connection.commit()
camPs.close()
gmtPs.close()
} catch {
case exception: Exception =>
logger.error("Error in batchUpdate "+ exception.getMessage + "-----------------------" + ExceptionUtils.getStackTrace(exception) + "-----------------------------")
throw exception
} finally {
ConnectionPool.closeConnection(connection)
}
"success"
}
f onSuccess {
case result => println(s"Success: $result")
}
f onFailure {
case t => println(s"Exception: ${ExceptionUtils.getStackTrace(t)}")
}
}
I look forward for getting any useful solution for this problem .
I had the same issue:
java.lang.AssertionError: assertion failed: Block rdd_xx_xx is not
locked for reading
I fixed it by just adding more clusters. It seems to have been a memory issue.
Based on what I have read from the different jiras, this is a race condition. Multiple attempts at fixing it have been checked in. I am experiencing this issue in 2.4.4, and looks like 3.0.0 might have fixed this issue.
For me it is happening during a call to df.rdd.isEmpty()
If you want more information what I found here are the resources:
First jira on the issue
Later jira on same issue (duplicate),
but later spark version
More details on why this is a race
condition
Very old jira when this seems to have first
appeared