I need to connect to HIVE from PySpark. I'm trying to run pyspark from CLI
export SPARK_WORKER_DIR=/app/spark/tmp
Facing the below exception while running the pyspark
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/28 15:29:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/28 15:29:51 WARN util.Utils: Your hostname, PC_NAME resolves to a loopback address:; using MY_IP instead (on interface eth1)
18/03/28 15:29:51 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/03/28 15:29:52 WARN client.StandaloneAppClient$ClientEndpoint: Failed to connect to master
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$


How to stream data from Kafka to MongoDB by Kafka Connector

I want to stream data from Kafka to MongoDB by using Kafka Connector.
I found this one But there is no step to do.
After googling, it seems to lead to Confluent Platform what I don't want to use.
Could anyone share me document/guideline how to use kafka-connect-mongodb without using Confluent Platform or another Kafka Connector to stream data from Kafka to MongoDB?
Thank you in advance.
What I tried
Step1: I download mongo-kafka-connect-0.1-all.jar from maven central
Step2: copy jar file to a new folder plugins inside kafka (I use Kafka on Windows, so the directory is D:\git\1.libraries\kafka_2.12-2.2.0\plugins)
Step3: Edit file by adding a new line
Step4: I add new config file for mongoDB sink
# Specific global MongoDB Sink Connector configuration
Step5: run command bin\windows\connect-standalone.bat config\ config\
But, I get the error
[2019-07-09 10:19:09,466] WARN The configuration '' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,467] WARN The configuration 'key.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,467] WARN The configuration '' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,468] WARN The configuration 'value.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,469] WARN The configuration 'plugin.path' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,469] WARN The configuration 'value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,470] WARN The configuration 'key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider will be ignored.
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider will be ignored.
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider will be ignored.
Jul 09, 2019 10:19:11 AM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in contains empty path annotation.
WARNING: The (sub)resource method createConnector in contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in contains empty path annotation.
WARNING: The (sub)resource method serverInfo in contains empty path annotation.
[2019-07-09 10:19:12,302] ERROR WorkerSinkTask{id=mongo-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(
at org.apache.kafka.connect.runtime.WorkerTask.doRun(
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error:
at org.apache.kafka.connect.json.JsonConverter.toConnectData(
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'this': was expecting 'null', 'true', 'false' or NaN
at [Source: (byte[])"this is a message"; line: 1, column: 6]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'this': was expecting 'null', 'true', 'false' or NaN
at [Source: (byte[])"this is a message"; line: 1, column: 6]
at com.fasterxml.jackson.core.JsonParser._constructError(
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._matchToken2(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._matchTrue(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(
at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(
at com.fasterxml.jackson.databind.ObjectMapper.readTree(
at org.apache.kafka.connect.json.JsonDeserializer.deserialize(
at org.apache.kafka.connect.json.JsonConverter.toConnectData(
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(
at org.apache.kafka.connect.runtime.WorkerTask.doRun(
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
[2019-07-09 10:19:12,305] ERROR WorkerSinkTask{id=mongo-sink-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
What configuration did I set wrong or I miss anything?
I fixed it. Now, I can stream data from Kafka to MongoDB succesfully
My fix is:
move my kafka to C:\kafka_2.12-2.2.0
update plugin_path corresponding to new path
update config file
There is an official source and sink connector from MongoDB themselves. It is available on Confluent Hub:
If you don't want to use Confluent Platform you can deploy Apache Kafka yourself - it includes Kafka Connect already. Which plugins (connectors) you use with it is up to you. In this case you would be using Kafka Connect (part of Apache Kafka) plus kafka-connect-mongodb (provided by MongoDB).
Documentation on how to use it is here:
Even though this question is a little old. Here is how I connected kafka_2.12-2.6.0 to mongodb (version 4.4) on ubuntu system:
a. Download mongodb connector '*-all.jar' from here .Mongodb-kafka connector with 'all' at the end will contain all connector dependencies also.
b. Drop this jar file in your kafka's lib folder
c. Configure '' as:
d. Configure '' as:
Place both 'properties' file here: $HOME/Documents/kafka/config
e. Start connector-process, as:
export folder_path="$HOME/Documents/kafka/config" $folder_path/ $folder_path/
e. In kafka, start zookeeper-server and also kafka-server. Create topic 'test'. In mongod server, create database 'test_kafka' and under it a collection, 'transaction'.
f. Start kafka producer: --broker-list localhost:9092 --topic test
And make an entry: {"abc" : "def" }
You should be able to see it in mongodb (db.transaction.find() ).

Can't run spark shell ! java.lang.NoSuchMethodError: org.apache.spark.repl.SparkILoop.mumly

hadoop#youngv-VirtualBox:/usr/local/spark$ ./bin/spark-shell
18/11/30 23:32:38 WARN Utils: Your hostname, youngv-VirtualBox resolves to a loopback address:; using instead (on interface enp0s3)
18/11/30 23:32:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/11/30 23:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.lang.NoSuchMethodError:
at org.apache.spark.repl.SparkILoop$$anonfun$process$$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1(SparkILoop.scala:199)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:267)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:247)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.withSuppressedSettings$1(SparkILoop.scala:235)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.startup$1(SparkILoop.scala:247)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:282)
at org.apache.spark.repl.SparkILoop.runClosure(SparkILoop.scala:159)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:182)
at org.apache.spark.repl.Main$.doMain(Main.scala:78)
at org.apache.spark.repl.Main$.main(Main.scala:58)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
while I want to run the spark-shell, but appear error
with: spark-2.4.0 scala-2.11.12 jdk-1.8
Anyone could tell me how to solve this problem? I will be very grateful.
there might be different jar version in assembly classpath, remove it and try
building it again.

Pyspark warning messages and couldn't not connect the SparkContext

I ran the /bin/pyspark to do some practice, but console throws an error as shown in below.
**[dst#localhost bin]$ ./pyspark
Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/07 01:45:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 01:45:41 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to '').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
17/02/07 01:45:41 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '' as a work-around.
17/02/07 01:45:41 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '' as a work-around.
17/02/07 01:45:41 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address:; using instead (on interface eth1)
17/02/07 01:45:41 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
/usr/local/spark/latest/python/pyspark/ UserWarning: Support for Python 2.6 is deprecated as of Spark 2.0.0
warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
Traceback (most recent call last):
File "/usr/local/spark/latest/python/pyspark/", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/latest/python/pyspark/sql/", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/latest/python/lib/", line 1133, in __call__
File "/usr/local/spark/latest/python/pyspark/sql/", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
Therefore, I cannot connect the SparkContext (sc variable) to make RDD operations. Even I tried to google it but failed to get the appropriate solutions. Could you help me use the pyspark in a normal way?
(My Spark version is 2.1.0)
You need to launch your SparkSession with .enableHiveSupport()
This error relates to not being able to launch Hive Session.
spark = SparkSession.builder.appName("Application name").enableHiveSupport().getOrCreate()

What do WARN messages mean when starting spark-shell?

When starting my spark-shell, I had a bunch of WARN messages. But I cannot understand them. Is there any important problems that I should take care of? Or is there any configuration that I missed? Or these WARN messages are normal.
cliu#cliu-ubuntu:Apache-Spark$ spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
Using Spark's repl log4j profile: org/apache/spark/
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address:; using (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/11/30 11:43:55 WARN MetricsSystem: Using default name DAGScheduler for source because is not set.
Spark context available as sc.
15/11/30 11:43:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:43:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:11 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/30 11:44:11 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/11/30 11:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/30 11:44:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:27 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/30 11:44:27 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
This one:
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address:; using (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
means that the hostname the driver managed to figure out for itself is not routable and hence no remote connections are allowed. In your local environment, it is not an issue, but if you go for multi-machine configuration, Spark won't work properly. Hence the WARN message as it may or may not be an issue. Just a heads-up.
The logging info are absolutely normal. Here the BoneCP tries to bind to a JDBC connection and this is why you receive these warnings. In any case if you would like to manage the log records you could specify the logging level by copying <spark-path>/conf/
file to <spark-path>/conf/ and make your configurations.
Lastly, a similar answer for logging level can be found here:
How to stop messages displaying on spark console?
Adding to #Jacek Laskowski answer, with respect to the SPARK_LOCAL_IP warning:
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address:; using (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
I encountered the same running spark-shell over a standalone Spark cluster running on Ubuntu 20.04 server. As expected, setting the SPARK_LOCAL_IP environment variables to $(hostname) made the warning go away, but while the application was running without issues, the worker GUI was not reachable using port 4040.
For fixing this, we had to set SPARK_LOCAL_HOSTNAME instead of SPARK_LOCAL_IP. Doing this, the warning was gone, and the worker GUI became accessible though port 4040.
I couldn't find information about this variable in Spark documentation, but according to Spark's source code it is used for setting a custom local machine URI:

Spark worker throwing ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId

I am trying to execute a simple app example code with spark. Executing the job using spark submit.
spark-submit --class "SimpleJob" --master spark://:7077 target/scala-2.10/simple-project_2.10-1.0.jar
15/03/08 23:21:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/08 23:21:53 WARN LoadSnappy: Snappy native library not loaded
15/03/08 23:22:09 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Lines with a: 21, Lines with b: 21
The job gives correct results but gives following errors below it:
15/03/08 23:22:28 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(<>,53628)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
15/03/08 23:22:28 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(<>,53628) not found
15/03/08 23:22:28 WARN ConnectionManager: All connections not cleaned up
Following is the spark-defaults.conf
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.master spark://<master-ip>:7077
spark.eventLog.enabled true
spark.executor.extraClassPath $SPARK-HOME/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
spark.cassandra.connection.conf.factory com.datastax.spark.connector.cql.DefaultConnectionFactory
spark.cassandra.auth.conf.factory com.datastax.spark.connector.cql.DefaultAuthConfFactory
spark.cassandra.query.retry.count 10
Following is the
SPARK_LOCAL_IP=<master-ip in master worker-ip in workers>
Got an answer to this,
Even though i am adding the cassandra connector to the class path by the command, i am not sending the same path to all nodes of cluster.
Now i am using below command sequence to do it properly
spark-shell --driver-class-path ~/Installers/spark-cassandra-connector-1.1.1/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.1.1.jar
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.datastax.spark.connector._
After these commands I am able to run all the read & write into my cassandra cluster properly using the spark RDDs.