Pyspark CLI Error - pyspark

I need to connect to HIVE from PySpark. I'm trying to run pyspark from CLI
spark-env.sh
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
export SPARK_MASTER_IP=MY_IP
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/app/spark/tmp
Facing the below exception while running the pyspark
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/28 15:29:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/28 15:29:51 WARN util.Utils: Your hostname, PC_NAME resolves to a loopback address: 127.0.1.1; using MY_IP instead (on interface eth1)
18/03/28 15:29:51 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/03/28 15:29:52 WARN client.StandaloneAppClient$ClientEndpoint: Failed to connect to master 127.0.0.1:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Related

How to stream data from Kafka to MongoDB by Kafka Connector

I want to stream data from Kafka to MongoDB by using Kafka Connector.
I found this one https://github.com/hpgrahsl/kafka-connect-mongodb. But there is no step to do.
After googling, it seems to lead to Confluent Platform what I don't want to use.
Could anyone share me document/guideline how to use kafka-connect-mongodb without using Confluent Platform or another Kafka Connector to stream data from Kafka to MongoDB?
Thank you in advance.
What I tried
Step1: I download mongo-kafka-connect-0.1-all.jar from maven central
Step2: copy jar file to a new folder plugins inside kafka (I use Kafka on Windows, so the directory is D:\git\1.libraries\kafka_2.12-2.2.0\plugins)
Step3: Edit file connect-standalone.properties by adding a new line
plugin.path=/git/1.libraries/kafka_2.12-2.2.0/plugins
Step4: I add new config file for mongoDB sink MongoSinkConnector.properties
name=mongo-sink
topics=test
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
tasks.max=1
key.ignore=true
# Specific global MongoDB Sink Connector configuration
connection.uri=mongodb://localhost:27017,mongo1:27017,mongo2:27017,mongo3:27017
database=test_kafka
collection=transaction
max.num.retries=3
retries.defer.timeout=5000
type.name=kafka-connect
Step5: run command bin\windows\connect-standalone.bat config\connect-standalone.properties config\MongoSinkConnector.properties
But, I get the error
[2019-07-09 10:19:09,466] WARN The configuration 'offset.flush.interval.ms' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,467] WARN The configuration 'key.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,467] WARN The configuration 'offset.storage.file.filename' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,468] WARN The configuration 'value.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,469] WARN The configuration 'plugin.path' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,469] WARN The configuration 'value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
[2019-07-09 10:19:09,470] WARN The configuration 'key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.admin.AdminClientConfig)
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will be ignored.
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.RootResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.RootResource will be ignored.
Jul 09, 2019 10:19:10 AM org.glassfish.jersey.internal.inject.Providers checkProviderRuntime
WARNING: A provider org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be ignored.
Jul 09, 2019 10:19:11 AM org.glassfish.jersey.internal.Errors logErrors
WARNING: The following warnings have been detected: WARNING: The (sub)resource method listConnectors in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method createConnector in org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains empty path annotation.
WARNING: The (sub)resource method listConnectorPlugins in org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource contains empty path annotation.
WARNING: The (sub)resource method serverInfo in org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty path annotation.
[2019-07-09 10:19:12,302] ERROR WorkerSinkTask{id=mongo-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error:
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:344)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'this': was expecting 'null', 'true', 'false' or NaN
at [Source: (byte[])"this is a message"; line: 1, column: 6]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'this': was expecting 'null', 'true', 'false' or NaN
at [Source: (byte[])"this is a message"; line: 1, column: 6]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:703)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3532)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3508)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._matchToken2(UTF8StreamJsonParser.java:2843)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._matchTrue(UTF8StreamJsonParser.java:2777)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:807)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:729)
at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4042)
at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2571)
at org.apache.kafka.connect.json.JsonDeserializer.deserialize(JsonDeserializer.java:50)
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:342)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2019-07-09 10:19:12,305] ERROR WorkerSinkTask{id=mongo-sink-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
What configuration did I set wrong or I miss anything?
I fixed it. Now, I can stream data from Kafka to MongoDB succesfully
My fix is:
move my kafka to C:\kafka_2.12-2.2.0
update plugin_path corresponding to new path
update config file connect-standalone.properties
There is an official source and sink connector from MongoDB themselves. It is available on Confluent Hub: https://www.confluent.io/hub/mongodb/kafka-connect-mongodb
If you don't want to use Confluent Platform you can deploy Apache Kafka yourself - it includes Kafka Connect already. Which plugins (connectors) you use with it is up to you. In this case you would be using Kafka Connect (part of Apache Kafka) plus kafka-connect-mongodb (provided by MongoDB).
Documentation on how to use it is here: https://docs.mongodb.com/kafka-connector/current/
Even though this question is a little old. Here is how I connected kafka_2.12-2.6.0 to mongodb (version 4.4) on ubuntu system:
a. Download mongodb connector '*-all.jar' from here .Mongodb-kafka connector with 'all' at the end will contain all connector dependencies also.
b. Drop this jar file in your kafka's lib folder
c. Configure 'connect-standalone_bare.properties' as:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
d. Configure 'MongoSinkConnector.properties' as:
name=mongo-sink
topics=test
connector.class=com.mongodb.kafka.connect.MongoSinkConnector
tasks.max=1
key.ignore=true
connection.uri=mongodb://localhost:27017
database=test_kafka
collection=transaction
max.num.retries=3
retries.defer.timeout=5000
type.name=kafka-connect
schemas.enable=false
Place both 'properties' file here: $HOME/Documents/kafka/config
e. Start connector-process, as:
export folder_path="$HOME/Documents/kafka/config"
connect-standalone.sh $folder_path/connect-standalone_bare.properties $folder_path/MongoSinkConnector.properties
e. In kafka, start zookeeper-server and also kafka-server. Create topic 'test'. In mongod server, create database 'test_kafka' and under it a collection, 'transaction'.
f. Start kafka producer:
kafka-console-producer.sh --broker-list localhost:9092 --topic test
And make an entry: {"abc" : "def" }
You should be able to see it in mongodb (db.transaction.find() ).

Can't run spark shell ! java.lang.NoSuchMethodError: org.apache.spark.repl.SparkILoop.mumly

hadoop#youngv-VirtualBox:/usr/local/spark$ ./bin/spark-shell
18/11/30 23:32:38 WARN Utils: Your hostname, youngv-VirtualBox resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface enp0s3)
18/11/30 23:32:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/11/30 23:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.repl.SparkILoop.mumly(Lscala/Function0;)Ljava/lang/Object;
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1(SparkILoop.scala:199)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:267)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:247)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.withSuppressedSettings$1(SparkILoop.scala:235)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.startup$1(SparkILoop.scala:247)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:282)
at org.apache.spark.repl.SparkILoop.runClosure(SparkILoop.scala:159)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:182)
at org.apache.spark.repl.Main$.doMain(Main.scala:78)
at org.apache.spark.repl.Main$.main(Main.scala:58)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
while I want to run the spark-shell, but appear error
with: spark-2.4.0 scala-2.11.12 jdk-1.8
Anyone could tell me how to solve this problem? I will be very grateful.
there might be different jar version in assembly classpath, remove it and try
building it again.

Pyspark warning messages and couldn't not connect the SparkContext

I ran the /bin/pyspark to do some practice, but console throws an error as shown in below.
**[dst#localhost bin]$ ./pyspark
Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/07 01:45:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/07 01:45:41 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to '').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
17/02/07 01:45:41 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '' as a work-around.
17/02/07 01:45:41 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to '' as a work-around.
17/02/07 01:45:41 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth1)
17/02/07 01:45:41 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
/usr/local/spark/latest/python/pyspark/context.py:194: UserWarning: Support for Python 2.6 is deprecated as of Spark 2.0.0
warnings.warn("Support for Python 2.6 is deprecated as of Spark 2.0.0")
Traceback (most recent call last):
File "/usr/local/spark/latest/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/latest/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/latest/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
**
Therefore, I cannot connect the SparkContext (sc variable) to make RDD operations. Even I tried to google it but failed to get the appropriate solutions. Could you help me use the pyspark in a normal way?
(My Spark version is 2.1.0)
You need to launch your SparkSession with .enableHiveSupport()
This error relates to not being able to launch Hive Session.
spark = SparkSession.builder.appName("Application name").enableHiveSupport().getOrCreate()

What do WARN messages mean when starting spark-shell?

When starting my spark-shell, I had a bunch of WARN messages. But I cannot understand them. Is there any important problems that I should take care of? Or is there any configuration that I missed? Or these WARN messages are normal.
cliu#cliu-ubuntu:Apache-Spark$ spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address: 127.0.1.1; using xxx.xxx.xxx.xx (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/11/30 11:43:55 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/11/30 11:43:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:43:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:11 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/30 11:44:11 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/11/30 11:44:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/30 11:44:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/30 11:44:27 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/30 11:44:27 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
scala>
This one:
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address: 127.0.1.1; using xxx.xxx.xxx.xx (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
means that the hostname the driver managed to figure out for itself is not routable and hence no remote connections are allowed. In your local environment, it is not an issue, but if you go for multi-machine configuration, Spark won't work properly. Hence the WARN message as it may or may not be an issue. Just a heads-up.
The logging info are absolutely normal. Here the BoneCP tries to bind to a JDBC connection and this is why you receive these warnings. In any case if you would like to manage the log records you could specify the logging level by copying <spark-path>/conf/log4j.properties.template
file to <spark-path>/conf/log4j.properties and make your configurations.
Lastly, a similar answer for logging level can be found here:
How to stop messages displaying on spark console?
Adding to #Jacek Laskowski answer, with respect to the SPARK_LOCAL_IP warning:
15/11/30 11:43:54 WARN Utils: Your hostname, cliu-ubuntu resolves to a loopback address: 127.0.1.1; using xxx.xxx.xxx.xx (`here I hide my IP`) instead (on interface wlan0)
15/11/30 11:43:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
I encountered the same running spark-shell over a standalone Spark cluster running on Ubuntu 20.04 server. As expected, setting the SPARK_LOCAL_IP environment variables to $(hostname) made the warning go away, but while the application was running without issues, the worker GUI was not reachable using port 4040.
For fixing this, we had to set SPARK_LOCAL_HOSTNAME instead of SPARK_LOCAL_IP. Doing this, the warning was gone, and the worker GUI became accessible though port 4040.
I couldn't find information about this variable in Spark documentation, but according to Spark's source code it is used for setting a custom local machine URI: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1058

Spark worker throwing ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId

I am trying to execute a simple app example code with spark. Executing the job using spark submit.
spark-submit --class "SimpleJob" --master spark://:7077 target/scala-2.10/simple-project_2.10-1.0.jar
15/03/08 23:21:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/08 23:21:53 WARN LoadSnappy: Snappy native library not loaded
15/03/08 23:22:09 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Lines with a: 21, Lines with b: 21
The job gives correct results but gives following errors below it:
15/03/08 23:22:28 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(<worker-host.domain.com>,53628)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:205)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/08 23:22:28 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(<worker-host.domain.com>,53628) not found
15/03/08 23:22:28 WARN ConnectionManager: All connections not cleaned up
Following is the spark-defaults.conf
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.master spark://<master-ip>:7077
spark.eventLog.enabled true
spark.executor.extraClassPath $SPARK-HOME/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
spark.cassandra.connection.conf.factory com.datastax.spark.connector.cql.DefaultConnectionFactory
spark.cassandra.auth.conf.factory com.datastax.spark.connector.cql.DefaultAuthConfFactory
spark.cassandra.query.retry.count 10
Following is the spark-env.sh
SPARK_LOCAL_IP=<master-ip in master worker-ip in workers>
SPARK_MASTER_HOST='<master-hostname>'
SPARK_MASTER_IP=<master-ip>
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=4
Got an answer to this,
Even though i am adding the cassandra connector to the class path by the command, i am not sending the same path to all nodes of cluster.
Now i am using below command sequence to do it properly
spark-shell --driver-class-path ~/Installers/spark-cassandra-connector-1.1.1/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.1.1.jar
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.datastax.spark.connector._
sc.addJar("~/Installers/spark-cassandra-connector-1.1.1/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.1.1.jar")
After these commands I am able to run all the read & write into my cassandra cluster properly using the spark RDDs.