I get this error when using sparkDataset to write to S3:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter not found
you need to get the matching spark-hadoop-cloud jar from your spark release into the spark classpath, that's where the class lives
I had installed the Hadoop + Spark cluster on the servers.
It is working fine writing scala code in the spark-shell on the master server.
I put the Spark library (the jar files) in my project and I'm writing my first Scala code on my computer through Intellij.
When I run a simple code that just creates a SparkContext object for reading a file from the HDFS through the hdfs protocol, it outputs error messages.
The test function:
import org.apache.spark.SparkContext
class SpcDemoProgram {
def demoPrint(): Unit ={
println("class spe demoPrint")
test()
}
def test(){
var spark = new SparkContext();
}
}
The messages is:
20/11/02 12:36:26 INFO SparkContext: Running Spark version 3.0.0
20/11/02 12:36:26 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException:
HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems at
org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548) at
org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569) at
org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592) at
org.apache.hadoop.util.Shell.(Shell.java:689) at
org.apache.hadoop.util.StringUtils.(StringUtils.java:78) at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1664)
at
org.apache.hadoop.security.SecurityUtil.setConfigurationInternal(SecurityUtil.java:104)
at
org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:88)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:316)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:304)
at
org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1828)
at
org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at
org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412) at
org.apache.spark.SparkContext.(SparkContext.scala:303) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) Caused by:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
unset. at
org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468) at
org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439) at
org.apache.hadoop.util.Shell.(Shell.java:516) ... 19 more
20/11/02 12:36:26 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable 20/11/02 12:36:27 ERROR SparkContext: Error initializing
SparkContext. org.apache.spark.SparkException: A master URL must be
set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala) 20/11/02
12:36:27 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master
URL must be set in your configuration at
org.apache.spark.SparkContext.(SparkContext.scala:380) at
org.apache.spark.SparkContext.(SparkContext.scala:120) at
scala.spc.demo.SpcDemoProgram.test(SpcDemoProgram.scala:14) at
scala.spc.demo.SpcDemoProgram.demoPrint(SpcDemoProgram.scala:9) at
scala.spc.demo.SpcDemoProgram$.main(SpcDemoProgram.scala:50) at
scala.spc.demo.SpcDemoProgram.main(SpcDemoProgram.scala)
Does that error message imply that Hadoop and Spark must be installed on my computer?
What configuration do I need to do?
I assume, you are trying to read the file with the path as hdfs://<FILE_PATH> then yes you need to have Hadoop installed else if its just a local directory you could try without "hdfs://" in the file path.
I am trying to run Python Spark Structured Streaming + Kafka, when I run the command
Master#MacBook-Pro spark-3.0.0-preview2-bin-hadoop2.7 % bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 \
examples/src/main/python/sql/streaming/structured_kafka_wordcount.py \
/Users/Master/Projects/bank_kafka_spark/spark_job1.py localhost:9092 transaction
receiving next
20/04/22 13:06:04 WARN Utils: Your hostname, MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.103 instead (on interface en0)
20/04/22 13:06:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/jars/spark-unsafe_2.12-3.0.0-preview2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /Users/Master/.ivy2/cache
The jars for the packages stored in: /Users/Master/.ivy2/jars
:: loading settings :: url = jar:file:/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-cd5905ea-5f80-4b14-995d-6ba03a353bb0;1.0
confs: [default]
found org.apache.spark#spark-sql-kafka-0-10_2.12;2.4.5 in central
found org.apache.kafka#kafka-clients;2.0.0 in central
found org.lz4#lz4-java;1.4.0 in central
found org.xerial.snappy#snappy-java;1.1.7.3 in central
found org.slf4j#slf4j-api;1.7.16 in central
found org.spark-project.spark#unused;1.0.0 in local-m2-cache
:: resolution report :: resolve 315ms :: artifacts dl 6ms
:: modules in use:
org.apache.kafka#kafka-clients;2.0.0 from central in [default]
org.apache.spark#spark-sql-kafka-0-10_2.12;2.4.5 from central in [default]
org.lz4#lz4-java;1.4.0 from central in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
org.xerial.snappy#snappy-java;1.1.7.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 6 | 0 | 0 | 0 || 6 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-cd5905ea-5f80-4b14-995d-6ba03a353bb0
confs: [default]
0 artifacts copied, 6 already retrieved (0kB/6ms)
20/04/22 13:06:04 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
20/04/22 13:06:04 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path: [/Users/Master/Library/Java/Extensions, /Library/Java/Extensions, /Network/Library/Java/Extensions, /System/Library/Java/Extensions, /usr/lib/java, .]
20/04/22 13:06:04 DEBUG NativeCodeLoader: java.library.path=/Users/Master/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
20/04/22 13:06:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/examples/src/main/python/sql/streaming/structured_kafka_wordcount.py", line 68, in <module>
.option(subscribeType, topics)\
File "/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 406, in load
File "/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
File "/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/Users/Master/Projects/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWriteSupport
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:821)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:719)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:642)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:600)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:416)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1210)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1221)
at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1265)
at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1300)
at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1385)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:170)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 43 more
I use example from PySpark examples/src/main/python/sql/streaming/structured_kafka_wordcount.py.
structured_kafka_wordcount.py.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""
Consumes messages from one or more topics in Kafka and does wordcount.
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
<bootstrap-servers> The Kafka "bootstrap.servers" configuration. A
comma-separated list of host:port.
<subscribe-type> There are three kinds of type, i.e. 'assign', 'subscribe',
'subscribePattern'.
|- <assign> Specific TopicPartitions to consume. Json string
| {"topicA":[0,1],"topicB":[2,4]}.
|- <subscribe> The topic list to subscribe. A comma-separated list of
| topics.
|- <subscribePattern> The pattern used to subscribe to topic(s).
| Java regex string.
|- Only one of "assign, "subscribe" or "subscribePattern" options can be
| specified for Kafka source.
<topics> Different value format depends on the value of 'subscribe-type'.
Run the example
`$ bin/spark-submit examples/src/main/python/sql/streaming/structured_kafka_wordcount.py \
host1:port1,host2:port2 subscribe topic1,topic2`
"""
from __future__ import print_function
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
""", file=sys.stderr)
sys.exit(-1)
bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]
spark = SparkSession\
.builder\
.appName("StructuredKafkaWordCount")\
.getOrCreate()
# Create DataSet representing the stream of input lines from kafka
lines = spark\
.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", bootstrapServers)\
.option(subscribeType, topics)\ # HERE IT STOPS AND RETURNS ERROR
.load()\
.selectExpr("CAST(value AS STRING)")
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.start()
query.awaitTermination()
Kafka server is runing, topic was created.
Java version 13.0.2
Scala 2.13.1
Kafka 2.12-2.4.1
Spark spark-3.0.0-preview2-bin-hadoop2.7
What is the problem?
I was having the exact same issue too until I realized I was adding the wrong dependency!
Instead of:
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5
Use:
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0-preview2
org.apache.spark.sql.sources.v2.StreamWriteSupport class is no longer part for Spark-Sql version 3.
But some pyspark libraries are still trying to load the class which causes above exception.
Should be a Spark:3.0.0 bug
THere https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying states that :
spark-sql-kafka-0-10_2.12 and its dependencies can be directly added
to spark-submit using --packages
You gotta make sure that your PySpark version is compatible with the Kafka version you're setting as a dependency. For me it was: Spark 3.3.0 -> org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1.
a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.
I used this code
My error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 20:39:24 INFO SparkContext: Running Spark version 2.1.0
17/02/03 20:39:25 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/02/03 20:39:25 WARN SparkConf: Detected deprecated memory fraction
settings: [spark.storage.memoryFraction]. As of Spark 1.6, execution and
storage memory management are unified. All memory fractions used in the old
model are now deprecated and no longer read. If you wish to use the old
memory management, you may explicitly enable `spark.memory.useLegacyMode`
(not recommended).
17/02/03 20:39:25 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
17/02/03 20:39:25 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at PCA$.main(PCA.scala:26)
at PCA.main(PCA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Process finished with exit code 1
If you are running spark stand alone then
val conf = new SparkConf().setMaster("spark://master") //missing
and you can pass parameter while submit job
spark-submit --master spark://master
If you are running spark local then
val conf = new SparkConf().setMaster("local[2]") //missing
you can pass parameter while submit job
spark-submit --master local
if you are running spark on yarn then
spark-submit --master yarn
Error message is pretty clear, you have to provide the address of the Spark Master node, either via the SparkContext or via spark-submit:
val conf =
new SparkConf()
.setAppName("ClusterScore")
.setMaster("spark://172.1.1.1:7077") // <--- This is what's missing
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
SparkConf configuration = new SparkConf()
.setAppName("Your Application Name")
.setMaster("local");
val sc = new SparkContext(conf);
It will work...
Most probably you are using Spark 2.x API in Java.
Use code snippet like this to avoid this error. This is true when you are running Spark standalone on your computer using Shade plug-in which will import all the runtime libraries on your computer.
SparkSession spark = SparkSession.builder()
.appName("Spark-Demo")//assign a name to the spark application
.master("local[*]") //utilize all the available cores on local
.getOrCreate();