java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame - scala

I am running my Spark code to save data to HBase in Amazon EMR 5.8.0 which has Spark 2.2.0 installed.
Running in IntelliJ, it works fine but in EMR Cluster it is throwing me this error:
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
Code
val zookeeperQuorum = args(0)
val tableName = args(1)
val inputPath = args(2)
val spark = SparkSession.builder
.appName("PhoenixSpark")
.getOrCreate
val df = spark.read
.option("delimiter", "\001")
.csv(inputPath)
val hBaseDf = spark.read
.format("org.apache.phoenix.spark")
.option("table", tableName)
.option("zkUrl", zookeeperQuorum)
.load()
val tableSchema = hBaseDf.schema
val rowKeyDf = df.withColumn("row_key", concat(col("_c3"), lit("_"), col("_c5"), lit("_"), col("_c0")))
rowKeyDf.createOrReplaceTempView("mytable")
val correctedDf = spark.sql("Select row_key, _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7," +
"_c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19 from mytable")
val rdd = correctedDf.rdd
val finalDf= spark.createDataFrame(rdd, tableSchema)
finalDf.write
.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", tableName)
.option("zkUrl", zookeeperQuorum)
.save()
spark.stop()
My pom.xml which is correctly mentioning Spark version as 2.2.0
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.myntra.analytics</groupId>
<artifactId>com.myntra.analytics</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- "package" command plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>4.11.0-HBase-1.3</version>
<scope>provided</scope>
</dependency>
</dependencies>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
Here is the stacktrace from EMR logs which shows this error.
17/09/28 23:20:18 ERROR ApplicationMaster: User class threw exception:
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1475)
at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:498)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:472)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:131)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:77)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:415)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at com.mynra.analytics.chronicles.PhoenixSpark$.main(PhoenixSpark.scala:29)
at com.mynra.analytics.chronicles.PhoenixSpark.main(PhoenixSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more

We encountered the same problem on Hortonworks HDP 2.6.3. The cause appears to be classpath conflict for the org.apache.phoenix.spark classes. In HDP, the Spark 1.6 version of this package is included in phoenix-client.jar. You need to override it by placing the Spark2-specific plugin phoenix-spark2.jar in front:
/usr/hdp/current/spark2-client/bin/spark-submit --master yarn-client --num-executors 2 --executor-cores 2 --driver-memory 3g --executor-memory 3g
--conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/etc/hbase/conf" --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/etc/hbase/conf"
--class com.example.test phoenix_test.jar

I ran into this same issue and was even seeing it with spark-shell without any of my custom code. After some wrangling I think it is an issue with the Phoenix jars that are included with EMR 5.8 (and 5.9). I have no idea why their Phoenix client jar seems to have a class reference to org.apache.spark.sql.DataFrame since it was changed in Spark 2.0.0 to be an alias to DataSet[Row]. (Especially since their Phoenix jars claim to be 4.11 which should have this issue fixed.)
Below I have what I did that fixed it for me. I suspect you could also monkey with your build to not use the provided Phoenix but to jar up your local version as well.
What I did to get around it:
I. I copied my local Phoenix client jar to S3 (I had a 4.10 version laying around.)
II. I wrote a simple install shell script and also put that on S3:
aws s3 cp s3://<YOUR_BUCKET_GOES_HERE>/phoenix-4.10.0-HBase-1.2-client.jar /home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar
III. I created a bootstrap action that simply ran the shell script from Step 2.
IV. I created a JSON file to put this jar into the executor and driver classpaths for spark-default and put it in S3 as well:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.extraClassPath": "/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
"spark.driver.extraClassPath": "/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
}
]
When I went to create my cluster I referenced the full S3 path to my JSON file in the "Edit software settings (optional)" -> "Load JSON from S3" section of the AWS console.
After that I booted up my cluster and fired up a spark-shell. Below you can see the output from it, including the verbose classpath info showing my jar being used, and a DataFrame being successfully loaded.
[hadoop#ip-10-128-7-183 ~]$ spark-shell -v
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Adding default property: spark.sql.warehouse.dir=hdfs:///user/spark/warehouse
Adding default property: spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
Adding default property: spark.history.fs.logDirectory=hdfs:///var/log/spark/apps
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property: spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
Adding default property: spark.yarn.historyServer.address=ip-10-128-7-183.columbuschildrens.net:18080
Adding default property: spark.stage.attempt.ignoreOnDecommissionFetchFailure=true
Adding default property: spark.resourceManager.cleanupExpiredHost=true
Adding default property: spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS=$(hostname -f)
Adding default property: spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
Adding default property: spark.master=yarn
Adding default property: spark.blacklist.decommissioning.timeout=1h
Adding default property: spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
Adding default property: spark.sql.hive.metastore.sharedPrefixes=com.amazonaws.services.dynamodbv2
Adding default property: spark.executor.memory=6144M
Adding default property: spark.driver.extraClassPath=/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Adding default property: spark.eventLog.dir=hdfs:///var/log/spark/apps
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.executor.extraClassPath=/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Adding default property: spark.executor.cores=1
Adding default property: spark.history.ui.port=18080
Adding default property: spark.blacklist.decommissioning.enabled=true
Adding default property: spark.hadoop.yarn.timeline-service.enabled=false
Parsed arguments:
master yarn
deployMode null
executorMemory 6144M
executorCores 1
totalExecutorCores null
propertiesFile /usr/lib/spark/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath /home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
driverExtraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
driverExtraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass org.apache.spark.repl.Main
primaryResource spark-shell
name Spark shell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf:
(spark.blacklist.decommissioning.timeout,1h)
(spark.blacklist.decommissioning.enabled,true)
(spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.executor.memory,6144M)
(spark.sql.warehouse.dir,hdfs:///user/spark/warehouse)
(spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.yarn.historyServer.address,ip-10-128-7-183.columbuschildrens.net:18080)
(spark.eventLog.enabled,true)
(spark.history.ui.port,18080)
(spark.stage.attempt.ignoreOnDecommissionFetchFailure,true)
(spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f))
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.resourceManager.cleanupExpiredHost,true)
(spark.shuffle.service.enabled,true)
(spark.history.fs.logDirectory,hdfs:///var/log/spark/apps)
(spark.driver.extraJavaOptions,-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2)
(spark.eventLog.dir,hdfs:///var/log/spark/apps)
(spark.executor.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.executor.cores,1)
(spark.driver.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
Main class:
org.apache.spark.repl.Main
Arguments:
System properties:
(spark.blacklist.decommissioning.timeout,1h)
(spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.blacklist.decommissioning.enabled,true)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.executor.memory,6144M)
(spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.sql.warehouse.dir,hdfs:///user/spark/warehouse)
(spark.yarn.historyServer.address,ip-10-128-7-183.columbuschildrens.net:18080)
(spark.eventLog.enabled,true)
(spark.history.ui.port,18080)
(spark.stage.attempt.ignoreOnDecommissionFetchFailure,true)
(spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f))
(SPARK_SUBMIT,true)
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.app.name,Spark shell)
(spark.resourceManager.cleanupExpiredHost,true)
(spark.shuffle.service.enabled,true)
(spark.history.fs.logDirectory,hdfs:///var/log/spark/apps)
(spark.driver.extraJavaOptions,-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.jars,)
(spark.submit.deployMode,client)
(spark.executor.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
(spark.eventLog.dir,hdfs:///var/log/spark/apps)
(spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2)
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.executor.cores,1)
(spark.driver.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
Classpath elements:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/10/11 13:36:05 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/10/11 13:36:23 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/10/11 13:36:23 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
17/10/11 13:36:23 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://ip-10-128-7-183.columbuschildrens.net:4040
Spark context available as 'sc' (master = yarn, app id = application_1507728658269_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import org.apache.spark.sql.DataFrame
var sqlContext = new SQLContext(sc);
val phoenixHost = "10.128.7.183:2181"
// Exiting paste mode, now interpreting.
warning: there was one deprecation warning; re-run with -deprecation for details
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import org.apache.spark.sql.DataFrame
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#258ff54a
phoenixHost: String = 10.128.7.183:2181
scala> val variant_hg19_df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "VARIANT_ANNOTATION_HG19", "zkUrl" -> phoenixHost))
warning: there was one deprecation warning; re-run with -deprecation for details
variant_hg19_df: org.apache.spark.sql.DataFrame = [CHROMOSOME_ID: int, POSITION: int ... 36 more fields]

Related

NoSuchMethodError: org.apache.spark.sql.kafka010.consumer

I am using Spark Structured Streaming to read messages from multiple topics in kafka.
I am facing below error:
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.setMinEvictableIdleTime(Ljava/time/Duration;)V
Below are my maven dependencies I am using,
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>untitled</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>A Camel Scala Route</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<dependencyManagement>
<dependencies>
<!-- Camel BOM -->
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-parent</artifactId>
<version>2.25.4</version>
<scope>import</scope>
<type>pom</type>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-scala</artifactId>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.13.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang.modules</groupId>
<artifactId>scala-xml_2.13</artifactId>
<version>2.1.0</version>
</dependency>
<!-- logging -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<scope>runtime</scope>
</dependency>
<!--spark-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.3.0</version>
</dependency>
<!--spark Streaming kafka-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.13</artifactId>
<version>3.3.0</version>
</dependency>
<!--kafka-->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.13</artifactId>
<version>3.2.0</version>
</dependency>
<!--jackson-->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>2.13.3</version>
</dependency>
<!-- testing -->
<dependency>
<groupId>org.apache.camel</groupId>
<artifactId>camel-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<defaultGoal>install</defaultGoal>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<!-- the Maven compiler plugin will compile Java source files -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- the Maven Scala plugin will compile Scala source files -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- configure the eclipse plugin to generate eclipse project descriptors for a Scala project -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>2.10</version>
<configuration>
<projectnatures>
<projectnature>org.scala-ide.sdt.core.scalanature</projectnature>
<projectnature>org.eclipse.jdt.core.javanature</projectnature>
</projectnatures>
<buildcommands>
<buildcommand>org.scala-ide.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<classpathContainers>
<classpathContainer>org.scala-ide.sdt.launching.SCALA_CONTAINER</classpathContainer>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
</classpathContainers>
<excludes>
<exclude>org.scala-lang:scala-library</exclude>
<exclude>org.scala-lang:scala-compiler</exclude>
</excludes>
<sourceIncludes>
<sourceInclude>**/*.scala</sourceInclude>
<sourceInclude>**/*.java</sourceInclude>
</sourceIncludes>
</configuration>
</plugin>
<!-- allows the route to be run via 'mvn exec:java' -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.6.0</version>
<configuration>
<mainClass>org.example.MyRouteMain</mainClass>
</configuration>
</plugin>
</plugins>
</build>
</project>
Scala Version: 2.13.8
Spark Version: 3.3.0
This my Code snippet to read from Kafka topics:
object consumerMain {
val log : Logger = Logger.getLogger(controller.driver.getClass)
val config: Map[String, String]=Map[String,String](
"kafka.bootstrap.servers" -> bootstrapServer,
"startingOffsets" -> "earliest",
"kafka.security.protocol" -> security_protocol,
"kafka.ssl.truststore.location" -> truststore_location,
"kafka.ssl.truststore.password" -> password,
"kafka.ssl.keystore.location" -> keystore_location,
"kafka.ssl.keystore.password" -> password,
"kafka.ssl.key.password"-> password,
"kafka.ssl.endpoint.identification.algorithm"-> ""
)
def main(args: Array[String]) : Unit ={
log.info("SPARKSESSION CREATED!!!")
val spark = SparkSession.builder()
.appName("kafka-sample-consumer")
.master("local")
.getOrCreate()
log.info("READING MESSAGES FROM KAFKA!!!")
val kafkaMsg = spark
.readStream
.format("Kafka")
.options(config)
.option("kafka.group.id", group_id)
.option("subscribe", "sample_topic_T")
.load()
kafkaMsg.printSchema()
kafkaMsg.writeStream
.format("console")
//.outputMode("append")
.start()
.awaitTermination()
}
}
Below, I am able to see the kafka proeprties I have set in the logs printed on the console:
[ main] StateStoreCoordinatorRef INFO Registered StateStoreCoordinator endpoint
[ main] ContextHandler INFO Started o.s.j.s.ServletContextHandler#6e00837f{/StreamingQuery,null,AVAILABLE,#Spark}
[ main] ContextHandler INFO Started o.s.j.s.ServletContextHandler#6a5dd083{/StreamingQuery/json,null,AVAILABLE,#Spark}
[ main] ContextHandler INFO Started o.s.j.s.ServletContextHandler#1e6bd263{/StreamingQuery/statistics,null,AVAILABLE,#Spark}
[ main] ContextHandler INFO Started o.s.j.s.ServletContextHandler#635ff2a5{/StreamingQuery/statistics/json,null,AVAILABLE,#Spark}
[ main] ContextHandler INFO Started o.s.j.s.ServletContextHandler#62735b13{/static/sql,null,AVAILABLE,#Spark}
[ main] ResolveWriteToStream WARN Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\xyz\AppData\Local\Temp\temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
[ main] ResolveWriteToStream INFO Checkpoint root C:\Users\xyz\AppData\Local\Temp\temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb resolved to file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb.
[ main] ResolveWriteToStream WARN spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
[ main] CheckpointFileManager INFO Writing atomically to file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb/metadata using temp file file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb/.metadata.c2b5aa2a-2a86-4931-a4f0-bbdaae8c3d5f.tmp
[ main] CheckpointFileManager INFO Renamed temp file file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb/.metadata.c2b5aa2a-2a86-4931-a4f0-bbdaae8c3d5f.tmp to file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb/metadata
[ main] MicroBatchExecution INFO Starting [id = 54eadb58-a957-4f8d-b67e-24ef6717482c, runId = ceb06ba5-1ce6-4ccd-bfe9-b4e24fd497a6]. Use file:/C:/Users/xyz/AppData/Local/Temp/temporary-c2ca1d2c-2c8d-4961-a1bd-1881bc00e0bb to store the query checkpoint.
[5-1ce6-4ccd-bfe9-b4e24fd497a6]] MicroBatchExecution INFO Reading table [org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable#5efc8880] from DataSourceV2 named 'Kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#2703aebd]
[5-1ce6-4ccd-bfe9-b4e24fd497a6]] KafkaSourceProvider WARN Kafka option 'kafka.group.id' has been set on this query, it is
not recommended to set this option. This option is unsafe to use since multiple concurrent
queries or sources using the same group id will interfere with each other as they are part
of the same consumer group. Restarted queries may also suffer interference from the
previous run having the same group id. The user should have only one query per group id,
and/or set the option 'kafka.session.timeout.ms' to be very small so that the Kafka
consumers from the previous query are marked dead by the Kafka group coordinator before the
restarted query starts running.
[5-1ce6-4ccd-bfe9-b4e24fd497a6]] MicroBatchExecution INFO Starting new streaming query.
[5-1ce6-4ccd-bfe9-b4e24fd497a6]] MicroBatchExecution INFO Stream started from {}
[5-1ce6-4ccd-bfe9-b4e24fd497a6]] ConsumerConfig INFO ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [localhost:9092, localhost: 9093]
check.crcs = true
client.dns.lookup = default
client.id =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = kafka-message-test-group
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 1
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm =
ssl.key.password = [hidden]
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = src/main/resources/consumer_inlet/keystore.jks
ssl.keystore.password = [hidden]
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = src/main/resources/consumer_inlet/truststore.jks
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
The following error I am getting while running the consumerMain:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted
=== Streaming Query ===
Identifier: [id = 54eadb58-a957-4f8d-b67e-24ef6717482c, runId = ceb06ba5-1ce6-4ccd-bfe9-b4e24fd497a6]
Current Committed Offsets: {}
Current Available Offsets: {KafkaV2[Subscribe[sample_topic_T]]: {"clinical_sample_T":{"0":155283144,"1":155233229}}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
WriteToMicroBatchDataSource org.apache.spark.sql.execution.streaming.ConsoleTable$#4f9c824, 54eadb58-a957-4f8d-b67e-24ef6717482c, Append
+- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan#135a05da, KafkaV2[Subscribe[sample_topic_T]]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:330)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: org.apache.spark.SparkException: Writing job aborted
at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:749)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:409)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:353)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:302)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:313)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3120)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:3120)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:663)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:658)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:658)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:255)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:218)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:212)
at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:307)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:285)
... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LHTU05CG050CC8Q.ms.ds.uhc.com executor driver): java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.setMinEvictableIdleTime(Ljava/time/Duration;)V
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.init(InternalKafkaConsumerPool.scala:186)
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.<init>(InternalKafkaConsumerPool.scala:163)
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool.<init>(InternalKafkaConsumerPool.scala:54)
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala:637)
at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:53)
at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:41)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
at scala.Option.foreach(Option.scala:437)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:377)
... 42 more
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.setMinEvictableIdleTime(Ljava/time/Duration;)V
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.init(InternalKafkaConsumerPool.scala:186)
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool$PoolConfig.<init>(InternalKafkaConsumerPool.scala:163)
at org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool.<init>(InternalKafkaConsumerPool.scala:54)
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala:637)
at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:53)
at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:41)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
I am running this in intellij
I cannot reproduce the error (using latest IntelliJ Ultimate), but here's the POM and code
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cricket.jomoore</groupId>
<artifactId>scala</artifactId>
<version>0.1-SNAPSHOT</version>
<name>SOReady4Spark</name>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.compat.version>2.13</scala.compat.version>
<scala.version>${scala.compat.version}.8</scala.version>
<spark.version>3.3.0</spark.version>
<spec2.version>4.2.0</spec2.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-bom</artifactId>
<version>2.18.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Test -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<version>5.8.2</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.7.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M7</version>
</plugin>
</plugins>
</build>
</project>
<scope>provided</scope> tags are needed for when you actually deploy Spark code to a real Spark cluster. And for that, you also need to configure IntelliJ run config.
package cricketeer.one;
import org.apache.kafka.clients.consumer.OffsetResetStrategy
import org.apache.spark.sql
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{DataType, DataTypes}
import org.slf4j.LoggerFactory
object KafkaTest extends App {
val logger = LoggerFactory.getLogger(getClass)
/**
* For testing output to a console.
*
* #param df A Streaming DataFrame
* #return A DataStreamWriter
*/
private def streamToConsole(df: sql.DataFrame) = {
df.writeStream.outputMode(OutputMode.Append()).format("console")
}
private def getKafkaDf(spark: SparkSession, bootstrap: String, topicPattern: String, offsetResetStrategy: OffsetResetStrategy = OffsetResetStrategy.EARLIEST) = {
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicPattern)
.option("startingOffsets", offsetResetStrategy.toString.toLowerCase())
.load()
}
val spark = SparkSession.builder()
.appName("Kafka Test")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val kafkaBootstrap = "localhost:9092"
val df = getKafkaDf(spark, kafkaBootstrap, "input-topic")
streamToConsole(
df.select($"value".cast(DataTypes.StringType))
).start().awaitTermination()
}
src/main/resources/log4j2.xml
<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
<Appenders>
<Console name="STDOUT" target="SYSTEM_OUT">
<PatternLayout pattern="%d %-5p [%t] %C{2} (%F:%L) - %m%n"/>
</Console>
</Appenders>
<Loggers>
<Logger name="org.apache.kafka.clients.consumer.internals.Fetcher" level="warn">
<AppenderRef ref="STDOUT"/>
</Logger>
<Root level="info">
<AppenderRef ref="STDOUT"/>
</Root>
</Loggers>
</Configuration>
I downgraded the version of spark from 3.3.0 to 3.2.2 with the Scala version 2.13.8 remaining the same. For me, it seems the Scala version 2.13 was not compatible with Spark version 3.3.0 . For now I am able to write the Avro data to a file.
And Thanks to #OneCricketeer for your help and support so far!
I hit the same issue with Spark 3.3.0. I did some deep dive and finally found out the root cause: Spark 3.3.0 has built-in dependency on commons-pools v.1.5.4 (commons-pool-1.5.4.jar) while this structure streaming library relies on v2.11.1 which replaced setMinEvictableIdleTimeMillis with setMinEvictableIdleTime. Thus the resolution is to add commons-pool2-2.11.1.jar to your Spark jars folder. This jar can be downloaded from Maven central https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar.
I've documented the details on the page if you want to learn more:
https://kontext.tech/article/1178/javalangnosuchmethoderror-poolconfigsetminevictableidletime

LazyBoolean on pyspark when inserting on MongoDB

I really need help.
This is a repeated question ( py4j.protocol.Py4JJavaError: An error occurred while calling o59.save. : java.lang.NoClassDefFoundError: scala/runtime/LazyBoolean ) as it has no answer.
I'm trying to insert a Dataframe on MongoDB version v3.6.17 or v4.2.3 but fail both times. I had tried my own data and the official documentation example ( https://docs.mongodb.com/spark-connector/master/python/write-to-mongodb/ ), but it returns
ava.lang.NoClassDefFoundError: scala/runtime/LazyBoolean
I had other issues trying to do this simple task, for example ( Can't connect to Mongo DB via Spark, Spark and MongoDB application in Scala 2.10 maven built error ) and read some possible solutions like ( https://github.com/mongodb/mongo-spark, https://github.com/mongodb/mongo-spark/blob/master/examples/src/test/python/introduction.py ) but with no success.
My Spark version is 2.4.4, Python 3.7.6, I'm using IntelliJ 2019.3 as IDE.
My code is as follows:
from pyspark.sql import Row
from pyspark.sql.types import *
from select import select
#import org.mongodb.spark.sql.DefaultSource # ModuleNotFoundError: No module named 'org'
#import org.mongodb.spark # ModuleNotFoundError: No module named 'org'
#import org.bson.Document # ModuleNotFoundError: No module named 'org'
from linecache import cache
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, monotonically_increasing_id
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("spark_python") \
.master("local") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
print("http://desktop-f5ffrvd:4040/jobs/")
mongo_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.contacts") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:2.4.1') \
.getOrCreate()
# .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.contacts") \
people = mongo_spark\
.createDataFrame([("Bilbo Baggins", "50"), ("Gandalf", "1000"), ("Thorin", "195"), ("Balin", "178"), ("Kili", "77"), ("Dwalin", "169"), ("Oin", "167"), ("Gloin", "158"), ("Fili", "82"), ("Bombur", "None")], ["name", "age"])
people.show()
people.printSchema()
people\
.write\
.format("mongo")\
.mode("append")\
.option("database", "test")\
.option("collection", "contacts")\
.save()
spark.stop()
mongo_spark.stop()
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>airbnb_python</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector -->
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.8.0</version>
</dependency>
</dependencies>
</project>
trying to follow the documentation ( https://docs.mongodb.com/spark-connector/master/python-api/#python-basics ), when running
pyspark --conf
"spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
on Windows console, with the code above I get the same error. On C:\Users\israel.ivy2\jars I have two jars: org.mongodb.spark_mongo-spark-connector_2.11-2.3.0 and org.mongodb_mongo-java-driver-3.8.0.
I'm posting this question, because I ran out of ideas, I already consulted 4 data engineer with no luck, and I really need help, please. Thanks in advance.

Scala on eclipse : reading csv as dataframe throw a java.lang.ArrayIndexOutOfBoundsException

Trying to read a simple csv file and load it in a dataframe throw a java.lang.ArrayIndexOutOfBoundsException.
As I am new to Scala I may have missed something trivial, however a thorough search both in google and stackoverflow lead nothing.
The code is the following:
import org.apache.spark.sql.SparkSession
object TransformInitial {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
val df = session.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("data_sets/small_test.csv")
df.show()
}
}
small_test.csv is as simple as possible:
v1,v2,v3
0,1,2
3,4,5
Here is the actual pom of this Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Scala_tests</groupId>
<artifactId>Scala_tests</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<resources>
<resource>
<directory>src</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Execution of the code throw the following
java.lang.ArrayIndexOutOfBoundsException:
18/11/09 12:03:31 INFO FileSourceStrategy: Pruning directories with:
18/11/09 12:03:31 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
18/11/09 12:03:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
18/11/09 12:03:31 INFO FileSourceScanExec: Pushed Filters:
18/11/09 12:03:31 INFO CodeGenerator: Code generated in 413.859722 ms
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.IterableLike.foreach(IterableLike.scala:71)
at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
at scala.collection.TraversableLike.map(TraversableLike.scala:234)
at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:170)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:169)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.immutable.List.flatMap(List.scala:352)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:169)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:22)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:30)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:78)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:467)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:351)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:283)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:169)
at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:223)
at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:348)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:210)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:153)
at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1203)
at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1157)
at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:481)
at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:679)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:107)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3559)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2927)
at org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:142)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:232)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:68)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at TransformInitial$.main(TransformInitial.scala:9)
at TransformInitial.main(TransformInitial.scala)
For the record eclipse version is 2018-09 (4.9.0).
I've hunted for special characters in the csv with a cat -A. It yield nothing.
I'm out of options, something trivial must be missing but I can't put a finger on it.
I'm not sure exactly what is causing your error, since the code works for me. It could be related to the version of the Scala compiler that you are using, since there's no information about that in your Maven file.
I have posted my complete solution—using SBT— to GitHub. To exectute the code, you'll need to install SBT, cd to the checked out source's root folder, then run the following command:
$ sbt run
BTW, I changed your code to take advantage of more idiomatic Scala conventions, and also used the csv function to load your file. The new Scala code looks like this:
import org.apache.spark.sql.SparkSession
// Extending App is more idiomatic than writing a "main" function.
object TransformInitial
extends App {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
// As of Spark 2.0, it's easier to read CSV files.
val df = session.read.option("header", "true").option("inferSchema", "true").csv("data_sets/small_test.csv")
df.show()
// Shutdown gracefully.
session.stop()
}
Note that I also removed the redundant delimiter option.
Downgrading scala version to 2.11 fixed for me.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>

jackson/guava jar conflict when run spark on yarn

My Spark environment is scala 2.10.5, spark1.6.0, hadoop2.6.0.
The application uses jackson to do some serialization/deserialzation things.
when submit to spark(yarn client mode):
spark-submit --class noce.train.Train_Grid --master yarn-client --num-executors 10 --executor-cores 2 --driver-memory 10g --executor-memory 12g --conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.executor.extraClassPath=./guava-15.0.jar:./jackson-annotations-2.4.4.jar:./jackson-core-2.4.4.jar:./jackson-databind-2.4.4.jar:./jackson-module-scala_2.10-2.4.4.jar \
--conf spark.driver.extraClassPath=/home/ck/lib/guava-15.0.jar:/home/ck/lib/jackson-annotations-2.4.4.jar:/home/ck/lib/jackson-core-2.4.4.jar:/home/ck/lib/jackson-databind-2.4.4.jar:/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar \
--jars /home/ck/lib/guava-15.0.jar,/home/ck/lib/jackson-annotations-2.4.4.jar,/home/ck/lib/jackson-core-2.4.4.jar,/home/ck/lib/jackson-databind-2.4.4.jar,/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar \
/home/ck/gnoce_scala.jar
I got errors:
18/09/12 09:46:47 WARN scheduler.TaskSetManager: Lost task 39.0 in stage 7.0 (TID 893, host-9-138): java.lang.NoClassDefFoundError: Could not initialize class noce.grid.Grid$
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:80)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:79)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
18/09/12 09:46:47 INFO scheduler.TaskSetManager: Lost task 198.0 in stage 7.0 (TID 897) on executor host-9-136: java.lang.NoClassDefFoundError (Could not initialize class noce.grid.Grid$) [duplicate 1]
18/09/12 09:46:47 WARN scheduler.TaskSetManager: Lost task 58.0 in stage 7.0 (TID 890, host-9-136): java.lang.AbstractMethodError: noce.grid.Grid$$anon$1.com$fasterxml$jackson$module$scala$experimental$ScalaObjectMapper$_setter_$com$fasterxml$jackson$module$scala$experimental$ScalaObjectMapper$$typeCache_$eq(Lorg/spark-project/guava/cache/LoadingCache;)V
at com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper$class.$init$(ScalaObjectMapper.scala:50)
at noce.grid.Grid$$anon$1.<init>(Grid.scala:75)
at noce.grid.Grid$.<init>(Grid.scala:75)
at noce.grid.Grid$.<clinit>(Grid.scala)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:80)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:79)
... ...
The code is as follows:
//Train_Grid.scala
val newGridData: RDD[(Long, Grid)] = data.map(nr => { //line 79
val grid = Grid(nr) //line 80
(grid.id, grid)
}).reduceByKey(_.merge(_))
//Grid.scala
object Grid {
val mapper = new ObjectMapper() with ScalaObjectMapper //line 75
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
I print the class paths in driver:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.take(20).foreach(println)
file:/home/ck/lib/guava-15.0.jar
file:/home/ck/lib/jackson-annotations-2.4.4.jar
file:/home/ck/lib/jackson-core-2.4.4.jar
file:/home/ck/lib/jackson-databind-2.4.4.jar
file:/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar
file:/etc/spark/conf.cloudera.spark_on_yarn/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/spark-assembly-1.6.0-cdh5.7.2-hadoop2.6.0-cdh5.7.2.jar
file:/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf/
file:/etc/hive/conf.cloudera.hive/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ST4-4.0.4.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-core-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-fate-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-start-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-trace-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/activation-1.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ant-1.9.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ant-launcher-1.9.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antisamy-1.4.3.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antlr-2.7.7.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antlr-runtime-3.4.jar
and executors:
val x = sc.parallelize(0 to 1, 2)
val p = x.flatMap { i =>
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.take(20).map(_.toString)
}
p.collect().foreach(println)
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/guava-15.0.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-annotations-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-core-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-databind-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-module-scala_2.10-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/
file:/DATA7/yarn/nm/usercache/ck/filecache/745/__spark_conf__2134162299477543917.zip/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/spark-assembly-1.6.0-cdh5.7.2-hadoop2.6.0-cdh5.7.2.jar
file:/etc/hadoop/conf.cloudera.yarn/
file:/var/run/cloudera-scm-agent/process/2147-yarn-NODEMANAGER/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-column-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-format-2.1.0-cdh5.7.2-sources.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-jackson-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-scala_2.10-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-hadoop-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-common-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-avro-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-auth-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-aws-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-common-2.6.0-cdh5.7.2-tests.jar
... ...
But obviously, it still use the incorrect guava version(org.spark-project.guava.cache.LoadingCache)
And if I set spark.{driver, executor}.userClassPathFirst to true, I got:
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
so, any suggestion?
Few things I would suggest for you.
First you should create a fat jar for your project. If you are using Maven just use follow this question here: Building a fat jar using maven
Or if you are using SBT you can use the SBT Assembly as well: https://github.com/sbt/sbt-assembly
That will help you to not send all the jars in the spark-submit line of code. And will allow you to use shading at your code.
Shade allows you to use the same library in your code with different version without any conflict with the framework library. To use that please follow the instructions for:
Maven - https://maven.apache.org/plugins/maven-shade-plugin/
SBT - https://github.com/sbt/sbt-assembly#shading
So for you case, you should shade your guava classes. I have this problem with the protobuf classes in my Spark Project and I use the shade plugin with maven like this:
<build>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shaded.com.google.protobuf</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<outputDirectory>target</outputDirectory>
</configuration>
</plugin>
</plugins>
</build>

TypeSafe config : reference.conf values not overridden by application.conf

I am using typesafe application conf to provide all external connfig in my scala project, but While trying to create shaded Jar using maven-shade-plugin and running it somehow package the conf in jar itself which cannot be overridden while changing the values in application conf.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${maven.shade.plugin.version}</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>test.myproj.DQJobTrigger</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
I am not sure of the behaviour when I am trying to load all configs from the application conf itself. using ConfigFactory
trait Config {
private val dqConfig = System.getenv("CONF_PATH")
private val config = ConfigFactory.parseFile(new File(dqConfig))
private val sparkConfig = config.getConfig("sparkConf")
private val sparkConfig = config.getConfig("mysql")
}
while CONF_PATH is set as the path where application.conf exist while running the jar.
application.conf
sparkConf = {
master = "yarn-client"
}
mysql = {
url = "jdbc:mysql://127.0.0.1:3306/test"
user = "root"
password = "MyCustom98"
driver = "com.mysql.jdbc.Driver"
connectionPool = disabled
keepAliveConnection = true
}
so now Even If I change the properties in the application conf still it takes the configs which were present while packaging the jar.