jackson/guava jar conflict when run spark on yarn - scala

My Spark environment is scala 2.10.5, spark1.6.0, hadoop2.6.0.
The application uses jackson to do some serialization/deserialzation things.
when submit to spark(yarn client mode):
spark-submit --class noce.train.Train_Grid --master yarn-client --num-executors 10 --executor-cores 2 --driver-memory 10g --executor-memory 12g --conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.executor.extraClassPath=./guava-15.0.jar:./jackson-annotations-2.4.4.jar:./jackson-core-2.4.4.jar:./jackson-databind-2.4.4.jar:./jackson-module-scala_2.10-2.4.4.jar \
--conf spark.driver.extraClassPath=/home/ck/lib/guava-15.0.jar:/home/ck/lib/jackson-annotations-2.4.4.jar:/home/ck/lib/jackson-core-2.4.4.jar:/home/ck/lib/jackson-databind-2.4.4.jar:/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar \
--jars /home/ck/lib/guava-15.0.jar,/home/ck/lib/jackson-annotations-2.4.4.jar,/home/ck/lib/jackson-core-2.4.4.jar,/home/ck/lib/jackson-databind-2.4.4.jar,/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar \
/home/ck/gnoce_scala.jar
I got errors:
18/09/12 09:46:47 WARN scheduler.TaskSetManager: Lost task 39.0 in stage 7.0 (TID 893, host-9-138): java.lang.NoClassDefFoundError: Could not initialize class noce.grid.Grid$
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:80)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:79)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
18/09/12 09:46:47 INFO scheduler.TaskSetManager: Lost task 198.0 in stage 7.0 (TID 897) on executor host-9-136: java.lang.NoClassDefFoundError (Could not initialize class noce.grid.Grid$) [duplicate 1]
18/09/12 09:46:47 WARN scheduler.TaskSetManager: Lost task 58.0 in stage 7.0 (TID 890, host-9-136): java.lang.AbstractMethodError: noce.grid.Grid$$anon$1.com$fasterxml$jackson$module$scala$experimental$ScalaObjectMapper$_setter_$com$fasterxml$jackson$module$scala$experimental$ScalaObjectMapper$$typeCache_$eq(Lorg/spark-project/guava/cache/LoadingCache;)V
at com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper$class.$init$(ScalaObjectMapper.scala:50)
at noce.grid.Grid$$anon$1.<init>(Grid.scala:75)
at noce.grid.Grid$.<init>(Grid.scala:75)
at noce.grid.Grid$.<clinit>(Grid.scala)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:80)
at noce.train.Train_Grid$$anonfun$3.apply(Train_Grid.scala:79)
... ...
The code is as follows:
//Train_Grid.scala
val newGridData: RDD[(Long, Grid)] = data.map(nr => { //line 79
val grid = Grid(nr) //line 80
(grid.id, grid)
}).reduceByKey(_.merge(_))
//Grid.scala
object Grid {
val mapper = new ObjectMapper() with ScalaObjectMapper //line 75
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
I print the class paths in driver:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.take(20).foreach(println)
file:/home/ck/lib/guava-15.0.jar
file:/home/ck/lib/jackson-annotations-2.4.4.jar
file:/home/ck/lib/jackson-core-2.4.4.jar
file:/home/ck/lib/jackson-databind-2.4.4.jar
file:/home/ck/lib/jackson-module-scala_2.10-2.4.4.jar
file:/etc/spark/conf.cloudera.spark_on_yarn/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/spark-assembly-1.6.0-cdh5.7.2-hadoop2.6.0-cdh5.7.2.jar
file:/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf/
file:/etc/hive/conf.cloudera.hive/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ST4-4.0.4.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-core-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-fate-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-start-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/accumulo-trace-1.6.0.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/activation-1.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ant-1.9.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/ant-launcher-1.9.1.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antisamy-1.4.3.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antlr-2.7.7.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/antlr-runtime-3.4.jar
and executors:
val x = sc.parallelize(0 to 1, 2)
val p = x.flatMap { i =>
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.take(20).map(_.toString)
}
p.collect().foreach(println)
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/guava-15.0.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-annotations-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-core-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-databind-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/jackson-module-scala_2.10-2.4.4.jar
file:/DATA2/yarn/nm/usercache/ck/appcache/application_1533542623806_5351/container_1533542623806_5351_01_000007/
file:/DATA7/yarn/nm/usercache/ck/filecache/745/__spark_conf__2134162299477543917.zip/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/spark-assembly-1.6.0-cdh5.7.2-hadoop2.6.0-cdh5.7.2.jar
file:/etc/hadoop/conf.cloudera.yarn/
file:/var/run/cloudera-scm-agent/process/2147-yarn-NODEMANAGER/
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-column-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-format-2.1.0-cdh5.7.2-sources.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-jackson-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-scala_2.10-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-hadoop-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-common-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/parquet-avro-1.5.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-auth-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-aws-2.6.0-cdh5.7.2.jar
file:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jars/hadoop-common-2.6.0-cdh5.7.2-tests.jar
... ...
But obviously, it still use the incorrect guava version(org.spark-project.guava.cache.LoadingCache)
And if I set spark.{driver, executor}.userClassPathFirst to true, I got:
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
so, any suggestion?

Few things I would suggest for you.
First you should create a fat jar for your project. If you are using Maven just use follow this question here: Building a fat jar using maven
Or if you are using SBT you can use the SBT Assembly as well: https://github.com/sbt/sbt-assembly
That will help you to not send all the jars in the spark-submit line of code. And will allow you to use shading at your code.
Shade allows you to use the same library in your code with different version without any conflict with the framework library. To use that please follow the instructions for:
Maven - https://maven.apache.org/plugins/maven-shade-plugin/
SBT - https://github.com/sbt/sbt-assembly#shading
So for you case, you should shade your guava classes. I have this problem with the protobuf classes in my Spark Project and I use the shade plugin with maven like this:
<build>
<outputDirectory>target/classes</outputDirectory>
<testOutputDirectory>target/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shaded.com.google.protobuf</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<outputDirectory>target</outputDirectory>
</configuration>
</plugin>
</plugins>
</build>

Related

an openapi client jar, generated for scala-akka by openapi-generator-maven-plugin, causes canBuildFromIterableViewMapLike() not found, under Zeppelin

I'm generating a Scala client jar with org.openapitools:openapi-generator-maven-plugin so that my Zeppelin will be able to execute this sample of code,
openapi-generator-maven-plugin offers few generators for Scala, and recommends to choose scala-akka.
It requires to produce a complicated piece of client code:
%spark
import java.util.Map;
import fr.ecoemploi.application.etude.swagger.invoker._
import fr.ecoemploi.application.etude.swagger.model.Commune;
import fr.ecoemploi.application.etude.swagger.api.CogControllerApi;
import fr.ecoemploi.application.etude.swagger.invoker.CollectionFormats._
import fr.ecoemploi.application.etude.swagger.invoker.ApiKeyLocations._
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.util.{Failure, Success}
implicit val system = ActorSystem("system")
implicit val executionContext = system.dispatcher
implicit val materializer: ActorMaterializer = ActorMaterializer()
val apiInvoker = ApiInvoker()
val cogService = CogControllerApi("http://localhost:9090")
val request = cogService.obtenirCommunes(2022);
val response = apiInvoker.execute(request)
response.onComplete {
case Success(ApiResponse(code, content, headers)) =>
System.out.println(s"Status code: $code}")
System.out.println(s"Response headers: ${headers.mkString(", ")}")
System.out.println(s"Response body: $content")
case Failure(error # ApiError(code, message, responseContent, cause, headers)) =>
System.err.println("Exception when calling CogControllerApi#obtenirCommunes")
System.err.println(s"Status code: $code}")
System.err.println(s"Reason: $responseContent")
System.err.println(s"Response headers: ${headers.mkString(", ")}")
error.printStackTrace();
case Failure(exception) =>
System.err.println("Exception when calling CogControllerApi#obtenirCommunes")
exception.printStackTrace();
}
With eventually a run failure under Zeppelin:
java.lang.NoSuchMethodError: 'scala.collection.generic.CanBuildFrom scala.collection.compat.package$.canBuildFromIterableViewMapLike()'
at fr.ecoemploi.application.etude.swagger.invoker.ApiInvoker.makeUri(ApiInvoker.scala:213)
at fr.ecoemploi.application.etude.swagger.invoker.ApiInvoker.execute(ApiInvoker.scala:226)
... 44 elided
How I've managed things:
Its generated openapi.jar has this content:
jar tf openapi-client-1.0.0.jar
META-INF/
META-INF/MANIFEST.MF
fr/
fr/ecoemploi/
fr/ecoemploi/application/
fr/ecoemploi/application/etude/
fr/ecoemploi/application/etude/swagger/
fr/ecoemploi/application/etude/swagger/model/
fr/ecoemploi/application/etude/swagger/api/
fr/ecoemploi/application/etude/swagger/invoker/
fr/ecoemploi/application/etude/swagger/model/Commune$.class
fr/ecoemploi/application/etude/swagger/model/Commune.class
fr/ecoemploi/application/etude/swagger/api/CogControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/ComptesCollectivitesControllerApi.class
fr/ecoemploi/application/etude/swagger/api/ComptesCollectivitesControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/EnumsSerializers$.class
fr/ecoemploi/application/etude/swagger/api/EnumsSerializers.class
fr/ecoemploi/application/etude/swagger/api/CovidControllerApi.class
fr/ecoemploi/application/etude/swagger/api/EquipementControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/CovidControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/EnumsSerializers$EnumNameSerializer.class
fr/ecoemploi/application/etude/swagger/api/IntercoControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/ActivitesControllerApi.class
fr/ecoemploi/application/etude/swagger/api/EnumsSerializers$EnumNameSerializer$$anonfun$deserialize$1.class
fr/ecoemploi/application/etude/swagger/api/ActivitesControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/AssociationsControllerApi$.class
fr/ecoemploi/application/etude/swagger/api/IntercoControllerApi.class
fr/ecoemploi/application/etude/swagger/api/EnumsSerializers$EnumNameSerializer$$anonfun$serialize$1.class
fr/ecoemploi/application/etude/swagger/api/EquipementControllerApi.class
fr/ecoemploi/application/etude/swagger/api/CogControllerApi.class
fr/ecoemploi/application/etude/swagger/api/AssociationsControllerApi.class
fr/ecoemploi/application/etude/swagger/invoker/ApiRequest$.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormat.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$LocalDateSerializer$.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$DateTimeSerializer$$anonfun$$lessinit$greater$1.class
fr/ecoemploi/application/etude/swagger/invoker/NumericValue.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyCredentials.class
fr/ecoemploi/application/etude/swagger/invoker/CustomContentTypes.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyValue.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$DateTimeSerializer$$anonfun$$lessinit$greater$1$$anonfun$apply$1.class
fr/ecoemploi/application/etude/swagger/invoker/ApiError$$anonfun$$lessinit$greater$1.class
fr/ecoemploi/application/etude/swagger/invoker/Credentials.class
fr/ecoemploi/application/etude/swagger/invoker/ParametersMap$.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$CSV$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyCredentials$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiModel.class
fr/ecoemploi/application/etude/swagger/invoker/ApiMethods.class
fr/ecoemploi/application/etude/swagger/invoker/ParametersMap.class
fr/ecoemploi/application/etude/swagger/invoker/BasicCredentials.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$DateTimeSerializer$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiMethod.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocations$HEADER$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyValue$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiInvoker$$anonfun$unmarshallApiResponse$4.class
fr/ecoemploi/application/etude/swagger/invoker/ApiSettings.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocations$QUERY$.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$SSV$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiInvoker$.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$LocalDateSerializer$$anonfun$$lessinit$greater$2.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$DateTimeSerializer$$anonfun$$lessinit$greater$1$$anonfun$apply$2.class
fr/ecoemploi/application/etude/swagger/invoker/ApiResponse.class
fr/ecoemploi/application/etude/swagger/invoker/ApiRequest.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocation.class
fr/ecoemploi/application/etude/swagger/invoker/ApiInvoker$ApiRequestImprovements.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$PIPES$.class
fr/ecoemploi/application/etude/swagger/invoker/UnitJSONSupport.class
fr/ecoemploi/application/etude/swagger/invoker/BearerToken$.class
fr/ecoemploi/application/etude/swagger/invoker/BasicCredentials$.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$.class
fr/ecoemploi/application/etude/swagger/invoker/NumericValue$.class
fr/ecoemploi/application/etude/swagger/invoker/MergedArrayFormat.class
fr/ecoemploi/application/etude/swagger/invoker/ApiResponse$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocations$COOKIE$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiMethod$.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$LocalDateSerializer$$anonfun$$lessinit$greater$2$$anonfun$apply$4.class
fr/ecoemploi/application/etude/swagger/invoker/BearerToken.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocations.class
fr/ecoemploi/application/etude/swagger/invoker/ResponseState$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiInvoker.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$TSV$.class
fr/ecoemploi/application/etude/swagger/invoker/ArrayValues.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$.class
fr/ecoemploi/application/etude/swagger/invoker/ResponseState$Error$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiError$.class
fr/ecoemploi/application/etude/swagger/invoker/ParametersMap$ParametersMapImprovements.class
fr/ecoemploi/application/etude/swagger/invoker/ApiMethods$.class
fr/ecoemploi/application/etude/swagger/invoker/CollectionFormats$MULTI$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiInvoker$ApiMethodExtensions.class
fr/ecoemploi/application/etude/swagger/invoker/ApiKeyLocations$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiReturnWithHeaders.class
fr/ecoemploi/application/etude/swagger/invoker/ApiSettings$.class
fr/ecoemploi/application/etude/swagger/invoker/ResponseState.class
fr/ecoemploi/application/etude/swagger/invoker/ApiError.class
fr/ecoemploi/application/etude/swagger/invoker/ResponseState$Success$.class
fr/ecoemploi/application/etude/swagger/invoker/ApiError$$anonfun$$lessinit$greater$2.class
fr/ecoemploi/application/etude/swagger/invoker/ArrayValues$.class
fr/ecoemploi/application/etude/swagger/invoker/Serializers$LocalDateSerializer$$anonfun$$lessinit$greater$2$$anonfun$apply$3.class
reference.conf
META-INF/maven/
META-INF/maven/org.openapitools/
META-INF/maven/org.openapitools/openapi-client/
META-INF/maven/org.openapitools/openapi-client/pom.xml
META-INF/maven/org.openapitools/openapi-client/pom.properties
The lib folder accompanying the generated sources when they are compiled, has this content:
akka-actor_2.12-2.6.12.jar json4s-core_2.12-3.6.7.jar scalatest-core_2.12-3.2.3.jar
akka-http_2.12-10.2.3.jar json4s-ext_2.12-3.6.7.jar scalatest-diagrams_2.12-3.2.3.jar
akka-http-core_2.12-10.2.3.jar json4s-jackson_2.12-3.6.7.jar scalatest-featurespec_2.12-3.2.3.jar
akka-http-json4s_2.12-1.27.0.jar json4s-scalap_2.12-3.6.7.jar scalatest-flatspec_2.12-3.2.3.jar
akka-parsing_2.12-10.2.3.jar junit-4-13_2.12-3.2.3.0.jar scalatest-freespec_2.12-3.2.3.jar
akka-protobuf-v3_2.12-2.6.12.jar junit-4.13.jar scalatest-funspec_2.12-3.2.3.jar
akka-stream_2.12-2.6.12.jar paranamer-2.8.jar scalatest-funsuite_2.12-3.2.3.jar
config-1.4.1.jar reactive-streams-1.0.3.jar scalatest-matchers-core_2.12-3.2.3.jar
hamcrest-core-1.3.jar scala-collection-compat_2.12-2.4.1.jar scalatest-mustmatchers_2.12-3.2.3.jar
hpack-1.0.2.jar scalactic_2.12-3.2.3.jar scalatest-propspec_2.12-3.2.3.jar
jackson-annotations-2.9.0.jar scala-java8-compat_2.12-0.8.0.jar scalatest-refspec_2.12-3.2.3.jar
jackson-core-2.9.8.jar scala-library-2.12.13.jar scalatest-shouldmatchers_2.12-3.2.3.jar
jackson-databind-2.9.8.jar scala-parser-combinators_2.12-1.1.2.jar scalatest-wordspec_2.12-3.2.3.jar
joda-convert-2.2.0.jar scala-reflect-2.12.12.jar scala-xml_2.12-1.2.0.jar
joda-time-2.10.1.jar scalatest_2.12-3.2.3.jar ssl-config-core_2.12-0.4.2.jar
json4s-ast_2.12-3.6.7.jar scalatest-compatible-3.2.3.jar
It appears that on the spark.jars attribute of the Zeppelin Scala interpreter, I shall not only add
openapi-client-1.0.0.jar, but also some of these jars listed.
But not all of them, of course...
I have joined those to spark.jars attribute of Zeppelin:
spark.jars /home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/openapi-client-1.0.0.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-actor_2.12-2.6.12.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-http_2.12-10.2.3.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-http-core_2.12-10.2.3.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-http-json4s_2.12-1.27.0.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-parsing_2.12-10.2.3.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-protobuf-v3_2.12-2.6.12.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/akka-stream_2.12-2.6.12.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/config-1.4.1.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/hamcrest-core-1.3.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/hpack-1.0.2.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/paranamer-2.8.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/reactive-streams-1.0.3.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/jackson-annotations-2.9.0.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/jackson-core-2.9.8.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/jackson-databind-2.9.8.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/joda-convert-2.2.0.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/joda-time-2.10.1.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/json4s-ast_2.12-3.6.7.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/json4s-core_2.12-3.6.7.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/json4s-ext_2.12-3.6.7.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/json4s-jackson_2.12-3.6.7.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/json4s-scalap_2.12-3.6.7.jar,
/home/lebihan/dev/Java/comptes-france/metier-et-gestion/dev/GenerationOpenAPI/target/generated-sources/swagger/scala/target/lib/ssl-config-core_2.12-0.4.2.jar
May be I souldn't use scala-akka for client generator?
Or, am I on the wrong way for generation, and Apache Zeppelin requires another generator selection that scala-akka, for the openapi-client.jar to use with it?
Epilogue
Scala is a hell of incompatible jars, even in the middle of its 2.12.x or 2.13.x versions.
It happened that Spark (used by Apache Zeppelin) was using incompatible Scala jars from scala.collection.compat package, some coming through com.google.protobuf, did I red.
So the workaround is to create a fat jar relocating these classes elsewhere, renaming their packages.
Then, the client-openapi.jar created will be accepted.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/maven/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>com</pattern>
<shadedPattern>repackaged.com.google.common</shadedPattern>
<includes>
<include>com.google.common.**</include>
</includes>
</relocation>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.shaded.protobuf</shadedPattern>
<includes>
<include>com.google.protobuf.**</include>
</includes>
</relocation>
<relocation>
<pattern>scala.collection.compat</pattern>
<shadedPattern>scala.shaded.compat</shadedPattern>
<includes>
<include>scala.collection.compat.**</include>
</includes>
</relocation>
<relocation>
<pattern>shapeless</pattern>
<shadedPattern>shapelessshaded</shadedPattern>
<includes>
<include>shapeless.**</include>
</includes>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>

Scala on eclipse : reading csv as dataframe throw a java.lang.ArrayIndexOutOfBoundsException

Trying to read a simple csv file and load it in a dataframe throw a java.lang.ArrayIndexOutOfBoundsException.
As I am new to Scala I may have missed something trivial, however a thorough search both in google and stackoverflow lead nothing.
The code is the following:
import org.apache.spark.sql.SparkSession
object TransformInitial {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
val df = session.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter",",").load("data_sets/small_test.csv")
df.show()
}
}
small_test.csv is as simple as possible:
v1,v2,v3
0,1,2
3,4,5
Here is the actual pom of this Maven project:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>Scala_tests</groupId>
<artifactId>Scala_tests</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<resources>
<resource>
<directory>src</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Execution of the code throw the following
java.lang.ArrayIndexOutOfBoundsException:
18/11/09 12:03:31 INFO FileSourceStrategy: Pruning directories with:
18/11/09 12:03:31 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
18/11/09 12:03:31 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
18/11/09 12:03:31 INFO FileSourceScanExec: Pushed Filters:
18/11/09 12:03:31 INFO CodeGenerator: Code generated in 413.859722 ms
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
at scala.collection.IterableLike.foreach(IterableLike.scala:71)
at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:58)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:176)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:29)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:191)
at scala.collection.TraversableLike.map(TraversableLike.scala:234)
at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:191)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:170)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:169)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238)
at scala.collection.immutable.List.flatMap(List.scala:352)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:169)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:22)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:30)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:78)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:467)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:351)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:283)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueMethod(POJOPropertiesCollector.java:169)
at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueMethod(BasicBeanDescription.java:223)
at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:348)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:210)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:153)
at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1203)
at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1157)
at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:481)
at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:679)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:107)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3559)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2927)
at org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:142)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3384)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:232)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:68)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at TransformInitial$.main(TransformInitial.scala:9)
at TransformInitial.main(TransformInitial.scala)
For the record eclipse version is 2018-09 (4.9.0).
I've hunted for special characters in the csv with a cat -A. It yield nothing.
I'm out of options, something trivial must be missing but I can't put a finger on it.
I'm not sure exactly what is causing your error, since the code works for me. It could be related to the version of the Scala compiler that you are using, since there's no information about that in your Maven file.
I have posted my complete solution—using SBT— to GitHub. To exectute the code, you'll need to install SBT, cd to the checked out source's root folder, then run the following command:
$ sbt run
BTW, I changed your code to take advantage of more idiomatic Scala conventions, and also used the csv function to load your file. The new Scala code looks like this:
import org.apache.spark.sql.SparkSession
// Extending App is more idiomatic than writing a "main" function.
object TransformInitial
extends App {
val session = SparkSession.builder.master("local").appName("test").getOrCreate()
// As of Spark 2.0, it's easier to read CSV files.
val df = session.read.option("header", "true").option("inferSchema", "true").csv("data_sets/small_test.csv")
df.show()
// Shutdown gracefully.
session.stop()
}
Note that I also removed the redundant delimiter option.
Downgrading scala version to 2.11 fixed for me.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>

How does one use JMH from a Scala Maven project in a JUnit test?

I created a complete test project (version in question linked). I have based this project on several things:
The original java repository of the same demo
Generating an archetype, e.g.:
mvn archetype:generate \
-DinteractiveMode=false \
-DarchetypeGroupId=org.openjdk.jmh \
-DarchetypeArtifactId=jmh-scala-benchmark-archetype \
-DgroupId=org.sample \
-DartifactId=test \
-Dversion=1.0
One way to get the test runner working with JMH (otherwise doesn't even try to run): JMH Unable to find the resource: /META-INF/BenchmarkList
Further confirmation in methodology: How to run JMH from inside JUnit tests?
Gradle/idea has similar issues, maybe.
Maybe others ... I've been staring at this for many hours, but I think that is it.
Here is the test:
package com.szatmary.peter
import org.junit.Assert
import org.junit.Test
import org.openjdk.jmh.annotations.Benchmark
import org.openjdk.jmh.annotations.Mode
import org.openjdk.jmh.annotations.Scope
import org.openjdk.jmh.annotations.State
import org.openjdk.jmh.results.BenchmarkResult
import org.openjdk.jmh.results.RunResult
import org.openjdk.jmh.runner.Runner
import org.openjdk.jmh.runner.RunnerException
import org.openjdk.jmh.runner.options.Options
import org.openjdk.jmh.runner.options.OptionsBuilder
import org.openjdk.jmh.runner.options.TimeValue
import org.openjdk.jmh.runner.options.VerboseMode
import java.util
import java.util.concurrent.TimeUnit
import com.szatmary.peter.SampleBenchmarkTest.St
import com.szatmary.peter.SampleBenchmarkTest.St.AVERAGE_EXPECTED_TIME
/**
* It is recommended to run JMH not from tests but directly from different main method code.
* Unit-tests and other IDE interferes with the measurements.
*
* If your measurements will be in second / minutes or longer than it running nechmarks from tests
* will not affect your benchmark results.
*
* If your measurements will be in miliseconds, microseconds , nanoseconds ... run your
* benchmarks rather not from tests bud from main code to have better measurements.
*/
object SampleBenchmarkTest {
#State(Scope.Benchmark) object St {
private[peter] val AVERAGE_EXPECTED_TIME = 100 // expected max 100 milliseconds
val app = new App
}
}
class SampleBenchmarkTest {
/**
* Benchmark run with Junit
*
* #throws Exception
*/
#Test
#throws[Exception]
def runTest(): Unit = {
val opt = initBench
val results = runBench(opt)
assertOutputs(results)
}
/**
* JMH benchmark
*
*/
#Benchmark
def oldWay(st: St.type): Unit = st.app.oldWay
#Benchmark
def newWay(st: St.type): Unit = st.app.newWay
/**
* Runner options that runs all benchmarks in this test class
* namely benchmark oldWay and newWay.
*
* #return
*/
private def initBench: Options = {
System.out.println("running" + classOf[SampleBenchmarkTest].getSimpleName + ".*")
return new OptionsBuilder()
.include(classOf[SampleBenchmarkTest].getSimpleName + ".*")
.mode(Mode.AverageTime)
.verbosity(VerboseMode.EXTRA)
.timeUnit(TimeUnit.MILLISECONDS)
.warmupTime(TimeValue.seconds(1))
.measurementTime(TimeValue.milliseconds(1))
.measurementIterations(2).
threads(4)
.warmupIterations(2)
.shouldFailOnError(true)
.shouldDoGC(true)
.forks(1)
.build
}
/**
*
* #param opt
* #return
* #throws RunnerException
*/
#throws[RunnerException]
private def runBench(opt: Options) = new Runner(opt).run
/**
* Assert benchmark results that are interesting for us
* Asserting test mode and average test time
*
* #param results
*/
private def assertOutputs(results: util.Collection[RunResult]) = {
import scala.collection.JavaConversions._
for (r <- results) {
import scala.collection.JavaConversions._
for (rr <- r.getBenchmarkResults) {
val mode = rr.getParams.getMode
val score = rr.getPrimaryResult.getScore
val methodName = rr.getPrimaryResult.getLabel
Assert.assertEquals("Test mode is not average mode. Method = " + methodName, Mode.AverageTime, mode)
Assert.assertTrue("Benchmark score = " + score + " is higher than " + AVERAGE_EXPECTED_TIME + " " + rr.getScoreUnit + ". Too slow performance !", score < AVERAGE_EXPECTED_TIME)
}
}
}
}
The error from running mvn clean install is:
[INFO] --- exec-maven-plugin:1.6.0:exec (run-benchmarks) # jmh-benchmark-demo ---
Exception in thread "main" java.lang.RuntimeException: ERROR: Unable to find the resource: /META-INF/BenchmarkList
at org.openjdk.jmh.runner.AbstractResourceReader.getReaders(AbstractResourceReader.java:98)
at org.openjdk.jmh.runner.BenchmarkList.find(BenchmarkList.java:122)
at org.openjdk.jmh.runner.Runner.internalRun(Runner.java:263)
at org.openjdk.jmh.runner.Runner.run(Runner.java:209)
at org.openjdk.jmh.Main.main(Main.java:71)
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
Unless requested, I won't bother including the other source file in this post (App.scala - it just has two ways of summing over an array), and it is found in the github repo.
For completeness here is my current pom file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.szatmary.peter</groupId>
<artifactId>jmh-benchmark-demo</artifactId>
<version>1.0</version>
<name>JMH benchmark demo</name>
<prerequisites>
<maven>3.0</maven>
</prerequisites>
<properties>
<javac.target>1.8</javac.target>
<scala.version.major>2.11</scala.version.major>
<scala.version.minor>11</scala.version.minor>
<scala.version>
${scala.version.major}.${scala.version.minor}
</scala.version>
<!--
Select a JMH benchmark generator to use. Available options:
default: whatever JMH chooses by default;
asm: parse bytecode with ASM;
reflection: load classes and use Reflection over them;
-->
<jmh.generator>default</jmh.generator>
<!--
Name of the benchmark Uber-JAR to generate.
-->
<uberjar.name>benchmarks</uberjar.name>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<jmh.version>1.20</jmh.version>
<java.version>1.8</java.version>
<junit.version>4.12</junit.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>${jmh.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>${jmh.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!--
1. Add source directories for both scalac and javac.
-->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>1.8</version>
<executions>
<execution>
<id>add-source</id>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>${project.basedir}/src/main/scala</source>
<source>${project.basedir}/target/generated-sources/jmh</source>
</sources>
</configuration>
</execution>
<execution>
<id>add-test-source</id>
<phase>generate-test-sources</phase>
<goals>
<goal>add-test-source</goal>
</goals>
<configuration>
<sources>
<source>${project.basedir}/src/test/scala</source>
<source>${project.basedir}/target/generated-test-sources/jmh</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
<!--
2. Compile Scala sources
-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
<configuration>
<recompileMode>incremental</recompileMode>
</configuration>
<executions>
<execution>
<phase>process-sources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--
4. Compile JMH generated code.
-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<compilerVersion>${javac.target}</compilerVersion>
<source>${javac.target}</source>
<target>${javac.target}</target>
<compilerArgument>-proc:none</compilerArgument>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<executions>
<execution>
<id>run-benchmarks</id>
<phase>integration-test</phase>
<goals>
<goal>exec</goal>
</goals>
<configuration>
<classpathScope>test</classpathScope>
<executable>java</executable>
<arguments>
<argument>-classpath</argument>
<classpath />
<argument>org.openjdk.jmh.Main</argument>
<argument>.*</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame

I am running my Spark code to save data to HBase in Amazon EMR 5.8.0 which has Spark 2.2.0 installed.
Running in IntelliJ, it works fine but in EMR Cluster it is throwing me this error:
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
Code
val zookeeperQuorum = args(0)
val tableName = args(1)
val inputPath = args(2)
val spark = SparkSession.builder
.appName("PhoenixSpark")
.getOrCreate
val df = spark.read
.option("delimiter", "\001")
.csv(inputPath)
val hBaseDf = spark.read
.format("org.apache.phoenix.spark")
.option("table", tableName)
.option("zkUrl", zookeeperQuorum)
.load()
val tableSchema = hBaseDf.schema
val rowKeyDf = df.withColumn("row_key", concat(col("_c3"), lit("_"), col("_c5"), lit("_"), col("_c0")))
rowKeyDf.createOrReplaceTempView("mytable")
val correctedDf = spark.sql("Select row_key, _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7," +
"_c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19 from mytable")
val rdd = correctedDf.rdd
val finalDf= spark.createDataFrame(rdd, tableSchema)
finalDf.write
.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", tableName)
.option("zkUrl", zookeeperQuorum)
.save()
spark.stop()
My pom.xml which is correctly mentioning Spark version as 2.2.0
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.myntra.analytics</groupId>
<artifactId>com.myntra.analytics</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- "package" command plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>4.11.0-HBase-1.3</version>
<scope>provided</scope>
</dependency>
</dependencies>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
Here is the stacktrace from EMR logs which shows this error.
17/09/28 23:20:18 ERROR ApplicationMaster: User class threw exception:
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1475)
at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:498)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:472)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:131)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:77)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:415)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at com.mynra.analytics.chronicles.PhoenixSpark$.main(PhoenixSpark.scala:29)
at com.mynra.analytics.chronicles.PhoenixSpark.main(PhoenixSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more
We encountered the same problem on Hortonworks HDP 2.6.3. The cause appears to be classpath conflict for the org.apache.phoenix.spark classes. In HDP, the Spark 1.6 version of this package is included in phoenix-client.jar. You need to override it by placing the Spark2-specific plugin phoenix-spark2.jar in front:
/usr/hdp/current/spark2-client/bin/spark-submit --master yarn-client --num-executors 2 --executor-cores 2 --driver-memory 3g --executor-memory 3g
--conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/etc/hbase/conf" --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-spark2.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/etc/hbase/conf"
--class com.example.test phoenix_test.jar
I ran into this same issue and was even seeing it with spark-shell without any of my custom code. After some wrangling I think it is an issue with the Phoenix jars that are included with EMR 5.8 (and 5.9). I have no idea why their Phoenix client jar seems to have a class reference to org.apache.spark.sql.DataFrame since it was changed in Spark 2.0.0 to be an alias to DataSet[Row]. (Especially since their Phoenix jars claim to be 4.11 which should have this issue fixed.)
Below I have what I did that fixed it for me. I suspect you could also monkey with your build to not use the provided Phoenix but to jar up your local version as well.
What I did to get around it:
I. I copied my local Phoenix client jar to S3 (I had a 4.10 version laying around.)
II. I wrote a simple install shell script and also put that on S3:
aws s3 cp s3://<YOUR_BUCKET_GOES_HERE>/phoenix-4.10.0-HBase-1.2-client.jar /home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar
III. I created a bootstrap action that simply ran the shell script from Step 2.
IV. I created a JSON file to put this jar into the executor and driver classpaths for spark-default and put it in S3 as well:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.extraClassPath": "/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
"spark.driver.extraClassPath": "/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
}
]
When I went to create my cluster I referenced the full S3 path to my JSON file in the "Edit software settings (optional)" -> "Load JSON from S3" section of the AWS console.
After that I booted up my cluster and fired up a spark-shell. Below you can see the output from it, including the verbose classpath info showing my jar being used, and a DataFrame being successfully loaded.
[hadoop#ip-10-128-7-183 ~]$ spark-shell -v
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Adding default property: spark.sql.warehouse.dir=hdfs:///user/spark/warehouse
Adding default property: spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
Adding default property: spark.history.fs.logDirectory=hdfs:///var/log/spark/apps
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property: spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
Adding default property: spark.yarn.historyServer.address=ip-10-128-7-183.columbuschildrens.net:18080
Adding default property: spark.stage.attempt.ignoreOnDecommissionFetchFailure=true
Adding default property: spark.resourceManager.cleanupExpiredHost=true
Adding default property: spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS=$(hostname -f)
Adding default property: spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
Adding default property: spark.master=yarn
Adding default property: spark.blacklist.decommissioning.timeout=1h
Adding default property: spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
Adding default property: spark.sql.hive.metastore.sharedPrefixes=com.amazonaws.services.dynamodbv2
Adding default property: spark.executor.memory=6144M
Adding default property: spark.driver.extraClassPath=/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Adding default property: spark.eventLog.dir=hdfs:///var/log/spark/apps
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.executor.extraClassPath=/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Adding default property: spark.executor.cores=1
Adding default property: spark.history.ui.port=18080
Adding default property: spark.blacklist.decommissioning.enabled=true
Adding default property: spark.hadoop.yarn.timeline-service.enabled=false
Parsed arguments:
master yarn
deployMode null
executorMemory 6144M
executorCores 1
totalExecutorCores null
propertiesFile /usr/lib/spark/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath /home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
driverExtraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
driverExtraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass org.apache.spark.repl.Main
primaryResource spark-shell
name Spark shell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf:
(spark.blacklist.decommissioning.timeout,1h)
(spark.blacklist.decommissioning.enabled,true)
(spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.executor.memory,6144M)
(spark.sql.warehouse.dir,hdfs:///user/spark/warehouse)
(spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.yarn.historyServer.address,ip-10-128-7-183.columbuschildrens.net:18080)
(spark.eventLog.enabled,true)
(spark.history.ui.port,18080)
(spark.stage.attempt.ignoreOnDecommissionFetchFailure,true)
(spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f))
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.resourceManager.cleanupExpiredHost,true)
(spark.shuffle.service.enabled,true)
(spark.history.fs.logDirectory,hdfs:///var/log/spark/apps)
(spark.driver.extraJavaOptions,-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2)
(spark.eventLog.dir,hdfs:///var/log/spark/apps)
(spark.executor.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.executor.cores,1)
(spark.driver.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
Main class:
org.apache.spark.repl.Main
Arguments:
System properties:
(spark.blacklist.decommissioning.timeout,1h)
(spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.blacklist.decommissioning.enabled,true)
(spark.hadoop.yarn.timeline-service.enabled,false)
(spark.executor.memory,6144M)
(spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native)
(spark.sql.warehouse.dir,hdfs:///user/spark/warehouse)
(spark.yarn.historyServer.address,ip-10-128-7-183.columbuschildrens.net:18080)
(spark.eventLog.enabled,true)
(spark.history.ui.port,18080)
(spark.stage.attempt.ignoreOnDecommissionFetchFailure,true)
(spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f))
(SPARK_SUBMIT,true)
(spark.executor.extraJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.app.name,Spark shell)
(spark.resourceManager.cleanupExpiredHost,true)
(spark.shuffle.service.enabled,true)
(spark.history.fs.logDirectory,hdfs:///var/log/spark/apps)
(spark.driver.extraJavaOptions,-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p')
(spark.jars,)
(spark.submit.deployMode,client)
(spark.executor.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
(spark.eventLog.dir,hdfs:///var/log/spark/apps)
(spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2)
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.executor.cores,1)
(spark.driver.extraClassPath,/home/hadoop/phoenix-4.10.0-HBase-1.2-client.jar:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*)
Classpath elements:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/10/11 13:36:05 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/10/11 13:36:23 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/10/11 13:36:23 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
17/10/11 13:36:23 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://ip-10-128-7-183.columbuschildrens.net:4040
Spark context available as 'sc' (master = yarn, app id = application_1507728658269_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import org.apache.spark.sql.DataFrame
var sqlContext = new SQLContext(sc);
val phoenixHost = "10.128.7.183:2181"
// Exiting paste mode, now interpreting.
warning: there was one deprecation warning; re-run with -deprecation for details
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import org.apache.spark.sql.DataFrame
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#258ff54a
phoenixHost: String = 10.128.7.183:2181
scala> val variant_hg19_df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "VARIANT_ANNOTATION_HG19", "zkUrl" -> phoenixHost))
warning: there was one deprecation warning; re-run with -deprecation for details
variant_hg19_df: org.apache.spark.sql.DataFrame = [CHROMOSOME_ID: int, POSITION: int ... 36 more fields]

TypeSafe config : reference.conf values not overridden by application.conf

I am using typesafe application conf to provide all external connfig in my scala project, but While trying to create shaded Jar using maven-shade-plugin and running it somehow package the conf in jar itself which cannot be overridden while changing the values in application conf.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${maven.shade.plugin.version}</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>test.myproj.DQJobTrigger</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
I am not sure of the behaviour when I am trying to load all configs from the application conf itself. using ConfigFactory
trait Config {
private val dqConfig = System.getenv("CONF_PATH")
private val config = ConfigFactory.parseFile(new File(dqConfig))
private val sparkConfig = config.getConfig("sparkConf")
private val sparkConfig = config.getConfig("mysql")
}
while CONF_PATH is set as the path where application.conf exist while running the jar.
application.conf
sparkConf = {
master = "yarn-client"
}
mysql = {
url = "jdbc:mysql://127.0.0.1:3306/test"
user = "root"
password = "MyCustom98"
driver = "com.mysql.jdbc.Driver"
connectionPool = disabled
keepAliveConnection = true
}
so now Even If I change the properties in the application conf still it takes the configs which were present while packaging the jar.