Cant find class in uber jar - scala

I am on Hortonworks Distribution 2.4 (effectively hadoop 2.7.1 and spark 1.6.1)
I am packaging my own version of spark in the uber jar (2.1.0) while cluster is on 1.6.1. In the process, i am sending all required libraries through a fat jar (built using maven - uber jar concept).
However, spark submit (through spark 2.1.0 client) fails citing NoClassFound Error on jersey client. Upon listing my uber jar contents, i can see the exact class file in the jar, still spark/yarn cant find it.
here goes -
The error message -
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:151)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
And here is my attempt to find the class in jar file -
jar -tf uber-xxxxx-something.jar | grep jersey | grep ClientCon
com/sun/jersey/api/client/ComponentsClientConfig.class
com/sun/jersey/api/client/config/ClientConfig.class
... Other files
what could be going on here ? Suggestions ? ideas please..
EDIT
the jersey client section of the pom goes here -
<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-client</artifactId>
<version>1.19.3</version>
</dependency>
EDIT
I also wanted to indicate this, that my code is compiled with Scala 2.12 with compatibility level set to 2.11. However, the cluster is perhaps on 2.10. I am saying perhaps since I believe that cluster nodes dont necessarily have to have Scala binaries installed; YARN just launches the components' jar/class files without using Scala binaries. wonder if thats playing a role here !!!

Related

Flink Scala Missing Import

In my Flink project I cannot find certain libraries for connectors (specifically I need to ingest a CSV once and read several TBs of parquet data in either batch or streaming mode). I think I have all the required packages, but I am still getting:
[ERROR] import org.apache.flink.connector.file.src.FileSource
[ERROR] ^
[ERROR] C:\Users\alias\project\...\MyFlinkJob.scala:46: error: not found: type FileSource
My POM.xml is rather large, but I think I have the relevant imports:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet</artifactId>
<version>1.15.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_${scala.binary.version}</artifactId>
<version>1.11.6</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-bulk_2.12</artifactId>
<version>1.14.6</version>
</dependency>
I am using the following versions:
<scala.version>2.12.16</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<log4j.version>2.17.1</log4j.version>
<flink.version>1.15.1</flink.version>
Do I need a different import path for Scala than Java?
I wish the Flink documentation had the imports in example code snippets as I spend a long time trying to figure out the imports. What are recommended ._ imports?
I've looked through the symbols in the package but didn't find FileSystem. I looked for different tutorials and example projects showing how to read/listen-to parquet and CSV files with Flink. I made some progress this way, but of the few examples I found in Scala (not Java) for using Parquet files as a source the imports still didn't work even after adding their dependencies and running mvn clean install.
I tried using GitHub's advance search to find a public Scala project using FileSource and eventually found one with the following dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${project.version}</version>
</dependency>
This package was missing on index.scala-lang.org where I thought I should be looking for dependencies (this is my first Scala project so I thought that was the place to find packages like PyPi in Python). It seems that MVN Repository may be a better place to look.
Flink 1.15 has a Scala-free classpath, which has resulted in a number of Flink artifacts no longer having a Scala suffix. You can read all about it in the dedicated Flink blog on this topic: https://flink.apache.org/2022/02/22/scala-free.html
You can also see in that blog how you can use any Scala version with Flink instead of being limited to Scala 2.12.6.
TL;DR: you should use the Java APIs in your application. The Scala APIs will also be deprecated as of Flink 1.17.
Last but not least: don't mix & match Flink version. That won't work.

Error through remote Spark Job: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem

Problem
I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream function.
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
Subsequently, I am unable to read any data from blob storage.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4 prebuilt for Hadoop 2.7.
Fix
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3 from 3.1.x to match the version in my local spark/jars folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4.
Summary
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
In my pom.xml I excluded the hadoop-client dependency automatically imported by the spark-core jar. This dependency was version 2.6.5 which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe version I had under C://winutils/bin to be the version I required and also added the corresponding hadoop.dll. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR
Issue was the auto imported hadoop-client dependency which was fixed by excluding it & adding the new winutils.exe and hadoop.dll under C://winutils/bin.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.
Problem:
I was facing same issue while running fat jar with hadoop 2.7 and spark 2.4 on cluster with hadoop 3.x ,
I was using maven shade plugin.
Observation:
While building fat jar it was including jar org.apache.hadoop:hadoop-hdfs:jar:2.6.5 which has class class org.apache.hadoop.hdfs.web.HftpFileSystem.
Which was causing problem in hadoop 3
Solution:
I have excluded this jar while building fat jar as below.Issue got resolved.

azure-eventhub jar upgrade required inside storm-eventhub jar

I use storm-eventhub jar for my project using maven artifact as follows
<groupId>org.apache.storm</groupId>
<artifactId>storm-eventhubs</artifactId>
<version>2.0.0</version>
storm-eventhub internally uses azure-eventhub version 0.13.1 which is old one.
Hence we are forced to use the same version of azure-eventhub jar in our project as well.
Now the requirement is that we have to upgrade to azure-eventhub version 2.3.2 but storm-eventhub classes fail with NoClassDef errors since many classes refer to 0.13.1 version of azure-eventhub.
Should I customize the classes myself OR can I raise a request to apache community to upgrade the azure-eventhub version inside storm-eventhub library. If so, what would be the ETA approximately.
There is a PR open to upgrade the client version for storm-eventhubs at https://github.com/apache/storm/pull/3004. Unfortunately it looks like the author didn't have time to finish it. You are welcome to pick it up.

How to build spark application using Scala IDE and Maven?

I'm new to Scala, Spark and Maven and would like to build spark application described here. It uses the Mahout library.
I have Scala IDE install and would like to use Maven to build the dependencies (which are the Mahout library as well as Spark lib). I couldn't find a good tutorial to start. Could someone help me figure it out?
First try compiling simple application with Maven in Scala IDE. The key of Maven project is directory structure and pom.xml. Although I don't use Scala IDE, this document seems helpful.
http://scala-ide.org/docs/tutorials/m2eclipse/
Next step is to add dependency on Spark in pom.xml you can follow this document.
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
For latest version of Spark and Mahout artifacts you can check them here:
http://mvnrepository.com/artifact/org.apache.spark
http://mvnrepository.com/artifact/org.apache.mahout
Hope this helps.
You need following tools to get started ( based on recent availability) -
Scala IDE for Eclipse – Download latest version of Scala IDE from
here.
Scala Version – 2.11 ( make sure scala compiler is set to
this version as well)
Spark Version 2.2 ( provided in maven
dependency)
winutils.exe
For running in Windows environment , you need hadoop binaries in
windows format. winutils provides that and we need to set
hadoop.home.dir system property to bin path inside which winutils.exe
is present. You can download winutils.exe here and place at path
like this – c:/hadoop/bin/winutils.exe
And, you can define Spark Core Dependency in your Maven POM.XML for your project, to get started with.
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
And in your Java/Scala class define this property, to run on your local environmet on Windows -
System.setProperty("hadoop.home.dir", "c://hadoop//");
More details and full setup details can be found here.

NoSuchMethod exception in Flink when using dataset with custom object array

I have a problem with Flink
java.lang.NoSuchMethodError: org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo.getInfoFor(Lorg/apache/flink/api/common/typeinfo/TypeInformation;)Lorg/apache/flink/api/java/typeutils/ObjectArrayTypeInfo;
at LowLevel.FlinkImplementation.FlinkImplementation$$anon$6.<init>(FlinkImplementation.scala:28)
at LowLevel.FlinkImplementation.FlinkImplementation.<init>(FlinkImplementation.scala:28)
at IRLogic.GmqlServer.<init>(GmqlServer.scala:15)
at it.polimi.App$.main(App.scala:20)
at it.polimi.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
the line with the problem is this one
implicit val regionTypeInformation =
api.scala.createTypeInformation[FlinkDataTypes.FlinkRegionType]
in the FlinkRegionType I have an Array of custom object
I developed the app with the maven plugin in the IDE and everything is working good, but when I move to the version I downloaded from the website I get the error above
I am using Flink 0.9
I was thinking that some library may be missing but I am using maven for handling everything. Moreover running through the code of ObjectArrayTypeInfo.java it doesn't seem to be the problem
A NoSuchMethodError commonly indicates a version mismatch between the libraries a Flink program was compiled with and the system the program is executed on. Especially if the same code works in an IDE setup where compile and execution libraries are the same.
In such case, you should check the version of the Flink dependencies, for example in the Maven POM file.