Trying to connect to file in another cluster through SFTP and nothing worked.
Current spark version is : '2.2.0.2.6.4.0-91 .
scala :2.11.8'
.
Below is the data frame :
val df_file_feed =spark.read.format("com.springml.spark.sftp").option("host","1-1111").option("username","user").option("password","pasword").option("fileType","csv").load("/home/folder/Path_02.csv")
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.springml.spark.sftp.
also tried these jars . spark-sftp_2.10-1.0.2.jar spark-sftp_2.11-1.1.0.jar ,spark-sftp_2.11-1.1.4.jar
If you are using the spark-shell you have to try like this...
bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.3
have a look at Spark SFTP Connector Library which states that
Linking
You can link against this library in your program at the following ways:
Maven Dependency
<dependency>
<groupId>com.springml</groupId>
<artifactId>spark-sftp_2.11</artifactId>
<version>1.1.3</version>
</dependency>
SBT Dependency
libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.3"
Related
Problem
I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream function.
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
Subsequently, I am unable to read any data from blob storage.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4 prebuilt for Hadoop 2.7.
Fix
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3 from 3.1.x to match the version in my local spark/jars folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4.
Summary
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
In my pom.xml I excluded the hadoop-client dependency automatically imported by the spark-core jar. This dependency was version 2.6.5 which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe version I had under C://winutils/bin to be the version I required and also added the corresponding hadoop.dll. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR
Issue was the auto imported hadoop-client dependency which was fixed by excluding it & adding the new winutils.exe and hadoop.dll under C://winutils/bin.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.
Problem:
I was facing same issue while running fat jar with hadoop 2.7 and spark 2.4 on cluster with hadoop 3.x ,
I was using maven shade plugin.
Observation:
While building fat jar it was including jar org.apache.hadoop:hadoop-hdfs:jar:2.6.5 which has class class org.apache.hadoop.hdfs.web.HftpFileSystem.
Which was causing problem in hadoop 3
Solution:
I have excluded this jar while building fat jar as below.Issue got resolved.
I have a program trying to connect to Neo4j database and run on Spark, testApp.scala, and I package it using sbt package to package it in a.jar with dependencies according to this_contribution (I already have the neo4j-spark-connector-2.0.0-M2.jar)
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.0.0-M2"
However while I tried spark-submit --class "testApp" a.jar it turns out to be
a NoClassDefFoundError
Exception in thread "main" java.lang.NoClassDefFoundError: org/neo4j/spark/Neo4j$ in the code val n = Neo4j(sc)
There are 2 more things I have to mention
1) I used jar vtf to check the content in a.jar, it only has testApp.class, no class of neo4j is in it, but the package process was success (does it mean the neo4j-spark-connector-2.0.0-M2.jar is not packaged in?)
2) I can use spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2 and type the code in testApp.scala, there is no problem (e.g. the wrong line above is val n = Neo4j(sc) but it can work in spark-shell)
You may try using the --jars option with spark-submit. For example
./bin/spark-submit --class "fully-qualified-class-name" --master "master-url" --jars "path-of-your-dependency-jar"
or you can also use spark.driver.extraClassPath="jars-class-path" to solve the issue.Hope this helps.
As the content in the .jar does not contain Neo4j class, it is the packaging problem.
What we should modify is sbt, instead of sbt package, we should use sbt clean assembly instead. This helps create a .jar pack containing all the dependencies in it.
If you use only sbt package, the compile progress is ok, but it will not pack neo4j-*.jar into your .jar. So during the run time it throws an NoClassDefError
I`m loading Spark using intellij project without an installed Spark.
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
I`ve added com.spark.databricks.xml to spark by using
sparkConf.set("spark.driver.extraClassPath", "C:/.../spark-xml_2.11-0.4.1.jar")
sparkConf.setExecutorEnv("spark.driver.extraClassPath", "C:/.../spark-xml_2.11-0.4.1.jar")
sparkConf.set("spark.executor.extraClassPath", "C:/.../spark-xml_2.11-0.4.1.jar")
sparkConf.setExecutorEnv("spark.executor.extraClassPath", "C:/.../spark-xml_2.11-0.4.1.jar")
sparkConf.setJars(Array("C:/.../spark-xml_2.11-0.4.1.jar" ))
and with
spark.sparkContext.addJar("C:/.../spark-xml_2.10-0.2.0.jar")
but when trying to use spark.read.format ("com.databricks.spark.xml") I get the exception "Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html".
How do I fix this? I know it is recommended to add jars using spark-shell but I do not have a spark-shell as I do`nt have spark installed...
If you have a project with maven/sbt you can add the spark-xml dependency as mentioned below:
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.4.1</version>
</dependency>
Ref: https://mvnrepository.com/artifact/com.databricks/spark-xml_2.11/0.4.1
I am on Hortonworks Distribution 2.4 (effectively hadoop 2.7.1 and spark 1.6.1)
I am packaging my own version of spark in the uber jar (2.1.0) while cluster is on 1.6.1. In the process, i am sending all required libraries through a fat jar (built using maven - uber jar concept).
However, spark submit (through spark 2.1.0 client) fails citing NoClassFound Error on jersey client. Upon listing my uber jar contents, i can see the exact class file in the jar, still spark/yarn cant find it.
here goes -
The error message -
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:151)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
And here is my attempt to find the class in jar file -
jar -tf uber-xxxxx-something.jar | grep jersey | grep ClientCon
com/sun/jersey/api/client/ComponentsClientConfig.class
com/sun/jersey/api/client/config/ClientConfig.class
... Other files
what could be going on here ? Suggestions ? ideas please..
EDIT
the jersey client section of the pom goes here -
<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-client</artifactId>
<version>1.19.3</version>
</dependency>
EDIT
I also wanted to indicate this, that my code is compiled with Scala 2.12 with compatibility level set to 2.11. However, the cluster is perhaps on 2.10. I am saying perhaps since I believe that cluster nodes dont necessarily have to have Scala binaries installed; YARN just launches the components' jar/class files without using Scala binaries. wonder if thats playing a role here !!!
I am trying to read data from redshift to spark 1.5 using scala 2.10
I have built the spark-redshift package and added the amazon JDBC connector to the project, but I keep getting this error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials
I have authenticated in the following way:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "ACCESSKEY")
hadoopConf.set("fs.s3n.awsSecretAccessKey","SECRETACCESSKEY")
val df: DataFrame = sqlContext.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://AWS_SERVER:5439/warehouseuser=USER&password=PWD")
.option("dbtable", "fact_time")
.option("tempdir", "s3n://bucket/path")
.load()
df.show()
Concerning your first error java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentials I repeat what I said in the comment :
You have forgot to ship your AWS dependency jar in your spark app jar
And about the second error, I'm not sure of the package but it's more likely to be the org.apache.httpcomponents library you need. (I don't know for what you are using it thought!)
You can add the following to your maven dependencies :
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.3</version>
</dependency>
and you'll need to assembly the whole.
PS: You'll always need to provide the libraries when they are not installed. You must also becareful with the size of the jar you are submitting because it can harm performance.