Spark write to S3 bucket giving java.lang.NoClassDefFoundError - scala

I'm trying to integrate Spark 2.3.0 running on my Mac with S3. I can read/write to S3 without any problem using spark-shell. But when I try to do the same using a little Scala program that I run via sbt, I get
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider.
I have installed hadoop-aws 3.0.0-beta1.
I have also set s3 access information in spark-2.3.0/conf/spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key XXXX
spark.hadoop.fs.s3a.secret.key YYYY
spark.hadoop.com.amazonaws.services.s3.enableV4 true
spark.hadoop.fs.s3a.endpoint s3.us-east-2.amazonaws.com
spark.hadoop.fs.s3a.fast.upload true
spark.hadoop.fs.s3a.encryption.enabled true
spark.hadoop.fs.s3a.server-side-encryption-algorithm AES256
The program compiles fine using sbt version 0.13.
name := "S3Test"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.0.0-beta1"
The scala code is:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import com.amazonaws._
import com.amazonaws.auth ._
import com.amazonaws.services.s3 ._
import com.amazonaws. services.s3.model ._
import java.io._
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.s3a.S3AFileSystem
object S3Test {
def main(args: Array[String]) = {
val spark = SparkSession.builder().master("local").appName("Spark AWS S3 example").getOrCreate()
import spark.implicits._
val df = spark.read.text("test.txt")
df.take(5)
df.write.save(<s3 bucket>)
}
}
I have set environment variables for JAVA_HOME, HADOOP_HOME, SPARK_HOME, CLASSPATH, SPARK_DIST_CLASSPATH, etc. But nothing lets me get past this error message.

You can't mix hadoop-* JARs, they all need to be in perfect sync. Which means: cut all the hadoop 2.7 artifacts & replace them.
FWIW, there isn't a significant enough difference between Hadoop 2.8 & Hadoop 3.0-beta-1 in terms of aws support, other than the s3guard DDB directory service (performance & listing through dynamo DB), that unless you need that feature, Hadoop 2.8 is going to be adequate.

Related

How to resolve "Failed to load class" with Spark 3 on EMR for Scala object

I'm trying to build a simple Scala-based Spark application and run it in EMR, but when I run it, I get Error: Failed to load class: com.myorganization.MyScalaObj. My Scala file is:
package com.myorganization
import org.apache.spark.sql.SparkSession
object MyScalaObj extends App {
val spark = SparkSession.builder()
.master(("local[*]"))
.appName("myTestApp")
.getOrCreate()
val df = spark.read.csv("s3://my_bucket/foo.csv")
df.write.parquet("s3://my_bucket/foo.parquet")
}
To the stock build.sbt file, I added a few lines including the Scala version, Spark library dependencies, and mainClass (which I found from this question.
name := "sbtproj"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0",
"org.apache.spark" %% "spark-sql" % "3.0.0"
)
mainClass in (Compile, run) := Some("com.myorganization.MyScalaObj")
I build this and get a MyScalaObj.class which I am manually packaging into a jar with jar cf MyScalaObj.jar MyScalaObj.class. I copied this to my EMR cluster running Spark 3.0.0 and Scala 2.12.10.
I then tried to run my application with spark-submit --class com.myorganization.MyScalaObj MyScalaObj.jar --deploy-mode cluster --master spark://x.x.x.x, but it fails with Error: Failed to load class com.myorganization.MyScalaObj.
As this whole process is quite new to me, I'm not sure whether the error is in my sbt config (I don't know sbt at all), with the Scala object itself, something missing (eg, a manifest?), or in how I'm invoking Spark. What's the likely cause of my error here?
It turns out my problem was in how I'm building my jar file. Having not done Java for many years, I forgot that the qualified class name -- in this case, com.myorganization.MyScalaObj -- needs to be reflected in the directory structure. I was running jar cf MyScalaObj.jar MyScalaObj.class, but I should have been up two directories, running jar cf MyScalaObj.jar com/.

Trying to integrate mongoDB and spark, keep having errors related to "could not find or load class"

So I've been trying to integrate mongoDB and Spark, after handling reading about all the dependencies, I'm running:
Jdk 1.8
scala-sdk-2.11-7
Spark version 2.0.2
This is how my build.sbt file looks:
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.mongodb.spark" %% "mongo-spark-connector" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0"
)
Every time I run this simple script:
import com.mongodb.spark._
import org.apache.log4j._
object myMain {
def main(args: Array[String]): Unit = {
println("lol")
}
}
It says "could not find or load main class myMain".
All I'm trying to see is if the dependencies are fine and if the script will import the libraries and work. Please help, I've been reading about this for at least a day and can not find any concrete tips except "compatibility problems".
newProject - >
src ->
project (src-build),
build.sbt,
myMain$
This is how my project tree in IntelliJ looks, in external libraries I have all libs downloaded by build.sbt file and the scala and java dependencies like I wrote above.
Thank you!
The answer was to change from IntelliJ ide to Eclipse.

How to install spark on Windows 7

I'm in an internship position and i have a limited access to a downloads/installation in the system. My problem is how to install Apache spark on Windows 7 (Always I used it via Hortonwork into VM but in my internship i don't have a right to install VM or Hortonworks). I searched more in the forum, I Think that I can use Eclipse, import spark and install scala IDE (scala is my prefered langage with Spark) but I can't arrived to a solution.
Can you please give me any suggestion or idea ?
I used this guide and it works just fine.
www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf
If you want to launch a Spark job in local mode from your IDE (Eclipse or Intellij), just :
download the IDE
download the corresponding Scala plugin
download the SBT plugin
create a SBT project
in the build.sbt add Spark dependencies :
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"
Then you can write a Scala main class named Main.scala :
object Main {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local")
.appName("HbaseWriteTest")
.getOrCreate()
...
}
}
Execute the Main class and it will execute the Spark job in local mode.
After if you want to run a Spark cluster on your local machine, you can follow official documentation here : https://spark.apache.org/docs/latest/spark-standalone.html

Why does import ...RandomForest give "object RandomForest is not a member of package org.apache.spark.mllib.tree"?

I worked with the spark implementation of Random Forest in the shell, and this import runs fine:
import org.apache.spark.mllib.tree.RandomForest
However, when I try to compile it as a standalone file, it fails. The exact error is:
5: object RandomForest is not a member of package org.apache.spark.mllib.tree
I have included mllib in my sbt file too, so can someone please tell me where this error arises? My code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.tree.RandomForest
My sbt file:
name := "churn"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.5.2" % "provided",
"org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
)
Edit:
My-MBP:Churn admin$ sbt 'show libraryDependencies'
[info] Set current project to churn (in build file:/Users/admin/Desktop/Churn/)
[info] List(org.scala-lang:scala-library:2.10.4, org.apache.spark:spark-core_2.10:1.1.0, org.apache.spark:spark-mllib_2.10:1.1.0)
My-MBP:Churn admin$ sbt scalaVersion
[info] Set current project to churn (in build file:/Users/admin/Desktop/Churn/)
[info] 2.10.4
tl;dr Use Spark 1.2.0 or later.
According to the history of org/apache/spark/mllib/tree/RandomForest.scala on GitHub the first version that supports Random Forest is 1.2.0 (see the tags the file was tagged with).
Even though you've showed that your build.sbt has 1.5.2 declared, the output of sbt 'show libraryDependencies' doesn't confirm it as it says:
org.apache.spark:spark-mllib_2.10:1.1.0
1.1.0 is the effective version of Spark MLlib you use in your project. That version has no support for Random Forest.

Is it possible to use json4s 3.2.11 with Spark 1.3.0?

Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10.
build.sbt
import sbt.Keys._
name := "sparkapp"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"`
plugins.sbt
logLevel := Level.Warn
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
App1.scala
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object App1 extends Logging {
def main(args: Array[String]) = {
val conf = new SparkConf().setAppName("App1")
val sc = new SparkContext(conf)
println(s"json4s version: ${org.json4s.BuildInfo.version.toString}")
}
}
sbt 0.13.7 + sbt-assembly 0.13.0
Scala 2.10.4
Is there a way to force 3.2.11 version usage?
We ran into a problem similar to the one Necro describes, but downgrading from 3.2.11 to 3.2.10 when building the assembly jar did not resolve it. We ended up solving it (using Spark 1.3.1) by shading the 3.2.11 version in the job assembly jar:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.json4s.**" -> "shaded.json4s.#1").inAll
)
I asked the same question in the Spark User Mailing List, and got two answers how to make it work:
Use spark.driver.userClassPathFirst=true and spark.executor.userClassPathFirst=true options, but it works only in Spark 1.3 and probably will require some other modifications like excluding Scala classes from your build.
Rebuild Spark with json4s 3.2.11 version (you can change it in core/pom.xml)
Both work fine, I prefered the second one.
This is not an answer to your question but this came up when searching for my problem. I was getting a NoSuchMethod exception in formats.emptyValueStrategy.replaceEmpty(value) in json4s's 'render'. The reason was I was building with 3.2.11 but Spark was linking 3.2.10. I downgraded to 3.2.10 and my problem went away. Your question helped me understand what was going on (that Spark was linking a conflicting version of json4s) and I was able to resolve the problem, so thanks.