How to install spark on Windows 7 - eclipse

I'm in an internship position and i have a limited access to a downloads/installation in the system. My problem is how to install Apache spark on Windows 7 (Always I used it via Hortonwork into VM but in my internship i don't have a right to install VM or Hortonworks). I searched more in the forum, I Think that I can use Eclipse, import spark and install scala IDE (scala is my prefered langage with Spark) but I can't arrived to a solution.
Can you please give me any suggestion or idea ?

I used this guide and it works just fine.
www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf

If you want to launch a Spark job in local mode from your IDE (Eclipse or Intellij), just :
download the IDE
download the corresponding Scala plugin
download the SBT plugin
create a SBT project
in the build.sbt add Spark dependencies :
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"
Then you can write a Scala main class named Main.scala :
object Main {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local")
.appName("HbaseWriteTest")
.getOrCreate()
...
}
}
Execute the Main class and it will execute the Spark job in local mode.
After if you want to run a Spark cluster on your local machine, you can follow official documentation here : https://spark.apache.org/docs/latest/spark-standalone.html

Related

Spark ignoring package jars included in the configuration of my Spark Session

I keep running into a java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html error.
I am trying to include the org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0 package as part of my spark code. The reason is that I want it to write unit tests locally. I have tried several things:
Include the package as part of my SparkSession builder:
val conf = new SparkConf()
conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0")
val sparkSession: SparkSession =
SparkSession
.builder()
.appName(getClass.getSimpleName)
.config(conf = conf)
// ... the rest of my config
.master("local[*]").getOrCreate()
and it does not work, I get the same error. I also tried directly using the configuration string in the sparksession builder and that didn't work either.
Downloading the jar myself. I really don't want to do this, I want it to be automated. But even this, I cannot specify "spark.jars" to point to the downloaded jar, it cannot find it for some reason.
Can anybody help me figure this out?
You can create a uber/fat jar and put all your dependencies in that jar.
Lets say if you want to use iceberg in your spark application.
Create a pom.xml file and add the dependency in include section.
<dependencies>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.2_2.12</artifactId>
<version>4.12</version>
</dependency>
</dependencies>
It will create a fat jar along with that dependency baked in it.
you can deploy that jar via spark-submit and the dependent libraries will be picked automatically.
It seems spark.jars.packages is only read when spark-shell starts up. That means it can be changed in the spark-shell session via SparkSession or SparkConf, however, it will not be processed or loaded.
For a Self-Contained Scala Application, you may used to add the following dependencies in the build.sbt:
libraryDependencies ++= Seq(
"org.mongodb.spark" %% "mongo-spark-connector" % "10.0.5",
"org.apache.spark" %% "spark-core" % "3.0.2",
"org.apache.spark" %% "spark-sql" % "3.0.2"
)

How to resolve "Failed to load class" with Spark 3 on EMR for Scala object

I'm trying to build a simple Scala-based Spark application and run it in EMR, but when I run it, I get Error: Failed to load class: com.myorganization.MyScalaObj. My Scala file is:
package com.myorganization
import org.apache.spark.sql.SparkSession
object MyScalaObj extends App {
val spark = SparkSession.builder()
.master(("local[*]"))
.appName("myTestApp")
.getOrCreate()
val df = spark.read.csv("s3://my_bucket/foo.csv")
df.write.parquet("s3://my_bucket/foo.parquet")
}
To the stock build.sbt file, I added a few lines including the Scala version, Spark library dependencies, and mainClass (which I found from this question.
name := "sbtproj"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0",
"org.apache.spark" %% "spark-sql" % "3.0.0"
)
mainClass in (Compile, run) := Some("com.myorganization.MyScalaObj")
I build this and get a MyScalaObj.class which I am manually packaging into a jar with jar cf MyScalaObj.jar MyScalaObj.class. I copied this to my EMR cluster running Spark 3.0.0 and Scala 2.12.10.
I then tried to run my application with spark-submit --class com.myorganization.MyScalaObj MyScalaObj.jar --deploy-mode cluster --master spark://x.x.x.x, but it fails with Error: Failed to load class com.myorganization.MyScalaObj.
As this whole process is quite new to me, I'm not sure whether the error is in my sbt config (I don't know sbt at all), with the Scala object itself, something missing (eg, a manifest?), or in how I'm invoking Spark. What's the likely cause of my error here?
It turns out my problem was in how I'm building my jar file. Having not done Java for many years, I forgot that the qualified class name -- in this case, com.myorganization.MyScalaObj -- needs to be reflected in the directory structure. I was running jar cf MyScalaObj.jar MyScalaObj.class, but I should have been up two directories, running jar cf MyScalaObj.jar com/.

How do you properly set up Scala Spark libraryDependencies with the correct version of Scala?

I'm new to Scala Spark and I'm trying to create an example project using Intellij. During Project creation I choose Scala and Sbt with Scala version 2.12 but When I tried adding spark-streaming version 2.3.2 if kept erroring out so I Google'd around and on Apache's website I found the sbt config shown below and I'm still getting the same error.
Error: Could not find or load main class SparkStreamingExample
Caused by: java.lang.ClassNotFoundException: SparkStreamingExample
How can it be determined which version of Scala works with which version of Spark Dependencies?
name := "SparkStreamExample"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming_2.11" % "2.3.2"
)
My Object class is very basic doesn't have much to it...
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
object SparkStreamingExample extends App {
println("SPARK Streaming Example")
}
You can see the version of Scala that is supported by Spark in the Spark documentation.
As of this writing, the documentation says:
Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.2 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).
Notice that only Scala 2.11.x is supported.

Spark write to S3 bucket giving java.lang.NoClassDefFoundError

I'm trying to integrate Spark 2.3.0 running on my Mac with S3. I can read/write to S3 without any problem using spark-shell. But when I try to do the same using a little Scala program that I run via sbt, I get
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider.
I have installed hadoop-aws 3.0.0-beta1.
I have also set s3 access information in spark-2.3.0/conf/spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key XXXX
spark.hadoop.fs.s3a.secret.key YYYY
spark.hadoop.com.amazonaws.services.s3.enableV4 true
spark.hadoop.fs.s3a.endpoint s3.us-east-2.amazonaws.com
spark.hadoop.fs.s3a.fast.upload true
spark.hadoop.fs.s3a.encryption.enabled true
spark.hadoop.fs.s3a.server-side-encryption-algorithm AES256
The program compiles fine using sbt version 0.13.
name := "S3Test"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.0.0-beta1"
The scala code is:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import com.amazonaws._
import com.amazonaws.auth ._
import com.amazonaws.services.s3 ._
import com.amazonaws. services.s3.model ._
import java.io._
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.s3a.S3AFileSystem
object S3Test {
def main(args: Array[String]) = {
val spark = SparkSession.builder().master("local").appName("Spark AWS S3 example").getOrCreate()
import spark.implicits._
val df = spark.read.text("test.txt")
df.take(5)
df.write.save(<s3 bucket>)
}
}
I have set environment variables for JAVA_HOME, HADOOP_HOME, SPARK_HOME, CLASSPATH, SPARK_DIST_CLASSPATH, etc. But nothing lets me get past this error message.
You can't mix hadoop-* JARs, they all need to be in perfect sync. Which means: cut all the hadoop 2.7 artifacts & replace them.
FWIW, there isn't a significant enough difference between Hadoop 2.8 & Hadoop 3.0-beta-1 in terms of aws support, other than the s3guard DDB directory service (performance & listing through dynamo DB), that unless you need that feature, Hadoop 2.8 is going to be adequate.

Is it possible to use json4s 3.2.11 with Spark 1.3.0?

Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10.
build.sbt
import sbt.Keys._
name := "sparkapp"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided"
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"`
plugins.sbt
logLevel := Level.Warn
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
App1.scala
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.{Logging, SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object App1 extends Logging {
def main(args: Array[String]) = {
val conf = new SparkConf().setAppName("App1")
val sc = new SparkContext(conf)
println(s"json4s version: ${org.json4s.BuildInfo.version.toString}")
}
}
sbt 0.13.7 + sbt-assembly 0.13.0
Scala 2.10.4
Is there a way to force 3.2.11 version usage?
We ran into a problem similar to the one Necro describes, but downgrading from 3.2.11 to 3.2.10 when building the assembly jar did not resolve it. We ended up solving it (using Spark 1.3.1) by shading the 3.2.11 version in the job assembly jar:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("org.json4s.**" -> "shaded.json4s.#1").inAll
)
I asked the same question in the Spark User Mailing List, and got two answers how to make it work:
Use spark.driver.userClassPathFirst=true and spark.executor.userClassPathFirst=true options, but it works only in Spark 1.3 and probably will require some other modifications like excluding Scala classes from your build.
Rebuild Spark with json4s 3.2.11 version (you can change it in core/pom.xml)
Both work fine, I prefered the second one.
This is not an answer to your question but this came up when searching for my problem. I was getting a NoSuchMethod exception in formats.emptyValueStrategy.replaceEmpty(value) in json4s's 'render'. The reason was I was building with 3.2.11 but Spark was linking 3.2.10. I downgraded to 3.2.10 and my problem went away. Your question helped me understand what was going on (that Spark was linking a conflicting version of json4s) and I was able to resolve the problem, so thanks.