Using s3n:// with spark-submit - scala

I've written a Spark app that is to run on a cluster using spark-submit. Here's part of my build.sbt.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided" exclude("asm", "asm")
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
asm is excluded because I'm using another library which depends on a different version of it. The asm dependency in Spark seems to come from one of Hadoop's dependents, and I'm not using the functionality.
The problem now is that with this setup, saveToTextFile("s3n://my-bucket/dir/file") throws java.io.IOException: No FileSystem for scheme: s3n.
Why is this happening? Shouldn't spark-submit provide the Hadoop dependencies?
I've tried a few things:
Leaving out "provided"
Putting hadoop-aws on the classpath, via a jar and spark.executor.extraClassPath and spark.driver.extraClassPath. This requires doing the same for all of its transitive dependencies though, which can be painful.
Neither really works. Is there a better approach?
I'm using the pre-built spark-1.6.1-bin-hadoop2.6.

Related

Intellisense in Intellij with spark libraries

I've created a Spark project in IntelliJ and Intellsense (code completation) doesnt't work. It works for standard Scala libraries, but not for Spark. Any ideas what can be wron? (I've already done Invalidate Caches and restart)
built.sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0"
Your build.sbt needs to be under the root folder of the project .

finding spark scala packages

I'm sure this is simpler than it looks, but I'm willing to look dumb.
I'm working my way through some Scala/Spark examples, which occasionally call for adding library dependencies, eg,
libraryDependencies ++= Seq(
scalaTest % Test,
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "2.2.0"
)
The question is, how do you find the appropriate names and versions for the libraries? It seems the texts all give import statements; there has to be some kind of registry or something. But where?
The correct version of library can always search from the mvnrepository .If you are trying to access the version from proprietary Distribution you need to add the repository of that Distribution.
Cloudera repository
MapR repository
hdp_maven_artifacts

Check Spark packages version [duplicate]

This question already has an answer here:
sbt unresolved dependency for spark-cassandra-connector 2.0.2
(1 answer)
Closed 5 years ago.
I am trying to set up my first Scala project with IntelliJ Idea on Ubuntu 16.04. I need the Spark library and I think I have installed correctly in my computer, however I am not able to refer it in the project dependencies. In particular, I have added the following code in my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core" % "2.1.1",
"org.apache.spark" % "spark-sql" % "2.1.1")
However sbt complains about not finding the correct packages (Unresolved Dependencies error, org.apache.spark#spark-core;2.1.1: not found and org.apache.spark#spark-sql;2.1.1: not found):
I think that the versions of the packages are incorrect (I copied the previous code from the web, just to try).
How can I determine the correct packages versions?
If you use % you have to define the exact version as
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.1",
"org.apache.spark" % "spark-sql_2.10" % "2.1.1")
And if you don't want to define the version and let sbt take the correct version then you need to define %% as
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.1",
"org.apache.spark" %% "spark-sql" % "2.1.1")
you can check of installed version by doing
spark-submit --version
And by going to maven dependency

Spark with IntelliJ or Eclipse

I am trying to setup IntelliJ for spark 2.11 but it is very daunting and after days I have not been able to compile a simple instruction such as with "spark.read.format" which is not found in main core and sql spark libraries.
I have seen a few posts on the subject but with none resolved. Does anyone have some experience with perhaps a working sample program I can start with?
Could it be that it would be easier with Eclipse?
Many thanks in advance for your answers,
EZ
build project in Intellij using with scala 2.11 and sbt 0.13: then ensure that your plugins.sbt contains as below:
logLevel := Level.Warn
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
then your build.sbt must contain as below:
scalaVersion := "2.11.8"
val sparkVersion = "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion %"provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion %"provided"
Then write your code, click Terminal in Intellij and type sbt assembly: you can ship that jar to remote cluster, otherwise run from Intelij locally, let me know how it goes

Spark MLLib exception LDA.class compiled against incompatible version

when I try to run a Main class with sbt, I get this error. What am I missing?
Error:scalac: missing or invalid dependency detected while loading class file 'LDA.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'LDA.class' was compiled against an incompatible version of org.apache.spark.
My build.sbt looks like this:
"org.apache.spark" %% "spark-core" % "2.0.1" % Provided,
"org.apache.spark" % "spark-mllib_2.11" % "1.3.0"
You are trying to run the old sparkMllib on new Spark Core. i.e. your version of mllib and spark core are totally different.
Try using this:
"org.apache.spark" %% "spark-core_2.11" % "2.0.1",
"org.apache.spark" %% "spark-mllib_2.11" % "2.0.1"
Thant might solve your problem !