finding spark scala packages - scala

I'm sure this is simpler than it looks, but I'm willing to look dumb.
I'm working my way through some Scala/Spark examples, which occasionally call for adding library dependencies, eg,
libraryDependencies ++= Seq(
scalaTest % Test,
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "2.2.0"
)
The question is, how do you find the appropriate names and versions for the libraries? It seems the texts all give import statements; there has to be some kind of registry or something. But where?

The correct version of library can always search from the mvnrepository .If you are trying to access the version from proprietary Distribution you need to add the repository of that Distribution.
Cloudera repository
MapR repository
hdp_maven_artifacts

Related

Spark Cassandra Join ClassCastException

I am trying to join two Cassandra tables with:
t1.join(t2, Seq("some column"), "left")
I am getting the below error message:
Exception in thread "main" java.lang.ClassCastException: scala.Tuple8 cannot be cast to scala.Tuple7 at org.apache.spark.sql.cassandra.execution.CassandraDirectJoinStrategy.apply(CassandraDirectJoinStrategy.scala:27)
I am using cassandra v3.11.13 and Spark 3.3.0. The code dependencies:
libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "3.2.11" % Test,
"com.github.mrpowers" %% "spark-fast-tests" % "1.0.0" % Test,
"graphframes" % "graphframes" % "0.8.1-spark3.0-s_2.12" % Provided,
"org.rogach" %% "scallop" % "4.1.0" % Provided,
"org.apache.spark" %% "spark-sql" % "3.1.2" % Provided,
"org.apache.spark" %% "spark-graphx" % "3.1.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "3.2.0" % Provided)
Your help is greatly appreciated
The Spark Cassandra connector does not support Apache Spark 3.3.0 yet and I suspect that is the reason it's not working though I haven't done any verification myself.
Support for Spark 3.3.0 has been requested in SPARKC-686 but the amount of work required is significant so stay tuned.
The latest supported Spark version is 3.2 using spark-cassandra-connector 3.2. Cheers!
this commit
adds initial support for Spark 3.3.x, although it is awaiting RC's/publish at the time of this comment, so you would need to build and package the jars yourself for the time being to begin making use of them to resolve the above error when using spark 3.3. This could be a good opportunity to provide any feedback on any subsequent RC's, as an active user.
I will update this comment when RC's/stable releases are available, which should resolve the above issue for others hitting this issue. Unfortunately, I don't have enough reputation to add this a comment to thread above.

Check Spark packages version [duplicate]

This question already has an answer here:
sbt unresolved dependency for spark-cassandra-connector 2.0.2
(1 answer)
Closed 5 years ago.
I am trying to set up my first Scala project with IntelliJ Idea on Ubuntu 16.04. I need the Spark library and I think I have installed correctly in my computer, however I am not able to refer it in the project dependencies. In particular, I have added the following code in my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core" % "2.1.1",
"org.apache.spark" % "spark-sql" % "2.1.1")
However sbt complains about not finding the correct packages (Unresolved Dependencies error, org.apache.spark#spark-core;2.1.1: not found and org.apache.spark#spark-sql;2.1.1: not found):
I think that the versions of the packages are incorrect (I copied the previous code from the web, just to try).
How can I determine the correct packages versions?
If you use % you have to define the exact version as
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.1",
"org.apache.spark" % "spark-sql_2.10" % "2.1.1")
And if you don't want to define the version and let sbt take the correct version then you need to define %% as
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.1",
"org.apache.spark" %% "spark-sql" % "2.1.1")
you can check of installed version by doing
spark-submit --version
And by going to maven dependency

Spark with IntelliJ or Eclipse

I am trying to setup IntelliJ for spark 2.11 but it is very daunting and after days I have not been able to compile a simple instruction such as with "spark.read.format" which is not found in main core and sql spark libraries.
I have seen a few posts on the subject but with none resolved. Does anyone have some experience with perhaps a working sample program I can start with?
Could it be that it would be easier with Eclipse?
Many thanks in advance for your answers,
EZ
build project in Intellij using with scala 2.11 and sbt 0.13: then ensure that your plugins.sbt contains as below:
logLevel := Level.Warn
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
then your build.sbt must contain as below:
scalaVersion := "2.11.8"
val sparkVersion = "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion %"provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion %"provided"
Then write your code, click Terminal in Intellij and type sbt assembly: you can ship that jar to remote cluster, otherwise run from Intelij locally, let me know how it goes

Using s3n:// with spark-submit

I've written a Spark app that is to run on a cluster using spark-submit. Here's part of my build.sbt.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided" exclude("asm", "asm")
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
asm is excluded because I'm using another library which depends on a different version of it. The asm dependency in Spark seems to come from one of Hadoop's dependents, and I'm not using the functionality.
The problem now is that with this setup, saveToTextFile("s3n://my-bucket/dir/file") throws java.io.IOException: No FileSystem for scheme: s3n.
Why is this happening? Shouldn't spark-submit provide the Hadoop dependencies?
I've tried a few things:
Leaving out "provided"
Putting hadoop-aws on the classpath, via a jar and spark.executor.extraClassPath and spark.driver.extraClassPath. This requires doing the same for all of its transitive dependencies though, which can be painful.
Neither really works. Is there a better approach?
I'm using the pre-built spark-1.6.1-bin-hadoop2.6.

Is it possible for a Scala project to list its own dependencies?

In sbt, we define dependencies for a project:
libraryDependencies ++= Seq(
"com.beachape" %% "enumeratum" % "1.3.2",
"org.scalatest" %% "scalatest" % "2.2.4" % "test"
)
Is it possible for the Scala application thus compiled to get access to this data, somehow?
I am making a modular system of Play 2.4 APIs and would like the "umbrella" to be able to list which APIs it's carrying.
I will probably get this done using sbt-buildinfo that I found via this question.
Other suggestions are of course welcome.
A simple solution is checking maven repositories. For example, the link below shows all libraries com.beachape" %% "enumeratum" % "1.3.2" depends on.
http://mvnrepository.com/artifact/com.beachape/enumeratum_2.11/1.3.2