How to build spark application using Scala IDE and Maven? - scala

I'm new to Scala, Spark and Maven and would like to build spark application described here. It uses the Mahout library.
I have Scala IDE install and would like to use Maven to build the dependencies (which are the Mahout library as well as Spark lib). I couldn't find a good tutorial to start. Could someone help me figure it out?

First try compiling simple application with Maven in Scala IDE. The key of Maven project is directory structure and pom.xml. Although I don't use Scala IDE, this document seems helpful.
http://scala-ide.org/docs/tutorials/m2eclipse/
Next step is to add dependency on Spark in pom.xml you can follow this document.
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/
For latest version of Spark and Mahout artifacts you can check them here:
http://mvnrepository.com/artifact/org.apache.spark
http://mvnrepository.com/artifact/org.apache.mahout
Hope this helps.

You need following tools to get started ( based on recent availability) -
Scala IDE for Eclipse – Download latest version of Scala IDE from
here.
Scala Version – 2.11 ( make sure scala compiler is set to
this version as well)
Spark Version 2.2 ( provided in maven
dependency)
winutils.exe
For running in Windows environment , you need hadoop binaries in
windows format. winutils provides that and we need to set
hadoop.home.dir system property to bin path inside which winutils.exe
is present. You can download winutils.exe here and place at path
like this – c:/hadoop/bin/winutils.exe
And, you can define Spark Core Dependency in your Maven POM.XML for your project, to get started with.
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
And in your Java/Scala class define this property, to run on your local environmet on Windows -
System.setProperty("hadoop.home.dir", "c://hadoop//");
More details and full setup details can be found here.

Related

Why when Maven Build Works good but adding Spark Jar as external Jars gives a compile error “object Apache is not a member of package org”

On Eclipse, while setting up spark , even after adding external jars to build path to spark-2.4.3-bin-hadoop2.7/jars/<_all.jar>,
Complier complains about '“object apache is not a member of package org''
Yes, Building dependencies via Maven or SBT would fix it. A question is asked
scalac compile yields "object apache is not a member of package org"
But Question over here is , WHY the traditional way is failing like this ?
If we reffer here , Scala/Spark version compatibility We could see a similar issue. The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, the eclipse Scala environment may not be compatible that particular scala version of which we have the Spark libraries set up.

Cant find class in uber jar

I am on Hortonworks Distribution 2.4 (effectively hadoop 2.7.1 and spark 1.6.1)
I am packaging my own version of spark in the uber jar (2.1.0) while cluster is on 1.6.1. In the process, i am sending all required libraries through a fat jar (built using maven - uber jar concept).
However, spark submit (through spark 2.1.0 client) fails citing NoClassFound Error on jersey client. Upon listing my uber jar contents, i can see the exact class file in the jar, still spark/yarn cant find it.
here goes -
The error message -
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:151)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
And here is my attempt to find the class in jar file -
jar -tf uber-xxxxx-something.jar | grep jersey | grep ClientCon
com/sun/jersey/api/client/ComponentsClientConfig.class
com/sun/jersey/api/client/config/ClientConfig.class
... Other files
what could be going on here ? Suggestions ? ideas please..
EDIT
the jersey client section of the pom goes here -
<dependency>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-client</artifactId>
<version>1.19.3</version>
</dependency>
EDIT
I also wanted to indicate this, that my code is compiled with Scala 2.12 with compatibility level set to 2.11. However, the cluster is perhaps on 2.10. I am saying perhaps since I believe that cluster nodes dont necessarily have to have Scala binaries installed; YARN just launches the components' jar/class files without using Scala binaries. wonder if thats playing a role here !!!

Dependency error while setting up spark with java in eclipse

I am new to spark and Java. I was trying to setup the spark environment in Eclipse using maven dependency. I am using Java 1.8 with Scala 2.11.7. I gave created a Scala project in Eclipse and created a maven dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
Now I am getting an error as "Failure to transfer org.spark-project.spark:unused:jar:1.0.0 from https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced."
I am new too. These r my steps working with eclipse, scala 2.11, maven and spark: 1.- Create a maven project 2.- Use this POM as basic orientation https://github.com/entradajuan/Spark0/blob/master/Spark0/pom.xml 3.- Convert the project to Scala: right buton->configure->convert to Scala 4.- Then create a package in src/main/java 5.- Create an scala object with its def main in the new package
So far I am always starting the same way. I dont use Java in my projects.
It works fine also when running maven build "clean install" for getting jar.

Scala and persistence framework version incompatible

I try to use slick and squeryl framework for data persistence with scala. I don't want to use Play framework, but just the persistence framework, but when I import slick (or squeryl) jar file, I encountered the issue below:
slick_2.10.1-2.0.0-M1.jar of <project_name> build path is cross-compiled with an incompatible version of Scala (2.10.1). In case this report is mistaken, this check can be disabled in the compiler preference page.
I used scala jar (2.11.6) under scala plugin on Eclipse, and I can run simple scala application. I can also get access to mysql dbms with jdbc. This problem appears when I import the slick (or squeryl) jar files. Is it because the framework does not support scala 2.11? Is downgrade scala version the solution? If so, can anyone point me a direction on how to downgrade the scala version under Eclipse scala plugin. Thank you very much
If you are using scala 2.11 you need to use this dependency for slick:
<dependency>
<groupId>com.typesafe.slick</groupId>
<artifactId>slick_2.11</artifactId>
<version>3.0.0</version>
</dependency>
The previous answer should resolve your issue with slick. If you'd like to use Squeryl, the dependency should be
<dependency>
<groupId>org.squeryl</groupId>
<artifactId>squeryl_2.11</artifactId>
<version>0.9.6-RC3</version>
</dependency>
Or, if you want to use 0.9.5
<dependency>
<groupId>org.squeryl</groupId>
<artifactId>squeryl_2.11</artifactId>
<version>0.9.5-7</version>
</dependency>
Libraries in Scala are only binary compatible with the minor version of Scala they were compiled against. You'll see that in these examples the correct scala version is appended to the artifact ID with an underscore.
If you have the ability to use SBT instead of Maven, I would recommend it. SBT can choose the proper version for you when you reference a dependency like the following
libraryDependencies += "org.squeryl" % "squeryl_2.11" % "0.9.6-RC3"

Running scalatest and Maven with two scala libraries - one for Maven, the other for Scala Eclipse plugin

The scala Eclipse plugin requires scala 2.10.0 to run :
To run the 'test' goal on Maven I require the dependency :
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.9.0-1</artifactId>
<version>2.0.M5</version>
</dependency>
As part of this dependency 'scala-library-2.9.0-1.jar' is also added to build path :
This causes an error to be displayed on problems tab in Eclipse :
More than one scala library found in the build path. At least one has
an incompatible version. Please update the project build path so it
contains only compatible scala libraries.
How can I fix this error ? I need both scala libraries, one is for the scala eclipse plugin and the other for the scalatest maven plugin. I don't want to just delete the error from the problems tab.
The scala Eclipse plugin requires scala 2.10.0 to run :
It has versions for both 2.10 and 2.9, install the one for 2.9.2 (and use ScalaTest for version 2.9.2 as well). Or use a version of ScalaTest for 2.10, but it seems you'll need to build and install it locally, there isn't one for 2.10.0-RC2 listed at http://mvnrepository.com/artifact/org.scalatest.