submit task to Spark - scala

I installed spark on ubuntu 14.04 following this tutorial http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/
I am able to run the examples provided inside spark and it seems to work.
The problem is that I am not able to make a scala file and to execute it with spark. This is what I have done following the guidelines https://spark.apache.org/docs/latest/quick-start.html
My standalone app is:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.commons.math3.random.RandomDataGenerator
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/home/donbeo/Applications/spark/spark-1.1.0/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println("A random number")
val randomData = new RandomDataGenerator()
println(randomData.nextLong(0, 100))
}
}
and my sbt file is :
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
libraryDependencies += "org.apache.commons" % "commons-math3" % "3.3"
my project structure is:
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Documents/scala_code/simpleApp$ find .
.
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala~
./src/main/scala/SimpleApp.scala
./simple.sbt
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Documents/scala_code/simpleApp$
and then I run
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Documents/scala_code/simpleApp$ sbt package
[info] Set current project to Simple Project (in build file:/home/donbeo/Documents/scala_code/simpleApp/)
[info] Updating {file:/home/donbeo/Documents/scala_code/simpleApp/}simpleapp...
[info] Resolving org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645 .[info] Resolving org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v20100508202[info] Resolving org.eclipse.jetty.orbit#javax.activation;1.1.0.v201105071233 ..[info] Resolving org.spark-project.akka#akka-remote_2.10;2.2.3-shaded-protobuf .[info] Resolving org.spark-project.akka#akka-actor_2.10;2.2.3-shaded-protobuf ..[info] Resolving org.spark-project.akka#akka-slf4j_2.10;2.2.3-shaded-protobuf ..[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/classes...
[info] Packaging /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar ...
[info] Done packaging.
[success] Total time: 8 s, completed 04-Feb-2015 15:20:09
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Documents/scala_code/simpleApp$
and at the final step I get an error
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$ ./bin/spark-submit \ --class "SimpleApp" \ --master local[4] \ /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
Exception in thread "main" java.net.URISyntaxException: Illegal character in path at index 0: --class
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3044)
at java.net.URI.<init>(URI.java:595)
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1343)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:338)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:225)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:60)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$
Am I doing something wrong? How can I solve?

You need to remove all the \ from their command line examples, they have been added because of the line breaks:
./bin/spark-submit --class "SimpleApp" --master local[4] /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

Related

sbt - object apache is not a member of package org

I want to deploy and submit a spark program using sbt but its throwing error.
Code:
package in.goai.spark
import org.apache.spark.{SparkContext, SparkConf}
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("First Spark")
val sc = new SparkContext(conf)
val fileName = args(0)
val lines = sc.textFile(fileName).cache
val c = lines.count
println(s"There are $c lines in $fileName")
}
}
build.sbt
name := "First Spark"
version := "1.0"
organization := "in.goai"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
resolvers += Resolver.mavenLocal
Under first/project directory
build.properties
bt.version=0.13.9
When I am trying to run sbt package its throwing error given below.
[root#hadoop first]# sbt package
[info] Loading project definition from /home/training/workspace_spark/first/project
[info] Set current project to First Spark (in build file:/home/training/workspace_spark/first/)
[info] Compiling 1 Scala source to /home/training/workspace_spark/first/target/scala-2.11/classes...
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:3: object apache is not a member of package org
[error] import org.apache.spark.{SparkContext, SparkConf}
[error] ^
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:9: not found: type SparkConf
[error] val conf = new SparkConf().setAppName("First Spark")
[error] ^
[error] /home/training/workspace_spark/first/src/main/scala/LineCount.scala:11: not found: type SparkContext
[error] val sc = new SparkContext(conf)
[error] ^
[error] three errors found
[error] (compile:compile) Compilation failed
[error] Total time: 4 s, completed May 10, 2018 4:05:10 PM
I have tried with extends to App too but no change.
Please remove resolvers += Resolver.mavenLocal from build.sbt. Since spark-core is available on Maven, we don't need to use local resolvers.
After that, you can try sbt clean package.

Spark-cassandra-connector: toArray does not work

I am using the spark-cassandra-connector with Scala and I want to read data from cassandra and display it via the method toArray.
However, I get an error message that it is not member of a class, but it is indicated in the API. Could somebody help me in finding my error?
Here are my files:
build.sbt:
name := "Simple_Project"
version := "1.0"
scalaVersion := "2.11.8"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0-preview"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0-preview"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies += "datastax" % "spark-cassandra-connector" % "2.0.0-M2-s_2.11"
SimpleScala.scala:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SQLContext
import com.datastax.spark.connector.cql.CassandraConnector._
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
conf.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val rdd_2 = sc.cassandraTable("test_2", "words")
rdd_2.toArray.foreach(println)
}
}
Functions for cqlsh:
CREATE KEYSPACE test_2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE test_2.words (word text PRIMARY KEY, count int);
INSERT INTO test_2.words (word, count) VALUES ('foo', 20);
INSERT INTO test_2.words (word, count) VALUES ('bar', 20);
Error message:
[info] Loading global plugins from /home/andi/.sbt/0.13/plugins
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-2cc8d2761242b072cedb0a04cb39435[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Loading project definition from /home/andi/test_spark/project
[info] Updating {file:/home/andi/test_spark/project/}test_spark-build...
[info] Resolving org.scala-sbt.ivy#ivy;2.3.0-sbt-2cc8d2761242b072cedb0a04cb39435[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to Simple_Project (in build file:/home/andi/test_spark/)
[info] Compiling 1 Scala source to /home/andi/test_spark/target/scala-2.11/classes...
[error] /home/andi/test_spark/src/main/scala/SimpleApp.scala:50: value toArray is not a member of com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
[error] rdd_2.toArray.foreach(println)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
Many thanks in advance,
Andi
CassandraTableScanRDD.toArray method has been deprecated and removed since 2.0.0 release of Spark Cassandra Connector. This method was there until 1.6.0 release. You can use collect method instead.
Unfortunately, the document Spark Cassandra Connector still uses toArray. Anyway, here is that will work
rdd_2.collect.foreach(println)

Scala - spark-corenlp - java.lang.ClassNotFoundException

I want to run spark-coreNLP example, but I get an java.lang.ClassNotFoundException error when running spark-submit.
Here is the scala code, from the github example, which I put into an object, and defined a SparkContext.
analyzer.Sentiment.scala:
package analyzer
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
import sqlContext.implicits._
object Sentiment {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sentiment")
val sc = new SparkContext(conf)
val input = Seq(
(1, "<xml>Stanford University is located in California. It is a great university.</xml>")
).toDF("id", "text")
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
output.show(truncate = false)
}
}
I am using the build.sbt provided by spark-coreNLP - I only modified the scalaVersion and sparkVerison to my own.
version := "1.0"
scalaVersion := "2.11.8"
initialize := {
val _ = initialize.value
val required = VersionNumber("1.8")
val current = VersionNumber(sys.props("java.specification.version"))
assert(VersionNumber.Strict.isCompatible(current, required), s"Java $required required.")
}
sparkVersion := "1.5.2"
// change the value below to change the directory where your zip artifact will be created
spDistDirectory := target.value
sparkComponents += "mllib"
spName := "databricks/spark-corenlp"
licenses := Seq("GPL-3.0" -> url("http://opensource.org/licenses/GPL-3.0"))
resolvers += Resolver.mavenLocal
libraryDependencies ++= Seq(
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0",
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" classifier "models",
"com.google.protobuf" % "protobuf-java" % "2.6.1"
)
Then, I created my jar by running without issues.
sbt package
Finally, I submit my job to Spark:
spark-submit --class "analyzer.Sentiment" --master local[4] target/scala-2.11/sentimentanalizer_2.11-0.1-SNAPSHOT.jar
But I get the following error:
java.lang.ClassNotFoundException: analyzer.Sentiment
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:641)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My file Sentiment.scala is correclty located in a package named "analyzer".
$ find .
./src
./src/analyzer
./src/analyzer/Sentiment.scala
./src/com
./src/com/databricks
./src/com/databricks/spark
./src/com/databricks/spark/corenlp
./src/com/databricks/spark/corenlp/CoreNLP.scala
./src/com/databricks/spark/corenlp/functions.scala
./src/com/databricks/spark/corenlp/StanfordCoreNLPWrapper.scala
When I ran the SimpleApp example from the Spark Quick Start , I noticed that MySimpleProject/bin/ contained a SimpleApp.class. MySentimentProject/bin is empty. So I have tried to clean my project (I am using Eclipse for Scala).
I think it is because I need to generate Sentiment.class, but I don't know how to do it - It was done automatically with SimpleApp.scala, and when it ry to run/build with Eclipse Scala, it crashes.
Maybe You should try to add
scalaSource in Compile := baseDirectory.value / "src"
to your build.sbt, cause sbt document reads that "the directory that contains the main Scala sources is by default src/main/scala".
Or just make your source code in this structure
$ find .
./src
./src/main
./src/main/scala
./src/main/scala/analyzer
./src/main/scala/analyzer/Sentiment.scala
./src/main/scala/com
./src/main/scala/com/databricks
./src/main/scala/com/databricks/spark
./src/main/scala/com/databricks/spark/corenlp
./src/main/scala/com/databricks/spark/corenlp/CoreNLP.scala
./src/main/scala/com/databricks/spark/corenlp/functions.scala
./src/main/scala/com/databricks/spark/corenlp/StanfordCoreNLPWrapper.scala

Scala program isn't seeing dependencies downloaded via SBT

I'm writing a script to try to get Cassandra and Spark working together but I can't even get the program to compile. I am using SBT as the build tool and I have all the dependencies required for the program declared. The first time I ran sbt run it downloaded the dependencies but I would get an error when it started compiling the scala code shown below:
[info] Compiling 1 Scala source to /home/vagrant/ScalaTest/target/scala-2.10/classes...
[error] /home/vagrant/ScalaTest/src/main/scala/ScalaTest.scala:6: not found: type SparkConf
[error] val conf = new SparkConf(true)
[error] ^
[error] /home/vagrant/ScalaTest/src/main/scala/ScalaTest.scala:9: not found: type SparkContext
[error] val sc = new SparkContext("spark://192.168.10.11:7077", "test", conf)
[error] ^
[error] two errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 3 s, completed Jun 5, 2015 2:40:09 PM
This is the SBT build file
lazy val root = (project in file(".")).
settings(
name := "ScalaTest",
version := "1.0"
)
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.3.0-M1"
and this is the actual Scala program
import com.datastax.spark.connector._
object ScalaTest {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://192.168.10.11:7077", "test", conf)
}
}
Here is my directory structure
- ScalaTest
- build.sbt
- project
- src
- main
- scala
- ScalaTest.scala
- target
I don't know if this is the problem, but you're not importing the SparkConf and SparkContext classes definition. Thus try adding to your scala file:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

Spark Scala Error: Exception in thread "main" java.lang.ClassNotFoundException

I tried to run a spark job on a yarn cluster written in Scala, and run into this error:
[!##$% spark-1.0.0-bin-hadoop2]$ export HADOOP_CONF_DIR="/etc/hadoop/conf"
[!##$% spark-1.0.0-bin-hadoop2]$ ./bin/spark-submit --class "SimpleAPP" \
> --master yarn-client \
> test_proj/target/scala-2.10/simple-project_2.10-0.1.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.ClassNotFoundException: SimpleAPP
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:289)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
And this is my sbt file:
[!##$% test_proj]$ cat simple.sbt
name := "Simple Project"
version := "0.1"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
// We need to be able to write Avro in Parquet
// libraryDependencies += "com.twitter" % "parquet-avro" % "1.3.2"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
this is my SimpleApp.scala program, it is the canonical one:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp{
def main(args: Array[String]) {
val logFile = "/home/myname/spark-1.0.0-bin-hadoop2/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
sbt package is as following:
[!##$% test_proj]$ sbt package
[info] Set current project to Simple Project (in build file:/home/myname/spark-1.0.0-bin-hadoop2/test_proj/)
[info] Compiling 1 Scala source to /home/myname/spark-1.0.0-bin-hadoop2/test_proj/target/scala-2.10/classes...
[info] Packaging /home/myname/spark-1.0.0-bin-hadoop2/test_proj/target/scala-2.10/simple-project_2.10-0.1.jar ...
[info] Done packaging.
[success] Total time: 12 s, completed Mar 3, 2015 10:57:12 PM
As suggested, I did the following:
jar tf simple-project_2.10-0.1.jar | grep .class
Something as followed shows up:
SimpleApp$$anonfun$1.class
SimpleApp$.class
SimpleApp$$anonfun$2.class
SimpleApp.class
Verify if the name is SimpleAPP in the jar.
Do this:
jar tf simple-project_2.10-0.1.jar | grep .class
And check if the name of the class is right.