sbt file doesn't recognize the spark input - scala

I try to execute Scala code in Spark. The example of the code and build.sbt file is possible to find here.
I have one difference to this example. I use already the version 2.0.0 of Spark (I have already download this version local and defined path in .bashrc file). Now, I have modified also my build.sbt file and set the version to 2.0.0
After that I have the error message.
Case 1:
I just executed the code of SparMeApp like is given in the link. I got the error message, that I have to define setMaster function.
16/09/05 19:37:01 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
Case 2:
I define setMaster function with different arguments. I have got next error messages:
Input: setMaster("spark://<username>:7077) or setMaster("local[2]")
Error:
[error] (run-main-0) java.lang.ArrayIndexOutOfBoundsException: 0
java.lang.ArrayIndexOutOfBoundsException: 0
(this error means that my string is empty)
In other cases just error: 16/09/05 19:44:29 WARN
StandaloneAppClient$ClientEndpoint: Failed to connect to master <...>
org.apache.spark.SparkException: Exception thrown in awaitResult
Additional I have only a little experience in Scala and in sbt. So probably my sbt is configutred false.... Can somebody, please, tell me the right way?

This is how your minimal build.sbt will look :
name := "SparkMe Project"
version := "1.0"
scalaVersion := "2.11.7"
organization := "pl.japila"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
And here is your SparkMeApp object :
object SparkMeApp{
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("SparkMe Application")
.setMaster("local[*]")
val sc = new SparkContext(conf)
val fileName = args(0)
val lines = sc.textFile(fileName).cache
val c = lines.count
println(s"There are $c lines in $fileName")
}
}
execute it like :
$ sbt "run [your file path]"

#Abhi, thank you very much for your answer. In general it works. Anyway I have some error message after the correct execution of the code.
I have created some test txt file with 4 lines
test file
test file
test file
test file
In the SparkMeApp I have changed the code line to:
val fileName = "/home/usr/test.txt"
After I execute the line run SparkMeApp.scala I got next output:
16/09/06 09:15:34 INFO DAGScheduler: Job 0 finished: count at SparkMeApp.scala:11, took 0.348171 s
There are 4 lines in /home/usr/test.txt
16/09/06 09:15:34 ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:175)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1229)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:172)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67)
16/09/06 09:15:34 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:67)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1229)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
16/09/06 09:15:34 INFO SparkUI: Stopped Spark web UI at http://<myip>:4040
[success] Total time: 7 s, completed Sep 6, 2016 9:15:34 AM
I can see the correct output of my code (second line), but after I have got the interrupt error. How I can fix it? Anyway, I do hope the code is worked currently.

Related

Spark Failing to write to hdfs because field With AVG

I'm running a spark script in scala from an .sh. When running the same code in a Zeppelin notebook I had no problem. But running it from the script returns the following:
ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 2032, Column 28: Redefinition of parameter "agg_expr_51"
The cause of this is a column which has an average calculated. Why is this happening? Does it have a solution?
Thanks.

scalastyle during sbt compilation reports error twice for the same line

I have the following in my build.sbt for setting up scalastyle to run on compilation. It works except that it produces duplicate errors on the same line.
lazy val scalaStyleOnCompileTask = taskKey[Unit]("scalaStyleOnCompileTask")
scalaStyleOnCompileTask := scalastyle.in(Compile).toTask("").value
(Compile / compile) := (Compile / compile).dependsOn(scalaStyleOnCompileTask).value
The scalastyle check rule does not really matter but take the built-in ones for example:
<check level="error" class="org.scalastyle.scalariform.NotImplementedErrorUsage" enabled="true"/>
<check level="error" class="org.scalastyle.scalariform.NullChecker" enabled="true"/>
So I get two errors for the same line.
[info] scalastyle using config .../scalastyle-config.xml
[error] .../BusinessLogic.scala:15:14: Usage of ??? operator
[error] .../BusinessLogic.scala:15:14: Usage of ??? operator
When I run the check on the command line using scalastyle -c scalastyle-config.xml ., error is reported only once.
Why is this happening? Help appreciated

Cannot Create SparkSession for Scala Without an Error in IntelliJ

I am trying to create a SparkSession so I can use implicits._, but I get errors when running a simple app.
My build.sbt file looks like this:
name := "Reddit-Data-Analyser"
version := "0.1"
scalaVersion := "2.11.12"
fork := true
libraryDependencies += "org.mongodb.scala" %% "mongo-scala-driver" % "2.4.0"
resolvers += "MavenRepository" at "http://central.maven.org/maven2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0"
)
I get unresolved dependency errors on spark-sql, but it appears that the SparkSession class can still load.
My Main.scala looks like this:
import org.apache.spark.sql.SparkSession
object main extends App {
val spark = SparkSession
.builder()
.config("spark.master", "local")
//.config("spark.network.timeout", "10000s") //Not Relevant
//.config("spark.executor.heartbeatInterval", "5000s") //Not Relevant
.getOrCreate()
println("Hello World")
spark.stop()
}
*Edit: I actually was able to get the SparkSession to Run by invalidating caches and restarting (though I already did this many times so I am not sure what changed), now when I do ~run in the SBT console I get the [error] messages and have posted this question here about it: SparkSession logging to console with [error] logs.
Below are my old error messages:
The println does not execute, instead I first get the following ERROR output:
[error] (run-main-7) java.lang.AbstractMethodError
java.lang.AbstractMethodError
at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
at org.apache.spark.sql.internal.SharedState.initializeLogIfNecessary(SharedState.scala:42)
at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
at org.apache.spark.sql.internal.SharedState.log(SharedState.scala:42)
at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
at org.apache.spark.sql.internal.SharedState.logInfo(SharedState.scala:42)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:71)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:112)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:111)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:284)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
at controller.main$.delayedEndpoint$controller$main$1(Main.scala:20)
at controller.main$delayedInit$body.apply(Main.scala:11)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at controller.main$.main(Main.scala:11)
at controller.main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
[trace] Stack trace suppressed: run last compile:run for the full output.
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
[error] Total time: 9 s, completed Mar 14, 2019 9:43:29 PM
8. Waiting for source changes... (press enter to interrupt)
19/03/14 21:43:29 INFO AsyncEventQueue: Stopping listener queue executorManagement.
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
19/03/14 21:43:29 INFO AsyncEventQueue: Stopping listener queue appStatus.
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
19/03/14 21:43:29 ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:181)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
Have you tried something like:
import org.apache.spark.sql.SparkSession
object main extends App {
val spark = SparkSession
.builder()
.appName("myApp")
.config("master", "local[*]")
.getOrCreate()
println("Hello World")
println(spark.version())
spark.stop()
}
So I am not sure exactly what fixed the problem because after doing a number of sbt commands and changes I was eventually able to run my app.
Here is a list of things I did, but I think the sbt command in step #4 may have been the missing piece:
Changed from version := 2.12.5 to version := 2.11.12 in build.sbt. I believe that Apache spark has support for scala 2.12, but IntelliJ or sbt apparently has difficulties retrieving the packages.
Created file build.properties under project root directory and added line sbt.version = 0.13.17, since sbt 1.0 apparently isn't great at working with spark-core repository.
Ran the following sbt commands in this order: reload plugins,update,reload.
One of the last things I tried was running the sbt command package which Creates a jar file containing the files in src/main/resources and the classes compiled from src/main/scala and src/main/java. After doing this (and maybe a full Rebuild/Cache Invalidation) I noticed the missing Scala packages appeared in my External Libraries.
Did Rebuild and Invalidate Cache/Restart several times.

How to reduce the verbosity of Spark's runtime output?

How to reduce the amount of trace info the Spark runtime produces?
The default is too verbose,
How to turn off it, and turn on it when I need.
Thanks
Verbose mode
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
15/01/28 09:57:24 INFO SparkContext: Starting job: collect at <console>:15
15/01/28 09:57:24 INFO DAGScheduler: Got job 3 (collect at <console>:15) with 1 output
...
15/01/28 09:57:24 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/01/28 09:57:24 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 626 bytes result sent to driver
15/01/28 09:57:24 INFO DAGScheduler: Stage 3 (collect at <console>:15) finished in 0.002 s
15/01/28 09:57:24 INFO DAGScheduler: Job 3 finished: collect at <console>:15, took 0.020061 s
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Silent mode(expected)
scala> val la = sc.parallelize(List(12,4,5,3,4,4,6,781))
scala> la.collect
res5: Array[Int] = Array(12, 4, 5, 3, 4, 4, 6, 781)
Spark 1.4.1
sc.setLogLevel("WARN")
From comments in source code:
Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
Spark 2.x - 2.3.1
sparkSession.sparkContext().setLogLevel("WARN")
Spark 2.3.2
sparkSession.sparkContext.setLogLevel("WARN")
quoting from 'Learning Spark' book.
You may find the logging statements that get printed in the shell
distracting. You can control the verbosity of the logging. To do this,
you can create a file in the conf directory called log4j.properties.
The Spark developers already include a template for this file called
log4j.properties.template. To make the logging less verbose, make a
copy of conf/log4j.properties.template called conf/log4j.properties
and find the following line:
log4j.rootCategory=INFO, console
Then
lower the log level so that we only show WARN message and above by
changing it to the following:
log4j.rootCategory=WARN, console
When
you re-open the shell, you should see less output.
Logging configuration at the Spark app level
With this approach no need of code change in cluster for a spark application.
Let's create a new file log4j.properties from log4j.properties.template.
Then change verbosity with log4j.rootCategory property.
Say, we need to check ERRORs of given jar then, log4j.rootCategory=ERROR, console
Spark submit command would be
spark-submit \
... #Other spark props goes here
--files prop/file/location \
--conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
--conf 'spark.driver.extraJavaOptions=-Dlog4j.configuration=prop/file/location' \
jar/location \
[application arguments]
Now you would see only the logs which are ERROR categorised.
Plain Log4j way wo Spark(but needs code change)
Set Logging OFF for packages org and akka
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
If you are invoking a command from a shell, there is a lot you can do without changing any configurations. That is by design.
Below are a couple of Unix examples using pipes, but you could do similar filters in other environments.
To completely silence the log (at your own risk)
Pipe stderr to /dev/null, i.e.:
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 2> /dev/null
To ignore INFO messages
run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999 | awk '{if ($3 != "INFO") print $0}'

When I try to run "exec" in SBT, I get " Error running exec: java.lang.ArrayIndexOutOfBoundsException: 0". How to fix?

If I create an SBT project, even a simple "hello world", compile (successfully) and then exec, the folowing error is thrown. WHat may the reason be and how to fix this?
java.lang.ArrayIndexOutOfBoundsException: 0
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at sbt.SimpleProcessBuilder.run(ProcessImpl.scala:381)
at sbt.AbstractProcessBuilder.run(ProcessImpl.scala:130)
at sbt.AbstractProcessBuilder.$bang(ProcessImpl.scala:158)
at sbt.ExecProject$$anonfun$execOut$1.apply(ScalaProject.scala:436)
at sbt.ExecProject$$anonfun$execOut$1.apply(ScalaProject.scala:435)
at sbt.TaskManager$Task.invoke(TaskManager.scala:62)
at sbt.impl.RunTask.doRun$1(RunTask.scala:77)
at sbt.impl.RunTask.runTask(RunTask.scala:85)
at sbt.impl.RunTask.run(RunTask.scala:32)
at sbt.impl.RunTask$.apply(RunTask.scala:17)
at sbt.impl.RunTask$.apply(RunTask.scala:16)
at sbt.Project$class.run(Project.scala:98)
at sbt.Project$class.call(Project.scala:93)
at sbt.BasicScalaProject.call(DefaultProject.scala:21)
at sbt.xMain$$anonfun$7.apply(Main.scala:512)
at sbt.xMain$$anonfun$7.apply(Main.scala:512)
at sbt.xMain.withAction(Main.scala:541)
at sbt.xMain.sbt$xMain$$handleAction(Main.scala:512)
at sbt.xMain.handleCommand(Main.scala:502)
at sbt.xMain.processAction(Main.scala:441)
at sbt.xMain.process$1(Main.scala:257)
at sbt.xMain$Continue$1.apply(Main.scala:132)
at sbt.xMain.run$1(Main.scala:136)
at sbt.xMain.processArguments(Main.scala:266)
at sbt.xMain.startProject(Main.scala:107)
at sbt.xMain.run(Main.scala:84)
at sbt.xMain.run0$1(Main.scala:35)
at sbt.xMain.run(Main.scala:42)
at xsbt.boot.Launch$.run(Launch.scala:53)
at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:42)
at xsbt.boot.Launch$$anonfun$explicit$1.apply(Launch.scala:42)
at xsbt.boot.Launch$.launch(Launch.scala:57)
at xsbt.boot.Launch$.explicit(Launch.scala:42)
at xsbt.boot.Launch$.initialized(Launch.scala:38)
at xsbt.boot.Launch$.parsed(Launch.scala:31)
at xsbt.boot.Launch$.configured(Launch.scala:21)
at xsbt.boot.Launch$.apply(Launch.scala:16)
at xsbt.boot.Launch$.apply(Launch.scala:13)
at xsbt.boot.Boot$.runImpl(Boot.scala:24)
at xsbt.boot.Boot$.run(Boot.scala:19)
at xsbt.boot.Boot$.main(Boot.scala:15)
at xsbt.boot.Boot.main(Boot.scala)
[info] == exec ==
[error] Error running exec: java.lang.ArrayIndexOutOfBoundsException: 0
The purpose of the build action exec is to execute a command on the underlying shell. As such it needs to be followed with a command. EG:
exec killall firefox
Under the covers, SBT calls java.lang.ProcessBuilder, which throws this Exception if the caller tries to start it but has not provided any parameters.
IndexOutOfBoundsException - If the command is an empty list (has size 0)
I reckon SBT should not be propagating this exception and that this is a bug. You should get an error message instead.
Perhaps you were looking for the build action run, which will invoke your main class.