I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!
Related
When debugging Spark/Scala code with IntelliJ, using e.g. df.select($"mycol") does not work in the evaluate expression window, while df.select(col("mycol")) works fine (but needs code change):
It says :
Error during generated code invocation:
com.intellij.debugger.engine.evaluation.EvaluateException: Error evaluating method : 'invoke': Method threw 'java.lang.NoSuchFieldError' exception.: Error evaluating method : 'invoke': Method threw 'java.lang.NoSuchFieldError' exception.
Strangely, it seems to work sometimes, especially if the $ is already part of an existing expression in the code which I mark to evaluate. If I write arbitrary expressions (code-fragments), it fails consistently
EDIT: even repeating import spark.implicts._ in the code-fragment window does not help
Try this workaround:
import spark.implicits._
$""
df.select($"diff").show()
It seems that import spark.implicits.StringToColumn at top of the code fragment works
I think the reason is that IntelliJ does not realize that the import of the implicits is used in the first place (it is rendered gray), therefore its not available in the evaluate expression context
I've had a similiar issue with NoSuchFieldError because of missing spark instance within the "Evaluate Expression" dialog. See example below.
class MyClass(implicit spark: SparkSession) {
import spark.implicits._
def myFunc1() = {
// breakpoint here
}
My workaround was to modify spark instance declaration in the constructor. I've changed "implicit spark" to "implicit val spark".
..and it works :)
With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class
I am trying to build a git command parser in sbt.
The goal of the parser is not so much to validate the actual git command but rather to provide auto-completion within the sbt console.
The parser relies on bash completion scripts, so it's fair to say that generating the completions is fairly expensive as a process has to spawn every time. That's why I'd like to minimize the number of call made to the bash-completion process.
I have a working solution, that looks like this:
def autoCompleteParser(state: State) = {
val extracted = Project.extract(state)
import extracted._
val dir = extracted.get(baseDirectory)
def suggestions(args: Seq[String]): Seq[String] = {
// .. calling Process and collecting the completions into a Seq[String]
}
val gitArgsParser: Parser[Seq[String]] = {
def loop(previous: Seq[String]): Parser[Seq[String]] =
token(Space) ~> NotSpace.examples(suggestions(previous): _*).flatMap(res => loop(previous :+ res))
loop(Vector())
}
gitArgsParser
}
val test = Command("git-auto-complete")(autoCompleteParser _)(autoCompleteAction)
However I have two problems:
the completion process is called for every character, which is more than I'd like
the potential completions seems to be passed as a parameter to another round of completions, which means even more calls to the external process.
My question is the following: how do I tell sbt to reuse/cache the completions he has got for the rest of an argument without calling the process for each character? For example:
completions for 'git a' are:
dd m nnotate pply rchive
Then completion for 'git ad' are:
d
without the need to call the suggestions method again. I have tried to implement an ExampleSource, but I could not obtain the behavior I was looking for from it.
Any pointer would be welcome. And if someone understands why the potential completions seems to be passed into another completions round, that would help me a lot too.
Is it possible to write a Spark script that has arguments that can referred to by name rather than index in the args() array? I have a script that has 4 required arguments and depending on the value of those, may require up to 3 additional arguments. For example, in one case args(5) might be a date I need to enter. I another, that date may end up in args(6) because of another argument I need.
Scalding has this implemented but I don;t see where Spark does.
I actually overcame this pretty simply. You just need to preface each argument with a name and a delimiter say "--" when you call your application
spark-submit --class com.my.application --master yarn-client ./spark-myjar-assembly-1.0.jar input--hdfs:/path/to/myData output--hdfs:/write/to/yourData
Then include this line at the beginning of your code:
val namedArgs = args.map(x=>x.split("--")).map(y=>(y(0),y(1))).toMap
This converts the default args array into a Map called namedArgs (or whatever you want to call it. From there on, just refer to the Map and call all of your arguments by name.
Spark does not provide such functionality.
You can use Args from scalding (if you don't mind the dependency for such as small class):
val args = Args(argsArr.toIterable)
You can also use any CLI library that provides the parsing features you may want.
This is a simple program. I expected main to run in interpreted mode. But the presence of another object caused it to do nothing. If the QSort were not present, the program would have executed.
Why is main not called when I run this in the REPL?
object MainObject{
def main(args: Array[String])={
val unsorted = List(8,3,1,0,4,6,4,6,5)
print("hello" + unsorted toString)
//val sorted = QSort(unsorted)
//sorted foreach println
}
}
//this must not be present
object QSort{
def apply(array: List[Int]):List[Int]={
array
}
}
EDIT: Sorry for causing confusion, I am running the script as scala filename.scala.
What's happening
If the parameter to scala is an existing .scala file, it will be compiled in-memory and run. When there is a single top level object a main method will be searched and, if found, executed. If that's not the case the top level statements are wrapped in a synthetic main method which will get executed instead.
This is why removing the top-level QSort objects allows your main method to run.
If you're going to expand this to a full program, I advise to compile and run (use a build tool like sbt) the compiled .class files:
scalac main.scala && scala MainObject
If you're writing a single file script, just drop the main method (and its object) and write the statements you want executed in the outer scope, like:
// qsort.scala
object QSort{
def apply(array: List[Int]):List[Int]={
array
}
}
val unsorted = List(8,3,1,0,4,6,4,6,5)
print("hello" + unsorted toString)
val sorted = QSort(unsorted)
sorted foreach println
and run with: scala qsort.scala
A little context
The scala command is meant for executing both scala "scripts" (single file programs) and complex java-like programs (with a main object and a bunch of classes in the classpath).
From man scala:
The scala utility runs Scala code using a Java runtime environment.
The Scala code to run is specified in one of three ways:
1. With no arguments specified, a Scala shell starts and reads com-
mands interactively.
2. With -howtorun:object specified, the fully qualified name of a
top-level Scala object may be specified. The object should pre-
viously have been compiled using scalac(1).
3. With -howtorun:script specified, a file containing Scala code
may be specified.
If not explicitly specified, the howtorun mode is guessed from the arguments passed to the script.
When given a fully qualified name of an object, scala will guess -howtorun:object and expect a compiled object with that name on the path.
Otherwise, if the parameter to scala is an existing .scala file, -howtorun:script is guessed and the entry point is selected as described above.
Any method of an object module can be run in REPL by explicitly specifying it and giving it the arguments it requires if any. For example:
scala> object MainObject{
| def main(args: Array[String])={
| val unsorted = List(9,3,1,0,7,5,9,3,11)
| print("sorted: " + unsorted.sorted)
| }
| def fun = println("fun here")
| }
defined module MainObject
scala> MainObject.main(Array(""))
sorted: List(0, 1, 3, 3, 5, 7, 9, 9, 11)
scala> MainObject.fun
fun here
In some cases this can be useful for quick testing and troubleshooting.