Is it possible to write a Spark script that has arguments that can referred to by name rather than index in the args() array? I have a script that has 4 required arguments and depending on the value of those, may require up to 3 additional arguments. For example, in one case args(5) might be a date I need to enter. I another, that date may end up in args(6) because of another argument I need.
Scalding has this implemented but I don;t see where Spark does.
I actually overcame this pretty simply. You just need to preface each argument with a name and a delimiter say "--" when you call your application
spark-submit --class com.my.application --master yarn-client ./spark-myjar-assembly-1.0.jar input--hdfs:/path/to/myData output--hdfs:/write/to/yourData
Then include this line at the beginning of your code:
val namedArgs = args.map(x=>x.split("--")).map(y=>(y(0),y(1))).toMap
This converts the default args array into a Map called namedArgs (or whatever you want to call it. From there on, just refer to the Map and call all of your arguments by name.
Spark does not provide such functionality.
You can use Args from scalding (if you don't mind the dependency for such as small class):
val args = Args(argsArr.toIterable)
You can also use any CLI library that provides the parsing features you may want.
Related
I am creating unit tests for a Scala application using Scala Test. I have the actual and expected results as Dataset . When I verified manually both the data and schema matches between actual and expected datasets.
Actual Dataset= actual_ds
Expected Dataset = expected_ds
when I execute below command ,it returns False.
assert(actual_ds.equals(expected_ds))
Could anyone suggest what could be the reason. And is there any other inbuilt function to compare the datasets in scala?
Use one of libraries designed for spark tests, spark-fast-tests , spark-testing-base, spark-test
They are quite ease to use and with their help its easy to compare two datasets with formatted message on output
You may start with spark-fast-tests (you can find usage in readme file) and check others if it does not suite your needs (fro example if you need other output formatting)
That .equals() is from Java Object .equals so it's correct that the assert fails.
I would start testing two datasets with:
assert actual_ds.schema == expected_ds.schema
assert actual_ds.count() == expected_ds.count()
And then checking this question: DataFrame equality in Apache Spark
My understanding of the mechanics of Spark's code distribution toward the nodes running it is merely cursory, and I fail in having my code successfully run within Spark's mapPartitions API when I wish to instantiate a class for each partition, with an argument.
The code below worked perfectly, up until I evolved the class MyWorkerClass to require an argument:
val result : DataFrame =
inputDF.as[Foo].mapPartitions(sparkIterator => {
// (1) initialize heavy class instance once per partition
val workerClassInstance = MyWorkerClass(bar)
// (2) provide an iterator using a function from that class instance
new CloseableIteratorForSparkMapPartitions[Post, Post](sparkIterator, workerClassInstance.recordProcessFunc)
}
The code above worked perfectly well up to the point in time when I had (or chose) to add a constructor argument to my class MyWorkerClass. The passed argument value turns out as null in the worker, instead of the real value of bar. Somehow the serialization of the argument fails to work as intended.
How would you go about this?
Additional Thoughts/Comments
I'll avoid adding the bulky code of CloseableIteratorForSparkMapPartitions ― it merely provides a Spark friendly iterator and might even not be the most elegant implementation in that.
As I understand it, the constructor argument is not being correctly passed to the Spark worker due to how Spark captures state when serializing stuff to send for execution on the Spark worker. However instantiating the class does seamlessly make heavy-to-load assets included in that class ― normally available to the function provided on the last line of my above code; And the class did seem to instantiate per partition. Which is actually a valid if not key use case for using mapPartitions instead of map.
It's the passing of an argument to its instantiation, that I am having trouble figuring how to enable or work-around. In my case this argument is a value only known after the program started running (even if always invariant throughout a single execution of my job; it's actually a program argument). I do need it passing along for the initialization of the class.
I tried tinkering to solve, by providing a function which instantiates MyWorkerClass with its input argument, rather than directly instantiating as above, but this did not solve matters.
The root symptom of the problem is not any exception, but simply that the value of bar when MyWorkerClass is instantiated will just be null, instead of the actual value of bar which is known in the scope of the code enveloping the code snippet which I included above!
* one related old Spark issue discussion here
I have a line of code in a scala app that takes a dataframe with one column and two rows, and assigns them to variables start and end:
val Array(start, end) = datesInt.map(_.getInt(0)).collect()
This code works fine when run in a REPL, but when I try to put the same line in a scala object in Intellij, it inserts a grey (?: Encoder[Int]) before the .collect() statement, and show an inline error No implicits found for parameter evidence$6: Encoder[Int]
I'm pretty new to scala and I'm not sure how to resolve this.
Spark needs to know how to serialize JVM types to send them from workers to the master. In some cases they can be automatically generated and for some types there are explicit implementations written by Spark devs. In this case you can implicitly pass them. If your SparkSession is named spark then you miss following line:
import spark.implicits._
As you are new to Scala: implicits are parameters that you don't have to explicitly pass. In your example map function requires Encoder[Int]. By adding this import, it is going to be included in the scope and thus passed automatically to map function.
Check Scala documentation to learn more.
I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!
it is easy for Hadoop to use .replace() for example
String[] valArray = value.toString().replace("\N", "")
But it dosen't work in Spark,I write Scala in Spark-shell like below
val outFile=inFile.map(x=>x.replace("\N",""))
So,how to deal with it?
For some reason your x is an Array[String]. How did you get it like that? You can .toString.replace it if you like, but that will probably not get you what you want (and would give the wrong output in java anyway); you probably want to do another layer of map, inFile.map(x => x.map(_.replace("\N","")))