Executing All Objects in A Package Sequentially With Spark Submit - scala

I'm looking for a way to execute all scala objects in a package sequentially using spark submit. Working on an ETL job. I have 4 scala objects (let say Object_1, Object_2, Object_3 and Object_4) all in one package,let's say etl. This package is then exported to a .jar file (say all_four.jar)
EXTRACTION - Object_1 and Object_2
TRANSFORMATION - Object_3
LOAD - Object_4
I know I can execute each object with the following spark submit command
./spark-submit --class etl.x --jars path/to/jar/if/any.jar path/to/exported/jar/all_four.jar arg(0)......arg(n)
where x represents each scala objects in the package.
However, I'm looking for a way to call the package only once and all objects will be executed in the following sequence:
Step 1 - Object_1 and Object_2 (Extraction) can be executed concurrently or maybe simultaneously. They just have to be both completed
Step 2 - Object_3 (Transformation ) is executed
Step 3 - Object_4 (Load) is executed
Is there a way to do this with Spark Submit? Or are there better and more efficient ways to pull this off?

One way is to write wrapper object (Execute) which contains step 1, step 2 & step 3 logics to invoke all in sequence. This wrapper object you can include along with those four objects if you have source code access.
Please find sample wrapper looks like below & You may need to modify according your need.
import etl.{Object_1,Object_2,Object_3,Object_4}
object Execute {
def extract() = {
// Make use of Object_1 & Object_2 logic here.
}
def transform() {
// Make use of Object_3 logic here.
}
def load() {
// Make use of Object_4 logic here.
}
def main(args: Array[String])
{
extract()
transform()
load()
}
}
./spark-submit \
--class Execute \
--jars path/to/jar/if/any.jar path/to/exported/jar/all_four.jar arg(0)......arg(n)

Related

Global Variable in Spark

How to set a global variable in Spark. ?
I have few files in spark in which I need to pass a parameters. But I want to set that variable in one file( file name - kafkaCom)
We have topic named as "test" in one environment of kafka. I have two more instances. So If we submit a job from first environment, lets suppose TEST1. So I want to call test_one. If job submitted from TEST2, Then it would call test_two. If job submitted from TEST3, Then it would call test_three etc.
So I want that keyword which is appending at last in test_ as per of job submission environment.
submit a job from first environment, lets suppose TEST1. So I want to call test_one. If job submitted from TEST2, Then it would call test_two. If job submitted from TEST3, Then it would call test_three etc
In first environment,
spark-submit ... --class your.code.SparkApp code.jar one
In second environment,
spark-submit ... --class your.code.SparkApp code.jar two
And so on.
Then access the argument in your main function and use string interpolation
object SparkApp extends App{
def main(args: Array[String]): Unit {
val environment = args(0)
val spark = SparkSession.builder.appName(s"test_${environment}").getOrCreate()
}
}

How to pass variable arguments to my scala program?

I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!

Using a module with udf defined inside freezes pyspark job - explanation?

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!
The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Spark: Using named arguments to submit application

Is it possible to write a Spark script that has arguments that can referred to by name rather than index in the args() array? I have a script that has 4 required arguments and depending on the value of those, may require up to 3 additional arguments. For example, in one case args(5) might be a date I need to enter. I another, that date may end up in args(6) because of another argument I need.
Scalding has this implemented but I don;t see where Spark does.
I actually overcame this pretty simply. You just need to preface each argument with a name and a delimiter say "--" when you call your application
spark-submit --class com.my.application --master yarn-client ./spark-myjar-assembly-1.0.jar input--hdfs:/path/to/myData output--hdfs:/write/to/yourData
Then include this line at the beginning of your code:
val namedArgs = args.map(x=>x.split("--")).map(y=>(y(0),y(1))).toMap
This converts the default args array into a Map called namedArgs (or whatever you want to call it. From there on, just refer to the Map and call all of your arguments by name.
Spark does not provide such functionality.
You can use Args from scalding (if you don't mind the dependency for such as small class):
val args = Args(argsArr.toIterable)
You can also use any CLI library that provides the parsing features you may want.