Global Variable in Spark - scala

How to set a global variable in Spark. ?
I have few files in spark in which I need to pass a parameters. But I want to set that variable in one file( file name - kafkaCom)
We have topic named as "test" in one environment of kafka. I have two more instances. So If we submit a job from first environment, lets suppose TEST1. So I want to call test_one. If job submitted from TEST2, Then it would call test_two. If job submitted from TEST3, Then it would call test_three etc.
So I want that keyword which is appending at last in test_ as per of job submission environment.

submit a job from first environment, lets suppose TEST1. So I want to call test_one. If job submitted from TEST2, Then it would call test_two. If job submitted from TEST3, Then it would call test_three etc
In first environment,
spark-submit ... --class your.code.SparkApp code.jar one
In second environment,
spark-submit ... --class your.code.SparkApp code.jar two
And so on.
Then access the argument in your main function and use string interpolation
object SparkApp extends App{
def main(args: Array[String]): Unit {
val environment = args(0)
val spark = SparkSession.builder.appName(s"test_${environment}").getOrCreate()
}
}

Related

Executing All Objects in A Package Sequentially With Spark Submit

I'm looking for a way to execute all scala objects in a package sequentially using spark submit. Working on an ETL job. I have 4 scala objects (let say Object_1, Object_2, Object_3 and Object_4) all in one package,let's say etl. This package is then exported to a .jar file (say all_four.jar)
EXTRACTION - Object_1 and Object_2
TRANSFORMATION - Object_3
LOAD - Object_4
I know I can execute each object with the following spark submit command
./spark-submit --class etl.x --jars path/to/jar/if/any.jar path/to/exported/jar/all_four.jar arg(0)......arg(n)
where x represents each scala objects in the package.
However, I'm looking for a way to call the package only once and all objects will be executed in the following sequence:
Step 1 - Object_1 and Object_2 (Extraction) can be executed concurrently or maybe simultaneously. They just have to be both completed
Step 2 - Object_3 (Transformation ) is executed
Step 3 - Object_4 (Load) is executed
Is there a way to do this with Spark Submit? Or are there better and more efficient ways to pull this off?
One way is to write wrapper object (Execute) which contains step 1, step 2 & step 3 logics to invoke all in sequence. This wrapper object you can include along with those four objects if you have source code access.
Please find sample wrapper looks like below & You may need to modify according your need.
import etl.{Object_1,Object_2,Object_3,Object_4}
object Execute {
def extract() = {
// Make use of Object_1 & Object_2 logic here.
}
def transform() {
// Make use of Object_3 logic here.
}
def load() {
// Make use of Object_4 logic here.
}
def main(args: Array[String])
{
extract()
transform()
load()
}
}
./spark-submit \
--class Execute \
--jars path/to/jar/if/any.jar path/to/exported/jar/all_four.jar arg(0)......arg(n)

How to pass variable arguments to my scala program?

I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Using Spark From Within My Application

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).
My application looks like the following:
An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
Processing begins, which will eventually produce a result. The result is expected to fit into memory.
The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?
Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files

Submitting a streaming job to a spark cluster via a Spark Context

I have an operation: RDD[T] => Unit that I would like to submit as a spark job, using a spark streamingContext. The Spark job should stream values from myStream, passing each instance of RDD[T] to operation.
Originally, I accomplished this by creating a Spark Job with a main() function that makes use of myStream.foreachRDD() and supplying the class name to spark-submit on the command-line, however, I would rather avoid making a call out-to-the-shell and instead: submit the job using the streamingContext. This would both be more elegant and allow me to terminate the job, at will, simply by calling streamingContext.stop().
I suspect that the solution is to make use of streamingContext.sparkContext.runJob() but this would require supplying additional arguments that I did not have to provide when using spark-submit: namely a single RDD[T] instance, and partition information. Is there a sensible way to provide "default" values for these parameters (to mirror the utility of spark-submit) or is there another approach that I could be missing?
Code Snippet
val streamingContext : StreamingContext = ...
val eventStream : DStream[T] = ...
eventStream.foreachRDD { rdd =>
rdd.toLocalIterator.toSeq.take(200).foreach { message =>
message.foreach { content =>
// process message content
}
}
}
streamingContext.start()
streamingContent.awaitTermination()
Note:
It is also acceptable (and possibly required) that only a specific time duration of the stream can be used as input for the submitted job.