var countryMap = Map("Amy" -> "Canada", "Sam" -> "US", "Bob" -> "Canada")
val names = List("Amy", "Sam", "Eric")
sc.parallelize(names).flatMap(broadcastMap.value.get).collect.foreach(println)
//output
Canada
US
I'm running this spark job in YARN mode, and I'm sure that driver and executors are not in the same node/JVM (see the attached pic). Since countryMap is not a broadcast variable, so the executor should not see it and this code shouldn't print anything. However, it printed Canada and US.
My question is that does Spark populate local variables to executors automatically if they are serializable? if not, how does the executor see the driver's local variables?
Hay Edwards,
when you invoke collect that bring result set back to driver try to perform mapping. that the reason that you could find mappings get generated.
Cheers,
local variables : The driver and each executor,no serialization required,Shared within the actuator/driver.
main variables: The driver and each copy(insulate) of the task,need serialization
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
reference https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#broadcast-variables
Basically, local variables in the driver will be broadcast to executors automatically. However, you need to create broadcast variables when you need them across different stages.
To have broadcastMap.value.get function running on a cluster, Spark needs to serialize broadcastMap and send to every executor, so you have a function with data already attached to it in the form an object instance. If you make broadcastMap class unserializable - you won't be able to run this code whatsoever.
So, Spark doesn't populate local variables to executors, but rather you explicitly tell it to serialize object broadcastMap and run a method of that object distributedly.
Related
I'm working on a project for which I need to collect all the Spark config.
The problem is that if a parameter is not explicitly set, I would need the default value. Is there a way to get all the config, including all defaults?
I tried with:
sc.getConf.getAll, but in this way, I do not get the defaults.
SparkListener, in order to get some specific information (for example the number of executors), but I couldn't find a way to get other needed information like the number of cores per executor, memory per executor etc...
Is there a way to capture the console output of a spark job in Oozie? I want to use the specific printed value in the next action node after the spark job.
I was thinking that I could have maybe used the ${wf:actionData("action-id")["Variable"]} but it seems that oozie does not have the capability to capture output from a spark action node unlike in the Shell action you could just use echo "var=12345" and then invoke the wf:actionData in oozie to be used as an Oozie Variable across the workflow.
I want to achieve that because I want to print the possible number of records processed and store that as an oozie variable and use that to the next action nodes in the workflow without doing any functionalities that requires you to store that data outside of the workflow like saving them in a table or storing them as a system variable via the implementing them inside the Spark Scala Program.
Any help would be thoroughly appreciated since I'm still a novice spark programmer. Thank you very much.
As Spark action does not support capture-output, you'll have to write the data into a file to HDFS.
This post explains how to do that from Spark.
How to run a talend job in multiple instances at the same time with different context group?
You can't run the same Talend Job in multiple context groups as a group is a collection of context variables that are assigned to a Job.
You can run multiple instances of the same Job in a different context, by passing the context at runtime.
e.g. --context=Prod
This assumes that you have considered all other conflicts that may occur, for example, directories and files that they job may use.
I would suggest, if you have not already done this, to externalise your context values so that, when you pass your context at runtime, values are dynamically loaded and you can have different values for different context.
Once your job is building as a jar, you can have multiple instance at the same time.
Can both RAM and JDBC job stores be used together? I wasn't able to find an answer to this in the official documentation.
No, they couldnt be used together.
Usually a scheduler will be obtained by the StdSchedulerFactory.
These factory uses a java.util.Properties class to determine the settings of the scheduler.
I.e. you could pass RAMJobstore as well as JDBCJobstore as value of the key org.quartz.jobStore.class, but in that case the key would be overloaded and only the last value will be visible.
Say I have 5 jobs that wants to access single method that will read this big file and put it to RDD. Instead of reading this file multiple times (because there will be 5 jobs that will do the same method), there's this "mother" class that will check if there already exist a job that already called the method.
Assuming that these 5 jobs are executed in a sequence, then you can read the file and cache it <RDD>.cache(...) in the first job itself and rest all job can check if the file already exists in cache then just use it, else read it again.
for more info, Refer to RDD API.