Why the local variable value is not visible after iterating RDD? [duplicate] - scala

This question already has an answer here:
Spark : Difference between accumulator and local variable
(1 answer)
Closed 3 years ago.
Hi I am writing code in scala for apache-spark.
my local variable "country" value is not reflecting after rdd iteration done.
I am assigning value in country variable after checking condition inside rdd iteration.until rdd is iterating value is available in country variable after control come out from loop value lost.
import org.apache.spark.sql.SparkSession
import java.lang.Long
object KPI1 {
def main(args:Array[String]){
System.setProperty("hadoop.home.dir","C:\\shivam docs\\hadoop-2.6.5.tar\\hadoop-2.6.5");
val spark=SparkSession.builder().appName("KPI1").master("local").getOrCreate();
val textFile=spark.read.textFile("C:\\shivam docs\\HADOOP\\sample data\\wbi.txt").rdd;
val splitData=textFile.map{
line=>{
val token=line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
(token(0),token(10).replace("\"","").replace(",", ""));
}
};
// splitData.max()._2;
var maxele=0l;
var index=0;
var country="";
splitData.foreach(println);
for(ele<-splitData){
val data=Long.parseLong(ele._2);
if(maxele<data){
maxele=data;
println(maxele);
country=ele._1;
println(country);
}
};
println("***************************** "+country+maxele);
spark.close()
}
}
country variable value should not have default value.

Both for and foreach is wide operation. That means the execution will happen on more than one executors and that's why you are getting default value for some threads. I'm running my sample code in single node cluster with 4 executors and you can see the execution has happened in two different executors( Thread id is evident)
Sample
val baseRdd = spark.sparkContext.parallelize(Seq((1, 2), (3, 4)))
for (h <- baseRdd) {
println( "Thread id " + Thread.currentThread().getId)
println("Value "+ h)
}
Output
Thread id 48
Value (1,2)
Thread id 50
Value (3,4)
If you still want to have your expected result, follow either of the below option
1.Make changes to your spark context configuration as
master("local[1]"). This will run your job with single executor.
collect() your splitData before you perform for(ele<-splitData){...}
Note Both the options are strictly for testing or experimental purpose only and it will not work against large datasets.

When you're using variables within Executors, Spark (YARN/Mesos etc.) creates a new instance of it per each Executor. This is why you don't see any update to your variable (the updates occur only on the Executors, none is retrieved to the Driver). If you want to accomplish this, you should use Accumulators:
Both 'maxele' & 'country' should be Accumulators.
You can read about it here and here

Related

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

Cannot change global variables within Spark foreach function body [duplicate]

This question already has answers here:
Understand closure in spark
(1 answer)
Why I cannot update an array in cluster mode but could in pseudo-distributed
(2 answers)
Closed 4 years ago.
I need to filter through a very large text file(500GB) and export the results with Spark.
The problem here is that it will be too big to fit everything in memory, so I have to use a global buffered writer, which gets flushed about every 500000 records.
Intuitively, I used RDD's foreach function and passed a function to it, like (in pyspark):
# global definitions
def fn(x);
global counter, writer
# some logic to generate data
if counter == 500000:
writer.write(data)
writer.flush_buf()
counter = 1
else:
counter = counter + 1
txt_file.foreach(fn)
# write final data
However, here somehow the variable counter is not being updated.
I initially thought it might have something to do with the python-scala bridge, and rewrote some test code with scala:
var value = 10
def fn(x: Int) = {
value = 20
}
sc.parallelzie(Seq(1, 2, 3, 4, 5), 5).foreach(fn)
println(value)
To my surprise, I still get value=10. I wonder why I'm getting this strange behavior here and what's the correct way of doing it?
Thanks!

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!
This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:
Where will that data be printed out?
What node has priority and what partition?
If all nodes are running in parallel, who will be printed first?
How will be this print queue created?
Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).
This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example:
// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)
def printStuff(x:Int):Int = {
println(x)
x + 1
}
// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)
// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)
I was able to work it around by making a utility function:
object PrintUtiltity {
def print(data:String) = {
println(data)
}
}

Understanding closures and parallelism in Spark

Iam trying understand how certain things work in Spark. In the example as shown in http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
Says that code will sum the values within the RDD and store it in counter, which is not the case here because it doesn't work. Only if you remove paralelize this will work.
Can someone explain to me how this works? Or is the example wrong?
Thanks
val data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
Nick the example and it's explanation provided above is absolutely correct, let me explain you in a deep ->
Let us suppose we are working on a single node with single worker node and executor and we have used foreach over a RDD to count number of elements in RDD. As we know we are on a single node and hence data will not be distributed and will remain a single identity and therefore the count variable(Closure -> These kinds of variable are known as Closure) will count for every element and this updation will be sent to the executor every time whenever an increment occur and then executor will submit the closure to driver node.
Drivernode -> Both executor and driver will reside on a single node and hence the count variable of driver node will be in a scope of executor node and therefore will update the driver node count variable value.
And we have been provided the resultant count value from driver node, not from the executor node.
Executor -> closure -> data
Now suppose we are working in a clustered environment, suppose 2 node and 2 workers and executors. Now the data will be split into several parts and hence ->
Data -> Data_1, Data_2
Drivernode -> on different node have it's count variable but not visible to the Executor 1 and Executor 2 because they reside on different nodes and hence executor1 and executor2 can't update the count variable at driver node
Executor1-> processing(Data_1) with closure_1
Executor2-> processing(Data_1) with closure_2
Closure 1 will update the executor 1 because it's serializable to executor 1 and similarly closure 2 will update executor 2
And to tackle such situation we use Accumulator like this:
val counter=sc.accumulator(0)
The variable counter is serialized along with task and then sent to all the partitions . For Each and every partition it will be a new local variable and update is done within partition alone. See the below example to understand more.
scala> var counter = 0
scala> var rdd = sc.parallelize(Range(1,10))
scala> import org.apache.spark.{SparkEnv, TaskContext}
scala> rdd.foreach(x => {val randNum = TaskContext.getPartitionId();
println("partitionID:"+randNum);counter += x; println(x,counter)})
partitionID:2
(5,5)
partitionID:2
(6,11)
partitionID:3
(7,7)
partitionID:3
(8,15)
partitionID:3
(9,24)
partitionID:0
(1,1)
partitionID:0
(2,3)
partitionID:1
(3,3)
partitionID:1
(4,7)