Cannot change global variables within Spark foreach function body [duplicate] - scala

This question already has answers here:
Understand closure in spark
(1 answer)
Why I cannot update an array in cluster mode but could in pseudo-distributed
(2 answers)
Closed 4 years ago.
I need to filter through a very large text file(500GB) and export the results with Spark.
The problem here is that it will be too big to fit everything in memory, so I have to use a global buffered writer, which gets flushed about every 500000 records.
Intuitively, I used RDD's foreach function and passed a function to it, like (in pyspark):
# global definitions
def fn(x);
global counter, writer
# some logic to generate data
if counter == 500000:
writer.write(data)
writer.flush_buf()
counter = 1
else:
counter = counter + 1
txt_file.foreach(fn)
# write final data
However, here somehow the variable counter is not being updated.
I initially thought it might have something to do with the python-scala bridge, and rewrote some test code with scala:
var value = 10
def fn(x: Int) = {
value = 20
}
sc.parallelzie(Seq(1, 2, 3, 4, 5), 5).foreach(fn)
println(value)
To my surprise, I still get value=10. I wonder why I'm getting this strange behavior here and what's the correct way of doing it?
Thanks!

Related

Scala - divide the dataset into dataset of arrays with a fixed size

I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))

Randomising number of repeats for different users in Gatling

I'm currently trying to write a scenario in Gatling where I would like an action to be repeated between 1 and 8 times. The randomness should be on a per user basis, so for example one user may get 3 repeats and another gets 7.
I'm wanting the scenario to work like this to simulate the fact that I don't know for certain how many times a user will repeat an action.
I tried the following:
class MySimulation extends Simulation {
private val myScenario = scenario("Scenario")
.repeat(Random.nextInt(8) + 1) {
// some stuff
}
setUp(myScenario.inject(rampUsers(100) during (60 seconds)))
}
However what this ends up doing is compiling to one random number, and then using that for every single user. So if the random number generation gets 5, each user will end up repeating 5 times, which is not what I want.
Is there a way in Gatling so that each user gets a different random number for the repeat function? Or will it only work with constant numbers?
The way you attempted didn't work as your scenario as defined is a builder that is executed once at startup - so Random.nextInt is only called once.
But there are a few ways you could achieve what you want.
The easiest (since you just want a random number) would be to use the gatling EL to randomly take an element of a sequence.
firstly, define a scala val with the range of numbers you want
private val times = 1 to 8
then put your range into the session and use the EL to get a random value from the collection
.exec(_.set("times", times))
.repeat("${times.random()}" ) {
// some stuff
}
Alternatively, you could define a custom feeder - this approach lets you do things like random strings
private val times = Iterator.continually( Map( "times" -> Random.nextInt(8) + 1))
Then just feed and use the "times" value
.feed(times)
.repeat("${times}") {
// some stuff
}

Why the local variable value is not visible after iterating RDD? [duplicate]

This question already has an answer here:
Spark : Difference between accumulator and local variable
(1 answer)
Closed 3 years ago.
Hi I am writing code in scala for apache-spark.
my local variable "country" value is not reflecting after rdd iteration done.
I am assigning value in country variable after checking condition inside rdd iteration.until rdd is iterating value is available in country variable after control come out from loop value lost.
import org.apache.spark.sql.SparkSession
import java.lang.Long
object KPI1 {
def main(args:Array[String]){
System.setProperty("hadoop.home.dir","C:\\shivam docs\\hadoop-2.6.5.tar\\hadoop-2.6.5");
val spark=SparkSession.builder().appName("KPI1").master("local").getOrCreate();
val textFile=spark.read.textFile("C:\\shivam docs\\HADOOP\\sample data\\wbi.txt").rdd;
val splitData=textFile.map{
line=>{
val token=line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
(token(0),token(10).replace("\"","").replace(",", ""));
}
};
// splitData.max()._2;
var maxele=0l;
var index=0;
var country="";
splitData.foreach(println);
for(ele<-splitData){
val data=Long.parseLong(ele._2);
if(maxele<data){
maxele=data;
println(maxele);
country=ele._1;
println(country);
}
};
println("***************************** "+country+maxele);
spark.close()
}
}
country variable value should not have default value.
Both for and foreach is wide operation. That means the execution will happen on more than one executors and that's why you are getting default value for some threads. I'm running my sample code in single node cluster with 4 executors and you can see the execution has happened in two different executors( Thread id is evident)
Sample
val baseRdd = spark.sparkContext.parallelize(Seq((1, 2), (3, 4)))
for (h <- baseRdd) {
println( "Thread id " + Thread.currentThread().getId)
println("Value "+ h)
}
Output
Thread id 48
Value (1,2)
Thread id 50
Value (3,4)
If you still want to have your expected result, follow either of the below option
1.Make changes to your spark context configuration as
master("local[1]"). This will run your job with single executor.
collect() your splitData before you perform for(ele<-splitData){...}
Note Both the options are strictly for testing or experimental purpose only and it will not work against large datasets.
When you're using variables within Executors, Spark (YARN/Mesos etc.) creates a new instance of it per each Executor. This is why you don't see any update to your variable (the updates occur only on the Executors, none is retrieved to the Driver). If you want to accomplish this, you should use Accumulators:
Both 'maxele' & 'country' should be Accumulators.
You can read about it here and here

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

How do I sort a list of Doubles in Scala? [duplicate]

This question already has answers here:
How do I sort an array in Scala?
(7 answers)
Closed 8 years ago.
How do I sort a simple list of Doubles in Scala?
var dubs = List(1.3,4.5,2.3,3.2)
I think my question may not have accurately reflected my specific problem, since I realize now that dubs.sorted will work just fine for the above. My problem is as follows, I have a string of doubles "2.3 32.4 54.2 1.33" that I'm parsing and adding to a list
var numsAsStrings = l.split("\\s");
var x = List(Double);
var i = 0;
for( i <- 0 until numsAsStrings.length) {
x :+ numsAsStrings(i).toDouble;
}
So, I would think that I could just call x.sorted on the above, but that doesn't work... I've been looking over the sortBy, sorted, and sortWith documentation and various posts, but I thought the solution should be simpler. I think I'm missing something basic, regardless.
Use the sorted method
dubs.sorted // List(1.3, 2.3, 3.2, 4.5)