Understanding closures and parallelism in Spark - scala

Iam trying understand how certain things work in Spark. In the example as shown in http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
Says that code will sum the values within the RDD and store it in counter, which is not the case here because it doesn't work. Only if you remove paralelize this will work.
Can someone explain to me how this works? Or is the example wrong?
Thanks
val data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)

Nick the example and it's explanation provided above is absolutely correct, let me explain you in a deep ->
Let us suppose we are working on a single node with single worker node and executor and we have used foreach over a RDD to count number of elements in RDD. As we know we are on a single node and hence data will not be distributed and will remain a single identity and therefore the count variable(Closure -> These kinds of variable are known as Closure) will count for every element and this updation will be sent to the executor every time whenever an increment occur and then executor will submit the closure to driver node.
Drivernode -> Both executor and driver will reside on a single node and hence the count variable of driver node will be in a scope of executor node and therefore will update the driver node count variable value.
And we have been provided the resultant count value from driver node, not from the executor node.
Executor -> closure -> data
Now suppose we are working in a clustered environment, suppose 2 node and 2 workers and executors. Now the data will be split into several parts and hence ->
Data -> Data_1, Data_2
Drivernode -> on different node have it's count variable but not visible to the Executor 1 and Executor 2 because they reside on different nodes and hence executor1 and executor2 can't update the count variable at driver node
Executor1-> processing(Data_1) with closure_1
Executor2-> processing(Data_1) with closure_2
Closure 1 will update the executor 1 because it's serializable to executor 1 and similarly closure 2 will update executor 2
And to tackle such situation we use Accumulator like this:
val counter=sc.accumulator(0)

The variable counter is serialized along with task and then sent to all the partitions . For Each and every partition it will be a new local variable and update is done within partition alone. See the below example to understand more.
scala> var counter = 0
scala> var rdd = sc.parallelize(Range(1,10))
scala> import org.apache.spark.{SparkEnv, TaskContext}
scala> rdd.foreach(x => {val randNum = TaskContext.getPartitionId();
println("partitionID:"+randNum);counter += x; println(x,counter)})
partitionID:2
(5,5)
partitionID:2
(6,11)
partitionID:3
(7,7)
partitionID:3
(8,15)
partitionID:3
(9,24)
partitionID:0
(1,1)
partitionID:0
(2,3)
partitionID:1
(3,3)
partitionID:1
(4,7)

Related

What kind of variable select for incrementing node labels in a community detection algorithm

i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable.
we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am working on is something like LPA. at first every node has initial label equal to 0 and then nodes get new labels. as nodes update and get new labels, based on some conditions the Label_counter should be incremented by one to use this value as label for other nodes . for example label=1 or label = 2 and so on. for example we have zachary karate club dataset that it has 34 nodes and the dataset has 2 communities.
the initial state is like this:
(1,0)
(2,0)
.
.
.
(34,0)
first number is node Id and second one is label.
as nodes get new label, the Label_counter increments and other nodes in next iterations get new label and again Label_counter increments.
(1,1)
(2,1)
(3,1)
.
.
.
(33,3)
(34,3)
nodes with same label, belong to same community.
the problem that i have is:
because nodes in RDD and variables are distributed across the machines(each machine has a copy of variables) and executors dont have connection with each other, if an executor updates the Label_counter, other executors wont be informed of new value of Label_counter and maybe nodes will get wrong labels, IS it true to use Accumulator as label counter in this case, because Accumulators are shared variables across machines, or there is other ways for handling this problem???
In spark it is always complicated to compute index like values because they depend on things that are not in all the partitions. I can propose the following idea.
Compute the number of time the condition is met per partition
Compute the cumulated increment per partition so that we know the initial increment of each partition.
Increment the values of the partition based on that initial increment
Here is what the code could look like this. Let me start by setting up a few things.
// Let's define some condition
def condition(node : Long) = node % 10 == 1
// step 0, generate the data
val rdd = spark.range(34)
.select('id+1).repartition(10).rdd
.map(r => (r.getAs[Long](0), 0))
.sortBy(_._1).cache()
rdd.collect
Array[(Long, Int)] = Array((1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0),
(9,0), (10,0), (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0),
(19,0), (20,0), (21,0), (22,0), (23,0), (24,0), (25,0), (26,0), (27,0), (28,0),
(29,0), (30,0), (31,0), (32,0), (33,0), (34,0))
Then the core of the solution:
// step 1 and 2
val partIncrInit = rdd
// to each partition, we associate the number of times we need to increment
.mapPartitionsWithIndex{ case (i,p) =>
Iterator(i -> p.map(_._1).count(condition))
}
.collect.sorted // sort by partition index
.map(_._2) // we don't need the index anymore
.scanLeft(0)(_+_) // cumulated sum
// step 3, we increment each partition based on this initial increment.
val result = rdd
.mapPartitionsWithIndex{ case (i, p) =>
var incr = 0
p.map{ case (node, value) =>
if(condition(node))
incr+=1
(node, partIncrInit(i) + value + incr)
}
}
result.collect
Array[(Long, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1),
(9,1), (10,1), (11,2), (12,2), (13,2), (14,2), (15,2), (16,2), (17,2), (18,2),
(19,2), (20,2), (21,3), (22,3), (23,3), (24,3), (25,3), (26,3), (27,3), (28,3),
(29,3), (30,3), (31,4), (32,4), (33,4), (34,4))

Why the local variable value is not visible after iterating RDD? [duplicate]

This question already has an answer here:
Spark : Difference between accumulator and local variable
(1 answer)
Closed 3 years ago.
Hi I am writing code in scala for apache-spark.
my local variable "country" value is not reflecting after rdd iteration done.
I am assigning value in country variable after checking condition inside rdd iteration.until rdd is iterating value is available in country variable after control come out from loop value lost.
import org.apache.spark.sql.SparkSession
import java.lang.Long
object KPI1 {
def main(args:Array[String]){
System.setProperty("hadoop.home.dir","C:\\shivam docs\\hadoop-2.6.5.tar\\hadoop-2.6.5");
val spark=SparkSession.builder().appName("KPI1").master("local").getOrCreate();
val textFile=spark.read.textFile("C:\\shivam docs\\HADOOP\\sample data\\wbi.txt").rdd;
val splitData=textFile.map{
line=>{
val token=line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
(token(0),token(10).replace("\"","").replace(",", ""));
}
};
// splitData.max()._2;
var maxele=0l;
var index=0;
var country="";
splitData.foreach(println);
for(ele<-splitData){
val data=Long.parseLong(ele._2);
if(maxele<data){
maxele=data;
println(maxele);
country=ele._1;
println(country);
}
};
println("***************************** "+country+maxele);
spark.close()
}
}
country variable value should not have default value.
Both for and foreach is wide operation. That means the execution will happen on more than one executors and that's why you are getting default value for some threads. I'm running my sample code in single node cluster with 4 executors and you can see the execution has happened in two different executors( Thread id is evident)
Sample
val baseRdd = spark.sparkContext.parallelize(Seq((1, 2), (3, 4)))
for (h <- baseRdd) {
println( "Thread id " + Thread.currentThread().getId)
println("Value "+ h)
}
Output
Thread id 48
Value (1,2)
Thread id 50
Value (3,4)
If you still want to have your expected result, follow either of the below option
1.Make changes to your spark context configuration as
master("local[1]"). This will run your job with single executor.
collect() your splitData before you perform for(ele<-splitData){...}
Note Both the options are strictly for testing or experimental purpose only and it will not work against large datasets.
When you're using variables within Executors, Spark (YARN/Mesos etc.) creates a new instance of it per each Executor. This is why you don't see any update to your variable (the updates occur only on the Executors, none is retrieved to the Driver). If you want to accomplish this, you should use Accumulators:
Both 'maxele' & 'country' should be Accumulators.
You can read about it here and here

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir

Scala + Spark collections interactions

I'm working under my little project that using graph as the main structure. Graph consists of Vertices that have this structure:
class SWVertex[T: ClassTag](
val id: Long,
val data: T,
var neighbors: Vector[Long] = Vector.empty[Long],
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
) extends Serializable {
def addNeighbor(neighbor: Long): Unit = {
if (neighbor >= 0) { neighbors = neighbors :+ neighbor }
}
}
Notes:
There are will be a lot of vertices, possibly over MAX_INT I think.
Each vertex has a mutable array of neighbors (which are just ID's of another vertices).
There are special function for adding vertex to the graph that using BFS algorithm to choose the best vertex in graph for connecting new vertex - modifying existing and adding vertices' neighbors arrays.
I've decided to use Apache Spark and Scala for processing and navigating through my graph, but I stuck with some misunderstandings: I know, that RDD is a parallel dataset, which I'm making from main collection using parallelize() method and I've discovered, that modifying source collection will take affect on created RDD as well. I used this piece of code to find this out:
val newVertex1 = new SWVertex[String](1, "test1")
val newVertex2 = new SWVertex[String](2, "test2")
var vertexData = Seq(newVertex1, newVertex2)
val testRDD1 = sc.parallelize(vertexData, vertexData.length)
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// The result is:
// | ID: 1, data: test1, neighbors:
// | ID: 2, data: test2, neighbors:
// Calling simple procedure, that uses `addNeighbor` on both parameters
makeFriends(vertexData(0), vertexData(1))
testRDD1.collect().foreach(
f => println("| ID: " + f.id + ", data: " + f.data + ", neighbors: "
+ f.neighbors.mkString(", "))
)
// Now the result is:
// | ID: 1, data: test1, neighbors: 2
// | ID: 2, data: test2, neighbors: 1
, but I didn't found the way to make the same thing using RDD methods (and honestly I'm not sure that this is even possible due to RDD immutability). In this case, the question is:
Is there any way to deal with such big amount of data, keeping the ability to access to the random vertices for modifying their neighbors lists and continuous appending of new vertices?
I believe that solution must be in using some kind of Vector data structures, and in this case I have another question:
Is it possible to store Scala structures in cluster memory?
P.S. I'm planning to use Spark for processing BFS search at least, but I will be really happy to hear any of other suggestions.
P.P.S. I've read about .view method for creating "lazy" collections transformations, but still have no clue how it could be used...
Update 1: As far as I'm reading Scala Cookbook, I think that choosing Vector will be the best choice, because working with graph in my case means a lot of random accessing to the vertices aka elements of the graph and appending new vertices, but still - I'm not sure that using Vector for such large amount of vertices won't cause OutOfMemoryException
Update 2: I've found several interesting things going on with the memory in the test above. Here's the deal (keep in mind, I'm using single-node Spark cluster):
// Test were performed using these lines of code:
val runtime = Runtime.getRuntime
var usedMemory = runtime.totalMemory - runtime.freeMemory
// In the beginning of my work, before creating vertices and collection:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating collection with two vertices:
usedMemory = 191066456 bytes // ~182 MB, 1st run
usedMemory = 173991072 bytes // ~166 MB, 2nd run
// After creating testRDD1
usedMemory = 191066552 bytes // ~182 MB, 1st run
usedMemory = 173991168 bytes // ~166 MB, 2nd run
// After performing first testRDD1.collect() function
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling makeFriends on source collection
usedMemory = 212618296 bytes // ~203 MB, 1st run
usedMemory = 200733808 bytes // ~191 MB, 2nd run
// After calling testRDD1.collect() for modified collection
usedMemory = 216645128 bytes // ~207 MB, 1st run
usedMemory = 203955264 bytes // ~195 MB, 2nd run
I know that this amount of test is too low to be sure in my conclusions, but I noticed, that:
There's nothing happens, when you creating collection.
After creating RDD on this sample, there are 96 bytes allocated, perhaps for storing partitions data or something.
The most amount of memory was allocated when I called .collect() method, because I basically collect all data to one node, and, probably because of single-node Spark installation, I'm getting double copy of data (not sure here), which has taken about 23 MB of memory.
Interesting moment happens after modifying neighbors' arrays, which requires additional 4 MB of memory to store them.
Let me try to address the different questions here:
RDD is a parallel dataset, which I'm making from main collection using
parallelize() method and I've discovered, that modifying source
collection will take affect on created RDD as well.
RDDs are parallel, distributed datasets. parallelize lets you take a local collection and distribute it over a cluster. The current behavior you are observing that when mutating the underlying objects the RDD representation also mutates is only because the program is currently running in 1 node. In a cluster that behavior would not be possible.
Immutability is key to distribute a computation either 'vertically': over several cores of the same processor or 'horizontally': over several machines in a cluster.
I didn't found the way to update the graph structure using RDD
methods
To achieve that you will need to re-think the graph structure in terms of a distributed collection. In the current OO model, each Vertex contains their own list of adjacent vertices and require mutation of the object in order to build up the graph.
We would need to make vertex immutable, by creating them only with their properties and externalize the relationships as a list of edges. In a nutshell, this is what GraphX does. Your Edge would look like:
case class Vertex[T: ClassTag](
val id: Long,
val data: T,
val timestamp: Timestamp = new Timestamp(System.currentTimeMillis())
)
and then we can build a collection of Edges:
val Edges:RDD[(Long, Long)] // (Source Vertex Id, Dest Vertex Id)
Then, given:
val usr1 = Vertex(1, "SuppieRK")
val usr2 = Vertex(2, "maasg")
val usr3 = Vertex(3, "graphy")
val usr4 = Vertex(4, "spark")
And some initial relationship:
val edgeSeq = Seq((1,2), (2,3))
and the RDD of such relationship:
val relations = sparkContext.parallelize(edgeSeq)
then adding new relationships will mean creating new edges:
val newRelations = sparkContext.parallelize(Seq((1,4),(2,4),(3,4))
and union-ing those collections together.
val allRel = relations.union(newRelations)
This is how "addFriend" would be implemented, but we probably will be reading that data from somewhere. This method is not to be used to do a one-by-one addition to the Edges collection. You are using Spark because the dataset to consider is very large and you need the possibility to distribute the computation across several machines.
If the collection fits in one node, I would stick to "standard" Scala representations and algorithms.

Unexpected spark caching behavior

I've got a spark program that essentially does this:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.
Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.
Am I misunderstanding how to get Spark to cache RDDs?
Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:
a b 1000.0
a c 1000.0
b c 1000.0
d e 1000.0
d f 1000.0
e f 1000.0
g h 1000.0
h i 1000.0
g i 1000.0
d g 400.0
Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.
The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.
Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!
I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
next.foreachPartition(x => {}) // materialize before unpersisting
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
Caching doesn't reduce stages, it just won't recompute the stage every time.
In the first iteration, in the stage's "Input Size" you can see that the data is coming from Hadoop, and that it reads shuffle input. In subsequent iterations, the data is coming from memory and no more shuffle input. Also, execution time is vastly reduced.
New map stages are created whenever shuffles have to be written, for example when there's a change in partitioning, in your case adding a key to the RDD.