The standard pyspark progress bar below is supposed to indicate the number of completed and running tasks (A and B respectively) out of all C tasks scheduled for stage X:
[Stage X:==========> (A + B) / C]
What am I to make of this progress bar then?
[Stage 0:=====> (54288 + -6736) / 65150]
More specifically, what is the meaning of a negative number of running tasks?
Related
I am new to pyspark. I noticed that when I run my code in jupyter it shows something like this: (0 + n) / n
when n equals 1, code is running much faster. So my question is what can I do to make it small?
for example, now I am running a code and n equals 8 and it takes forever to run the code.
for example, now I am running a code and n equals 8 and it takes forever to run the code.
Where can I find the proof for the linear context free languages pumping lemma?
I am looking for the proof that is specific for the linear context free language
I also looked for the formal prof and could not find one.Not sure if the below is a formal prof but it may give you some idea.
The lemma : For every linear context free languages L there is an n>0 so that for every w in L with |w| > n we can write w as uvxyz such that |vy|> 0,|uvyz| <= n and uv^ixy^iz for every i>= 0 is in L.
"Proof":
Imagine a parse tree for some long string w in L with a start symbol S. Also lets assume that the tree does not contains non useful nodes. If w is long enough, there will be at least one non terminal repeating more than once. Lets call the first repeating non terminal going down the tree X, its first occurrence (from the top) as X[1] and its second occurrence as X[2].Let x be the string in w generated by X[2], vxy the string generated by X[1]and uvxyz the full string w generated by S. Since the movement from X[1] to X[2] generates v,y we could theoretically generate a new tree where we replicate this move multiple times before moving from X[1] down.This proves that uv^ixy^iz for every i>= 0 is in L. Since our tree contains no useless nodes, moving from X[1] to X[2] must generate some terminals and this proves that |vy|> 0.L is linear which means that on every level of the tree we have a single non terminal symbol. Each node in the tree covers some substring in w that its length is bounded by a linear function of the node height. Moving from S to X[2] covers uv and yz from w and the number of tree levels traveled is bounded by (2 * the number of non-terminals symbols + 1). Since the number of levels traveled is bounded and the tree is linear it also puts a bound on the yield of the movement from S to X[2] which means ,|uvyz| <= n for some n >= 0.
Note: Keep in mind that we construct X[1] , X[2] top down , in contradiction to how we prove the “regular” pumping lemma for context free grammar in general. In the "regular” pumping lemma there is a bound on the height of X[1] and therefore a bound on |vxy|. In our case there is no bound on the height of X[1]and it can be as high as required by the length of w. There is a bound,however,on the number of tree levels from S to X[2].This does not means much if the grammar is not linear as the output going from S to X[2] is still bounded only by the high of S (that is unbounded). But in the linear case,this output is bounded and therefore |uvyz| <= n
I am having a problem in working on Behavior Space. I have 3 parameters, percentage A, Percentage B and Percentage C. I want to vary values of these three in behavior space experiment but the sum of it must be always 100. For example, Percentage A 30%, Percentage B 30%, Percentage C 40%.
["percentage A" 50]
["percentage B" 25]
["percentage C" 25]
One way to skip unsufficient parameter settings would be the use of a stop condition. In the variables section of "Behaviour space" you can vary your parameters automatically by a range definition like:
["percentageA" [0 10 100]]
["percentageB" [0 10 100]]
["percentageC" [0 10 100]]
This would of course generate combinations which do not have a sum of 100.
Next in the reporter section you could add a reporter, which helps to filter your results later on:
(percentageA + percentageB + percentageC)
In the bottom section of the Behaviour Space Menu you can then simply add a stop condition like:
(percentageA + percentageB + percentageC != 100)
This condition will skip all unsufficient variations. Nevertheless you still would have entries in the output file for runs with unsufficient combinations but you can easily filter them. Just use the defined reporter and select only those entries with a value of 100 in that column.
I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code
Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.
I recently found out about Parallel Collection in Scala 2.9 and was excited to see that the degree of parallelism can be set using collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism.
However when I tried an experiment of adding two vectors of size one million each , I find
Using parallel collection with parallelism set to 64 is as fast as sequential (Shown in results).
Increasing setParallelism seems to increase performance in a non-linear way. I would have atleast expected monotonic behaviour (That is performance should not degrade if I increase parallelism)
Can some one explain why this is happening
object examplePar extends App{
val Rnd = new Random()
val numSims = 1
val x = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val y = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val parInt = List(1,2,4,8,16,32,64,128,256)
var avg:Double = 0.0
var currTime:Long = 0
for(j <- parInt){
collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(j)
avg = 0.0
for (k <- 1 to numSims){
currTime = System.currentTimeMillis()
(x zip y).par.map(x => x._1 + x._2)
avg += (System.currentTimeMillis() - currTime)
}
println("Average Time to execute with Parallelism set to " + j.toString + " = "+ (avg/numSims).toString + "ms")
}
currTime = System.currentTimeMillis()
(x zip y).map(x => x._1 + x._2)
println("Time to execute using Sequential = " + (System.currentTimeMillis() - currTime).toString + "ms")
}
The results on running the example using Scala 2.9.1 and a four core processor is
Average Time to execute with Parallelism set to 1 = 1047.0ms
Average Time to execute with Parallelism set to 2 = 594.0ms
Average Time to execute with Parallelism set to 4 = 672.0ms
Average Time to execute with Parallelism set to 8 = 343.0ms
Average Time to execute with Parallelism set to 16 = 375.0ms
Average Time to execute with Parallelism set to 32 = 391.0ms
Average Time to execute with Parallelism set to 64 = 406.0ms
Average Time to execute with Parallelism set to 128 = 813.0ms
Average Time to execute with Parallelism set to 256 = 469.0ms
Time to execute using Sequential = 406ms
Though these results are for one run, they are consistent when averaged over more runs
Parallelism does not come free. It requires extra cycles to split the problem into smaller chunks, organize everything, and synchronize the result.
You can picture this as calling all your friends to help you move, waiting for them to get there, helping you load the truck, then taking them out to lunch, and finally, getting on with your task.
In your test case you are adding two doubles, which is a trivial exercise and takes so little time that overhead from parallelization is greater than simply doing the task in a one thread.
Again, the analogy would be to call all your friends to help you move 3 suitcases. It would take you half a day to get rid of them, while you could finish by yourself in minutes.
To get any benefit from parallelization your task has to be complicated enough to warrant the extra overhead. Try doing some expensive calculations, for example a formula involving a mix of 5-10 trigonometric and logarithmic functions.
I would suggest studying and using the scala.testing.Benchmark trait to benchmark snippets of code. You have to take JIT, GC and other things into account when benchmarking on the JVM - see this paper. In short, you have to do each of the runs in the separate JVM after doing a couple of warm-up runs.
Also, note that the (x zip y) part does not occur in parallel, because x and y are not yet parallel - the zip is done sequentially. Next, I would suggest turning x and y into arrays (toArray) and then calling par - this will ensure that the benchmark uses parallel arrays, rather than parallel vectors (which are slower for transformer methods such as zip and map).