Why I get no speed up with this in Spark?

Why I get no speed up with this in Spark? - scala

I am running the following snippet in Spark, trying to see a speed-up?
val howManyPrimes = sc.paralellize(1 to 1000000).filter(isPrime).count()
Where isPrime is implemented used trial division.
I run spark-shell:
spark-shell --master=local[8]
I wait for it to startup, and that snippet runs in ~50 seconds.
Then I quit it and open a spark-shell again but now with 1 core:
spark-shell --master=local[1]
And I again the snippet runs in ~50 seconds. Why?
isPrime is implemented as follows:
def isPrime(x: Int):Boolean = {
var d = 2
while(d < x) {
if (x%d == 0)
return false
else d += 1
}
return true
}

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

Spark iterate over dataframe rows, cells

(Spark beginner) I wrote the code below to iterate over the rows and columns of a data frame (Spark 2.4.0 + Scala 2.12).
I have computed the row and cell counts as a sanity check.
I was surprised to find that the method returns 0, even though the counters are incremented during the iteration.
To be precise: while the code is running it prints messages showing that it has found
rows 10, 20, ..., 610 - as expected.
cells 100, 200, ..., 1580 -
as expected.
After the iteration is done, it prints "Found 0 cells", and returns 0.
I understand that Spark is a distributed processing engine, and that code is not executed exactly as written - but how should I think about this code?
The row/cell counts were just a sanity check; in reality I need to loop over the data and accumulate some results, but how do I prevent Spark from zeroing out my results as soon as the iteration is done?
def processDataFrame(df: sql.DataFrame): Int = {
var numRows = 0
var numCells = 0
df.foreach { row =>
numRows += 1
if (numRows % 10 == 0) println(s"Found row $numRows") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells % 100 == 0) println(s"Found cell $numCells") // prints 100,200,...,15800
numCells += 1
}
}
println(s"Found $numCells cells") // prints 0
numCells
}

Spark have accumulators variables which provides you functionality like counting in a distributed environment. You can use a simple long and int type of accumulator. Even custom datatype of accumulator can also be implemented quite easily in Spark.
In your code changing your counting variables to accumulator variables like below will give you the correct result.
val numRows = sc.longAccumulator("numRows Accumulator") // string name only for debug purpose
val numCells = sc.longAccumulator("numCells Accumulator")
df.foreach { row =>
numRows.add(1)
if (numRows.value % 10 == 0) println(s"Found row ${numRows.value}") // prints 10,20,...,610
row.toSeq.foreach { c =>
if (numCells.value % 100 == 0) println(s"Found cell ${numCells.value}") // prints 100,200,...,15800
numCells.add(1)
}
}
println(s"Found ${numCells.value} cells") // prints 0
numCells.value

ForAll in scala check skips some input and do not respect containers size

I am new to scala check and I want to test the following piece of my application. I want to generate 30 and 20 random events and check if my application code correctly computes a result
// generate 30 random events
val eventGenerator: Gen[Event] = for {
d <- Gen.oneOf[String](Seq("es1", "es2", "es3"))
t <- Gen.choose[Long](minEvent.getTime, maxEvent.getTime)
s <- Gen.oneOf[String](Seq("s1", "s2", "s3", "s4", "s5", "s6", "s7"))} yield Event(d, t, s)
val eventsGenerator: Gen[List[VpSearchLog]] = Gen.containerOfN[List, VpSearchLog](30, eventGenerator)
// generate 20 random instances
val instanceGenerator: Gen[Instance] = for {
d <- Gen.oneOf[String](Seq("es1", "es2", "es3"))
t <- Gen.choose[Long](minInstance.getTime, maxInstance.getTime)} yield Instance(d, new Timestamp(t))
val instancesGenerator: Gen[List[Instance]] = Gen.containerOfN[List, Instance](20, instanceGenerator)
val p: Prop = forAll(instancesGenerator, eventsGenerator) { (i, e) =>
println(i.size)
println(e.size)
println()
val instancesWithFeature = computeExpected(instance)
isEqual(transform(instance), instanceWithFeature)
}
For some reason I see this in the stdout
20
15
20
7
20
3
20
1
20
0
starting to compute expected:
Basically it looks like the forAll generates a couple of inputs with a certain size and then skips them. For some reaon, it starts to compute things when one of the input has size 0 and then it starts the proper check. My questions are:
why if I use containerofN or listOfN I don't get exactly input of that specific size? How can I then generate input like this?
is it normal that forAll starts to explore the space of the possible input and skips some of them? Am I missing something here? This behaviour is quite counter intuitive for me

You may need to use forAllNoShrink to avoid the known defect in ScalaCheck where shrinking fails to respect generators
val thirtyInts: Gen[List[Int]] =
Gen.listOfN[Int](30, Gen.const(99))
val twentyLongs: Gen[List[Long]] =
Gen.listOfN[Long](20, Gen.const(44L))
property("listOfN") = {
Prop.forAllNoShrink(thirtyInts, twentyLongs) { (ii, ll) =>
ii.size == 30 && ll.size == 20
}
}

RxScala recursive stream with timeout

I'm trying to recursively define an observable that either emits items from a subject or, if a certain amount of time passes, a default value, in this case I'm using the timer's default value of zero. I'm using RxScala and have begun with the following code:
val s = PublishSubject[Int]()
def o: Observable[Unit] = {
val timeout = Observable.timer(1 second)
Observable.amb(s, timeout)
.first
.concatMap((v) => {
println(v)
o
})
}
ComputationScheduler().createWorker.schedule {
var value = 0
def loop(): Unit = {
Thread.sleep(5000)
s.onNext(value + 1)
value += 1
loop()
}
loop()
}
o.toBlocking.last
This seems like it should work, but the output is confusing. Every other sequence of zeros contains two instead of the expected four. Two zeros are emitted and the remaining three seconds elapses, but without output.
0
0
0
0
1
0
0
2
0
0
0
0
3
0
0
4
0
0
0
0
5
0
0
6
0
0
0
0
7
0
0
8

This one is puzzling indeed! So here's a theory:
In truth, your code is producing 4 ticks every 5 seconds rather than 5.
There is a race condition on the 4th, one, won first by the timeout, then the worker, then the timeout, etc.
So, instead of the sequence being 00001 002 00003... look at it as 0000 1002 0000...
So you might have 2 separate problems here and I can't do much without fiddling with it, but things you can try:
Add also a serial number to o(), so you can see which timouts are not winning the race.
Change the values from 1 and 5 seconds to some that are not multiple of each other, like 1.5 and 5. This might help you take one problem out and focus on the other.
Put an external, unrelated worker to print "----" every second. Start it after 0.3 seconds or so. Might give you a better idea of where the divide is.

Refactoring your code to the following generates the expected result (on my machine):
object Test {
def main(args: Array[String]) {
val s = PublishSubject[Int]()
val timeout = Observable.timer(1 second)
def o: Observable[Unit] = {
Observable.amb(s, timeout).first
.concatMap((v) => {
println(v)
o
})
}
var value = 0
NewThreadScheduler().createWorker.scheduleRec {
Thread.sleep(5000)
value += 1
s.onNext(value)
}
o.toBlocking.last
}
}
Notice the switch to the NewThreadScheduler and the use of the scheduleRec method as opposed to manual recursive scheduling.

Executors run longer than timeout value

Here is a code segment of scala. I set timeout as 100 mills. Out of 10000 loops, 106 of them run more than 100 mills without throwing exceptions. The largest one is even 135 mills. Any reason why this is happening?
for (j <- 0 to 10000) {
total += 1
val executor = Executors.newSingleThreadExecutor
val result = executor.submit[Int](new Callable[Int] {
def call = try {
Thread.sleep(95)
for (i <- 0 to 1000000) {}
4
} catch {
case e: Exception => exception1 += 1
5
}
})
try {
val t1 = Calendar.getInstance.getTimeInMillis
result.get(100, TimeUnit.MILLISECONDS)
val t2 = Calendar.getInstance.getTimeInMillis
println("timediff = " + (t2 - t1).toString)
} catch {
case e: Exception => exception2 += 1
}
}

Firstly, if you're running on Windows you should be aware that the timer resolution is around 15.6 milliseconds.
Secondly, your empty loop of 1M iterations is quite likely to be removed by a compiler, and more importantly, can't be interrupted by any timeout.

The way a thread sleep works is that a thread asks the o/s to interrupt it after the given time. That's how the timeout works in the result.get call. Now you're relying on the OS thread that does this to be running at the exact time when your timeout has expired, which of course it may not be. Then there is the fact you have 10000 threads for it to interrupt which it can't do all at the exact same time.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why I get no speed up with this in Spark? - scala

Related

Pyspark / Databricks. Kolmogorov - Smirnov over time. Efficiently. In parallel

Spark iterate over dataframe rows, cells

ForAll in scala check skips some input and do not respect containers size

RxScala recursive stream with timeout

Executors run longer than timeout value

Categories

Resources