How to efficiently perform nested-loop in Spark/Scala? - scala

So I have this main dataframe, called main_DF which contain all measurement values:
main_DF
group index width height
--------------------------------
1 1 21.3 15.2
1 2 11.3 45.1
2 3 23.2 25.2
2 4 26.1 85.3
...
23 986453 26.1 85.3
And another table called selected_DF, derived from main_DF, which contain the start & end index of important rows in main_DF, along with the length (end_index - start_index). The fields start_index and end_index correspond with field index in main_DF.
selected_DF
group start_index end_index length
--------------------------------
1 1 154 153
2 236 312 76
3 487 624 137
...
238 17487 18624 1137
Now, for each row in selected_DF, I need to perform filtering for all measurement values between the start_index and end_index. For example, let's say row 1 is for index = 1 until 154. After some filtering, dataframe derived from this row is:
peak_DF
peak_start peak_end
--------------------------------
1 12
15 21
27 54
86 91
...
143 150
peak_start and peak_end indicate the area where width exceeds the threshold. It was obtained by selecting all width > threshold, and then check the position of its index (sorry but it's kind of hard to explain, even with the code)
Then I need to take the measurement value (width) based on peak_DF and calculate the average, making it something like:
peak_DF_summary
peak_start peak_end avg_width
--------------------------------
1 12 25.6
15 21 35.7
27 54 24.2
86 91 76.6
...
143 150 13.1
And, lastly, calculate the average of avg_width, and save the result.
After that, the curtain moves to the next row in selected_DF, and so on.
So far I somehow managed to obtain what I want with this code:
val main_DF = spark.read.parquet("hdfs_path_here")
df.createOrReplaceTempView("main_DF")
val selected_DF = spark.read.parquet("hdfs_path_here").collect.par //parallelized array
val final_result_array = scala.collection.mutable.ArrayBuffer.empty[Array[Double]] //for final result
selected_DF.foreach{x =>
val temp = x.split(',')
val start_index = temp(1)
val end_index = temp(2)
//obtain peak_start and peak_end (START)
val temp_df_1 = spark.sql( " (SELECT index, width, height FROM main_DF WHERE width > 25 index BETWEEN " + start_index + " AND " + end_index + ")")
val temp_df_2 = temp_df_1.withColumn("next_index", lead(temp_df("index"), 1).over(window_spec) ).withColumn("previous_index", lag(temp_df("index"), 1).over(window_spec) )
val temp_df_3 = temp_df_2.withColumn("rear_gap", temp_df_2.col("index") - temp_df_2.col("previous_index") ).withColumn("front_gap", temp_df_2.col("next_index") - temp_df_2.col("index") )
val temp_df_4 = temp_df_3.filter("front_gap > 9 or rear_gap > 9")
val temp_df_5 = temp_df_4.withColumn("next_front_gap", lead(temp_df_4("front_gap"), 1).over(window_spec) ).withColumn("next_front_gap_index", lead(temp_df_4("index"), 1).over(window_spec) )
val temp_df_6 = temp_df_5.filter("rear_gap > 9 and next_front_gap > 9").sort("index")
//obtain peak_start and peak_end (END)
val peak_DF = temp_df_6.select("index" , "next_front_gap_index").toDF("peak_start", "peak_end").collect
val peak_DF_temp = peak_DF.map { y =>
spark.sql( " (SELECT avg(width) as avg_width FROM main_DF WHERE index BETWEEN " + y(0) + " AND " + y(1) + ")")
}
val peak_DF_summary = peak_DF_temp.reduceLeft( (dfa, dfb) => dfa.unionAll(dfb) )
val avg_width = peak_DF_summary.agg(mean("avg_width")).as[(Double)].first
final_result_array += avg_width._1
}
spark.catalog.dropTempView("main_DF")
(reference)
The problem is, the code can only run until around halfway (after 20-30 iterations) until it crashed and give out java.lang.OutOfMemoryError: Java heap space. It runs okay when I ran the iterations 1-by-1, though.
So my questions are:
How can there be insufficient memory? I thought the reason should be
accumulated usage of memory, so I add .unpersist() for every
dataframe inside foreach loop (even though I do no .persist()) to no avail.
But then, every memory consumption should be reset along with
re-initiation of variables when we enter new iteration in foreach
loop, no?
Is there any efficient way to do this kind of calculation? I am
doing nested-loop in Spark and I feel like this is a very
inefficient way to do this, but so far it's the only way I can get
result.
I'm using CDH 5.7 with Spark 2.1.0. My cluster has 6 nodes with 32GB memory (each) & 40 cores (total). main_DF is based on 30GB parquet file.

Related

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

Creating an optimal selection of overlapping time intervals

A car dealer rents out the rare 1956 Aston Martin DBR1 (of which Aston Martin only ever made 5).
Since there are so many rental requests, the dealer decides to place bookings for an entire year in advance.
He collects the requests and now needs to figure out which requests to take.
Make a script that selects the rental requests such that greatest number of individual customers
can drive in the rare Aston Martin.
The input of the script is a matrix of days of the year, each row representing the starting and ending
days of the request. The output should be the indices of the customers and their day ranges.
It is encouraged to plan your code first and write your own functions.
At the top of the script, add a comment block with a description of how your code works.
Example of a list with these time intervals:
list = [10 20; 9 15; 16 17; 21 100;];
(It should also work for a list with 100 time intervals)
We could select customers 1 and 4, but then 2 and 3 are impossible, resulting in two happy customers.
Alternatively we could select requests 2, 3 and 4. Hence three happy customers is the optimum here.
The output would be:
customers = [2, 3, 4],
days = [9, 15; 16, 17; 21, 100]
All I can think of is checking if intervals intersect, but I have no clue how to make an overall optimal selection.
My idea:
1) Sort them by start date
2) Make an array of intersections for each one
3) Start to reject from the ones which has the biggest intersection array, removing rejected item from arrays of intersected units
4) Repeat point 3 until only units with empty arrays will remain
In your example we will get data
10 20 [9 15, 16 17]
9 15 [10 20]
16 17 [10 20]
21 100 []
so we reject 10 20 as it has 2 intersections, so we will have only items with empty arrays
9 15 []
16 17 []
21 100 []
so the search is finished
code on javascript
const inputData = ' 50 74; 6 34; 147 162; 120 127; 98 127; 120 136; 53 68; 145 166; 95 106; 242 243; 222 250; 204 207; 69 79; 183 187; 198 201; 184 199; 223 245; 264 291; 100 121; 61 61; 232 247'
// convert string to array of objects
const orders = inputData.split(';')
.map((v, index) => (
{
id: index,
start: Number(v.split(' ')[1]),
end: Number(v.split(' ')[2]),
intersections: []
}
))
// sort them by start value
orders.sort((a, b) => a.start - b.start)
// find intersection for each one and add them to intersection array
orders.forEach((item, index) => {
for (let i = index + 1; i < orders.length; i++) {
if (orders[i].start <= item.end) {
item.intersections.push(orders[i])
orders[i].intersections.push(item)
} else {
break
}
}
})
// sort by intersections count
orders.sort((a, b) => a.intersections.length - b.intersections.length)
// loop while at least one item still has intersections
while (orders[orders.length - 1].intersections.length > 0) {
const rejected = orders.pop()
// remove rejected item from other's intersections
rejected.intersections.forEach(item => {
item.intersections = item.intersections.filter(
item => item.id !== rejected.id
)
})
// sort by intersections count
orders.sort((a, b) => a.intersections.length - b.intersections.length)
}
// sort by start value
orders.sort((a, b) => a.start - b.start)
// show result
orders.forEach(item => { console.log(item.start + ' - ' + item.end)})
Wanted to expand/correct a little bit on the acvepted answer.
You should start by sorting by the start date.
Then accept the very last customer.
Go through the list descending from there and accept all request that do not overlap with the already accepted ones.
That's the optimal solution.

Scala, iterating a collection, working out 10% points

While iterating an arbitrarily-sized List, I'd like to print some output at ~10% intervals to show that the iteration is progressing. For any list of 10 or more elements, I want 10 outputs printed.
I've played around with % and Math functions, but am not always getting 10 outputs printed unless the list sizes are multiples of 10. Would appreciate your help.
One possibility is to calculate 10% of the size based on your input, and then use IterableLike.grouped to group based on that percent:
object Test {
def main(args: Array[String]): Unit = {
val range = 0 to Math.abs(Random.nextInt(100))
val length = range.length
val percent = Math.ceil((10.0 * length) / 100.0).toInt
println(s"Printing by $percent percent")
range.grouped(percent).foreach {
listByPercent =>
println(s"Printing $percent elements: ")
listByPercent.foreach(println)
}
}
}
Unless the length of your list is divisible by 10, then you are not going to get 10 print statements. Here I am rounding by interval up (ceil) so you will have less print statements. You could used Math.floor which would round down, and give you more print statements.
// Some list
val list = List.range(0, 27)
// Find the interval that is roughly 10 percent
val interval = Math.ceil(list.length / 10.0)
// Zip the list with the index, so that we can look at the indexes
list.zipWithIndex.foreach {
case (value, index) =>
// If an index is divisible by out interval, do your logging
if (index % interval == 0) {
println(s"$index / ${list.length}")
}
// Do something with the value here
}
Output:
0 / 27
3 / 27
6 / 27
9 / 27
12 / 27
15 / 27
18 / 27
21 / 27
24 / 27

How to use GroupByKey in Spark to calculate nonlinear-groupBy task

I have a table looks like
Time ID Value1 Value2
1 a 1 4
2 a 2 3
3 a 5 9
1 b 6 2
2 b 4 2
3 b 9 1
4 b 2 5
1 c 4 7
2 c 2 0
Here is the tasks and requirements:
I want to set the column ID as the key, not the column Time, but I don't want to delete the column Time. Is there a way in Spark to set Primary Key?
The aggregation function is non-linear, which means you can not use "reduceByKey". All the data must be shuffled to one single node before calculation. For example, the aggregation function may looks like root N of the sum values, where N is the number of records (count) for each ID :
output = root(sum(value1), count(*)) + root(sum(value2), count(*))
To make it clear, for ID="a", the aggregated output value should be
output = root(1 + 2 + 5, 3) + root(4 + 3 + 9, 3)
the later 3 is because we have 3 record for a. For ID='b', it is:
output = root(6 + 4 + 9 + 2, 4) + root(2 + 2 + 1 + 5, 4)
The combination is non-linear. Therefore, in order to get correct results, all the data with the same "ID" must be in one executor.
I checked UDF or Aggregator in Spark 2.0. Based on my understanding, they all assume "linear combination"
Is there a way to handle such nonlinear combination calculation? Especially, taking the advantage of parallel computing with Spark?
Function you use doesn't require any special treatment. You can use plain SQL with join
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, lit, sum, pow}
def root(l: Column, r: Column) = pow(l, lit(1) / r)
val out = root(sum($"value1"), count("*")) + root(sum($"value2"), count("*"))
df.groupBy("id").agg(out.alias("outcome")).join(df, Seq("id"))
or window functions:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id")
val outw = root(sum($"value1").over(w), count("*").over(w)) +
root(sum($"value2").over(w), count("*").over(w))
df.withColumn("outcome", outw)

T-SQL Decimal Division Accuracy

Does anyone know why, using SQLServer 2005
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
gives me 11.74438969709659,
but when I increase the decimal places on the denominator to 15, I get a less accurate answer:
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
give me 11.74438969
For multiplication we simply add the number of decimal places in each argument together (using pen and paper) to work out output dec places.
But division just blows your head apart. I'm off to lie down now.
In SQL terms though, it's exactly as expected.
--Precision = p1 - s1 + s2 + max(6, s1 + p2 + 1)
--Scale = max(6, s1 + p2 + 1)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 9 + 54 = 72
--Max P = 38, P & S are linked, so (72,54) -> (38,20)
--So, we have 38,20 output (but we don use 20 d.p. for this sum) = 11.74438969709659
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 15 + 54 = 84
--Max P = 38, P & S are linked, so (84,54) -> (38,8)
--So, we have 38,8 output = 11.74438969
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
You can do the same math if follow this rule too, if you treat each number pair as
146804871.212533000000000 and 12499999.999900000
146804871.212533000000000 and 12499999.999900000000000
To put it shortly, use DECIMAL(25,13) and you'll be fine with all calculations - you'll get precision right as declared: 12 digits before decimal dot, and 13 decimal digits after.
Rule is: p+s must equal 38 and you will be on safe side!
Why is this?
Because of very bad implementation of arithmetic in SQL Server!
Until they fix it, follow that rule.
I've noticed that if you cast the dividing value to float, it gives you the correct answer, i.e.:
select 49/30 (result = 1)
would become:
select 49/cast(30 as float) (result = 1.63333333333333)
We were puzzling over the magic transition,
P & S are linked, so:
(72,54) -> (38,29)
(84,54) -> (38,8)
Assuming (38,29) is a typo and should be (38,20), the following is the math:
i. 72 - 38 = 34,
ii. 54 - 34 = 20
i. 84 - 38 = 46,
ii. 54 - 46 = 8
And this is the reasoning:
i. Output precision less max precision is the digits we're going to throw away.
ii. Then output scale less what we're going to throw away gives us... remaining digits in the output scale.
Hope this helps anyone else trying to make sense of this.
Convert the expression not the arguments.
select CONVERT(DECIMAL(38,36),146804871.212533 / 12499999.9999)
Using the following may help:
SELECT COL1 * 1.0 / COL2