Scala Spark : Difference in the results returned by df.stat.sampleBy() - scala

I saw many questions posted on stratifiedSampling, but none of them answered my question, so asking it as “new post” , hoping to get some update.
I have noticed that there is a difference in results returned by spark API:sampleBy(), this is not much significant for small sized dataframe but is more noticeable for large sized data frame (>1000 rows)
sample code:
val inputRDD:RDD[(Any,Row)] =df.rdd.keyBy(x=> x.get(0))
val keyCount = inputRDD.countByKey()
val sampleFractions = keyCount.map(x => (x._1,{(

 x._2.toDouble*sampleSize)/(totalCount*100)})).toMap
val sampleDF = df.stat.sampleBy(cols(0),fractions = sampleFractions,seed = 11L)
total dataframe count:200
Keys count:
A:16
B:91
C:54
D:39
fractions : Map(A -> 0.08, B -> 0.455, C -> 0.27, D -> 0.195)
I get only 69 rows as output from df.stat.sampleBy() though I have specified that sample size expected is 100, of course this is specified as fraction to spark API.
Thanks

sampleBy doesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions. Depending on a run this value will vary and there is nothing unusual about it.

The result is combined from A -> 16 * 0.08, B -> 91 * 0.455, C -> 54 * 0.27, D -> 39 * 0.195 = ( 1.28 rows + 41.405 rows + 14.58 rows + 7.605 rows) which will make around 67 rows

Related

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

How to efficiently perform nested-loop in Spark/Scala?

So I have this main dataframe, called main_DF which contain all measurement values:
main_DF
group index width height
--------------------------------
1 1 21.3 15.2
1 2 11.3 45.1
2 3 23.2 25.2
2 4 26.1 85.3
...
23 986453 26.1 85.3
And another table called selected_DF, derived from main_DF, which contain the start & end index of important rows in main_DF, along with the length (end_index - start_index). The fields start_index and end_index correspond with field index in main_DF.
selected_DF
group start_index end_index length
--------------------------------
1 1 154 153
2 236 312 76
3 487 624 137
...
238 17487 18624 1137
Now, for each row in selected_DF, I need to perform filtering for all measurement values between the start_index and end_index. For example, let's say row 1 is for index = 1 until 154. After some filtering, dataframe derived from this row is:
peak_DF
peak_start peak_end
--------------------------------
1 12
15 21
27 54
86 91
...
143 150
peak_start and peak_end indicate the area where width exceeds the threshold. It was obtained by selecting all width > threshold, and then check the position of its index (sorry but it's kind of hard to explain, even with the code)
Then I need to take the measurement value (width) based on peak_DF and calculate the average, making it something like:
peak_DF_summary
peak_start peak_end avg_width
--------------------------------
1 12 25.6
15 21 35.7
27 54 24.2
86 91 76.6
...
143 150 13.1
And, lastly, calculate the average of avg_width, and save the result.
After that, the curtain moves to the next row in selected_DF, and so on.
So far I somehow managed to obtain what I want with this code:
val main_DF = spark.read.parquet("hdfs_path_here")
df.createOrReplaceTempView("main_DF")
val selected_DF = spark.read.parquet("hdfs_path_here").collect.par //parallelized array
val final_result_array = scala.collection.mutable.ArrayBuffer.empty[Array[Double]] //for final result
selected_DF.foreach{x =>
val temp = x.split(',')
val start_index = temp(1)
val end_index = temp(2)
//obtain peak_start and peak_end (START)
val temp_df_1 = spark.sql( " (SELECT index, width, height FROM main_DF WHERE width > 25 index BETWEEN " + start_index + " AND " + end_index + ")")
val temp_df_2 = temp_df_1.withColumn("next_index", lead(temp_df("index"), 1).over(window_spec) ).withColumn("previous_index", lag(temp_df("index"), 1).over(window_spec) )
val temp_df_3 = temp_df_2.withColumn("rear_gap", temp_df_2.col("index") - temp_df_2.col("previous_index") ).withColumn("front_gap", temp_df_2.col("next_index") - temp_df_2.col("index") )
val temp_df_4 = temp_df_3.filter("front_gap > 9 or rear_gap > 9")
val temp_df_5 = temp_df_4.withColumn("next_front_gap", lead(temp_df_4("front_gap"), 1).over(window_spec) ).withColumn("next_front_gap_index", lead(temp_df_4("index"), 1).over(window_spec) )
val temp_df_6 = temp_df_5.filter("rear_gap > 9 and next_front_gap > 9").sort("index")
//obtain peak_start and peak_end (END)
val peak_DF = temp_df_6.select("index" , "next_front_gap_index").toDF("peak_start", "peak_end").collect
val peak_DF_temp = peak_DF.map { y =>
spark.sql( " (SELECT avg(width) as avg_width FROM main_DF WHERE index BETWEEN " + y(0) + " AND " + y(1) + ")")
}
val peak_DF_summary = peak_DF_temp.reduceLeft( (dfa, dfb) => dfa.unionAll(dfb) )
val avg_width = peak_DF_summary.agg(mean("avg_width")).as[(Double)].first
final_result_array += avg_width._1
}
spark.catalog.dropTempView("main_DF")
(reference)
The problem is, the code can only run until around halfway (after 20-30 iterations) until it crashed and give out java.lang.OutOfMemoryError: Java heap space. It runs okay when I ran the iterations 1-by-1, though.
So my questions are:
How can there be insufficient memory? I thought the reason should be
accumulated usage of memory, so I add .unpersist() for every
dataframe inside foreach loop (even though I do no .persist()) to no avail.
But then, every memory consumption should be reset along with
re-initiation of variables when we enter new iteration in foreach
loop, no?
Is there any efficient way to do this kind of calculation? I am
doing nested-loop in Spark and I feel like this is a very
inefficient way to do this, but so far it's the only way I can get
result.
I'm using CDH 5.7 with Spark 2.1.0. My cluster has 6 nodes with 32GB memory (each) & 40 cores (total). main_DF is based on 30GB parquet file.

RankingMetrics in Spark (Scala)

I am trying to use spark RankingMetrics.meanAveragePrecision.
However it seems like its not working as expected.
val t2 = (Array(0,0,0,0,1), Array(1,1,1,1,1))
val r = sc.parallelize(Seq(t2))
val rm = new RankingMetrics[Int](r)
rm.meanAveragePrecision // Double = 0.2
rm.precisionAt(5) // Double = 0.2
t2 is a tuple where the left array indicates the real values and the right array the predicted values (1 - relevant document, 0- non relevant)
If we calculate the average precision for t2 we get :
(0/1 + 0/2 + 0/3 + 0/4 + 1/5 )/5 = 1/25
But the RankingMetric returns 0.2 for MeanAveragePrecision which should be 1/25.
Thanks.
I think that the problem is your input data. Since your predicted/actual data contains relevance scores, I think you should be looking at binary classification metrics rather than ranking metrics if you want to evaluate using the 0/1 scores.
RankingMetrics is expecting two lists/arrays of ranked items instead, so if you replace the scores with the document ids it should work as expected. Here is an example in PySpark, with two lists that only match on the 5th item:
from pyspark.mllib.evaluation import RankingMetrics
rdd = sc.parallelize([(['a','b','c','d','z'], ['e','f','g','h','z'])])
metrics = RankingMetrics(rdd)
for i in range(1, 6):
print i, metrics.precisionAt(i)
print 'meanAveragePrecision', metrics.meanAveragePrecision
print 'Mean precisionAt', sum([0, 0, 0, 0, 0.2]) / 5
Which produced:
1 0.0
2 0.0
3 0.0
4 0.0
5 0.2
meanAveragePrecision 0.04
Mean precisionAt 0.04
Basically how the RankingMetrics function works is with two lists on each row,
First list is the items being recommended order matters here
Second list is the relevant items
For example in PySpark (But should be equivalent for Scala or Java),
recs_rdd = sc.parallelize([
(
['item1', 'item2', 'item3'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
(
['item3', 'item1', 'item2'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
])
from pyspark.mllib.evaluation import RankingMetrics
rankingMetrics = RankingMetrics(recs_rdd)
print("MAP: ", rankingMetrics.meanAveragePrecision)
This prints the MAP value of 0.7083333333333333 and is calculated by
(
(1/2 + 2/3) / 2
+ (1/1 + 2/3) / 2
) / 2
Which equals 0.708333
With
row 1 as (1/2 + 2/3) / 2
1/2 : 1 item in positions 2 or less are relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 1 has 2 relevant items
row 2 as (1/1 + 2/3) / 2
1/1 : 1 item in position 1 or less is relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 2 has 2 relevant items
And / 2 as there are 2 rows

PySpark : how to split data without randomnize

there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)

T-SQL Decimal Division Accuracy

Does anyone know why, using SQLServer 2005
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
gives me 11.74438969709659,
but when I increase the decimal places on the denominator to 15, I get a less accurate answer:
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
give me 11.74438969
For multiplication we simply add the number of decimal places in each argument together (using pen and paper) to work out output dec places.
But division just blows your head apart. I'm off to lie down now.
In SQL terms though, it's exactly as expected.
--Precision = p1 - s1 + s2 + max(6, s1 + p2 + 1)
--Scale = max(6, s1 + p2 + 1)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 9 + 54 = 72
--Max P = 38, P & S are linked, so (72,54) -> (38,20)
--So, we have 38,20 output (but we don use 20 d.p. for this sum) = 11.74438969709659
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,9),12499999.9999)
--Scale = 15 + 38 + 1 = 54
--Precision = 30 - 15 + 15 + 54 = 84
--Max P = 38, P & S are linked, so (84,54) -> (38,8)
--So, we have 38,8 output = 11.74438969
SELECT CONVERT(DECIMAL(30,15),146804871.212533)/CONVERT(DECIMAL (38,15),12499999.9999)
You can do the same math if follow this rule too, if you treat each number pair as
146804871.212533000000000 and 12499999.999900000
146804871.212533000000000 and 12499999.999900000000000
To put it shortly, use DECIMAL(25,13) and you'll be fine with all calculations - you'll get precision right as declared: 12 digits before decimal dot, and 13 decimal digits after.
Rule is: p+s must equal 38 and you will be on safe side!
Why is this?
Because of very bad implementation of arithmetic in SQL Server!
Until they fix it, follow that rule.
I've noticed that if you cast the dividing value to float, it gives you the correct answer, i.e.:
select 49/30 (result = 1)
would become:
select 49/cast(30 as float) (result = 1.63333333333333)
We were puzzling over the magic transition,
P & S are linked, so:
(72,54) -> (38,29)
(84,54) -> (38,8)
Assuming (38,29) is a typo and should be (38,20), the following is the math:
i. 72 - 38 = 34,
ii. 54 - 34 = 20
i. 84 - 38 = 46,
ii. 54 - 46 = 8
And this is the reasoning:
i. Output precision less max precision is the digits we're going to throw away.
ii. Then output scale less what we're going to throw away gives us... remaining digits in the output scale.
Hope this helps anyone else trying to make sense of this.
Convert the expression not the arguments.
select CONVERT(DECIMAL(38,36),146804871.212533 / 12499999.9999)
Using the following may help:
SELECT COL1 * 1.0 / COL2