RankingMetrics in Spark (Scala) - scala

I am trying to use spark RankingMetrics.meanAveragePrecision.
However it seems like its not working as expected.
val t2 = (Array(0,0,0,0,1), Array(1,1,1,1,1))
val r = sc.parallelize(Seq(t2))
val rm = new RankingMetrics[Int](r)
rm.meanAveragePrecision // Double = 0.2
rm.precisionAt(5) // Double = 0.2
t2 is a tuple where the left array indicates the real values and the right array the predicted values (1 - relevant document, 0- non relevant)
If we calculate the average precision for t2 we get :
(0/1 + 0/2 + 0/3 + 0/4 + 1/5 )/5 = 1/25
But the RankingMetric returns 0.2 for MeanAveragePrecision which should be 1/25.
Thanks.

I think that the problem is your input data. Since your predicted/actual data contains relevance scores, I think you should be looking at binary classification metrics rather than ranking metrics if you want to evaluate using the 0/1 scores.
RankingMetrics is expecting two lists/arrays of ranked items instead, so if you replace the scores with the document ids it should work as expected. Here is an example in PySpark, with two lists that only match on the 5th item:
from pyspark.mllib.evaluation import RankingMetrics
rdd = sc.parallelize([(['a','b','c','d','z'], ['e','f','g','h','z'])])
metrics = RankingMetrics(rdd)
for i in range(1, 6):
print i, metrics.precisionAt(i)
print 'meanAveragePrecision', metrics.meanAveragePrecision
print 'Mean precisionAt', sum([0, 0, 0, 0, 0.2]) / 5
Which produced:
1 0.0
2 0.0
3 0.0
4 0.0
5 0.2
meanAveragePrecision 0.04
Mean precisionAt 0.04

Basically how the RankingMetrics function works is with two lists on each row,
First list is the items being recommended order matters here
Second list is the relevant items
For example in PySpark (But should be equivalent for Scala or Java),
recs_rdd = sc.parallelize([
(
['item1', 'item2', 'item3'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
(
['item3', 'item1', 'item2'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
])
from pyspark.mllib.evaluation import RankingMetrics
rankingMetrics = RankingMetrics(recs_rdd)
print("MAP: ", rankingMetrics.meanAveragePrecision)
This prints the MAP value of 0.7083333333333333 and is calculated by
(
(1/2 + 2/3) / 2
+ (1/1 + 2/3) / 2
) / 2
Which equals 0.708333
With
row 1 as (1/2 + 2/3) / 2
1/2 : 1 item in positions 2 or less are relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 1 has 2 relevant items
row 2 as (1/1 + 2/3) / 2
1/1 : 1 item in position 1 or less is relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 2 has 2 relevant items
And / 2 as there are 2 rows

Related

Apply groupby in udf from a increase function Pyspark

I have the follow function:
import copy
rn = 0
def check_vals(x, y):
global rn
if (y != None) & (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
#pandas_udf(IntegerType(), functionType=PandasUDFType.GROUPED_AGG)
def check_final(x, y):
return lambda x, y: check_vals(x, y)
I need apply this function in a follow df:
index initial_range final_range
1 1 299
1 300 499
1 500 699
1 800 1000
2 10 99
2 100 199
So I need that follow output:
index min_val max_val
1 1 699
1 800 1000
2 10 199
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken, applying the groupBy.
I tried:
w = Window.partitionBy('index').orderBy(sf.col('initial_range'))
df = (df.withColumn('nextRange', sf.lead('initial_range').over(w))
.fillna(0,subset=['nextRange'])
.groupBy('index')
.agg(check_final("final_range", "nextRange").alias('check_1'))
.withColumn('min_val', sf.min("initial_range").over(Window.partitionBy("check_1")))
.withColumn('max_val', sf.max("final_range").over(Window.partitionBy("check_1")))
)
But, don't worked.
Anyone can help me?
I think pure Spark SQL API can solve your question and it doesn't need to use any UDF, which might be an impact of your Spark performance. Also, I think two window function is enough to solve this question:
df.withColumn(
'next_row_initial_diff', func.col('initial_range')-func.lag('final_range', 1).over(Window.partitionBy('index').orderBy('initial_range'))
).withColumn(
'group', func.sum(
func.when(func.col('next_row_initial_diff').isNull()|(func.col('next_row_initial_diff')==1), func.lit(0))
.otherwise(func.lit(1))
).over(
Window.partitionBy('index').orderBy('initial_range')
)
).groupBy(
'group', 'index'
).agg(
func.min('initial_range').alias('min_val'),
func.max('final_range').alias('max_val')
).drop(
'group'
).show(100, False)
Column next_row_initial_diff: Just like the lead you use to shift/lag the row and check if it's in sequence.
Column group: To group the sequence in index partition.

Table sort by month

I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);

split rowcount of a table by 3 ways in perl

I am getting rowcount of a sybase table in perl. For example table have 100 rows, so n=100
I want to split this value into 3 parts
1-33 | 34-66 | 67-99 or 100
please advise how do get this in perl.
Reason for this split: I need to pass the values 1 and 33 as input parameter to a stored proc to select rows whose identity column value is between 1 and 33.
same goes for 34-66 & 67-99
The interesting part is deciding where each range starts. From there it's easy to decide that each range ends at one less than the start of the next range.
This partition() function will determine the start points for given number of partitions within a given number of elements starting at a given offset.
sub partition {
my ($offset, $n_elements, $n_partitions) = #_;
die "Cannot create $n_partitions partitions from $n_elements elements.\n"
if $n_partitions > $n_elements;
my $step = int($n_elements / $n_partitions);
return map {$step * $_ + $offset} 0 .. $n_partitions - 1;
}
Here's how it works:
First, determine what the step should be by dividing the number of elements by the number of partitions, and preserving the integer by truncating any trailing decimal places.
Next walk through the steps by starting at zero and multiplying by the step number (or the partition number). So if the step is 5 then 5*0=0, 5x1=5, 5x2=10, and so on. We will not look at the last step, because it makes more sense to include an "off by one" in the last partition than to start a new partition with only one element.
Finally, we allow for an offset to be applied, so that partition(0,100,5)means to find the starting element positions for five partitions starting at zero and continuing for 100 elements (so a range of 0 to 99). And partition(1,100,5) would mean start at 1 and continue to 100 elements partitioning in five segments, so a range of 1 to 100.
Here's an example of putting the function to use to find the partition points in a set of several ranges:
use strict;
use warnings;
use Test::More;
sub partition {
my ($offset, $n_elements, $n_partitions) = #_;
die "Cannot create $n_partitions partitions from $n_elements elements.\n"
if $n_partitions > $n_elements;
my $step = int($n_elements / $n_partitions);
return map {$step * $_ + $offset} 0 .. $n_partitions - 1;
}
while(<DATA>) {
chomp;
next unless length;
my ($off, $n_elems, $n_parts, #starts) = split /,\s*/;
local $" = ',';
is_deeply
[partition($off, $n_elems, $n_parts)],
[#starts],
"Partitioning $n_elems elements starting at $off by $n_parts yields start positions of [#starts]";
}
done_testing();
__DATA__
0,10,2,0,5
1,11,2,1,6
0,3,2,0,1
0,7,3,0,2,4
0,21,3,0,7,14
0,21,7,0,3,6,9,12,15,18
0,20,3,0,6,12
0,100,4,0,25,50,75
1,100,4,1,26,51,76
1,100,3,1,34,67
0,10,1,0
1,10,10,1,2,3,4,5,6,7,8,9,10
This yields the following output:
ok 1 - Partitioning 10 elements starting at 0 by 2 yields start positions of [0,5]
ok 2 - Partitioning 11 elements starting at 1 by 2 yields start positions of [1,6]
ok 3 - Partitioning 3 elements starting at 0 by 2 yields start positions of [0,1]
ok 4 - Partitioning 7 elements starting at 0 by 3 yields start positions of [0,2,4]
ok 5 - Partitioning 21 elements starting at 0 by 3 yields start positions of [0,7,14]
ok 6 - Partitioning 21 elements starting at 0 by 7 yields start positions of [0,3,6,9,12,15,18]
ok 7 - Partitioning 20 elements starting at 0 by 3 yields start positions of [0,6,12]
ok 8 - Partitioning 100 elements starting at 0 by 4 yields start positions of [0,25,50,75]
ok 9 - Partitioning 100 elements starting at 1 by 4 yields start positions of [1,26,51,76]
ok 10 - Partitioning 100 elements starting at 1 by 3 yields start positions of [1,34,67]
ok 11 - Partitioning 10 elements starting at 0 by 1 yields start positions of [0]
ok 12 - Partitioning 10 elements starting at 1 by 10 yields start positions of [1,2,3,4,5,6,7,8,9,10]
1..12
For additional examples look at Split range 0 to M into N non-overlapping (roughly equal) ranges. on PerlMonks.
Your question is looking for complete range start and end points. This method makes it rather trivial:
sub partition {
my ($offset, $n_elements, $n_partitions) = #_;
my $step = int($n_elements / $n_partitions);
return map {$step * $_ + $offset} 0 .. $n_partitions - 1;
}
my $n_elems = 100;
my $offset = 1;
my $n_parts = 3;
my #starts = partition($offset, $n_elems, $n_parts);
my #ranges = map{
[
$starts[$_],
($starts[$_+1] // $n_elems+$offset)-1,
]
} 0..$#starts;
print "($_->[0], $_->[1])\n" foreach #ranges;
The output:
(1, 33)
(34, 66)
(67, 100)
Even more implementation examples appear in Algorithm for dividing a range into ranges and then finding which range a number belongs to on the StackExchange Software Engineering forum.

How to use GroupByKey in Spark to calculate nonlinear-groupBy task

I have a table looks like
Time ID Value1 Value2
1 a 1 4
2 a 2 3
3 a 5 9
1 b 6 2
2 b 4 2
3 b 9 1
4 b 2 5
1 c 4 7
2 c 2 0
Here is the tasks and requirements:
I want to set the column ID as the key, not the column Time, but I don't want to delete the column Time. Is there a way in Spark to set Primary Key?
The aggregation function is non-linear, which means you can not use "reduceByKey". All the data must be shuffled to one single node before calculation. For example, the aggregation function may looks like root N of the sum values, where N is the number of records (count) for each ID :
output = root(sum(value1), count(*)) + root(sum(value2), count(*))
To make it clear, for ID="a", the aggregated output value should be
output = root(1 + 2 + 5, 3) + root(4 + 3 + 9, 3)
the later 3 is because we have 3 record for a. For ID='b', it is:
output = root(6 + 4 + 9 + 2, 4) + root(2 + 2 + 1 + 5, 4)
The combination is non-linear. Therefore, in order to get correct results, all the data with the same "ID" must be in one executor.
I checked UDF or Aggregator in Spark 2.0. Based on my understanding, they all assume "linear combination"
Is there a way to handle such nonlinear combination calculation? Especially, taking the advantage of parallel computing with Spark?
Function you use doesn't require any special treatment. You can use plain SQL with join
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, lit, sum, pow}
def root(l: Column, r: Column) = pow(l, lit(1) / r)
val out = root(sum($"value1"), count("*")) + root(sum($"value2"), count("*"))
df.groupBy("id").agg(out.alias("outcome")).join(df, Seq("id"))
or window functions:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id")
val outw = root(sum($"value1").over(w), count("*").over(w)) +
root(sum($"value2").over(w), count("*").over(w))
df.withColumn("outcome", outw)

Crystal Reports Cross-tab with mix of Sum, Percentages and Computed values

Being new to crystal, I am unable to figure out how to compute rows 3 and 4 below.
Rows 1 and 2 are simple percentages of the sum of the data.
Row 3 is a computed value (see below.)
Row 4 is a sum of the data points (NOT a percentage as in row 1 and row 2)
Can someone give me some pointers on how to generate the display as below.
My data:
2010/01/01 A 10
2010/01/01 B 20
2010/01/01 C 30
2010/02/01 A 40
2010/02/01 B 50
2010/02/01 C 60
2010/03/01 A 70
2010/03/01 B 80
2010/03/01 C 90
I want to display
2010/01/01 2010/02/01 2010/03/01
========== ========== ==========
[ B/(A + B + C) ] 20/60 50/150 80/240 <=== percentage of sum
[ C/(A + B + C) ] 30/60 60/150 90/240 <=== percentage of sum
[ 1 - A/(A + B + C) ] 1 - 10/60 1 - 40/150 1 - 70/240 <=== computed
[ (A + B + C) ] 60 150 250 <=== sum
Assuming you are using a SQL data source, I suggest deriving each of the output rows' values (ie. [B/(A + B + C)], [C/(A + B + C)], [1 - A/(A + B + C)] and [(A + B + C)]) per date in the SQL query, then using Crystal's crosstab feature to pivot them into the output format desired.
Crystal's crosstabs aren't particularly suited to deriving different calculations on different rows of output.