Generate percent using group_by and mutate - group-by

I am working on a dataset that contains predicted label (predicted) vs. true label (label) for each id and a column indicating whether the predicted label equals true label (match). I want to show the percentage of correct prediction for each label versus the total number of observations belonging to that label.
As an example, given the following data:
id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
label <- c(6, 5, 1, 5, 4, 2, 3, 1, 6, 1)
predicted <- c(6, 5, 1, 3, 2, 2, 3, 1, 4, 4)
match <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0)
dt <- data.frame(id, label, predicted, match)
head(dt)
id label predicted match
1 1 6 6 1
2 2 5 5 1
3 3 1 1 1
4 4 5 3 0
5 5 4 2 0
6 6 2 2 1
If I group_by(label) and count(label, predicted) and then mutate(percent = sum(match == 1)/sum(n)), it is expected that I should obtain a new grouped data frame like this
library(plyr)
library(dplyr)
dt %>% group_by(label) %>% dplyr::count(label, predicted) %>% mutate(percent = sum(match == 1)/sum(n))
dt
id label predicted match percent
1 3 1 1 1 0.67
2 8 1 1 1 0.67
3 10 1 4 0 0.67
4 6 2 2 1 1.00
5 7 3 3 1 1.00
6 5 4 2 0 0.00
7 4 5 3 0 0.50
8 2 5 5 1 0.50
9 9 6 4 0 0.50
10 1 6 6 1 0.50
However, my code gives me this following output instead
dt
# A tibble: 6 x 4
# Groups: label [5]
label predicted n percent
<dbl> <dbl> <int> <dbl>
1 1.00 1.00 2 0.600
2 1.00 4.00 1 0.600
3 2.00 2.00 1 0.600
4 3.00 3.00 1 0.600
5 4.00 2.00 1 0.600
6 5.00 3.00 1 0.600
It calculated the percentage of correct prediction for "all" label (hence, all equals 0.600) instead of doing that for each label. How should I modify my code to achieve my desired output?

I wasn't able to reproduce your output with the code that you shared. I think the following will accomplish what you are seeking, though (I used total as the variable name rather than n):
dt %>%
arrange(label) %>%
group_by(label) %>%
mutate(total = n(),
percent = sum(match == 1) / total)
# A tibble: 10 x 6
# Groups: label [6]
id label predicted match total percent
<dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 3 1 1 1 3 0.667
2 8 1 1 1 3 0.667
3 10 1 4 0 3 0.667
4 6 2 2 1 1 1
5 7 3 3 1 1 1
6 5 4 2 0 1 0
7 2 5 5 1 2 0.5
8 4 5 3 0 2 0.5
9 1 6 6 1 2 0.5
10 9 6 4 0 2 0.5

Related

Implementation of FIFO pnl in kdb/q

Consider the table below:
Id
Verb
Qty
Price
1
Buy
6
10.0
2
Sell
5
11.0
3
Buy
4
10.0
4
Sell
3
12.0
5
Sell
8
9.0
6
Buy
7
8.0
I would like to compute the PnL in a FIFO way. For example for Id=1, PnL is -6*(10.0) +5*(11.0) + 1*(12.0) = +$7.00. For Id=5, this case is a bit different: our position is +2, and we will firstly fill this position(which will not take account into the PnL of Id=5), then we sell the remaining 6 assets. At Id=6, the -6 position is fulfilled and we get the PnL of Id=5 which is +6*(9.0)-6*(8.0)=+$6.00. Hence this table with PnL is what I want to have :
Id
Verb
Qty
Price
PnL
1
Buy
6
10.0
7.0
2
Sell
5
11.0
0.0
3
Buy
4
10.0
2.0
4
Sell
3
12.0
0.0
5
Sell
8
9.0
6.0
6
Buy
7
8.0
0.0(with 1 asset remaining)
I have read this post and KDB: pnl in FIFO manner and https://code.kx.com/q4m3/1_Q_Shock_and_Awe/#114-example-fifo-allocation. But in their approach, they don't care about the order between buy orders and sell orders, which is not my case.
My idea is to firstly produce the FIFO allocation matrix where the dimension is the trades number:
Id
1
2
3
4
5
6
1
6
0
0
0
0
0
2
1
0
0
0
0
0
3
1
0
4
0
0
0
4
0
0
2
0
0
0
5
0
0
0
0
-6
0
6
0
0
0
0
0
1
Then I compute the diff(price). The inner product of each column and diff(price) is PnL of each trade.
I am having trouble to implement this allocation matrix. Or any advice on solving this problem more directly?
Here's one approach. It's more convoluted than I'd like but it covers a lot of the intermediary steps and generates a type of allocation matrix as you suggested. There are likely edge-cases and tweaks needed but this should give you some ideas at least.
t:([]id:1+til 6;side:`b`s`b`s`s`b;qty:6 5 4 3 8 7;px:10 11 10 12 9 8f);
t:update pos:sums delta from update delta:qty*(1;-1)side=`s from t;
f:{signum[x]*x,{#[(-). z;x;:;abs[y]-sum z 1]}[y;x y]{(x;deltas y&sums x)}[abs where[signum[x]<>signum x y]#x;abs x y]};
t:update fifo:deltas[id!delta;f\[id!delta;id]] from t;
q)update pnl:sum each(id!px)*/:fifo from t
id side qty px delta pos fifo pnl
-----------------------------------------------------
1 b 6 10 6 6 1 2 3 4 5 6!-6 5 0 1 0 0 7
2 s 5 11 -5 1 1 2 3 4 5 6!0 0 0 0 0 0 0
3 b 4 10 4 5 1 2 3 4 5 6!0 0 -4 2 2 0 2
4 s 3 12 -3 2 1 2 3 4 5 6!0 0 0 0 0 0 0
5 s 8 9 -8 -6 1 2 3 4 5 6!0 0 0 0 6 -6 6
6 b 7 8 7 1 1 2 3 4 5 6!0 0 0 0 0 0 0

Find rows of a matrix whose certain columns all match a condition

Suppose I have a matrix with many rows and columns, for example a small subset would be:
1 2 3 4 5 6
1 1 5 6 0 0
1 2 2 3 2 1
1 2 0 3 4 5
1 9 5 7 3 0
I want to find the rows whose columns #4, #5 and #6 contain elements greater than zero, so I in this case would get a vector like this:
1
3
4
I have tried using the find() function this way:
idx = find(y(:, 4:6) > 0)
but I get this:
1
2
3
4
5
6
8
9
10
11
13
14
You can use a combination of find and all like this:
idx = find(all(y(:,4:6) > 0, 2))
This gives:
>> y = [1 2 3 4 5 6; 1 1 5 6 0 0; 1 2 2 3 2 1; 1 2 0 3 4 5; 1 9 5 7 3 0]
y =
1 2 3 4 5 6
1 1 5 6 0 0
1 2 2 3 2 1
1 2 0 3 4 5
1 9 5 7 3 0
>> idx = find(all(y(:,4:6) > 0, 2))
idx =
1
3
4
The idea is that we extract columns 4 to 6, check which values are greater than 0, operate along the 2nd dimension with an AND condition (all), and then extract which indices (rows) are 1/true in the resulting column vector.

Is there a way I can compute all this possible paths and store them?

I have a matrix and each of its columns represents a sequence of points, to be more specific:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2
5 4 3 2 6 4 3 2 6 5 3 2 6 5 4 2 6 5 4 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 3 2 2 2 3 2 2 2 3 2 2 2 4 3 3 3 4
3 3 4 4 3 3 4 4 3 3 5 5 4 4 5 5 4 4 5 5
4 5 5 5 4 6 6 6 5 6 6 6 5 6 6 6 5 6 6 6
1 stands for point number one, 2 stands for point number two, and so on.
So as said above, every column represent a different configuration of a set of point (x and y coordinates).
If the set of points is:
(1,9)
(2,5)
(3,7)
(4,2)
(2,1)
(2,3)
then one possible path, according to the first column is:
(1,9)
(2,3)
(2,1)
(1,9)
(2,5)
(3,7)
(4,2)
Is there a way I can compute all this possible configurations and store them?
When I first approached this problem I didn't know about graph theory, that's why so far I am not using it.
I don't understand the logics behind the fouth row of your sequence matrix. It's filled with 1 but they seem to be completely ignored by your example. Respecting your example, given the points:
(1,9) (2,5) (3,7) (4,2) (2,1) (2,3)
and the first column sequence:
1 6 5 1 2 3 4
the output should be:
(1,9) (2,3) (2,1) (1,9) (2,5) (3,7) (4,2)
and not:
(1,9) (2,3) (2,1) (2,5) (3,7) (4,2)
Since I don't know how your scripts should work and how I should deal with the fourth row, I implemented a code that ignore that logic, producing the result that seems the most obvious to me:
seq = [
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2
5 4 3 2 6 4 3 2 6 5 3 2 6 5 4 2 6 5 4 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 3 2 2 2 3 2 2 2 3 2 2 2 4 3 3 3 4
3 3 4 4 3 3 4 4 3 3 5 5 4 4 5 5 4 4 5 5
4 5 5 5 4 6 6 6 5 6 6 6 5 6 6 6 5 6 6 6
];
pts = {
[1 9]
[2 5]
[3 7]
[4 2]
[2 1]
[2 3]
};
paths = pts(seq);
Then, in order to access the paths you can to, for example:
for i = 1:size(paths,2)
disp(cell2mat(paths(:,i)))
end
or:
paths = cell2mat(paths);
for i = 1:2:(size(paths,2) / 2)
x = paths(:,i);
y = paths(:,i+1);
disp([x y]);
end

Match patterns in a matrix with a variable number of lines and count them in Matlab

I have a matrix like this one:
8
8
8
2
2
2
6
6
7
7
7
1
1
6
6
6
6
8
8
0
6
8
8
1
6
6
There are fixed patterns that always repeat. I would like to detect them. They repeat according to these rules:
Lines with 7 followed by lines with a number which can be (0, 1 or 2), followed by a 6
Lines with 8 followed by lines with a number which can be (0, 1 or 2), followed by a 6
For each one of the values on a single pattern detected (independently from the number of lines they are composed of), write in a second column a number of rank, starting from 1 and incrementing each time a new pattern in column one is detected. This would be the result:
8 1
8 1
8 1
2 1
2 1
2 1
6 1
6 1
7 2
7 2
7 2
1 2
1 2
6 2
6 2
6 2
6 2
8 3
8 3
0 3
6 3
8 4
8 4
1 4
6 4
6 4
Column 2 encodes in each line the first pattern (series of values = 1 meaning that on this line there is data related to patter 1), the second pattern (values 2) and so on...
How can I do that?
Here's a solution that only uses the "closing tags" to split the matrix into parts:
function b = replaceValues(a)
closingTag = 6;
% Find all closing tag positions
clTagPos = a(:, 1) == closingTag;
% Keep only the "last" tags and add matrix start/end positions
splitPoints = [0; find(diff(clTagPos) == -1); length(a)];
% Split matrix into cell array
acell = mat2cell(a, diff(splitPoints));
% Replace the second column of each part with the corresponding non-zero value
bcell = cellfun(#(c)[c(:, 1) ones(length(c), 1)*c(find(c(:, 2), 1), 2)], acell, 'UniformOutput', 0);
% Convert back to matrix
b = cell2mat(bcell);
end
Example input-output in Matlab:
a =
8 0
8 0
8 0
2 1
2 1
2 1
6 0
6 0
7 0
7 0
7 0
1 2
1 2
6 0
6 0
6 0
6 0
8 0
8 0
0 3
6 0
8 0
8 0
1 4
6 0
6 0
>> b = replaceValues(a)
b =
8 1
8 1
8 1
2 1
2 1
2 1
6 1
6 1
7 2
7 2
7 2
1 2
1 2
6 2
6 2
6 2
6 2
8 3
8 3
0 3
6 3
8 4
8 4
1 4
6 4
6 4

Replace duplicate elements from vector with 0 (Matlab/Octave)

I want to replace duplicate elements from a vector with 0, and keep only the first occurrence.
If I have a vector like
[ 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 5 6 6 6 ]
how could I transform it into
[ 1 0 2 0 0 3 0 0 4 0 0 0 5 0 0 0 6 0 0 ] ?
Thanks.
a = [ 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 5 6 6 6 ];
[c, ia] = unique(a, 'first');
t = a;
t(ia) = 0;
filtered_vect = a - t;
edit: That in a more concise way, destroying the original vector:
a = [ 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 5 6 6 6 ];
[c, ia] = unique(a, 'first');
a(~ismember(1:length(a),ia)) = 0;