Counts and proportions within nested dataframes - group-by

I would like to count and calculate proportions within several subdataframes. I used to do this with some ugly code, using lapply and creating and looping over many objects. With the purrr package this should be more straightforward, however I didn't manage to do it so far.
Illustration with the standard R dataset "mtcars":
The following code is a standard dplyr solution (which works):
mtcars%>% group_by(vs, am) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
Thus, I get the proportions and counts within the levels of "vs" (0 and 1). However, I want to calculate the grouped counts and proportions within several subgroups of "carb". So with standard dplyr this would look like that (works as well):
mtcars %>% filter(carb == 1) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
mtcars %>% filter(carb == 2) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
etc.
This works but is cumbersome.
With Purrr it should look something like that:
mtcars %>% group_by(carb) %>% nest() %>% mutate(n = map(data, count))
Here however, the grouping group_by(vs, am) is lost. How can I introduce the grouping here?
Many thanks in advance!

Is this the desired ouput?
library(tidyverse)
unique(mtcars$carb) %>% map_dfr(~ mtcars %>%
group_by(vs, am) %>%
filter(carb == .x) %>%
summarise(carb = .x, n = n(),.groups = 'drop') %>%
group_by(vs) %>%
mutate(freq = prop.table(n)))
#> # A tibble: 12 × 5
#> # Groups: vs [2]
#> vs am carb n freq
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0 0 4 5 0.625
#> 2 0 1 4 3 0.375
#> 3 1 0 4 2 1
#> 4 1 0 1 3 0.429
#> 5 1 1 1 4 0.571
#> 6 0 0 2 4 0.8
#> 7 0 1 2 1 0.2
#> 8 1 0 2 2 0.4
#> 9 1 1 2 3 0.6
#> 10 0 0 3 3 1
#> 11 0 1 6 1 1
#> 12 0 1 8 1 1
Created on 2022-01-06 by the reprex package (v2.0.1)

Related

How to repeat the simulation 1000 times to have 1000 different datasets

I have 1 binary covariate x and want to use logistic regression to get the outcome y binary using the logistic formula. Then after that, I want to simulate a data on the basis of the probability from the logistic equation
I have tried and I am able to simulate the first data but repeating the process and adding subject id and simulation number is giving me the problem.
simulation id y x
1 1 1 0
1 2 0 1
1 3 0 0
1 4 1 1
1 5 0 1
2 1 1 1
2 2 0 0
2 3 1 1
2 4 1 1
2 5 1 0
3 1 1 1
3 2 1 0
3 3 1 0
3 4 0 0
3 5 0 1
The code is as follows:
nsample <- 100
set.seed(1234)
id <- rep(1:nsample)
p <- 0.01
b0 <- log(p/(1-p))
b1 <- 0.5
b2 <- 0.1
b5 <- -4
x1 <- rbinom(nsample, 1, 0.4)
x2 <- rbinom(nsample, 1, 0.6)
z2 <- b0 + b1*x1+b2*x2
p_vector <- 1/(1+exp(-z2))
y <- rbinom(n = length(nsample), size = 1, prob = p_vector)

Generate percent using group_by and mutate

I am working on a dataset that contains predicted label (predicted) vs. true label (label) for each id and a column indicating whether the predicted label equals true label (match). I want to show the percentage of correct prediction for each label versus the total number of observations belonging to that label.
As an example, given the following data:
id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
label <- c(6, 5, 1, 5, 4, 2, 3, 1, 6, 1)
predicted <- c(6, 5, 1, 3, 2, 2, 3, 1, 4, 4)
match <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0)
dt <- data.frame(id, label, predicted, match)
head(dt)
id label predicted match
1 1 6 6 1
2 2 5 5 1
3 3 1 1 1
4 4 5 3 0
5 5 4 2 0
6 6 2 2 1
If I group_by(label) and count(label, predicted) and then mutate(percent = sum(match == 1)/sum(n)), it is expected that I should obtain a new grouped data frame like this
library(plyr)
library(dplyr)
dt %>% group_by(label) %>% dplyr::count(label, predicted) %>% mutate(percent = sum(match == 1)/sum(n))
dt
id label predicted match percent
1 3 1 1 1 0.67
2 8 1 1 1 0.67
3 10 1 4 0 0.67
4 6 2 2 1 1.00
5 7 3 3 1 1.00
6 5 4 2 0 0.00
7 4 5 3 0 0.50
8 2 5 5 1 0.50
9 9 6 4 0 0.50
10 1 6 6 1 0.50
However, my code gives me this following output instead
dt
# A tibble: 6 x 4
# Groups: label [5]
label predicted n percent
<dbl> <dbl> <int> <dbl>
1 1.00 1.00 2 0.600
2 1.00 4.00 1 0.600
3 2.00 2.00 1 0.600
4 3.00 3.00 1 0.600
5 4.00 2.00 1 0.600
6 5.00 3.00 1 0.600
It calculated the percentage of correct prediction for "all" label (hence, all equals 0.600) instead of doing that for each label. How should I modify my code to achieve my desired output?
I wasn't able to reproduce your output with the code that you shared. I think the following will accomplish what you are seeking, though (I used total as the variable name rather than n):
dt %>%
arrange(label) %>%
group_by(label) %>%
mutate(total = n(),
percent = sum(match == 1) / total)
# A tibble: 10 x 6
# Groups: label [6]
id label predicted match total percent
<dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 3 1 1 1 3 0.667
2 8 1 1 1 3 0.667
3 10 1 4 0 3 0.667
4 6 2 2 1 1 1
5 7 3 3 1 1 1
6 5 4 2 0 1 0
7 2 5 5 1 2 0.5
8 4 5 3 0 2 0.5
9 1 6 6 1 2 0.5
10 9 6 4 0 2 0.5

ANOSIM: Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...)

I am trying to perform Anosim {vegan} on my ecological data and I keep getting the same error message. I don't think this is a duplicate question from another one already posted and would like to fully show what's happening.
I have got my numeric dataframe ("sps") consisting of 17 rows (sites) and 313 columns (species), and a second dataframe ("env.data") containing a column with 17 factors. I would therefore want to test if there are any significant differences between my 17 groups.
Here is a sample of my data:
> sps[,2:5]
A. faranauti A. tecta A. lyra A. arbuscula
Sargasso Sea 0 0 2 0
Equatorial Brazil 0 0 0 0
Canarias Sea 0 0 0 0
Corner Seamounts 0 0 0 2
Gulf of Mexico 0 0 0 0
Labrador Sea 0 0 0 0
Equatorial Africa 0 0 0 0
Tropic Seamount 0 0 0 107
NewEngland Seamount Chain 0 0 0 0
Norwegian Basin 0 0 0 0
Eastern North Atlantic 0 0 3 0
Logachev and BritishIsles 0 0 0 4
Reykjanes Ridge 0 0 0 0
MAR North 0 0 0 14
Flemish Cap 0 0 0 217
MAR South 1 1 0 0
Azores Seamount Chain 0 0 0 12
> class(sps)
[1] "data.frame"
> head(env.data)
idcell geo_area
1 1 Sargasso Sea
2 2 Equatorial Brazil
3 3 Canarias Sea
4 4 Corner Seamounts
5 5 Gulf of Mexico
6 6 Labrador Sea
> str(env.data)
'data.frame': 17 obs. of 2 variables:
$ idcell : Factor w/ 17 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ geo_area: Factor w/ 17 levels "Canarias Sea",..: 15 5 1 2 7 8 4 17 12 13..
Following {vegan}, I have first calculated a dissimilarity matrix with Sorensen as the distance method. I then use this dissimilarity matrix as my input for anosim:
dist.sorensen <- vegdist(sps, method= "bray", binary = TRUE, na.rm= TRUE,
diag = TRUE)
sorensen.anosim <- anosim(dat=dist.sorensen, env.data$geo_area, permutations
= 999)
> summary(sorensen.anosim )
Call:
anosim(dat = dist.sorensen, grouping = env.data$geo_area, permutations =
999)
Dissimilarity: binary bray
ANOSIM statistic R:
Significance: 0.001
Permutation: free
Number of permutations: 999
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
I have also tried anosim with the raw species data and I get the same error:
raw.anosim <- anosim(sps, env.data$geo_area, permutations = 999, distance =
"bray")
Any ideas? My "sps" dataframe (x) is numeric. My "env.data" dataset (groupings) has a factor column with 17 levels. I can't see where the error comes from, unless it's intrinsic to my data. Many of the 313 species listed in my original dataframe have been recorded only once across my 17 sites (very probably due to sampling bias). However, I get clusters after performing "vegdist (Sorensen index)" and "hclust".

Generate all possible column vectors in matlab

I am essentially trying to figure out how to generate code for basis vectors of different configurations of M objects into N different states (for example, if I had 2 snacks between 2 kids, I could have (2,0) (0,2) or (1,1), terrible example, but thats the idea)
I am struggling to figure out how to do this without going into many different loops (I want this to be automatic). The idea would be to create a Matrix where each row is a vector of length M. I would start with vec(1) = N then an if loop where if sum(vec) == N, Matrix(1,:)=vec; Then I could take vec(1)=N-i and do the same.
My only issue is I do not see how to use the if and forget it so that if I had maybe 2 objects in 5 locations, how would I do this to get (1 0 0 0 1).
I am not seeing how to do this.
You could use a recursive function:
function out = combos(M,N)
if N == 1
out = M;
else
out = [];
for i = 0:M
subout = combos(M-i,N-1);
subout(:,end+1) = i;
out = [out;subout];
end
end
I think this does what you want.
The key idea is to generate not the number of elements in each group, but the split points between groups. This can be done via combinations with repetition. Matlab's nchoosek generates combinations without repetition, but these are easily converted into what we need.
M = 5; % number of objects
N = 3; % number of groups
t = nchoosek(1:M+N-1, N-1); % combinations without repetition...
t = bsxfun(#minus, t, 1:N-1); % ...convert into combinations with repetition
t = diff([zeros(size(t,1), 1) t repmat(M, size(t,1), 1) ], [], 2); % the size of each
% group is the distance between split points
In this example, the result is
t =
0 0 5
0 1 4
0 2 3
0 3 2
0 4 1
0 5 0
1 0 4
1 1 3
1 2 2
1 3 1
1 4 0
2 0 3
2 1 2
2 2 1
2 3 0
3 0 2
3 1 1
3 2 0
4 0 1
4 1 0
5 0 0
This is a similar approach to Luis' without bsxfun. Because we don't like fun.
n = 5;
k = 3;
c = nchoosek(n+k-1, k-1);
result = diff([zeros(c, 1) nchoosek(1:(n+k-1), k-1) ones(c, 1)*(n+k)], [], 2) - 1;
This creates the partitions of the integer n with length k. Given an array of length n + (k-1), we find all combinations of (k-1) places to place partitions between the (unary) integers. For 5 items and 3 locations, we have 7 choices of where to put the partitions:
[ 0 0 0 0 0 0 0 ]
If our chosen combination is [2 4], we replace positions 2 and 4 with partitions to look like this:
[ 0 | 0 | 0 0 0 ]
The O's give the value in unary, so this combination is 1 1 3. To recover the values easily, we just augment the combinations with imaginary partitions at the next values to the left and right of the array (0 and n+k) and take the difference and subtract 1 (because the partitions themselves don't contribute to the value):
diff([0 2 4 8]) - 1
ans =
1 1 3
By sliding the partitions in to each possible combination of positions, we get all of the partitions of n.
Output:
result =
0 0 5
0 1 4
0 2 3
0 3 2
0 4 1
0 5 0
1 0 4
1 1 3
1 2 2
1 3 1
1 4 0
2 0 3
2 1 2
2 2 1
2 3 0
3 0 2
3 1 1
3 2 0
4 0 1
4 1 0
5 0 0

How to flatten a data frame in apache spark | Scala

I have the following data frame :
df1
uid text frequency
1 a 1
1 b 0
1 c 2
2 a 0
2 b 0
2 c 1
I need to flatten it on the basis of uid to :
df2
uid a b c
1 1 0 2
2 0 0 1
I've worked on similar lines in R but haven't been able to translate it into sql or scala.
Any suggestions on how to approach this?
You can group by uid, use text as a pivot column and sum frequencies:
df1
.groupBy("uid")
.pivot("text")
.sum("frequency").show()