How to not make dplyr mutate calculate for missing group - group-by

I have data looking like this:
df=data.frame(a=1:6,b=rep(c("one","two"),each=3))
df[2,2]<-NA
I want to calculate the mean of each group for each row, like this:
df %>% group_by(b) %>% mutate(mean=mean(a))
The problem is that R views the NA as a group. Desired output would be
mean=c(2,NA,2,5,5,5).
Attempt:
df %>% group_by(b) %>% mutate(mean=if_else(b==NA,NA,mean(a)))
but this throws an error

Try
df %>% group_by(b) %>% mutate(mean=mean(a)) %>% mutate(mean = if_else(is.na(b), NA_real_, mean))

If you want to avoid error messages:
library(hablar)
df %>%
convert(chr(b)) %>%
group_by(b) %>%
mutate(mean = if_else_(!is.na(b), mean(a), NA))

Related

Omit categories from table

The following code
library(tidyverse)
library(gtsummary)
df <- tibble(category = c("a", "a", "a",
"b", "b", "b", "b"))
output_table <- df %>%
tbl_summary()
output_table
produces this table.
Is it possible to remove the "a" category from this table without changing the associated frequencies? So in this case, the final table should look like this (but without the extra whitespace).
Using the tbl_summary(value=) argument, you can select a single level of a categorical variable to display in the table.
library(gtsummary)
trial %>%
select(grade) %>%
# show only one level for grade
tbl_summary(value = grade ~ "I",
label = grade ~ "Grade I")
You can also just delete a single row from the output. But you'll need to install the development version of the package to use the new function modify_table_body().
remotes::install_github("ddsjoberg/gtsummary")
trial %>%
select(grade) %>%
tbl_summary() %>%
# remove grade I row
modify_table_body(filter, !(variable == "grade" & label == "I"))

how to create dataframe from one column in pyspark?

I have sliced out one column of type Column in pyspark.
x =game_reviews.groupBy("product_id_index").agg((F.count('star_rating').alias('num') )
x.num
gives
Column<b'num'>
But this
new_df = spark.createDataFrame(x.num)
new_df.show()
gives error.
What you want to achieve is a simple one-liner. Good luck!
new_df = game_reviews.groupBy("product_id_index").agg((F.count('star_rating').alias('num')).select("num")
new_df.show()

Drop list of Column from a single dataframe in spark

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var?
Currently I've done this
var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))
but what I really want is just a shortcut for
df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....
without using a var.
Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2
EDIT:
Important note: I don't know in advance the columns of df1 and df2
This is possible to achieve while you are performing the join itself. Please try the below code
val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")
This would only contain the columns from the second data frame. Hope this helps
A shortcut would be:
val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))
I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:
val ret = df3.select(df2.columns.map(col):_*)

Spark Dataframe: Calculate variance between groups

With R's dplyr I would calculate variance between groups like so:
df %>% group_by(group) %>% summarise(total = sum(value)) %>% summarise(variance_between_groups = var(total))
Trying to perform the same action with Sparks DataFrame API:
df.groupBy(group).agg(sum(value).alias("total")).agg(var_samp(total).alias("variance_between_groups"))
I receive an error in the second agg saying that it can't find total. I am clearly misunderstanding something so any help would be appreciated.
var_samp() takes a String-type column name, hence you need to provide a String as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("a", 1.0),
("a", 2.5),
("a", 1.5),
("b", 2.0),
("b", 1.6)
).toDF("group", "value")
df.groupBy("group").
agg(sum("value").alias("total")).
agg(var_samp("total").alias("variance_between_groups")).
show
// +-----------------------+
// |variance_between_groups|
// +-----------------------+
// | 0.9799999999999999|
// +-----------------------+
It can also take a column (of Column type), e.g. var_samp($"total"). See Spark's API doc for more details.

check condition for two column in two different dataframes in spark

Suppose there is one column in dataframe and there is similar schema column in another dataframe. how to check check the values consisting in the columns are same or not without joining them as there is not common attribute.
DF1
serial_nm
abc
mnc
pqr
DF2
ser_nm
hgf
mnc
uio
pqr
lok
And i want third DF3 as output
DF3
mnc
pqr
I tried this
val DF3 = DF1.filter(DF1("serial_nm") === DF2("ser_nm"))
But its not working
Please Help
Thanks..!!
I believe you can use a join. Consider using it like this:
val DF3 = DF1.join(DF2, DF1("serial_nm") === DF2("ser_nm"))
or
val DF3 = DF1.join(DF2).where(DF1("serial_nm") === DF2("ser_nm"))
Both approaches are quivalent.
Note: To avoid problems with ambiguous columns, one option is to rename them before the join:
val df2_renamed = DF2
.withColumnRenamed("mnc", "df2_mnc")
.withColumnRenamed("pqr", "df2_pqr")