R extract most common word(s) / ngrams in a column by group - tm

I wish to extract main keywords from the column 'title', for each group (1st column).
Desired result in column 'desired title':
Reproducible data:
myData <-
structure(list(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), title = c("mentoring aug 8th 2018",
"mentoring aug 9th 2017", "mentoring aug 9th 2018", "mentoring august 31",
"mentoring blue care", "mentoring cara casual", "mentoring CDP",
"mentoring cell douglas", "mentoring centurion", "mentoring CESO",
"mentoring charlotte", "medication safety focus", "medication safety focus month",
"medication safety for nurses 2017", "medication safety formulations errors",
"medication safety foundations care", "medication safety general",
"communication surgical safety", "communication tips", "communication tips for nurses",
"communication under fire", "communication webinar", "communication welling",
"communication wellness")), row.names = c(NA, -24L), class = c("tbl_df",
"tbl", "data.frame"))
I've looked into record linkage solutions, but that's mainly for grouping the full titles.
Any suggestions would be great.

I concatenated all titles by group, and tokenized them:
library(dplyr)
myData <-
topic_modelling %>%
group_by(group) %>%
mutate(titles = paste0(title, collapse = " ")) %>%
select(group, titles) %>%
distinct()
myTokens <- myData %>%
unnest_tokens(word, titles) %>%
anti_join(stop_words, by = "word")
myTokens
Below is the resulting dataframe:
# finding top ngrams
library(textrank)
stats <- textrank_keywords(myTokens$word, ngram_max = 3, sep = " ")
stats <- subset(stats$keywords, ngram > 0 & freq >= 3)
head(stats, 5)
I'm happy with the result:
While applying the algorithm to my real data of about 100000 lines, I made a function to tackle the problem group by group:
# FUNCTION: TOP NGRAMS ----
find_top_ngrams <- function(titles_concatenated)
{
myTest <-
titles_concatenated %>%
as_tibble() %>%
unnest_tokens(word, value) %>%
anti_join(stop_words, by = "word")
stats <- textrank_keywords(myTest$word, ngram_max = 4, sep = " ")
stats <- subset(stats$keywords, ngram > 1 & freq >= 5)
top_ngrams <- head(stats, 5)
top_ngrams <- tibble(top_ngrams)
return(top_ngrams)
# print(top_ngrams)
}
for (i in 1:5){
find_top_ngrams(myData$titles[i])
}

Related

Is it bad to use `GroupBy` multiple times in pyspark?

This is an educational question.
I have a text file containing several records of power consumption of factories - identified by a unique id -. The file contains the following columns
factory_id, city, country, date, consumption
where date is in the format mm/YYYY. I want to compute which countries have less than 20 cities (including those with 0) which experienced a decrease in factories' consumption in two consecutive years. This is nothing but the total yearly consumption of the factories located in that city.
To do this, I used multiple times a groupBy + agg as follows
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("year", F.split("Date", "/")[1])
# compute for each city the yearly consumption
df_consump = df.groupBy("Country", "City", "year").agg(
F.sum("consumption").alias("consumption")
)
#F.udf(returnType=T.IntegerType())
def had_a_decrease(structs):
structs = sorted(structs, key=lambda s: s.year)
# retrieve 0 if list is monotonically growing, 1 otherwise
cur_cons = pairs[0].consumption
for struct in structs[1:]:
cons = struct.consumption
if cons <= cur_cons:
return 1
cur_cons = cons
return 0
df_cons_decrease = df_consump.groupBy("Country", "City").agg(
# here I collect a list of structs containing (year, consumption)
# which is needed because collect_list doesn't guarantee the order
# is respected so I keep the info on the year to sort this (small)
# list first in the udf "had_a_decrease" defined above.
# eventually this yields a column with a 1 if we had a decrease, 0 otherwise,
# which I sum afterwards.
had_a_decrease(F.collect_list(F.struct("year", "consumption"))).alias("had_decrease")
)
df_cons_decrease.groupBy("Country").agg(
F.sum("had_decrease").alias("num_cities_with_decrease")
).filter("num_cities_with_decrease < 20")\
.write.csv(outputFolder)
however I was wondering:
is this a bad practice (e.g. inefficient) ?
are dataframe better suited than RDDs for this ?
would you recommend a better approach than grouping this many times ?
Compare the consumption with the consomption 1 year and 2 year ago by using Window and lag function without udf and then group by.
data = [
[1, 1, 1, '01/2022', 100],
[1, 1, 1, '01/2021', 90],
[1, 1, 1, '01/2020', 80],
[1, 1, 2, '01/2022', 100],
[1, 1, 2, '01/2021', 110],
[1, 1, 2, '01/2020', 120]
]
cols = ['factory_id', 'city', 'country', 'date', 'consumption']
df = spark.createDataFrame(data, cols) \
.withColumn('year', f.split('date', '/')[1])
w = Window.partitionBy('country', 'city').orderBy('year')
df.groupBy('country', 'city', 'year') \
.agg(f.sum('consumption').alias('consumption')) \
.withColumn('consumption-1', f.lag('consumption', 1).over(w)) \
.withColumn('consumption-2', f.lag('consumption', 2).over(w)) \
.withColumn('is_decreased', f.expr('if(`consumption` < `consumption-1` and `consumption-1` < `consumption-2`, true, false)')) \
.filter('is_decreased = true') \
.select('country', 'city').distinct() \
.groupBy('country').count() \
.filter('count < 20') \
.select('country') \
.show()
+-------+
|country|
+-------+
| 2|
+-------+

summary row with gtsummary

I am trying to create a table of events with gtsummary and I would like to obtain a final row counting the events of the previous rows. add_overall() and add_n() do add the total but in a column, counting the same event across groups but not the overall events.
I created this example.
x1 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.85, 0.15))
x2 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.9, 0.1))
x3 <- sample(c("No", "Yes"), 30, replace = TRUE, prob = c(0.75, 0.25))
y <- sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
df <- data.frame(as_factor(x1), as_factor(x2), as_factor(x3), as_factor(y))
colnames(df) <-c("event_1", "event_2", "event_3", "group")
tbl_summary(df, by=group, statistic = all_categorical() ~ "{n}")
example
I tried using summary_rows() function from gt package after converting the table to a gt object but there is an error when summarising because these variables are factors.
Any other ideas?
You can do this by adding a new variable to your data frame that is the row sum of each of the events. Then you can display that variable's sum in the summary table. Example below!
library(gtsummary)
#> #Uighur
library(tidyverse)
df <-
data.frame(
event_1 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.85, 0.15)),
event_2 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.9, 0.1)),
event_3 = sample(c(FALSE, TRUE), 30, replace = TRUE, prob = c(0.75, 0.25)),
group = sample(c("A", "B"), 30, replace = TRUE, prob = c(0.5, 0.5))
) |>
rowwise() |>
mutate(Total = sum(event_1, event_2, event_3))
tbl_summary(
df,
by = group,
type = Total ~ "continuous",
statistic =
list(all_categorical() ~ "{n}",
all_continuous() ~ "{sum}")
) |>
as_kable() # convert to kable to display on stack overflow
Characteristic
A, N = 16
B, N = 14
event_1
4
4
event_2
1
2
event_3
7
6
Total
12
12
Created on 2023-01-12 with reprex v2.0.2
Thank you so much (great package gtsummary). That works! I had some trouble summing over factors. If variables are factors the code
mutate(Total = sum(event_1=="Yes", event_2=="Yes", event_3=="Yes"))
does it.

Polars Dataframe: Apply MinMaxScaler to a column with condition

I am trying to perform the following operation in Polars.
For value in column B which is below 80 will be scaled between 1 and 4, where as for anything above 80, will be set as 5.
df_pandas = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [50, 300, 80, 12, 105, 78, 66, 42, 61.5, 35],
}
)
test_scaler = MinMaxScaler(feature_range=(1,4)) # from sklearn.preprocessing
df_pandas.loc[df_pandas['B']<80, 'Test'] = test_scaler.fit_transform(df_pandas.loc[df_pandas['B']<80, "B"].values.reshape(-1,1))
df_pandas = df_pandas.fillna(5)
This is what I did with Polars:
# dt is a dictionary
dt = df.filter(
pl.col('B')<80
).to_dict(as_series=False)
below_80 = list(dt.keys())
dt_scale = list(
test_scaler.fit_transform(
np.array(dt['B']).reshape(-1,1)
).reshape(-1) # reshape back to one dimensional
)
# reassign to dictionary dt
dt['B'] = dt_scale
dt_scale_df = pl.DataFrame(dt)
dt_scale_df
dummy = df.join(
dt_scale_df, how="left", on="A"
).fill_null(5)
dummy = dummy.rename({"B_right": "Test"})
Result:
A
B
Test
1
50.0
2.727273
2
300.0
5.000000
3
80.0
5.000000
4
12.0
1.000000
5
105.0
5.000000
6
78.0
4.000000
7
66.0
3.454545
8
42.0
2.363636
9
61.5
3.250000
10
35.0
2.045455
Is there a better approach for this?
Alright, I have got 3 examples for you that should help you from which the last should be preferred.
Because you only want to apply your scaler to a part of a column, we should ensure we only send that part of the data to the scaler. This can be done by:
window function over a partition
partition_by
when -> then -> otherwise + min_max expression
Window function over partititon
This requires a python function that will be applied over the partitions. In the function itself we then have to check in which partition we are and deal with it accordingly.
df = pl.from_pandas(df_pandas)
min_max_sc = MinMaxScaler((1, 4))
def my_scaler(s: pl.Series) -> pl.Series:
if s.len() > 0 and s[0] > 80:
out = (s * 0 + 5)
else:
out = pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
# ensure all types are the same
return out.cast(pl.Float64)
df.with_column(
pl.col("B").apply(my_scaler).over(pl.col("B") < 80).alias("Test")
)
partition_by
This partitions the the original dataframe to a dictionary holding the different partitions. We then only modify the partitions as needed.
parts = (df
.with_column((pl.col("B") < 80).alias("part"))
.partition_by("part", as_dict=True)
)
parts[True] = parts[True].with_column(
pl.col("B").map(
lambda s: pl.Series(min_max_sc.fit_transform(s.to_numpy().reshape(-1, 1)).flatten())
).alias("Test")
)
parts[False] = parts[False].with_column(
pl.lit(5.0).alias("Test")
)
pl.concat([df for df in parts.values()]).select(pl.all().exclude("part"))
when -> then -> otherwise + min_max expression
This one I like best. We can make function that creates a polars expression that is the min_max scaling function you need. This will have best performance.
def min_max_scaler(col: str, predicate: pl.Expr):
x = pl.col(col)
x_min = x.filter(predicate).min()
x_max = x.filter(predicate).max()
# * 3 + 1 to set scale between 1 - 4
return (x - x_min) / (x_max - x_min) * 3 + 1
predicate = pl.col("B") < 80
df.with_column(
pl.when(predicate)
.then(min_max_scaler("B", predicate))
.otherwise(5).alias("Test")
)

For loop to fill NA with dates

I have a DF with consecutive dates with some NA's following, also worth mentioning is [1] will always have a Date.
I wish to fill the proceeding NA's with consecutive dates.
I have : "2016-01-01", "2016-01-02", "2016-01-03", NA, NA
Expected outcome within the DF will be:
"2016-01-01", "2016-01-02", "2016-01-03", "2016-01-04", "2016-01-05"
I attempted to use a for loop but I am fairly amateur at this:
Date <-as.Date(c("2016-01-01", "2016-01-02", "2016-01-03", NA, NA))
DF <- as.data.frame(Date)
i <- which.max(is.na(DF$Date))
for (i in which.max(is.na(DF$Date)):max(length(DF$Date))){
if(is.na(DF$Date)){
DF$Date[i]= as.Date(max(DF$Date, na.rm=TRUE) + 1)
}
}
It returns this:
Warning messages:
1: In if (is.na(DF$Date)) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(DF$Date)) { :
the condition has length > 1 and only the first element will be used
Found it out myself, the if statement wasn't a single value it was listing the whole DF if true or false NA.
Date <-as.Date(c("2016-01-01", "2016-01-02", "2016-01-03", NA, NA))
DF <- as.data.frame(Date)
i <- which.max(is.na(DF$Date))
for (i in which.max(is.na(DF$Date)):max(length(DF$Date))){
if(which.max(is.na(DF$Date))){
DF$Date[i]= as.Date(max(DF$Date, na.rm=TRUE) + 1)
}
}

How to fix the order of scale in rCharts nPlot?

Currently I've got a data set look like this:
df <- data.frame(Time = c("2013-07", "2013-07", "2013-07","2013-10", "2014-01", "2014-05", "2014-05", "2014-05"),
local = "ABC",
Point = c("Point1", "Point2", "Point3", "Point3", "Point3", "Point1", "Point2", "Point3"),
Part1 = c(NaN, NaN, NaN, NaN, NaN, 1, 1, NaN),
Part2 = c(NaN, 2, 11, 4, 2, NaN, 1, 1),
Part3 = c(4, NaN, NaN, NaN, NaN, 1, 1, NaN))
I'm trying to plot a bar plot using rCharts in R studio.
n1 <- nPlot(Part2 ~ Time, group = "Point", data = df, type = "multiBarChart")
n1
The output looks like what I want except one thing.
Ideally the order of x axis should be 2013-07, 2013-10, 2014-01, 2014-05
But the one I got is 2013-07, 2014-05, 2013-10, 2014-01.
I have also tried to convert "Time" variable into a Date format or a POSIXct format. Things turn out to be the same.
So can anybody help me with this?
Is there any help file for rCharts with all possible functions, arguments and customization explanations?
Thanks in advance
I think the primary issue is missing data, which nvd3 does not like. I changed the structure of the data slightly with expand.grid to make sure that there as a point for each date, and in that case nvd3 sorts as expected whether we hand it a number or character date.
Here is the code
library(rCharts)
df <- data.frame(Time = c("2013-07", "2013-07", "2013-07","2013-10", "2014-01", "2014-05", "2014-05", "2014-05"),
local = "ABC",
Point = c("Point1", "Point2", "Point3", "Point3", "Point3", "Point1", "Point2", "Point3"),
Part1 = c(NaN, NaN, NaN, NaN, NaN, 1, 1, NaN),
Part2 = c(NaN, 2, 11, 4, 2, NaN, 1, 1),
Part3 = c(4, NaN, NaN, NaN, NaN, 1, 1, NaN))
#df$Time <- as.Date( paste0(as.character(df$Time),"-01" ) )
df2 <- merge(
structure(expand.grid(unique(df$Time),unique(df$Point)),names=c("Time","Point"))
,df
,all=T
)
#df2[,4:6] <- lapply(df2[,4:6], function(x){ ifelse(is.na(x),0,x) })
n1 <- nPlot(Part2 ~ Time, group = "Point", data = df2, type = "multiBarChart")
n1$xAxis (
#"#! function(d){ return d3.time.format('%Y-%m')(new Date( d*60*60*24*1000 ) ) } !#"
"#! function(d){ return d3.time.format('%Y-%m')(function(d){ return d3.time.format('%Y-%m')(d3.time.format('%Y-%m').parse(d) ) }) } !#"
)
n1