R summarise by group sum giving NA - group-by

I have a data frame like this
Observations: 2,190,835
Variables: 13
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489…
$ preparationid <dbl> 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1…
$ doseday <int> 90, 90, 91, 91, 92, 92, 92, 92, 93, 93, 93, 93, 94, 94, 94, 94, 95, 95, 95, 95, 99, 99, 100, 100, 10…
$ route <fct> enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., …
$ enteral <fct> t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t…
$ energy_kcal_kg <dbl> 0.00, 13.56, 0.00, 13.56, 0.00, 13.49, 0.00, 13.49, 0.00, 13.35, 0.00, 13.35, 0.00, 12.95, 0.00, 12.…
$ prot_g_kg <dbl> 0.000, 0.366, 0.000, 0.366, 0.000, 0.365, 0.000, 0.365, 0.000, 0.361, 0.000, 0.361, 0.000, 0.350, 0.…
$ lipids_g_kg <dbl> 0.000, 0.495, 0.000, 0.495, 0.000, 0.492, 0.000, 0.492, 0.000, 0.487, 0.000, 0.487, 0.000, 0.472, 0.…
$ K_mmol_kg <dbl> 0.000, 0.385, 0.000, 0.385, 0.000, 0.383, 0.000, 0.383, 0.000, 0.379, 0.000, 0.379, 0.000, 0.368, 0.…
$ Na_mmol_kg <dbl> 0.0000, 0.1832, 0.0000, 0.1832, 0.0000, 0.1823, 0.0000, 0.1823, 0.0000, 0.1804, 0.0000, 0.1804, 0.00…
$ Ca_mg_kg <dbl> 0.00, 10.99, 0.00, 10.99, 0.00, 10.94, 0.00, 10.94, 0.00, 10.82, 0.00, 10.82, 0.00, 10.50, 0.00, 10.…
$ P_mg_kg <dbl> 0.00, 8.25, 0.00, 8.25, 0.00, 8.20, 0.00, 8.20, 0.00, 8.12, 0.00, 8.12, 0.00, 7.88, 0.00, 7.88, 0.00…
$ Pi_mmol_kg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>
And I would need to calculate for each patient the daily sum of nutrient intakes. And I´ve been using the code below.
nutrient_intake <- nutrient_data %>% group_by(patientid, doseday, enteral) %>% summarise(energy_kcal_kg_d=sum(energy_kcal_kg), protein_g_kg_d=sum(prot_g_kg), lipids_g_kg_d=sum(lipids_g_kg), na_total_mmol_kg_d=sum(Na_mmol_kg), K_total_mmol_kg_d=sum(K_mmol_kg), Ca_mg_total_kg_d=sum(Ca_mg_kg), P_mg_kg_d=sum(P_mg_kg), Pi_mmol_kg_d=sum(Pi_mmol_kg))
The code seems to be working in someway since the grouping seems fine, however the daily sums are missing, result is NA. What is wrong here?
Variables: 11
Groups: patientid, doseday [30,991]
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, …
$ doseday <int> 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, …
$ enteral <fct> f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, …
$ energy_kcal_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ protein_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ lipids_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ na_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ K_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Ca_mg_total_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ P_mg_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Pi_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
> ```

By default, sum does not consider NA.
Try this:
nutrient_intake <- nutrient_data %>%
group_by(patientid, doseday, enteral) %>%
summarise(
energy_kcal_kg_d=sum(energy_kcal_kg, na.rm=T),
protein_g_kg_d=sum(prot_g_kg, na.rm=T),
lipids_g_kg_d=sum(lipids_g_kg, na.rm=T),
na_total_mmol_kg_d=sum(Na_mmol_kg, na.rm=T),
K_total_mmol_kg_d=sum(K_mmol_kg, na.rm=T),
Ca_mg_total_kg_d=sum(Ca_mg_kg, na.rm=T),
P_mg_kg_d=sum(P_mg_kg, na.rm=T),
Pi_mmol_kg_d=sum(Pi_mmol_kg, na.rm=T)
)

Related

Element-wise addition of lists in Pyspark Dataframe

I have dataframe:
data = [{"category": 'A', "bigram": 'delicious spaghetti', "vector": [0.01, -0.02, 0.03], 'all_vector' : 2},
{"category": 'A', "bigram": 'delicious dinner', "vector": [0.04, 0.05, 0.06], 'all_vector' : 2},
{"category": 'B', "bigram": 'new blog', "vector": [-0.14, -0.15, -0.16], 'all_vector' : 2},
{"category": 'B', "bigram": 'bright sun', "vector": [0.071, -0.09, 0.063], 'all_vector' : 2}
]
sdf = spark.createDataFrame(data)
+----------+-------------------+--------+---------------------+
|all_vector|bigram |category|vector |
+----------+-------------------+--------+---------------------+
|2 |delicious spaghetti|A |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[0.071, -0.09, 0.063]|
+----------+-------------------+--------+---------------------+
I need to element-wise add lists in a vector column and divide by all_vector column ( i need normalize vector). Then group by category column. I wrote an example code but unfortunately it doesn't work:
#udf_annotator(returnType=ArrayType(FloatType()))
def result_vector(vector, all_vector):
lst = [sum(x) for x in zip(*vector)] / all_vector
return lst
sdf_new = sdf\
.withColumn('norm_vector', result_vector(F.col('vector'), F.col('all_vector')))\
.withColumn('rank', F.row_number().over(Window.partitionBy('category')))\
.where(F.col('rank') == 1)
I want it this way:
+----------+-------------------+--------+-----------------------+---------------------+
|all_vector|bigram |category|norm_vector |vector |
+----------+-------------------+--------+-----------------------+---------------------+
|2 |delicious spaghetti|A |[0.05, 0.03, 0.09] |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.05, 0.03, 0.09] |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.069, -0.24, -0.097]|[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[-0.069, -0.24, -0.097]|[0.071, -0.09, 0.063]|
+----------+-------------------+--------+-----------------------+---------------------+
The zip_with function will help you zip two arrays and apply a function element wise. To use the function, we can create an array collection of the arrays in the vector column, and use the aggregate function. There might also be other simpler ways to do this though.
data_sdf. \
withColumn('vector_collection', func.collect_list('vector').over(wd.partitionBy('cat'))). \
withColumn('ele_wise_sum',
func.expr('''
aggregate(vector_collection,
cast(array() as array<double>),
(x, y) -> zip_with(x, y, (a, b) -> coalesce(a, 0) + coalesce(b, 0))
)
''')
). \
show(truncate=False)
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |cat|vector |vector_collection |ele_wise_sum |
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |B |[-0.14, -0.15, -0.16]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |B |[0.071, -0.09, 0.063]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |A |[0.01, -0.02, 0.03] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# |A |[0.04, 0.05, 0.06] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# +---+---------------------+----------------------------------------------+-------------------------------------+

Partial Replication of DataFrame rows

I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021
M, N, 05/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)
To Result:
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021
A, B, 03/01/2021
A, B, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021
Any idea on how this can be achieved in scala spark?
I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.
We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
("A", "B", "02/01/2021"),
("M", "N", "05/01/2021"))
.toDF("Column1(String)", "Column2(String)", "Date")
df
.withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
.withColumn("Date",date_format('Date,"dd/MM/yyyy"))
.show(false)
/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date |
+---------------+---------------+----------+
|1 |2 |01/01/2021|
|1 |2 |02/01/2021|
|1 |2 |03/01/2021|
|1 |2 |04/01/2021|
|1 |2 |05/01/2021|
|1 |2 |06/01/2021|
|1 |2 |07/01/2021|
|A |B |02/01/2021|
|A |B |03/01/2021|
|A |B |04/01/2021|
|A |B |05/01/2021|
|A |B |06/01/2021|
|A |B |07/01/2021|
|M |N |05/01/2021|
|M |N |06/01/2021|
|M |N |07/01/2021|
+---------------+---------------+----------+ */

Filter for calculated date

I have troubles to filter my date for a calculated date
Here's my data:
> dput(df)
structure(list(date = structure(c(1490652000, 1490738400, 1490824800,
1490911200, 1490997600, 1491084000, 1491170400, 1491256800, 1491343200,
1491429600, 1491516000, 1491602400, 1491688800, 1491775200, 1491861600,
1491948000, 1492034400, 1492120800, 1492207200, 1492293600, 1492380000,
1492466400, 1492552800, 1492639200, 1492725600, 1492812000, 1492898400,
1492984800, 1493071200), class = c("POSIXct", "POSIXt"), tzone = ""),
date2 = structure(c(NA, NA, NA, NA, NA, NA, NA, 1491256800,
NA, NA, NA, NA, NA, 1491775200, NA, NA, NA, NA, NA, NA, NA,
1492466400, NA, NA, NA, NA, NA, NA, 1493071200), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = 87:115, class = "data.frame")
Now I want to filter the date column for all dates that are 7 days before date2, but I always get dataset with 0 observations:
library(lubridate)
library(dplyr)
df2 <- df %>%
filter(date == date2 -days(7))
However. the following works fine:
df2 <- df %>%
filter(date == date2)
I don't understand why!?!?
The second filter works and returns only rows where date == date2.
The desired filter needs the lubridate function days also it needs all rows in date2 column to have a valid date value.
First fill the date2 column then do the filter
df %>% tidyr::fill(date2, .direction = "up") %>%
filter(date == (date2 -lubridate::days(7)))

How to replicate an element in Spark dataframe in Scala?

Suppose I have a DataFrame:
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array")
+---+---+---+------------+
|one|two| X| Array|
+---+---+---+------------+
| 1| 2| x|[1, 2, 3, 4]|
+---+---+---+------------+
I want to replicate the single elements, let's say 4 times, in order to achieve a single row DataFrame with each field as an array of four elements. The desired output would be:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
You can use builit-in array function to replicate n time column of your choice.
Below is PoC code.
import org.apache.spark.sql.functions._
val replicate = (n: Int, colName: String) => array((1 to n).map(s => col(colName)):_*)
val replicatedCol = Seq("one", "two", "X").map(s => replicate(4, s).as(s))
val cols = col("Array") +: replicatedCol
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(cols:_*)
testDf.show(false)
+------------+------------+------------+------------+
|Array |one |two |X |
+------------+------------+------------+------------+
|[1, 2, 3, 4]|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|
+------------+------------+------------+------------+
In the case, you want different n for each column
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(replicate(2, "one").as("one"), replicate(3, "X").as("X"), replicate(4, "two").as("two"), $"Array")
testDf.show(false)
+------+---------+------------+------------+
|one |X |two |Array |
+------+---------+------------+------------+
|[1, 1]|[x, x, x]|[2, 2, 2, 2]|[1, 2, 3, 4]|
+------+---------+------------+------------+
Well, here is my solution:
First declare the columns you want to replicate:
val columnsToReplicate = List("one", "two", "X")
Then define the replication factor and the udf to perform it:
val replicationFactor = 4
val replicate = (s:String) => {
for {
i <- 1 to replicationFactor
} yield s
}
val replicateudf = functions.udf(replicate)
Then just perform the foldLeft on the DataFrame when the columname belongs to your list of desired column names:
testDf.columns.foldLeft(testDf)((acc, colname) => if (columnsToReplicate.contains(colname)) acc.withColumn(colname, replicateudf(acc.col(colname))) else acc)
Output:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
Note: You need to import this class:
import org.apache.spark.sql.functions
EDIT:
Variable replicationFactor as suggested in comments:
val mapColumnsToReplicate = Map("one"->4, "two"->5, "X"->6)
val replicateudf2 = functions.udf ((s: String, replicationFactor: Int) =>
for {
i <- 1 to replicationFactor
} yield s
)
testDf.columns.foldLeft(testDf)((acc, colname) => if (mapColumnsToReplicate.keys.toList.contains(colname)) acc.withColumn(colname, replicateudf2($"$colname", functions.lit(mapColumnsToReplicate(colname))))` else acc)
Output with those values above:
+------------+---------------+------------------+------------+
| one| two| X| Array|
+------------+---------------+------------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2, 2]|[x, x, x, x, x, x]|[1, 2, 3, 4]|
+------------+---------------+------------------+------------+
You can use explode und groupBy/collect_list :
val testDf = sc.parallelize(
Seq((1, 2, "x", Array(1, 2, 3, 4)),
(3, 4, "y", Array(1, 2, 3)),
(5,6, "z", Array(1)))
).toDF("one", "two", "X", "Array")
testDf
.withColumn("id",monotonically_increasing_id())
.withColumn("tmp", explode($"Array"))
.groupBy($"id")
.agg(
collect_list($"one").as("cl_one"),
collect_list($"two").as("cl_two"),
collect_list($"X").as("cl_X"),
first($"Array").as("Array")
)
.select(
$"cl_one".as("one"),
$"cl_two".as("two"),
$"cl_X".as("X"),
$"Array"
)
.show()
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
| [5]| [6]| [z]| [1]|
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
| [3, 3, 3]| [4, 4, 4]| [y, y, y]| [1, 2, 3]|
+------------+------------+------------+------------+
This solution has the advantage that it does not rely on constant array-sizes

Scala DataTable to List of Maps

Could you please suggest how I can implement the following :
I have a dataTable in a Cucumber feature file such as :
|A |B |C |
|1 |2 |3 |
|11 |22 |33 |
|111|222|333|
I try to get a List of Maps like this:
A:1,11,111; B:2,22,222; C:3,33,333
If I do like this
List[Map[String, Any]] =
data.asMaps(classOf[String], classOf[Any]).asScala.map(_.asScala.toMap).toList
I got a bit another staff: A:1, B:2, C:3, A:11 ....
Transpose, and then map to Maps.
val source = List(
List("A", "B", "C"),
List(1, 2, 3),
List(11, 22, 33),
List(111, 222, 333)
)
val transposed = source.transpose
println(transposed) // List(List(A, 1, 11, 111), List(B, 2, 22, 222), List(C, 3, 33, 333))
val mapped = transposed.map {
case l: List[Any] => Map(l.head -> l.tail)
}
println(mapped) // List(Map(A -> List(1, 11, 111)), Map(B -> List(2, 22, 222)), Map(C -> List(3, 33, 333)))