Filter for calculated date - date

I have troubles to filter my date for a calculated date
Here's my data:
> dput(df)
structure(list(date = structure(c(1490652000, 1490738400, 1490824800,
1490911200, 1490997600, 1491084000, 1491170400, 1491256800, 1491343200,
1491429600, 1491516000, 1491602400, 1491688800, 1491775200, 1491861600,
1491948000, 1492034400, 1492120800, 1492207200, 1492293600, 1492380000,
1492466400, 1492552800, 1492639200, 1492725600, 1492812000, 1492898400,
1492984800, 1493071200), class = c("POSIXct", "POSIXt"), tzone = ""),
date2 = structure(c(NA, NA, NA, NA, NA, NA, NA, 1491256800,
NA, NA, NA, NA, NA, 1491775200, NA, NA, NA, NA, NA, NA, NA,
1492466400, NA, NA, NA, NA, NA, NA, 1493071200), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = 87:115, class = "data.frame")
Now I want to filter the date column for all dates that are 7 days before date2, but I always get dataset with 0 observations:
library(lubridate)
library(dplyr)
df2 <- df %>%
filter(date == date2 -days(7))
However. the following works fine:
df2 <- df %>%
filter(date == date2)
I don't understand why!?!?

The second filter works and returns only rows where date == date2.
The desired filter needs the lubridate function days also it needs all rows in date2 column to have a valid date value.
First fill the date2 column then do the filter
df %>% tidyr::fill(date2, .direction = "up") %>%
filter(date == (date2 -lubridate::days(7)))

Related

scala: get column name corresponding to max column value from variable columns list

I have the following working solution in a databricks notebook as test.
var maxcol = udf((col1: Long, col2: Long, col3: Long) => {
var res = ""
if (col1 > col2 && col1 > col3) res = "col1"
else if (col2 > col1 && col2 > col3) res = "col2"
else res = "col3"
res
})
val someDF = Seq(
(8, 10, 12, "bat"),
(64, 61, 59, "mouse"),
(-27, -30, -15, "horse")
).toDF("number1", "number2", "number3", "word")
.withColumn("maxColVal", greatest("number1", "number2", "number3"))
.withColumn("maxColVal_Name", maxcol(col("number1"), col("number2"), col("number3")))
display(someDF)
Is there any way to make this generic? I have a usecase to make variable columns pass to this UDF and still get the max column name as output corresponding to the column having max value.
Unlike above where I have hard coded the column names 'col1', 'col2' and 'col3' in the UDF.
Use below:
val df = List((1,2,3,5,"a"),(4,2,3,1,"a"),(1,20,3,1,"a"),(1,22,22,2,"a")).toDF("mycol1","mycol2","mycol3","mycol4","mycol5")
//list all your columns among which you want to find the max value
val colGroup = List(df("mycol1"),df("mycol2"),df("mycol3"),df("mycol4"))
//list column value -> column name of the columns among which you want to find max value column NAME
val colGroupMap = List(df("mycol1"),lit("mycol1"),
df("mycol2"),lit("mycol2"),
df("mycol3"),lit("mycol3"),
df("mycol4"),lit("mycol4"))
var maxcol = udf((colVal: Map[Int,String]) => {
colVal.max._2 //you can easily find the column name of the max column value
})
df.withColumn("maxColValue",greatest(colGroup:_*)).withColumn("maxColVal_Name",maxcol(map(colGroupMap:_*))).show(false)
+------+------+------+------+------+-----------+--------------+
|mycol1|mycol2|mycol3|mycol4|mycol5|maxColValue|maxColVal_Name|
+------+------+------+------+------+-----------+--------------+
|1 |2 |3 |5 |a |5 |mycol4 |
|4 |2 |3 |1 |a |4 |mycol1 |
|1 |20 |3 |1 |a |20 |mycol2 |
|1 |22 |22 |2 |a |22 |mycol3 |
+------+------+------+------+------+-----------+--------------+

R summarise by group sum giving NA

I have a data frame like this
Observations: 2,190,835
Variables: 13
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489…
$ preparationid <dbl> 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1…
$ doseday <int> 90, 90, 91, 91, 92, 92, 92, 92, 93, 93, 93, 93, 94, 94, 94, 94, 95, 95, 95, 95, 99, 99, 100, 100, 10…
$ route <fct> enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., …
$ enteral <fct> t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t…
$ energy_kcal_kg <dbl> 0.00, 13.56, 0.00, 13.56, 0.00, 13.49, 0.00, 13.49, 0.00, 13.35, 0.00, 13.35, 0.00, 12.95, 0.00, 12.…
$ prot_g_kg <dbl> 0.000, 0.366, 0.000, 0.366, 0.000, 0.365, 0.000, 0.365, 0.000, 0.361, 0.000, 0.361, 0.000, 0.350, 0.…
$ lipids_g_kg <dbl> 0.000, 0.495, 0.000, 0.495, 0.000, 0.492, 0.000, 0.492, 0.000, 0.487, 0.000, 0.487, 0.000, 0.472, 0.…
$ K_mmol_kg <dbl> 0.000, 0.385, 0.000, 0.385, 0.000, 0.383, 0.000, 0.383, 0.000, 0.379, 0.000, 0.379, 0.000, 0.368, 0.…
$ Na_mmol_kg <dbl> 0.0000, 0.1832, 0.0000, 0.1832, 0.0000, 0.1823, 0.0000, 0.1823, 0.0000, 0.1804, 0.0000, 0.1804, 0.00…
$ Ca_mg_kg <dbl> 0.00, 10.99, 0.00, 10.99, 0.00, 10.94, 0.00, 10.94, 0.00, 10.82, 0.00, 10.82, 0.00, 10.50, 0.00, 10.…
$ P_mg_kg <dbl> 0.00, 8.25, 0.00, 8.25, 0.00, 8.20, 0.00, 8.20, 0.00, 8.12, 0.00, 8.12, 0.00, 7.88, 0.00, 7.88, 0.00…
$ Pi_mmol_kg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>
And I would need to calculate for each patient the daily sum of nutrient intakes. And I´ve been using the code below.
nutrient_intake <- nutrient_data %>% group_by(patientid, doseday, enteral) %>% summarise(energy_kcal_kg_d=sum(energy_kcal_kg), protein_g_kg_d=sum(prot_g_kg), lipids_g_kg_d=sum(lipids_g_kg), na_total_mmol_kg_d=sum(Na_mmol_kg), K_total_mmol_kg_d=sum(K_mmol_kg), Ca_mg_total_kg_d=sum(Ca_mg_kg), P_mg_kg_d=sum(P_mg_kg), Pi_mmol_kg_d=sum(Pi_mmol_kg))
The code seems to be working in someway since the grouping seems fine, however the daily sums are missing, result is NA. What is wrong here?
Variables: 11
Groups: patientid, doseday [30,991]
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, …
$ doseday <int> 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, …
$ enteral <fct> f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, …
$ energy_kcal_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ protein_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ lipids_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ na_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ K_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Ca_mg_total_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ P_mg_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Pi_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
> ```
By default, sum does not consider NA.
Try this:
nutrient_intake <- nutrient_data %>%
group_by(patientid, doseday, enteral) %>%
summarise(
energy_kcal_kg_d=sum(energy_kcal_kg, na.rm=T),
protein_g_kg_d=sum(prot_g_kg, na.rm=T),
lipids_g_kg_d=sum(lipids_g_kg, na.rm=T),
na_total_mmol_kg_d=sum(Na_mmol_kg, na.rm=T),
K_total_mmol_kg_d=sum(K_mmol_kg, na.rm=T),
Ca_mg_total_kg_d=sum(Ca_mg_kg, na.rm=T),
P_mg_kg_d=sum(P_mg_kg, na.rm=T),
Pi_mmol_kg_d=sum(Pi_mmol_kg, na.rm=T)
)

improve if condition with scala

I wrote this:
if (fork == "0" || fork == "1" || fork == "3" || fork == "null" ) {
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
daFuncId,
NA,
name,
code)
)
}
else list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
NA,
NA,
name,
code
)
)
}
I want to improve this by replacing the if else with another pattern
best regards
It seems only the ID is different between the two cases. You could use pattern matching to choose the id, and append to the list only after so you don't repeat the Wrapper construction:
val id = fork match {
case "0" | "1" | "3" | "null" => daFuncId
case _ => NA
}
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
id,
NA,
name,
code)
)
You can write the same if-else condition using pattern matching in scala.
fork match {
case "0" | "1" | "3" | null =>
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
daFuncId,
NA,
name,
code)
)
case _ =>
list2 :: List(
Wrapper(
Location.PL_TYPES,
subType,
NA,
NA,
name,
code
)
)
}
Please let me know if this works out for you.
list2 :: List(fork)
.map {
case "0" | "1" | "3" | "null" => daFuncId
case _ => NA
}.map { id =>
Wrapper(Location.PL_TYPES, subType, id, NA, name, code)
}
Not really scala specific but I'd suggest something like this:
if (List("0", "1", "3", "null").contains(fork)) {
} else {
}

I want to calculate using three columns and produce single column with showing all three values

I am loading a file in dataframe in spark databrick
spark.sql("""select A,X,Y,Z from fruits""")
A X Y Z
1E5 1.000 0.000 0.000
1U2 2.000 5.000 0.000
5G6 3.000 0.000 10.000
I need output as
A D
1E5 X 1
1U2 X 2, Y 5
5G6 X 3, Z 10
I am able to find the solution.
Each column name can be joined with value, and then all values can be joined in one column, separated by comma:
// data
val df = Seq(
("1E5", 1.000, 0.000, 0.000),
("1U2", 2.000, 5.000, 0.000),
("5G6", 3.000, 0.000, 10.000))
.toDF("A", "X", "Y", "Z")
// action
val columnsToConcat = List("X", "Y", "Z")
val columnNameValueList = columnsToConcat.map(c =>
when(col(c) =!= 0, concat(lit(c), lit(" "), col(c).cast(IntegerType)))
.otherwise("")
)
val valuesJoinedByComaColumn = columnNameValueList.reduce((a, b) =>
when(org.apache.spark.sql.functions.length(a) =!= 0 && org.apache.spark.sql.functions.length(b) =!= 0, concat(a, lit(", "), b))
.otherwise(concat(a, b))
)
val result = df.withColumn("D", valuesJoinedByComaColumn)
.drop(columnsToConcat: _*)
Output:
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
Solution similar with proposed by stack0114106, but looks more explicit.
Check this out:
scala> val df = Seq(("1E5",1.000,0.000,0.000),("1U2",2.000,5.000,0.000),("5G6",3.000,0.000,10.000)).toDF("A","X","Y","Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: double ... 2 more fields]
scala> df.show()
+---+---+---+----+
| A| X| Y| Z|
+---+---+---+----+
|1E5|1.0|0.0| 0.0|
|1U2|2.0|5.0| 0.0|
|5G6|3.0|0.0|10.0|
+---+---+---+----+
scala> val newcol = df.columns.drop(1).map( x=> when(col(x)===0,lit("")).otherwise(concat(lit(x),lit(" "),col(x).cast("int").cast("string"))) ).reduce( (x,y) => concat(x,lit(", "),y) )
newcol: org.apache.spark.sql.Column = concat(concat(CASE WHEN (X = 0) THEN ELSE concat(X, , CAST(CAST(X AS INT) AS STRING)) END, , , CASE WHEN (Y = 0) THEN ELSE concat(Y, , CAST(CAST(Y AS INT) AS STRING)) END), , , CASE WHEN (Z = 0) THEN ELSE concat(Z, , CAST(CAST(Z AS INT) AS STRING)) END)
scala> df.withColumn("D",newcol).withColumn("D",regexp_replace(regexp_replace('D,", ,",","),", $", "")).drop("X","Y","Z").show(false)
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
scala>

Collapsing column values in spark dataframes

I have 2 DataFrames
case class UserTransactions(id: Long, transactionDate: java.sql.Date, currencyUsed: String, value: Long)
ID, TransactionDate, CurrencyUsed, value
1, 2016-01-05, USD, 100
1, 2016-01-09, GBP, 150
1, 2016-02-01, USD, 50
1, 2016-02-10, JPN, 10
2, 2016-01-10, EURO, 50
2, 2016-01-10, GBP, 100
case class ReportingTime(userId: Long, reportDate: java.sql.Date)
userId, reportDate
1, 2016-01-05
1, 2016-01-31
1, 2016-02-15
2, 2016-01-10
2, 2016-02-01
Now I want to get summary by combining all previously used currencies by userId, reportDate and sum. The results should look like:
userId, reportDate, trasactionSummary
1, 2016-01-05, None
1, 2016-01-31, (USD -> 100)(GBP-> 150) // combined above 2 transactions less than 2016-01-31
1, 2016-02-15, (USD -> 150)(GBP-> 150)(JPN->10) // combined transactions less than 2016-02-15
2, 2016-01-10, None
2, 2016-02-01, (EURO-> 50) (GBP-> 100)
What is the best way to do this to do this? We have over 300 million transactions where each user can have up to 10,000 transactions.
The below snippet would achieve your requirement. Initial joining and aggregation is done via the Dataframe API of pyspark. Then the grouping of data (using reduceByKey) and final dataset preparation is done via RDD api since it is more suitable for such operations.
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(1,'2016-01-05','USD',100),
(1,'2016-01-09','GBP',150),
(1,'2016-02-01','USD',50),
(1,'2016-02-10','JPN',10),
(2,'2016-01-10','EURO',50),
(2,'2016-01-10','GBP',100)],['id', 'tdate', 'currency', 'value'])
df2 = spark.createDataFrame([(1,'2016-01-05'),
(1,'2016-01-31'),
(1,'2016-02-15'),
(2,'2016-01-10'),
(2,'2016-02-01')],['user_id', 'report_date'])
func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) ### function to convert string data type to date data type
df2 = df2.withColumn('tdate', func(df2.report_date))
df1 = df1.withColumn('tdate', func(df1.tdate))
result = df2.join(df1, (df1.id == df2.user_id) & (df1.tdate < df2.report_date), 'left_outer').select('user_id', 'report_date', 'currency', 'value').groupBy('user_id', 'report_date', 'currency').agg(F.sum('value').alias('value'))
data = result.rdd.map(lambda x: (x.user_id,x.report_date,x.currency,x.value)).keyBy(lambda x: (x[0],x[1])).mapValues(lambda x: filter(lambda x: bool(x),[(x[2],x[3]) if x[2] else None])).reduceByKey(lambda x,y: x + y).map(lambda x: (x[0][0],x[0][1], x[1]))
The final result generated is as shown below.
>>> spark.createDataFrame([ (x[0],x[1],str(x[2])) for x in data.collect()], ['id', 'date', 'values']).orderBy('id', 'date').show(20, False)
+---+----------+--------------------------------------------+
|id |date |values |
+---+----------+--------------------------------------------+
|1 |2016-01-05|[] |
|1 |2016-01-31|[(u'USD', 100), (u'GBP', 150)] |
|1 |2016-02-15|[(u'USD', 150), (u'GBP', 150), (u'JPN', 10)]|
|2 |2016-01-10|[] |
|2 |2016-02-01|[(u'EURO', 50), (u'GBP', 100)] |
+---+----------+--------------------------------------------+
In case some one needs in Scala
case class Transaction(id: String, date: java.sql.Date, currency:Option[String], value: Option[Long])
case class Report(id:String, date:java.sql.Date)
def toDate(date: String): java.sql.Date = {
val sf = new SimpleDateFormat("yyyy-MM-dd")
new java.sql.Date(sf.parse(date).getTime)
}
val allTransactions = Seq(
Transaction("1", toDate("2016-01-05"),Some("USD"),Some(100L)),
Transaction("1", toDate("2016-01-09"),Some("GBP"),Some(150L)),
Transaction("1",toDate("2016-02-01"),Some("USD"),Some(50L)),
Transaction("1",toDate("2016-02-10"),Some("JPN"),Some(10L)),
Transaction("2",toDate("2016-01-10"),Some("EURO"),Some(50L)),
Transaction("2",toDate("2016-01-10"),Some("GBP"),Some(100L))
)
val allReports = Seq(
Report("1",toDate("2016-01-05")),
Report("1",toDate("2016-01-31")),
Report("1",toDate("2016-02-15")),
Report("2",toDate("2016-01-10")),
Report("2",toDate("2016-02-01"))
)
val transections:Dataset[Transaction] = spark.createDataFrame(allTransactions).as[Transaction]
val reports: Dataset[Report] = spark.createDataFrame(allReports).as[Report]
val result = reports.alias("rp").join(transections.alias("tx"), (col("tx.id") === col("rp.id")) && (col("tx.date") < col("rp.date")), "left_outer")
.select("rp.id", "rp.date", "currency", "value")
.groupBy("rp.id", "rp.date", "currency").agg(sum("value"))
.toDF("id", "date", "currency", "value")
.as[Transaction]
val data = result.rdd.keyBy(x => (x.id , x.date))
.mapValues(x => if (x.currency.isDefined) collection.Map[String, Long](x.currency.get -> x.value.get) else collection.Map[String, Long]())
.reduceByKey((x,y) => x ++ y).map(x => (x._1._1, x._1._2, x._2))
.toDF("id", "date", "map")
.orderBy("id", "date")
Console output
+---+----------+--------------------------------------+
|id |date |map |
+---+----------+--------------------------------------+
|1 |2016-01-05|Map() |
|1 |2016-01-31|Map(GBP -> 150, USD -> 100) |
|1 |2016-02-15|Map(USD -> 150, GBP -> 150, JPN -> 10)|
|2 |2016-01-10|Map() |
|2 |2016-02-01|Map(GBP -> 100, EURO -> 50) |
+---+----------+--------------------------------------+