I have a data frame like this
Observations: 2,190,835
Variables: 13
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489…
$ preparationid <dbl> 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1000307, 1…
$ doseday <int> 90, 90, 91, 91, 92, 92, 92, 92, 93, 93, 93, 93, 94, 94, 94, 94, 95, 95, 95, 95, 99, 99, 100, 100, 10…
$ route <fct> enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., enteral., …
$ enteral <fct> t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t, t…
$ energy_kcal_kg <dbl> 0.00, 13.56, 0.00, 13.56, 0.00, 13.49, 0.00, 13.49, 0.00, 13.35, 0.00, 13.35, 0.00, 12.95, 0.00, 12.…
$ prot_g_kg <dbl> 0.000, 0.366, 0.000, 0.366, 0.000, 0.365, 0.000, 0.365, 0.000, 0.361, 0.000, 0.361, 0.000, 0.350, 0.…
$ lipids_g_kg <dbl> 0.000, 0.495, 0.000, 0.495, 0.000, 0.492, 0.000, 0.492, 0.000, 0.487, 0.000, 0.487, 0.000, 0.472, 0.…
$ K_mmol_kg <dbl> 0.000, 0.385, 0.000, 0.385, 0.000, 0.383, 0.000, 0.383, 0.000, 0.379, 0.000, 0.379, 0.000, 0.368, 0.…
$ Na_mmol_kg <dbl> 0.0000, 0.1832, 0.0000, 0.1832, 0.0000, 0.1823, 0.0000, 0.1823, 0.0000, 0.1804, 0.0000, 0.1804, 0.00…
$ Ca_mg_kg <dbl> 0.00, 10.99, 0.00, 10.99, 0.00, 10.94, 0.00, 10.94, 0.00, 10.82, 0.00, 10.82, 0.00, 10.50, 0.00, 10.…
$ P_mg_kg <dbl> 0.00, 8.25, 0.00, 8.25, 0.00, 8.20, 0.00, 8.20, 0.00, 8.12, 0.00, 8.12, 0.00, 7.88, 0.00, 7.88, 0.00…
$ Pi_mmol_kg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
>
And I would need to calculate for each patient the daily sum of nutrient intakes. And I´ve been using the code below.
nutrient_intake <- nutrient_data %>% group_by(patientid, doseday, enteral) %>% summarise(energy_kcal_kg_d=sum(energy_kcal_kg), protein_g_kg_d=sum(prot_g_kg), lipids_g_kg_d=sum(lipids_g_kg), na_total_mmol_kg_d=sum(Na_mmol_kg), K_total_mmol_kg_d=sum(K_mmol_kg), Ca_mg_total_kg_d=sum(Ca_mg_kg), P_mg_kg_d=sum(P_mg_kg), Pi_mmol_kg_d=sum(Pi_mmol_kg))
The code seems to be working in someway since the grouping seems fine, however the daily sums are missing, result is NA. What is wrong here?
Variables: 11
Groups: patientid, doseday [30,991]
$ patientid <int> 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, 4489, …
$ doseday <int> 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, …
$ enteral <fct> f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, f, t, …
$ energy_kcal_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ protein_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ lipids_g_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ na_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ K_total_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Ca_mg_total_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ P_mg_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Pi_mmol_kg_d <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
> ```
By default, sum does not consider NA.
Try this:
nutrient_intake <- nutrient_data %>%
group_by(patientid, doseday, enteral) %>%
summarise(
energy_kcal_kg_d=sum(energy_kcal_kg, na.rm=T),
protein_g_kg_d=sum(prot_g_kg, na.rm=T),
lipids_g_kg_d=sum(lipids_g_kg, na.rm=T),
na_total_mmol_kg_d=sum(Na_mmol_kg, na.rm=T),
K_total_mmol_kg_d=sum(K_mmol_kg, na.rm=T),
Ca_mg_total_kg_d=sum(Ca_mg_kg, na.rm=T),
P_mg_kg_d=sum(P_mg_kg, na.rm=T),
Pi_mmol_kg_d=sum(Pi_mmol_kg, na.rm=T)
)
I am loading a file in dataframe in spark databrick
spark.sql("""select A,X,Y,Z from fruits""")
A X Y Z
1E5 1.000 0.000 0.000
1U2 2.000 5.000 0.000
5G6 3.000 0.000 10.000
I need output as
A D
1E5 X 1
1U2 X 2, Y 5
5G6 X 3, Z 10
I am able to find the solution.
Each column name can be joined with value, and then all values can be joined in one column, separated by comma:
// data
val df = Seq(
("1E5", 1.000, 0.000, 0.000),
("1U2", 2.000, 5.000, 0.000),
("5G6", 3.000, 0.000, 10.000))
.toDF("A", "X", "Y", "Z")
// action
val columnsToConcat = List("X", "Y", "Z")
val columnNameValueList = columnsToConcat.map(c =>
when(col(c) =!= 0, concat(lit(c), lit(" "), col(c).cast(IntegerType)))
.otherwise("")
)
val valuesJoinedByComaColumn = columnNameValueList.reduce((a, b) =>
when(org.apache.spark.sql.functions.length(a) =!= 0 && org.apache.spark.sql.functions.length(b) =!= 0, concat(a, lit(", "), b))
.otherwise(concat(a, b))
)
val result = df.withColumn("D", valuesJoinedByComaColumn)
.drop(columnsToConcat: _*)
Output:
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
Solution similar with proposed by stack0114106, but looks more explicit.
Check this out:
scala> val df = Seq(("1E5",1.000,0.000,0.000),("1U2",2.000,5.000,0.000),("5G6",3.000,0.000,10.000)).toDF("A","X","Y","Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: double ... 2 more fields]
scala> df.show()
+---+---+---+----+
| A| X| Y| Z|
+---+---+---+----+
|1E5|1.0|0.0| 0.0|
|1U2|2.0|5.0| 0.0|
|5G6|3.0|0.0|10.0|
+---+---+---+----+
scala> val newcol = df.columns.drop(1).map( x=> when(col(x)===0,lit("")).otherwise(concat(lit(x),lit(" "),col(x).cast("int").cast("string"))) ).reduce( (x,y) => concat(x,lit(", "),y) )
newcol: org.apache.spark.sql.Column = concat(concat(CASE WHEN (X = 0) THEN ELSE concat(X, , CAST(CAST(X AS INT) AS STRING)) END, , , CASE WHEN (Y = 0) THEN ELSE concat(Y, , CAST(CAST(Y AS INT) AS STRING)) END), , , CASE WHEN (Z = 0) THEN ELSE concat(Z, , CAST(CAST(Z AS INT) AS STRING)) END)
scala> df.withColumn("D",newcol).withColumn("D",regexp_replace(regexp_replace('D,", ,",","),", $", "")).drop("X","Y","Z").show(false)
+---+---------+
|A |D |
+---+---------+
|1E5|X 1 |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+
scala>
I have 2 DataFrames
case class UserTransactions(id: Long, transactionDate: java.sql.Date, currencyUsed: String, value: Long)
ID, TransactionDate, CurrencyUsed, value
1, 2016-01-05, USD, 100
1, 2016-01-09, GBP, 150
1, 2016-02-01, USD, 50
1, 2016-02-10, JPN, 10
2, 2016-01-10, EURO, 50
2, 2016-01-10, GBP, 100
case class ReportingTime(userId: Long, reportDate: java.sql.Date)
userId, reportDate
1, 2016-01-05
1, 2016-01-31
1, 2016-02-15
2, 2016-01-10
2, 2016-02-01
Now I want to get summary by combining all previously used currencies by userId, reportDate and sum. The results should look like:
userId, reportDate, trasactionSummary
1, 2016-01-05, None
1, 2016-01-31, (USD -> 100)(GBP-> 150) // combined above 2 transactions less than 2016-01-31
1, 2016-02-15, (USD -> 150)(GBP-> 150)(JPN->10) // combined transactions less than 2016-02-15
2, 2016-01-10, None
2, 2016-02-01, (EURO-> 50) (GBP-> 100)
What is the best way to do this to do this? We have over 300 million transactions where each user can have up to 10,000 transactions.
The below snippet would achieve your requirement. Initial joining and aggregation is done via the Dataframe API of pyspark. Then the grouping of data (using reduceByKey) and final dataset preparation is done via RDD api since it is more suitable for such operations.
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(1,'2016-01-05','USD',100),
(1,'2016-01-09','GBP',150),
(1,'2016-02-01','USD',50),
(1,'2016-02-10','JPN',10),
(2,'2016-01-10','EURO',50),
(2,'2016-01-10','GBP',100)],['id', 'tdate', 'currency', 'value'])
df2 = spark.createDataFrame([(1,'2016-01-05'),
(1,'2016-01-31'),
(1,'2016-02-15'),
(2,'2016-01-10'),
(2,'2016-02-01')],['user_id', 'report_date'])
func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) ### function to convert string data type to date data type
df2 = df2.withColumn('tdate', func(df2.report_date))
df1 = df1.withColumn('tdate', func(df1.tdate))
result = df2.join(df1, (df1.id == df2.user_id) & (df1.tdate < df2.report_date), 'left_outer').select('user_id', 'report_date', 'currency', 'value').groupBy('user_id', 'report_date', 'currency').agg(F.sum('value').alias('value'))
data = result.rdd.map(lambda x: (x.user_id,x.report_date,x.currency,x.value)).keyBy(lambda x: (x[0],x[1])).mapValues(lambda x: filter(lambda x: bool(x),[(x[2],x[3]) if x[2] else None])).reduceByKey(lambda x,y: x + y).map(lambda x: (x[0][0],x[0][1], x[1]))
The final result generated is as shown below.
>>> spark.createDataFrame([ (x[0],x[1],str(x[2])) for x in data.collect()], ['id', 'date', 'values']).orderBy('id', 'date').show(20, False)
+---+----------+--------------------------------------------+
|id |date |values |
+---+----------+--------------------------------------------+
|1 |2016-01-05|[] |
|1 |2016-01-31|[(u'USD', 100), (u'GBP', 150)] |
|1 |2016-02-15|[(u'USD', 150), (u'GBP', 150), (u'JPN', 10)]|
|2 |2016-01-10|[] |
|2 |2016-02-01|[(u'EURO', 50), (u'GBP', 100)] |
+---+----------+--------------------------------------------+
In case some one needs in Scala
case class Transaction(id: String, date: java.sql.Date, currency:Option[String], value: Option[Long])
case class Report(id:String, date:java.sql.Date)
def toDate(date: String): java.sql.Date = {
val sf = new SimpleDateFormat("yyyy-MM-dd")
new java.sql.Date(sf.parse(date).getTime)
}
val allTransactions = Seq(
Transaction("1", toDate("2016-01-05"),Some("USD"),Some(100L)),
Transaction("1", toDate("2016-01-09"),Some("GBP"),Some(150L)),
Transaction("1",toDate("2016-02-01"),Some("USD"),Some(50L)),
Transaction("1",toDate("2016-02-10"),Some("JPN"),Some(10L)),
Transaction("2",toDate("2016-01-10"),Some("EURO"),Some(50L)),
Transaction("2",toDate("2016-01-10"),Some("GBP"),Some(100L))
)
val allReports = Seq(
Report("1",toDate("2016-01-05")),
Report("1",toDate("2016-01-31")),
Report("1",toDate("2016-02-15")),
Report("2",toDate("2016-01-10")),
Report("2",toDate("2016-02-01"))
)
val transections:Dataset[Transaction] = spark.createDataFrame(allTransactions).as[Transaction]
val reports: Dataset[Report] = spark.createDataFrame(allReports).as[Report]
val result = reports.alias("rp").join(transections.alias("tx"), (col("tx.id") === col("rp.id")) && (col("tx.date") < col("rp.date")), "left_outer")
.select("rp.id", "rp.date", "currency", "value")
.groupBy("rp.id", "rp.date", "currency").agg(sum("value"))
.toDF("id", "date", "currency", "value")
.as[Transaction]
val data = result.rdd.keyBy(x => (x.id , x.date))
.mapValues(x => if (x.currency.isDefined) collection.Map[String, Long](x.currency.get -> x.value.get) else collection.Map[String, Long]())
.reduceByKey((x,y) => x ++ y).map(x => (x._1._1, x._1._2, x._2))
.toDF("id", "date", "map")
.orderBy("id", "date")
Console output
+---+----------+--------------------------------------+
|id |date |map |
+---+----------+--------------------------------------+
|1 |2016-01-05|Map() |
|1 |2016-01-31|Map(GBP -> 150, USD -> 100) |
|1 |2016-02-15|Map(USD -> 150, GBP -> 150, JPN -> 10)|
|2 |2016-01-10|Map() |
|2 |2016-02-01|Map(GBP -> 100, EURO -> 50) |
+---+----------+--------------------------------------+