assign patient to nearest facility in spark,scala - scala

I have a data frame as below
+-----------+----------+-----------+--------+-------------+
|LOCATION_ID|PATIENT_ID|FACILITY_ID|DISTANCE|rank_distance|
+-----------+----------+-----------+--------+-------------+
|LOC0001 |P1 |FAC003 |54 |2 |
|LOC0001 |P1 |FAC002 |45 |1 |
|LOC0001 |P2 |FAC003 |54 |2 |
|LOC0001 |P2 |FAC002 |45 |1 |
|LOC0010 |P3 |FAC006 |12 |1 |
|LOC0010 |P3 |FAC003 |54 |4 |
|LOC0010 |P3 |FAC005 |23 |2 |
|LOC0010 |P3 |FAC002 |45 |3 |
|LOC0010 |P4 |FAC002 |45 |3 |
|LOC0010 |P4 |FAC005 |23 |2 |
|LOC0010 |P4 |FAC003 |54 |4 |
|LOC0010 |P4 |FAC006 |12 |1 |
|LOC0010 |P5 |FAC006 |12 |1 |
|LOC0010 |P5 |FAC002 |45 |3 |
|LOC0010 |P5 |FAC005 |23 |2 |
|LOC0010 |P5 |FAC003 |54 |4 |
|LOC0010 |P6 |FAC006 |12 |1 |
|LOC0010 |P6 |FAC005 |23 |2 |
|LOC0010 |P6 |FAC002 |45 |3 |
|LOC0010 |P6 |FAC003 |54 |4 |
|LOC0043 |P7 |FAC004 |42 |1 |
|LOC0054 |P8 |FAC002 |24 |2 |
|LOC0054 |P8 |FAC006 |12 |1 |
|LOC0054 |P8 |FAC005 |76 |3 |
|LOC0054 |P8 |FAC007 |100 |4 |
|LOC0065 |P9 |FAC006 |32 |1 |
|LOC0065 |P9 |FAC005 |54 |2 |
|LOC0065 |P10 |FAC006 |32 |1 |
|LOC0065 |P10 |FAC005 |54 |2 |
+-----------+----------+-----------+--------+-------------+
for each patient I have to assign facility for which rank is least.my output map should be as below
p1 ---> FAC002 (because its rank is least)
P2 ---> FAC002 (because its rank is least)
note each facility has capacity of just 2,except of FAC003 which has capacity of 3
so for P3,P4,P5 and P6 output should be
p3 ----> FAC006 (because its rank is 1)
P4 ----> FAC006 (because its rank is 1)
p5 ----> FAC005 (bacause FAC006 has fulled its capacity of 2,and now least
rank is of FAC005)
p6 ---->FAC005 (bacause FAC005 has one capacity left)
P7 ----->FAC004

Please check below code.
Extracting data from DataFrame based on patients & checking condition with previously extracted data, if the condition is false skipping that record else join that records with previous extracted records.
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(("LOC0001","P1","FAC003","54","2"),("LOC0001","P1","FAC002","45","1"),("LOC0001","P2","FAC003","54","2"),("LOC0001","P2","FAC002","45","1"),("LOC0010","P3","FAC006","12","1"),("LOC0010","P3","FAC003","54","4"),("LOC0010","P3","FAC005","23","2"),("LOC0010","P3","FAC002","45","3"),("LOC0010","P4","FAC002","45","3"),("LOC0010","P4","FAC005","23","2"),("LOC0010","P4","FAC003","54","4"),("LOC0010","P4","FAC006","12","1"),("LOC0010","P5","FAC006","12","1"),("LOC0010","P5","FAC002","45","3"),("LOC0010","P5","FAC005","23","2"),("LOC0010","P5","FAC003","54","4"),("LOC0010","P6","FAC006","12","1"),("LOC0010","P6","FAC005","23","2"),("LOC0010","P6","FAC002","45","3"),("LOC0010","P6","FAC003","54","4"),("LOC0043","P7","FAC004","42","1"),("LOC0054","P8","FAC002","24","2"),("LOC0054","P8","FAC006","12","1"),("LOC0054","P8","FAC005","76","3"),("LOC0054","P8","FAC007","100","4"),("LOC0065","P9","FAC006","32","1"),("LOC0065","P9","FAC005","54","2"),("LOC0065","P10","FAC006","32","1"),("LOC0065","P10","FAC005","54","2")).toDF("location_id","patient_id","facility_id","distance","rank_distance").withColumn("facility_new",first($"facility_id").over(Window.partitionBy($"patient_id").orderBy($"rank_distance".asc))).orderBy(substring($"patient_id",2,3).cast("int").asc)
val patients = df.select("patient_id").distinct.orderBy(substring($"patient_id",2,3).cast("int").asc).collect.map(_.getAs[String](0)) // Taking all patients into collection.
val facilities = Map("FAC004" -> 2, "FAC003" -> 3, "FAC007" -> 2, "FAC002" -> 2, "FAC006" -> 2, "FAC005" -> 2)// For checking conditions.
case class Config(newDF: DataFrame,oldDF: DataFrame,facilities: Map[String,Int]) // Taking all facilitiy ids & max allowed values for checking codition.
def fetchFacilityId(config:Config) = {
config.newDF.select("facility_id").distinct.orderBy(substring($"facility_id",5,6).cast("int").asc)
.except(config.oldDF.select("facility_new").distinct.orderBy(substring($"facility_new",5,6).cast("int").asc)).head.getAs[String](0)
} // Getting required facility id.
def findFacilityId(config:Config, patientId: String,index:Int):(Boolean,String,Int) = {
val max_distance = config.oldDF.filter($"patient_id"=== patientId).select("rank_distance").orderBy($"rank_distance".desc).head.getAs[String](0).toInt
(index < max_distance) match {
case true => {
val fac = config.oldDF.filter($"patient_id" === patientId && $"rank_distance"=== index).select("facility_id").distinct.map(_.getAs[String](0)).collect.head
val bool = config.newDF.filter($"facility_new" === lit(fac) && $"rank_distance"=== index).select("patient_id").distinct.count < config.facilities(fac)
(bool,fac, index)
}
case false => (true,fetchFacilityId(config), 0)
}
} // finding required facility id.
def process(config:Config,patientId:String,index:Int):DataFrame = findFacilityId(config,patientId,index) match {
case (true,fac,_) => {
config.newDF.union(config.oldDF.filter($"patient_id" === patientId).withColumn("facility_new",lit(fac)))
}
case (false,_,rank) => {
process(config,patientId,index+1)
}
} // Checking duplicate facility ids
val config = Config(newDF= df.limit(0),df, facilities)
val updatedDF = patients.foldLeft(config){ case ((cfg),patientId) =>
cfg.newDF.count match {
case 0L => cfg.copy(newDF = cfg.newDF.union(cfg.oldDF.filter($"patient_id" === patientId)))
case _ => cfg.copy(newDF = process(cfg, patientId,1))
}
}.newDF.drop("facility_id").withColumnRenamed("facility_new","facility_id")
updatedDF.show(31, false)
// Exiting paste mode, now interpreting.
+-----------+----------+--------+-------------+-----------+
|location_id|patient_id|distance|rank_distance|facility_id|
+-----------+----------+--------+-------------+-----------+
|LOC0001 |P1 |45 |1 |FAC002 |
|LOC0001 |P1 |54 |2 |FAC002 |
|LOC0001 |P2 |54 |2 |FAC002 |
|LOC0001 |P2 |45 |1 |FAC002 |
|LOC0010 |P3 |12 |1 |FAC006 |
|LOC0010 |P3 |54 |4 |FAC006 |
|LOC0010 |P3 |23 |2 |FAC006 |
|LOC0010 |P3 |45 |3 |FAC006 |
|LOC0010 |P4 |45 |3 |FAC006 |
|LOC0010 |P4 |23 |2 |FAC006 |
|LOC0010 |P4 |54 |4 |FAC006 |
|LOC0010 |P4 |12 |1 |FAC006 |
|LOC0010 |P5 |12 |1 |FAC005 |
|LOC0010 |P5 |45 |3 |FAC005 |
|LOC0010 |P5 |23 |2 |FAC005 |
|LOC0010 |P5 |54 |4 |FAC005 |
|LOC0010 |P6 |12 |1 |FAC005 |
|LOC0010 |P6 |23 |2 |FAC005 |
|LOC0010 |P6 |45 |3 |FAC005 |
|LOC0010 |P6 |54 |4 |FAC005 |
|LOC0043 |P7 |42 |1 |FAC003 |
|LOC0054 |P8 |24 |2 |FAC003 |
|LOC0054 |P8 |12 |1 |FAC003 |
|LOC0054 |P8 |76 |3 |FAC003 |
|LOC0054 |P8 |100 |4 |FAC003 |
|LOC0065 |P9 |32 |1 |FAC003 |
|LOC0065 |P9 |54 |2 |FAC003 |
|LOC0065 |P10 |32 |1 |FAC003 |
|LOC0065 |P10 |54 |2 |FAC003 |
+-----------+----------+--------+-------------+-----------+
Time taken: 31009 ms
scala>

Related

How can I make a unique match with join with two spark dataframes and different columns?

I have two dataframes spark(scala):
First:
+-------------------+------------------+-----------------+----------+-----------------+
|id |zone |zone_father |father_id |country |
+-------------------+------------------+-----------------+----------+-----------------+
|2 |1 |123 |1 |0 |
|2 |2 |123 |1 |0 |
|3 |3 |1 |2 |0 |
|2 |4 |123 |1 |0 |
|3 |5 |2 |2 |0 |
|3 |6 |4 |2 |0 |
|3 |7 |19 |2 |0 |
+-------------------+------------------+-----------------+----------+-----------------+
Second:
+-------------------+------------------+-----------------+-----------------+
|country |id |zone |zone_value |
+-------------------+------------------+-----------------+-----------------+
|0 |2 |1 |7 |
|0 |2 |2 |7 |
|0 |2 |4 |8 |
|0 |0 |0 |2 |
+-------------------+------------------+-----------------+-----------------+
Then I need following logic:
1 -> If => first.id = second.id && first.zone = second.zone
2 -> Else if => first.father_id = second.id && first.zone_father = second.zone
3 -> If neither the first nor the second is true, follow the latter => first.country = second.zone
And the expected result would be:
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|id |zone |zone_father |father_id |country |zone_value |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|2 |1 |123 |1 |0 |7 |
|2 |2 |123 |1 |0 |7 |
|3 |3 |1 |2 |0 |7 |
|2 |4 |123 |1 |0 |8 |
|3 |5 |2 |2 |0 |7 |
|3 |6 |4 |2 |0 |8 |
|3 |7 |19 |2 |0 |2 |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
I tried to join both dataframes, but due "or" operation, two results for each row is returned, because the last premise returns true regardless of the result of the other two.

group by per day pyspark

I have a PySpark DataFrame :
From id To id Price Date
a b 20 30/05/2019
b c 5 30/05/2019
c a 20 30/05/2019
a d 10 02/06/2019
d c 5 02/06/2019
id Name
a Claudia
b Manuella
c remy
d Paul
The output that i want is :
Date Name current balance
30/05/2019 Claudia 0
30/05/2019 Manuella 15
30/05/2019 Remy -15
30/05/2019 Paul 0
02/06/2019 Claudia -10
02/06/2019 Manuella 15
02/06/2019 Remy -10
02/06/2019 Paul 5
I want to get the current balance in each day for all users.
my idea is to make a groupby per user and calculate the sum of the TO column minus the From column. But how to do it per day? especially it's cumulative and not per day?
Thank You
I took a bit of an effort to get the requirements right. Here's my version of the solution.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("a","b",20,"30/05/2019"),
("b","c",5 ,"30/05/2019"),
("c","a",20,"30/05/2019"),
("a","d",10,"02/06/2019"),
("d","c",5 ,"02/06/2019"),
]
df1Columns = ["From_Id", "To_Id", "Price", "Date"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("Date",F.to_date(F.to_timestamp("Date", 'dd/MM/yyyy')).alias('Date'))
print("Actual initial data")
df1.show(truncate=False)
data2 = [
("a","Claudia"),
("b","Manuella"),
("c","Remy"),
("d","Paul"),
]
df2Columns = ["id","Name"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print("Actual initial data")
df2.show(truncate=False)
alldays_df = df1.select("Date").distinct().repartition(20)
allusers_df = df2.select("id").distinct().repartition(10)
crossjoin_df = alldays_df.crossJoin(allusers_df)
crossjoin_df = crossjoin_df.withColumn("initial", F.lit(0))
crossjoin_df = crossjoin_df.withColumnRenamed("id", "common_id").cache()
crossjoin_df.show(n=40, truncate=False)
from_sum_df = df1.groupby("Date", "From_Id").agg(F.sum("Price").alias("from_sum"))
from_sum_df = from_sum_df.withColumnRenamed("From_Id", "common_id")
from_sum_df.show(truncate=False)
from_sum_df = crossjoin_df.alias('cross').join(
from_sum_df.alias('from'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('from.from_sum', 'cross.initial').alias('from_amount') ).cache()
from_sum_df.show(truncate=False)
to_sum_df = df1.groupby("Date", "To_Id").agg(F.sum("Price").alias("to_sum"))
to_sum_df = to_sum_df.withColumnRenamed("To_Id", "common_id")
to_sum_df.show(truncate=False)
to_sum_df = crossjoin_df.alias('cross').join(
to_sum_df.alias('to'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('to.to_sum', 'cross.initial').alias('to_amount') ).cache()
to_sum_df.show(truncate=False)
joined_df = to_sum_df.join(from_sum_df, ["Date", "common_id"], how='inner')
joined_df.show(truncate=False)
balance_df = joined_df.withColumn("balance", F.col("to_amount") - F.col("from_amount"))
balance_df.show(truncate=False)
final_df = balance_df.join(df2, F.col("id") == F.col("common_id"))
final_df.show(truncate=False)
final_cum_sum = final_df.withColumn('cumsum_balance', F.sum('balance').over(Window.partitionBy('common_id').orderBy('Date').rowsBetween(-sys.maxsize, 0)))
final_cum_sum.show()
Following are all the outputs for your progressive understanding. I am not explaining the steps. You can figure them out.
Actual initial data
+-------+-----+-----+----------+
|From_Id|To_Id|Price|Date |
+-------+-----+-----+----------+
|a |b |20 |2019-05-30|
|b |c |5 |2019-05-30|
|c |a |20 |2019-05-30|
|a |d |10 |2019-06-02|
|d |c |5 |2019-06-02|
+-------+-----+-----+----------+
Actual initial data
+---+--------+
|id |Name |
+---+--------+
|a |Claudia |
|b |Manuella|
|c |Remy |
|d |Paul |
+---+--------+
+----------+---------+-------+
|Date |common_id|initial|
+----------+---------+-------+
|2019-05-30|a |0 |
|2019-05-30|d |0 |
|2019-05-30|b |0 |
|2019-05-30|c |0 |
|2019-06-02|a |0 |
|2019-06-02|d |0 |
|2019-06-02|b |0 |
|2019-06-02|c |0 |
+----------+---------+-------+
+----------+---------+--------+
|Date |common_id|from_sum|
+----------+---------+--------+
|2019-06-02|a |10 |
|2019-05-30|a |20 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+--------+
+----------+---------+-----------+
|Date |common_id|from_amount|
+----------+---------+-----------+
|2019-06-02|a |10 |
|2019-06-02|c |0 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+-----------+
+----------+---------+------+
|Date |common_id|to_sum|
+----------+---------+------+
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+------+
+----------+---------+---------+
|Date |common_id|to_amount|
+----------+---------+---------+
|2019-06-02|a |0 |
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+---------+
+----------+---------+---------+-----------+
|Date |common_id|to_amount|from_amount|
+----------+---------+---------+-----------+
|2019-06-02|a |0 |10 |
|2019-06-02|c |5 |0 |
|2019-05-30|a |20 |20 |
|2019-05-30|d |0 |0 |
|2019-06-02|b |0 |0 |
|2019-06-02|d |10 |5 |
|2019-05-30|c |5 |20 |
|2019-05-30|b |20 |5 |
+----------+---------+---------+-----------+
+----------+---------+---------+-----------+-------+
|Date |common_id|to_amount|from_amount|balance|
+----------+---------+---------+-----------+-------+
|2019-06-02|a |0 |10 |-10 |
|2019-06-02|c |5 |0 |5 |
|2019-05-30|a |20 |20 |0 |
|2019-05-30|d |0 |0 |0 |
|2019-06-02|b |0 |0 |0 |
|2019-06-02|d |10 |5 |5 |
|2019-05-30|c |5 |20 |-15 |
|2019-05-30|b |20 |5 |15 |
+----------+---------+---------+-----------+-------+
+----------+---------+---------+-----------+-------+---+--------+
|Date |common_id|to_amount|from_amount|balance|id |Name |
+----------+---------+---------+-----------+-------+---+--------+
|2019-05-30|a |20 |20 |0 |a |Claudia |
|2019-06-02|a |0 |10 |-10 |a |Claudia |
|2019-05-30|b |20 |5 |15 |b |Manuella|
|2019-06-02|b |0 |0 |0 |b |Manuella|
|2019-05-30|c |5 |20 |-15 |c |Remy |
|2019-06-02|c |5 |0 |5 |c |Remy |
|2019-06-02|d |10 |5 |5 |d |Paul |
|2019-05-30|d |0 |0 |0 |d |Paul |
+----------+---------+---------+-----------+-------+---+--------+
+----------+---------+---------+-----------+-------+---+--------+--------------+
| Date|common_id|to_amount|from_amount|balance| id| Name|cumsum_balance|
+----------+---------+---------+-----------+-------+---+--------+--------------+
|2019-05-30| d| 0| 0| 0| d| Paul| 0|
|2019-06-02| d| 10| 5| 5| d| Paul| 5|
|2019-05-30| c| 5| 20| -15| c| Remy| -15|
|2019-06-02| c| 5| 0| 5| c| Remy| -10|
|2019-05-30| b| 20| 5| 15| b|Manuella| 15|
|2019-06-02| b| 0| 0| 0| b|Manuella| 15|
|2019-05-30| a| 20| 20| 0| a| Claudia| 0|
|2019-06-02| a| 0| 10| -10| a| Claudia| -10|
+----------+---------+---------+-----------+-------+---+--------+--------------+

PySpark: Creating a column with number of timesteps to an event

I have a dataframe that looks as follows:
|id |val1|val2|
+---+----+----+
|1 |1 |0 |
|1 |2 |0 |
|1 |3 |0 |
|1 |4 |0 |
|1 |5 |5 |
|1 |6 |0 |
|1 |7 |0 |
|1 |8 |0 |
|1 |9 |9 |
|1 |10 |0 |
|1 |11 |0 |
|2 |1 |0 |
|2 |2 |0 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |0 |
|2 |6 |6 |
|2 |7 |0 |
|2 |8 |8 |
|2 |9 |0 |
+---+----+----+
only showing top 20 rows
I want to create a new column with the number of rows until a non-zero value appears in val2, this should be done groupby/partitionby 'id'... if the event never happens, I need to put a -1 in the steps field.
|id |val1|val2|steps|
+---+----+----+----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 | event
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 | event
|1 |10 |0 |-1 | no further events for this id
|1 |11 |0 |-1 | no further events for this id
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 | event
|2 |7 |0 |1 |
|2 |8 |8 |0 | event
|2 |9 |0 |-1 | no further events for this id
+---+----+----+----+
only showing top 20 rows
Your requirement seems easy but implementing in spark and preserving immutability is a difficult task. I am suggesting you would need a recursive function to generate the steps column. Below I have tried to suggest you a recursive way using a udf function.
import org.apache.spark.sql.functions._
//udf function to populate step column
def stepsUdf = udf((values: Seq[Row]) => {
//sorting the collected struct in reverse order according to val1 column in reverse order
val val12 = values.sortWith(_.getAs[Int]("val1") > _.getAs[Int]("val1"))
//selecting the first of sorted list
val val12Head = val12.head
//generating the first step column in the collected list
val prevStep = if(val12Head.getAs("val2") != 0) 0 else -1
//generating the first output struct
val listSteps = List(steps(val12Head.getAs("val1"), val12Head.getAs("val2"), prevStep))
//recursive function for generating the step column
def recursiveSteps(vals : List[Row], previousStep: Int, listStep : List[steps]): List[steps] = vals match {
case x :: y =>
//event changed so step column should be 0
if(x.getAs("val2") != 0) {
recursiveSteps(y, 0, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), 0))
}
//event doesn't change after the last event change
else if(x.getAs("val2") == 0 && previousStep == -1) {
recursiveSteps(y, previousStep, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep))
}
//val2 is 0 after the event change so increment the step column
else {
recursiveSteps(y, previousStep+1, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep+1))
}
case Nil => listStep
}
//calling the recursive function
recursiveSteps(val12.tail.toList, prevStep, listSteps)
})
df
.groupBy("id") // grouping by id column
.agg(stepsUdf(collect_list(struct("val1", "val2"))).as("stepped")) //calling udf function after the collection of struct of val1 and val2
.withColumn("stepped", explode(col("stepped"))) // generating rows from the list returned from udf function
.select(col("id"), col("stepped.*")) // final desired output
.sort("id", "val1") //optional step just for viewing
.show(false)
where steps is a case class
case class steps(val1: Int, val2: Int, steps: Int)
which should give you
+---+----+----+-----+
|id |val1|val2|steps|
+---+----+----+-----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 |
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 |
|1 |10 |0 |-1 |
|1 |11 |0 |-1 |
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 |
|2 |7 |0 |1 |
|2 |8 |8 |0 |
|2 |9 |0 |-1 |
+---+----+----+-----+
I hope the answer is helpful

Pyspark-Populate missing values with values other than the last known non-null value

i have data as such:
PeopleCountTestSchema=StructType([StructField("building",StringType(), True),
StructField("date_created",StringType(), True),
StructField("hour",StringType(), True),
StructField("wirelesscount",StringType(), True),
StructField("rundate",StringType(), True)])
df=spark.read.csv("wasb://reftest#refdev.blob.core.windows.net/Praneeth/HVAC/PeopleCount_test/",schema=PeopleCountTestSchema,sep=",")
df.createOrReplaceTempView('Test')
|building date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |null |
|36 |2017-01-10 |0 |null |
|36 |2017-01-11 |0 |null |
|36 |2017-01-12 |0 |null |
|36 |2017-01-13 |0 |null |
this needs to be transformed into:
|building|date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |35 |
|36 |2017-01-10 |0 |46 |
|36 |2017-01-11 |0 |32 |
|36 |2017-01-12 |0 |90 |
|36 |2017-01-13 |0 |33 |
The current null value needs to replaced by the 7th previous value.
i tried using:
Test2 = df.withColumn("wirelesscount2", last('wirelesscount', True).over(Window.partitionBy('building','hour').orderBy('hour').rowsBetween(-sys.maxsize, -7)))
the resulting output is
|building|date_created|hour|wirelesscount|rundate |wirelesscount2|
+--------+------------+----+-------------+----------+--------------+
|36 |2017-01-02 |0 |35 |2017-04-01|null |
|36 |2017-01-03 |0 |46 |2017-04-01|null |
|36 |2017-01-04 |0 |32 |2017-04-01|null |
|36 |2017-01-05 |0 |90 |2017-04-01|null |
|36 |2017-01-06 |0 |33 |2017-04-01|null |
|36 |2017-01-07 |0 |22 |2017-04-01|null |
|36 |2017-01-08 |0 |11 |2017-04-01|null |
|36 |2017-01-09 |0 |null |2017-04-01|35 |
|36 |2017-01-10 |0 |null |2017-04-01|46 |
|36 |2017-01-11 |0 |null |2017-04-01|32 |
|36 |2017-01-12 |0 |null |2017-04-01|90 |
|36 |2017-01-13 |0 |null |2017-04-01|33 |
the null values are being populated with the 7th previous value but 7 of the previous values are becoming null.
Please let me know, how this can be handled.
Thanks in advance!
You can do it with coalesce.
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
Test2 = Test2.withColumn('wirelesscount', Test2.wirelesscount.cast('integer'))
Test2 = Test2.withColumn('wirelesscount2', Test2.wirelesscount2.cast('integer'))
test3 = Test2.withColumn('wirelesscount3', coalesce(Test2.wirelesscount, Test2.wirelesscount2))
test3.show()

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
Given that you have dataframe as
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
You can Window functions by doing the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Result:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
If you remove orderBy as below
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
result would be
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
I hope the answer is helpful
Existing answer is valid, just adding here a different style of writting window functions:
import org.apache.spark.sql.expressions.Window
val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)
df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)