I am having a piece of scala code which will take count of signals at 3 different stages with respect to an id_no and an identifier.
The output of the code will be as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |25 |23 |20 |45 |43 |40 |66 |60 |55 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |78 |70 |68 |15 |14 |14 |10 |10 |9 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |88 |88 |86 |75 |73 |70 |16 |13 |13 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
There will be more than 100 signals, so that number of columns will be more than 300.
This dataframe is written to the delta table location as shown below.
statisticsDf.write.format("delta").option("mergeSchema", "true").mode("append").partitionBy("id_no").save(statsDestFolderPath)
For the next week data i have again executed this code and get the data as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |10 |8 |7 |15 |15 |14 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |10 |10 |9 |16 |15 |15 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |26 |24 |24 |24 |23 |23 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
But the output I expect is if the id_no ,identifier and signal name is already present in the table, then it should add the count with existing data, If the id_no, identifier and signal name are new, then it should add to the final table.
The output I receive now is as shown below, where data gets appended for each run.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |25 |23 |20 |45 |43 |40 |66 |60 |55 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |78 |70 |68 |15 |14 |14 |10 |10 |9 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |88 |88 |86 |75 |73 |70 |16 |13 |13 |2021-08-10T16:58:30.054+0000|
|050 |ident01 |10 |8 |7 |15 |15 |14 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |10 |10 |9 |16 |15 |15 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |26 |24 |24 |24 |23 |23 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
But I am expecting the output as shown below.
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|id_no|identifier|signal01_total|signal01_without_NaN|signal01_total_valid|signal02_total|signal02_without_NaN|signal02_total_valid|signal03_total|signal03_without_NaN|signal03_total_valid|load_timestamp |
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
|050 |ident01 |35 |31 |27 |60 |58 |54 |38 |38 |37 |2021-08-10T16:58:30.054+0000|
|051 |ident01 |88 |80 |77 |31 |29 |19 |30 |30 |30 |2021-08-10T16:58:30.054+0000|
|052 |ident01 |114 |102 |110 |99 |96 |93 |40 |38 |36 |2021-08-10T16:58:30.054+0000|
|053 |ident01 |25 |24 |23 |20 |19 |19 |25 |25 |24 |2021-08-10T16:58:30.054+0000|
+-----+----------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------+--------------------+--------------------+----------------------------+
Got a hint using upsert command as below.
val updatesDF = ... // define the updates DataFrame[id_no, identifier, sig01_total, sig01_NaN, sig01_final, sig02_total,.......]
DeltaTable.forPath(spark, "/data/events/")
.as("events")
.merge(
updatesDF.as("updates"),
"events.id_no = updates.id_no" &&
"events.identifier = updates.identifier")
.whenMatched
.updateExpr(
Map("sig01_total" -> "updates.sig01_total"
->
->........))
.whenNotMatched
.insertExpr(
Map(
"id_no" -> "updates.id_no",
"identifier" -> "updates.identifier",
"sig01_total" -> "updates.sig01_total"
->
->
.....))
.execute()
But in my case the number of columns may vary each time, if a new signal is added to the id, then we have to add the same. If one of the signal for existing id is not available for current week process, that signal value alone should keep same and rest should be updated.
Is there any option to achieve this requirement using delta table merge or by updating the above code or any other ways?
Any leads appreciated!
Use case mentioned in Question, needs an upsert operation.
You can use Databricks documentation for upsert operation where you can write a logic to perform upsert operation.
You can control when to insert and when to update based on expression.
Reference link
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge
Related
I have a PySpark DataFrame :
From id To id Price Date
a b 20 30/05/2019
b c 5 30/05/2019
c a 20 30/05/2019
a d 10 02/06/2019
d c 5 02/06/2019
id Name
a Claudia
b Manuella
c remy
d Paul
The output that i want is :
Date Name current balance
30/05/2019 Claudia 0
30/05/2019 Manuella 15
30/05/2019 Remy -15
30/05/2019 Paul 0
02/06/2019 Claudia -10
02/06/2019 Manuella 15
02/06/2019 Remy -10
02/06/2019 Paul 5
I want to get the current balance in each day for all users.
my idea is to make a groupby per user and calculate the sum of the TO column minus the From column. But how to do it per day? especially it's cumulative and not per day?
Thank You
I took a bit of an effort to get the requirements right. Here's my version of the solution.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("a","b",20,"30/05/2019"),
("b","c",5 ,"30/05/2019"),
("c","a",20,"30/05/2019"),
("a","d",10,"02/06/2019"),
("d","c",5 ,"02/06/2019"),
]
df1Columns = ["From_Id", "To_Id", "Price", "Date"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("Date",F.to_date(F.to_timestamp("Date", 'dd/MM/yyyy')).alias('Date'))
print("Actual initial data")
df1.show(truncate=False)
data2 = [
("a","Claudia"),
("b","Manuella"),
("c","Remy"),
("d","Paul"),
]
df2Columns = ["id","Name"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print("Actual initial data")
df2.show(truncate=False)
alldays_df = df1.select("Date").distinct().repartition(20)
allusers_df = df2.select("id").distinct().repartition(10)
crossjoin_df = alldays_df.crossJoin(allusers_df)
crossjoin_df = crossjoin_df.withColumn("initial", F.lit(0))
crossjoin_df = crossjoin_df.withColumnRenamed("id", "common_id").cache()
crossjoin_df.show(n=40, truncate=False)
from_sum_df = df1.groupby("Date", "From_Id").agg(F.sum("Price").alias("from_sum"))
from_sum_df = from_sum_df.withColumnRenamed("From_Id", "common_id")
from_sum_df.show(truncate=False)
from_sum_df = crossjoin_df.alias('cross').join(
from_sum_df.alias('from'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('from.from_sum', 'cross.initial').alias('from_amount') ).cache()
from_sum_df.show(truncate=False)
to_sum_df = df1.groupby("Date", "To_Id").agg(F.sum("Price").alias("to_sum"))
to_sum_df = to_sum_df.withColumnRenamed("To_Id", "common_id")
to_sum_df.show(truncate=False)
to_sum_df = crossjoin_df.alias('cross').join(
to_sum_df.alias('to'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('to.to_sum', 'cross.initial').alias('to_amount') ).cache()
to_sum_df.show(truncate=False)
joined_df = to_sum_df.join(from_sum_df, ["Date", "common_id"], how='inner')
joined_df.show(truncate=False)
balance_df = joined_df.withColumn("balance", F.col("to_amount") - F.col("from_amount"))
balance_df.show(truncate=False)
final_df = balance_df.join(df2, F.col("id") == F.col("common_id"))
final_df.show(truncate=False)
final_cum_sum = final_df.withColumn('cumsum_balance', F.sum('balance').over(Window.partitionBy('common_id').orderBy('Date').rowsBetween(-sys.maxsize, 0)))
final_cum_sum.show()
Following are all the outputs for your progressive understanding. I am not explaining the steps. You can figure them out.
Actual initial data
+-------+-----+-----+----------+
|From_Id|To_Id|Price|Date |
+-------+-----+-----+----------+
|a |b |20 |2019-05-30|
|b |c |5 |2019-05-30|
|c |a |20 |2019-05-30|
|a |d |10 |2019-06-02|
|d |c |5 |2019-06-02|
+-------+-----+-----+----------+
Actual initial data
+---+--------+
|id |Name |
+---+--------+
|a |Claudia |
|b |Manuella|
|c |Remy |
|d |Paul |
+---+--------+
+----------+---------+-------+
|Date |common_id|initial|
+----------+---------+-------+
|2019-05-30|a |0 |
|2019-05-30|d |0 |
|2019-05-30|b |0 |
|2019-05-30|c |0 |
|2019-06-02|a |0 |
|2019-06-02|d |0 |
|2019-06-02|b |0 |
|2019-06-02|c |0 |
+----------+---------+-------+
+----------+---------+--------+
|Date |common_id|from_sum|
+----------+---------+--------+
|2019-06-02|a |10 |
|2019-05-30|a |20 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+--------+
+----------+---------+-----------+
|Date |common_id|from_amount|
+----------+---------+-----------+
|2019-06-02|a |10 |
|2019-06-02|c |0 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+-----------+
+----------+---------+------+
|Date |common_id|to_sum|
+----------+---------+------+
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+------+
+----------+---------+---------+
|Date |common_id|to_amount|
+----------+---------+---------+
|2019-06-02|a |0 |
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+---------+
+----------+---------+---------+-----------+
|Date |common_id|to_amount|from_amount|
+----------+---------+---------+-----------+
|2019-06-02|a |0 |10 |
|2019-06-02|c |5 |0 |
|2019-05-30|a |20 |20 |
|2019-05-30|d |0 |0 |
|2019-06-02|b |0 |0 |
|2019-06-02|d |10 |5 |
|2019-05-30|c |5 |20 |
|2019-05-30|b |20 |5 |
+----------+---------+---------+-----------+
+----------+---------+---------+-----------+-------+
|Date |common_id|to_amount|from_amount|balance|
+----------+---------+---------+-----------+-------+
|2019-06-02|a |0 |10 |-10 |
|2019-06-02|c |5 |0 |5 |
|2019-05-30|a |20 |20 |0 |
|2019-05-30|d |0 |0 |0 |
|2019-06-02|b |0 |0 |0 |
|2019-06-02|d |10 |5 |5 |
|2019-05-30|c |5 |20 |-15 |
|2019-05-30|b |20 |5 |15 |
+----------+---------+---------+-----------+-------+
+----------+---------+---------+-----------+-------+---+--------+
|Date |common_id|to_amount|from_amount|balance|id |Name |
+----------+---------+---------+-----------+-------+---+--------+
|2019-05-30|a |20 |20 |0 |a |Claudia |
|2019-06-02|a |0 |10 |-10 |a |Claudia |
|2019-05-30|b |20 |5 |15 |b |Manuella|
|2019-06-02|b |0 |0 |0 |b |Manuella|
|2019-05-30|c |5 |20 |-15 |c |Remy |
|2019-06-02|c |5 |0 |5 |c |Remy |
|2019-06-02|d |10 |5 |5 |d |Paul |
|2019-05-30|d |0 |0 |0 |d |Paul |
+----------+---------+---------+-----------+-------+---+--------+
+----------+---------+---------+-----------+-------+---+--------+--------------+
| Date|common_id|to_amount|from_amount|balance| id| Name|cumsum_balance|
+----------+---------+---------+-----------+-------+---+--------+--------------+
|2019-05-30| d| 0| 0| 0| d| Paul| 0|
|2019-06-02| d| 10| 5| 5| d| Paul| 5|
|2019-05-30| c| 5| 20| -15| c| Remy| -15|
|2019-06-02| c| 5| 0| 5| c| Remy| -10|
|2019-05-30| b| 20| 5| 15| b|Manuella| 15|
|2019-06-02| b| 0| 0| 0| b|Manuella| 15|
|2019-05-30| a| 20| 20| 0| a| Claudia| 0|
|2019-06-02| a| 0| 10| -10| a| Claudia| -10|
+----------+---------+---------+-----------+-------+---+--------+--------------+
I am trying to add columns values in a narrative text but able to add only one value for every row
var hashColDf = rowmaxDF.select("min", "max", "Total")
val peopleArray = hashColDf.collect.map(r => Map(hashColDf.columns.zip(r.toSeq): _*))
val comstr = "shyam has max and min not Total"
var mapArrayStr = List[String]()
for(eachrow <- peopleArray){
mapArrayStr = mapArrayStr :+ eachrow.foldLeft(comstr)((a, b) => a.replaceAllLiterally(b._1, b._2.toString()))
}
for(eachCol <- mapArrayStr){
rowmaxDF = rowmaxDF.withColumn("compCols", lit(eachCol))
}
}
Source Dataframe :
|max|min|TOTAL|
|3 |1 |4 |
|5 |2 |7 |
|7 |3 |10 |
|8 |4 |12 |
|10 |5 |15 |
|10 |5 |15 |
Actual Result:
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 10 and 5 not 15|
|5 |2 |7 |shyam has 10 and 5 not 15|
|7 |3 |10 |shyam has 10 and 5 not 15|
|8 |4 |12 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
Expected Result :
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 3 and 1 not 4 |
|5 |2 |7 |shyam has 5 and 2 not 7 |
|7 |3 |10 |shyam has 7 and 3 not 10 |
|8 |4 |12 |shyam has 8 and 4 not 12 |
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
i have data as such:
PeopleCountTestSchema=StructType([StructField("building",StringType(), True),
StructField("date_created",StringType(), True),
StructField("hour",StringType(), True),
StructField("wirelesscount",StringType(), True),
StructField("rundate",StringType(), True)])
df=spark.read.csv("wasb://reftest#refdev.blob.core.windows.net/Praneeth/HVAC/PeopleCount_test/",schema=PeopleCountTestSchema,sep=",")
df.createOrReplaceTempView('Test')
|building date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |null |
|36 |2017-01-10 |0 |null |
|36 |2017-01-11 |0 |null |
|36 |2017-01-12 |0 |null |
|36 |2017-01-13 |0 |null |
this needs to be transformed into:
|building|date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |35 |
|36 |2017-01-10 |0 |46 |
|36 |2017-01-11 |0 |32 |
|36 |2017-01-12 |0 |90 |
|36 |2017-01-13 |0 |33 |
The current null value needs to replaced by the 7th previous value.
i tried using:
Test2 = df.withColumn("wirelesscount2", last('wirelesscount', True).over(Window.partitionBy('building','hour').orderBy('hour').rowsBetween(-sys.maxsize, -7)))
the resulting output is
|building|date_created|hour|wirelesscount|rundate |wirelesscount2|
+--------+------------+----+-------------+----------+--------------+
|36 |2017-01-02 |0 |35 |2017-04-01|null |
|36 |2017-01-03 |0 |46 |2017-04-01|null |
|36 |2017-01-04 |0 |32 |2017-04-01|null |
|36 |2017-01-05 |0 |90 |2017-04-01|null |
|36 |2017-01-06 |0 |33 |2017-04-01|null |
|36 |2017-01-07 |0 |22 |2017-04-01|null |
|36 |2017-01-08 |0 |11 |2017-04-01|null |
|36 |2017-01-09 |0 |null |2017-04-01|35 |
|36 |2017-01-10 |0 |null |2017-04-01|46 |
|36 |2017-01-11 |0 |null |2017-04-01|32 |
|36 |2017-01-12 |0 |null |2017-04-01|90 |
|36 |2017-01-13 |0 |null |2017-04-01|33 |
the null values are being populated with the 7th previous value but 7 of the previous values are becoming null.
Please let me know, how this can be handled.
Thanks in advance!
You can do it with coalesce.
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
Test2 = Test2.withColumn('wirelesscount', Test2.wirelesscount.cast('integer'))
Test2 = Test2.withColumn('wirelesscount2', Test2.wirelesscount2.cast('integer'))
test3 = Test2.withColumn('wirelesscount3', coalesce(Test2.wirelesscount, Test2.wirelesscount2))
test3.show()
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?
Given that you have dataframe as
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
You can Window functions by doing the following
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Result:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
If you remove orderBy as below
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
result would be
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
I hope the answer is helpful
Existing answer is valid, just adding here a different style of writting window functions:
import org.apache.spark.sql.expressions.Window
val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)
df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)
I have a crystal report that when run will look like the below. The fields are place in the detail section:
Code|Jan|Feb|Mar|Apr|May|Jun|Jul|
405 |70 |30 |10 |45 |5 |76 |90 |
406 |10 |23 |30 |7 |1 |26 |10 |
488 |20 |30 |60 |7 |5 |44 |10 |
501 |40 |15 |90 |10 |8 |75 |40 |
502 |30 |30 |10 |7 |5 |12 |30 |
600 |60 |16 |50 |7 |9 |75 |20 |
I need to create a formula or a parameter to check if the Code=501 and then return the column Jun value of "75" from the footer section.
I wrote this formula:
whileprintingrecords;
NumberVar COSValue;
If {ds_RevSBU.Code}=501
Then COSValue :={ds_RevSBU.JUN)}
Else 0;
If I place this formula within the detail it work, it give me the value of 75. How can I get this value from the report footer section?
Please help.
Thank you.
I finally figure a way but I'm not sure if it is the correct way. I create the below formula and suppress it in the detail section:
Global NumberVar COSValue;
If {ds_RevSBU.Code}=501
Then COSValue :={ds_RevSBU.JUN)}
Else 0;
Then in the footer section, I created the below formula:
WhileReadingRecords;
Global NumberVar COSValue;
(COSValue * 4.5)/100