How to update Iceberg table storing time series data

How to update Iceberg table storing time series data - pyspark

I'm trying to apply some updates to an Iceberg table using pyspark. The original data in the table is:
+-------------------+---+---+
| time| A| B|
+-------------------+---+---+
|2022-12-01 00:00:00| 1| 6|
|2022-12-02 00:00:00| 2| 7|
|2022-12-03 00:00:00| 3| 8|
|2022-12-04 00:00:00| 4| 9|
|2022-12-05 00:00:00| 5| 10|
+-------------------+---+---+
And the update (stored as a temporary view) is:
+-------------------+---+---+
| time| A| C|
+-------------------+---+---+
|2022-12-04 00:00:00| 40| 90|
|2022-12-05 00:00:00| 50|100|
+-------------------+---+---+
I'd like to end up with:
+-------------------+----+---+----+
| time| A| B| C|
+-------------------+----+---+----+
|2022-12-01 00:00:00| 1| 6| NaN|
|2022-12-02 00:00:00| 2| 7| NaN|
|2022-12-03 00:00:00| 3| 8| NaN|
|2022-12-04 00:00:00| 40| 9| 90|
|2022-12-05 00:00:00| 50| 10| 100|
+-------------------+----+---+----+
As per the docs, I've tried the query:
spark.sql("MERGE INTO db.data d USING update u ON d.time = u.time"
" WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")
but it fails because the update doesn't contain column B. Also, even if the update did contain column B, column C wouldn't get added in the result, because it isn't in the original table. Is there anything I can do to get the behaviour I'm after?
Thanks for any help.

Related

How to take row_number() based on a condition in spark with scala

I have the below data frame -
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
created by -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
I am trying to make a rank with the help of row_number() and window fuction as below.
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
But I have to filter those records with null values for column - val.
Expected output --
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
wondering if this is possible with minimal change. Also there can be 'n' number of values for the columns val and count.

Filter those rows with null val, assign them a null row number, and union back to the original dataframe.
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.

To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

Spark Scala Window extend result until the end

I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...

Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).

#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)

spark sql conditional maximum

I have a tall table which contains up to 10 values per group. How can I transform this table into a wide format i.e. add 2 columns where these resemble the value smaller or equal to a threshold?
I want to find the maximum per group, but it needs to be smaller than a specified value like:
min(max('value1), lit(5)).over(Window.partitionBy('grouping))
However min()will only work for a column and not for the Scala value which is returned from the inner function?
The problem can be described as:
Seq(Seq(1,2,3,4).max,5).min
Where Seq(1,2,3,4) is returned by the window.
How can I formulate this in spark sql?
edit
E.g.
+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
| 1| 1| first|
| 1| 2| second|
| 1| 3| third|
| 1| 4| fourth|
| 1| 7| 7|
| 1| 10| 10|
| 21| 1| first|
| 21| 2| second|
| 21| 3| third|
+--------+-----+---------+
created by
case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS
Where
df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show
returns
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
| 1| 1| first| 10| 10|
| 1| 2| second| 10| 10|
| 1| 3| third| 10| 10|
| 1| 4| fourth| 10| 10|
| 1| 7| 7| 10| 10|
| 1| 10| 10| 10| 10|
| 21| 1| first| 3| 3|
| 21| 2| second| 3| 3|
| 21| 3| third| 3| 3|
+--------+-----+---------+----------------------------+-------------------------+
Instead, I rather would want to formulate:
lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))
But that does not work as max('value) is not a scalar value.
Expected output should look like
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
| 1| 4| fourth| 4| 7|
| 21| 1| first| 3| NULL|
+--------+-----+---------+----------------------------+-------------------------+
edit2
When trying a pivot
df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping| 1| 2| 3| 4| 7| 10|
+--------+-----+------+-----+------+----+----+
| 1|first|second|third|fourth| 7| 10|
| 21|first|second|third| null|null|null|
+--------+-----+------+-----+------+----+----+
The second part of the problem remains that some columns might not exist or be null.
When aggregating to arrays:
df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping| value| something|
+--------+-------------------+--------------------+
| 1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
| 21| [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+
The values are already next to each other, but the right values need to be selected. This is probably still more efficient than a join or window function.

Would be easier to do in two separate steps - calculate max over Window, and then use when...otherwise on result to produce min(x, 5):
df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))
EDIT: some example data to clarify this:
val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
.toDF("grouping", "value1")
df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('result > lit(5), 5).otherwise('result))
.show()
// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// | 1| 1| 4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// | 1| 2| 4|
// | 1| 3| 4|
// | 1| 4| 4|
// | 2| 7| 5| // 5, because Seq(Seq(7,8).max,5).min = 5
// | 2| 8| 5|
// +--------+------+------+

Add column index to dataframe based on another column (user in this case)

I have a dataframe as given below where the last column represents the number of times the user has searched for the location and stay
| Hanks| Rotterdam| airbnb7| 1|
|Sanders| Rotterdam| airbnb2| 1|
| Hanks| Rotterdam| airbnb2| 3|
| Hanks| Tokyo| airbnb8| 2|
| Larry| Hanoi| | 2|
| Mango| Seoul| airbnb5| 1|
| Larry| Hanoi| airbnb1| 2|
which i want to transform as follows
| Hanks| Rotterdam| airbnb7| 1| 1|
|Sanders| Rotterdam| airbnb2| 1| 1|
| Hanks| Rotterdam| airbnb2| 3| 2|
| Hanks| Tokyo| airbnb8| 2| 3|
| Larry| Hanoi| | 2| 0|
| Mango| Seoul| airbnb5| 1| 1|
| Larry| Hanoi| airbnb1| 2| 1|
Notice that column 5 represents the index of the unique combination of options(location+stay) that user selected.
eg
| Hanks| Rotterdam| airbnb7| 1| 1|
| Hanks| Rotterdam| airbnb2| 3| 2|
| Hanks| Tokyo| airbnb8| 2| 3|
I tried using groupBy/Agg to do this by implementing a udf function as the following in the agg function.
val df2 = df1.groupBy("User", "clickedDestination", "clickedAirbnb")
.agg(indexUserDetailsUDF(col("clickedAirbnb")) as ("clickedAirbnbIndex"))
And the udf as follows
var cnt = 0
val airbnbClickIndex:(String) => String = (airbnb) => {
if(airbnb== "") "null" //return 0 for airbnbClickIndex when airbnb is empty
else{cnt+=1; cnt.toString()} //otherwise return incremented value
}
val indexUserDetailsUDF = udf(airbnbClickIndex)
But this is not working. Any input is much appreciated.
Thanks.
Update1: Daniel's suggestion of dense_rank does the following to a user
|Meera| Amsterdam| airbnb12| 1| 1|
|Meera| Amsterdam| airbnb2| 1| 2|
|Meera| Amsterdam| airbnb7| 1| 3|
|Meera| Amsterdam| airbnb8| 1| 4|
|Meera| Bangalore| | 1| 5|
|Meera| Bangalore| airbnb11| 1| 6|
|Meera| Bangalore| airbnb8| 1| 7|
|Meera| Hanoi| airbnb1| 2| 8|
|Meera| Hanoi| airbnb2| 1| 9|
|Meera| Hanoi| airbnb7| 1| 10|
|Meera| Mumbai| | 1| 11|
|Meera| Oslo| | 2| 12|
|Meera| Oslo| airbnb8| 1| 13|
|Meera| Paris| | 1| 14|
|Meera| Paris| airbnb11| 1| 15|
|Meera| Paris| airbnb6| 1| 16|
|Meera| Paris| airbnb7| 1| 17|
|Meera| Paris| airbnb8| 2| 18|
|Meera| Rotterdam| airbnb2| 1| 19|
I assumed dense_rank will push those records with empty field values (in this case 3rd empty field) to the last. Is this correct?

If I got it right, you probably want a windowed rank. You could try the following:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("User").orderBy("User", "clickedDestination", "clickedAirbnb")
val result = df.withColumn("clickedAirbnbIndex", dense_rank().over(window))
If needed, you can find some good reading about window functions in Spark here.
Also, the functions package api documentation is very useful.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to update Iceberg table storing time series data - pyspark

Related

How to take row_number() based on a condition in spark with scala

Apache Spark visualization

Spark Scala Window extend result until the end

spark sql conditional maximum

Add column index to dataframe based on another column (user in this case)

Categories

Resources