Delta table update if column exists otherwise add as a new column - scala

I have a delta table as shown below. There is a id column and columns for count of signals. There will be around 300 signals and so the number of columns will be approx 601 in the final delta table.
The requirement is to have a single record for each id_no in the final table.
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
|id_no|sig01_total |sig01_valid_total|sig02_total |sig02_valid_total|sig03_total |sig03_valid_total|sig03_total_valid |..............|sig300_valid_total|sig300_total_valid |load_timestamp |
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............|------------------+--------------------+----------------------------+
|050 |25 |23 |45 |43 |66 |60 |55 |..............|60 |55 |2021-08-10T16:58:30.054+0000|
|051 |78 |70 |15 |14 |10 |10 |9 |..............|10 |9 |2021-08-10T16:58:30.054+0000|
|052 |88 |88 |75 |73 |16 |13 |13 |..............|13 |13 |2021-08-10T16:58:30.054+0000|
+-----+--------------+-----------------+--------------+-----------------+--------------+-----------------+--------------------+..............+------------------+--------------------+----------------------------+
I could perform the upsert based on the id_no using delta merge option as shown below.
targetDeltaTable.alias("t")
.merge(
sourceDataFrame.alias("s"),
"t.id_no = s.id_no")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
This code will execute weekly and the final tabe has to be updated weekly. Not all signals might be available for every week data.
But the issue with merging is, If the id_no is available, then i have to do an additional check for the columns (signal columns) availability also in the existing table.
If the signal column exists then i have to do an addition of the new count with existing count.
If the signal column does not exists for that particular id_no then i have to add that signal as a new column to the existing row.
I have tried to execute delta table upsert command as shown below, but since the number of columns are not static every data load and considering the huge number of columns it was not succeeded.
DeltaTable.forPath(spark, "/data/events/target/")
.as("t")
.merge(
updatesDF.as("s"),
"t.id_no = s.id_no"
.whenMatched
.updateExpr(//some condition to be applied to check whether column exists or not?
Map("sig01_total" -> "t.sig01_total"+"s.sig01_total"
->
->........))
.whenNotMatched
.insertAll()
.execute()
How can I acheive this requirement? Any leads appreciated!

Related

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

How can I apply QuantileDiscretizer on a groupBy object in PySpark?

I want to compute Quantiles on a dataframe after grouping it.
This is my sample dataframe:
|id |shop|amount|
|:--|:--:|---:|
|1 |A. |100|
|2 |B. |200|
|3. |A. |125|
|1. |A |25 |
|2 |B |220|
|3. |A. |110|
I want to bin the amount into low, medium and high, based on each shop.
So, I would group my dataframe like this.
shop_groups= df.groupBy('shop')
The mistake I did originally, was I applied the QuantileDiscretizer on the whole amount set as is, without grouping it by shop.
How can I do this on the shop_groups ?

How to set a dynamic where clause using pyspark

I have a dataset within which there are multiple groups. I have a rank column which incrementally counts counts each entry per group. An example of this structure is shown below:
+-----------+---------+---------+
| equipment| run_id|run_order|
+-----------+---------+---------+
|1 |430032589| 1|
|1 |430332632| 2|
|1 |430563033| 3|
|1 |430785715| 4|
|1 |431368577| 5|
|1 |431672148| 6|
|2 |435497596| 1|
|1 |435522469| 7|
Each group (equipment) has a different amount of runs. Shown above equipment 1 has 7 runs whilst equipment 2 has 1 run. I would like to select the first and last n runs per equipment. To select the first n runs is straightforward:
df.select("equipment", "run_id").distinct().where(df.run_order <= n).orderBy("equipment").show()
The distinct is in the query because each row is equivalent to a timestep and therefore each row will log sensor readings associated with that timestep. Therefore there will be many rows with the same equipment, run_id and run_order, which should be preserved in the end result and not aggregated.
As the number of runs is unique to each equipment I can't do an equivalent select query with a where clause (I think) to get the last n runs:
df.select("equipment", "run_id").distinct().where(df.rank >= total_runs - n).orderBy("equipment").show()
I can run a groupBy to get the highest run_order for each equipment
+-----------+----------------+
| equipment| max(run_order) |
+-----------+----------------+
|1 | 7|
|2 | 1|
But I am unsure if there is a way I can construct a dynamic where clause that works like this. So that I get the last n runs (including all timestep data for each run).
You can add a column of the max rank for each equipment and do a filter based on that column:
from pyspark.sql import functions as F, Window
n = 3
df2 = df.withColumn(
'max_run',
F.max('run_order').over(Window.partitionBy('equipment'))
).where(F.col('run_order') >= F.col('max_run') - n)

Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types.
While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column
Example of Dataframe
+---+----------+-----+
|id |name |class|
+---+----------+-----+
|1 |abc |21 |
|2 |bca |32 |
|3 |abab | 4 |
|4 |baba |5a |
|5 |cccca | |
+---+----------+-----+
Json Schema of the file:
{"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}
In this row 4 is corrupt records as the class column is of type Integer
So only this records has to be there in corrupt records, not the 5th row
Just check if value is NOT NULL before casting and NULL after casting
import org.apache.spark.sql.functions.when
df
.withColumn("class_integer", $"class".cast("integer"))
.withColumn(
"class_corrupted",
when($"class".isNotNull and $"class_integer".isNull, $"class"))
Repeat for each column / cast you need.

Spark Structured Streaming operations on rows of a single dataframe

In my problem, there is a data stream of information about package delivery coming in. The data consists of "NumberOfPackages", "Action" (which can be either "Loaded", "Delivered" or "In Transit"), and "Driver".
val streamingData = <filtered data frame based on "Loaded" and "Delivered" Action types only>
The goal is to look at number of packages at the moment of loading and at the moment of delivery, and if they are not the same - execute a function that would call a REST service with the parameter of "TrackingId".
The data looks like this:
+-----------------+-----------+-----------------------
|NumberOfPackages |Action |TrackingId |Driver |
+-----------------+-----------+-----------------------
|5 |Loaded |a |Alex
|5 |Devivered |a |Alex
|8 |Loaded |b |James
|8 |Delivered |b |James
|7 |Loaded |c |Mark
|3 |Delivered |c |Mark
<...more rows in this streaming data frame...>
In this case, we see that by the "TrackingId" equal to "c", the number of packages loaded and delivered isn't the same, so this is where we'd need to call the REST api with the "TrackingId".
I would like to combine rows based on "TrackingId", which will always be unique for each trip. If we get the rows combined based on this tracking id, we could have two columns for number of packages, something like "PackagesAtLoadTime" and "PackagesAtDeliveryTime". Then we could compare these two values for each row and filter the dataframe by those which are not equal.
So far I have tried the groupByKey method with the "TrackingId", but I couldn't find a similar example and my experimental attempts weren't successful.
After I figure out how to "merge" the two rows with the same tracking id together and have a column for each corresponding count of packages, I could define a UDF:
def notEqualPackages = udf((packagesLoaded: Int, packagesDelivered: Int) => packagesLoaded!=packagesDelivered)
And use it to filter the rows of the dataframe to contain only those with not matching numbers:
streamingData.where(notEqualPackages(streamingData("packagesLoaded", streamingData("packagesDelivered")))