Spark lag with default value as another column - scala

Let's say we have
|bin | min | end | start |
|1 | 5 | 10 |
|2 | 12 | 24 |
|3 | 28 | 36 |
|4 | 40 | 50 |
|5 | null| null |
I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.
What lag gives us would be
df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min | end | start |
|1 | 5 | 10 | (5 wanted)
|2 | 12 | 24 | 10
|3 | 28 | 36 | 24
|4 | 40 | 50 | 36
|5 | null| null | null
My questions :
1/ What do we put in default_value for lag to take another column of current row, in this case min
2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?

Use coalesce to get a column value for the first row in a group.
from pyspark.sql import functions as F
df.withColumn("start", F.coalesce(F.lag(col("end"), 1).over(orderBy(col("bin")),col("min")))
lag currently doesn't support ignorenulls option, so you might have to separate out the null rows, compute the start column for non-null rows and union the data frames.

Related

Fixing hierarchy data with table transformation (Hive, scala, spark)

I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+

Create another col using value of other col

I have a dataframe in which I need to add another col based on the grouping logic.
Dataframe
id|x_id|y_id|val_id|
1| 2 | 3 | 4 |
10| 2 | 3 | 40 |
1| 12 | 13 | 14 |
I need to add other col parent_id which will be based on this rule:
over x_id and y_id select the max value in col val_id and use its corresponding id value
Final frame will look like this
id|x_id|y_id|val_id| parent_id
91| 2 | 3 | 4 | 10 (coming from row 2)
10| 2 | 3 | 40 | 10 (coming from row 2)
1| 12 | 13 | 14 | 14
I have tried using withColumn, but I can only set the row over that group that its value will be parent.
Explanation: Here parent_id is 10 because its coming from col id. Row 2 was chosen because it has max value of val_id over group x_id and y_id
I am using scala
Use Window to split the ids and calculate the maximum over the window by sorting for each partition with respect to the val_id.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('x_id, 'y_id).orderBy('val_id.desc)
df.withColumn("parent_id", first('id).over(w))
.show(false)
The result is:
+---+----+----+------+---------+
|id |x_id|y_id|val_id|parent_id|
+---+----+----+------+---------+
|10 |2 |3 |40 |10 |
|1 |2 |3 |4 |10 |
|1 |12 |13 |14 |1 |
+---+----+----+------+---------+

How to pivot a Table in POSTGRESQL 8.4

I have following table:
|Type | Year | amount |
_______________________
|t1 | 2001 | 40 |
|t1 | 2000 | 50 |
|t2 | 2003 | 30 |
|t2 | 2003 | 20 |
|t3 | 2004 | 10 |
and I would like to show it as:
| type |2001 |2000 |2003 |2004|
|______________________________
| t1 |40 |50 |0 |0 |
| t2 | 0 |0 | 50 |0 |
| t3 |0 |0 | 0 |10 |
I don't want to hard code the years and I need to do that in POstgresql 8.4, which doesn't support:
CREATE EXTENSION
IF NOT EXISTS tablefunc;
I have pivoted table before, using following code: more explanation for the following code is here
sum(CASE
WHEN year = 2000 THEN
total
ELSE 0
END)
In which the total =sum (amount) in each year and I had calculated it in another CTE. But at that time the years were already known but for the above table I need to loop through the years and read each and then calculate the sum(amount) and the years may change in the main table.

Convert a single row into multiple rows by the columns in Postgresql

I have a table cash_drawer which stores quantity for each denomination of currency for each day at day end:
cash_drawer(
date DATE,
100 SMALLINT,
50 SMALLINT,
20 SMALLINT,
10 SMALLINT,
5 SMALLINT,
1 SMALLINT
)
Now any given day, I wish to get each denomination as a row.
If lets say for day 2016-11-25, if we have the following row:
+------------+-------+------+------+------+-----+-----+
| date | 100 | 50 | 20 | 10 | 5 | 1 |
+------------+-------+------+------+------+-----+-----+
| 2016-11-25 | 5 | 12 | 27 | 43 | 147 | 129 |
+------------+-------+------+------+------+-----+-----+
Now I wish to get the out put of the query as:
+------------+--------+
|denomination|quantity|
+------------+--------+
|100 |5 |
+------------+--------+
|50 |12 |
+------------+--------+
|20 |27 |
+------------+--------+
|10 |43 |
+------------+--------+
|5 |147 |
+------------+--------+
|1 |129 |
+------------+--------+
Is there a method by which this is possible? If you have any other suggestion please be free to suggest.
Use json functions:
select key as denomination, value as quantity
from cash_drawer c,
lateral json_each(row_to_json(c))
where key <> 'date'
and date = '2016-11-25';
denomination | quantity
--------------+----------
100 | 5
50 | 12
20 | 27
10 | 43
5 | 147
1 | 129
(6 rows)
Test it here.

Postgresql + Select field name as upper and lowercase mixed

I am using this postgresql code,
SELECT id as DT_RowId , title
FROM table_name
ORDER BY title asc
LIMIT 25 OFFSET 0
The results returned like this.
+--------+-----+
|dt_rowid|title|
+--------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |D |
| 5 |E |
| 6 |F |
+--------+-----+
But i want results should return like this.
+--------+-----+
|DT_RowId|title|
+--------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |D |
| 5 |E |
| 6 |F |
+--------+-----+
Note - DT_RowId field i want same like this(upper and lower case mixed).
As explained in the manual unquoted identifier are folded to lowercase (which violates the SQL standard where unquoted identifiers should be folded to uppercase).
You need to use a quoted identifier in order to preserve the case:
SELECT id as "DT_RowId",
title
FROM table_name
ORDER BY title asc
LIMIT 25 OFFSET 0