How can I get consecutively the same dataframe in spark - scala

My data is like this, status is 0 or 1, uid is user id.
uid |timestamp |status
1 |1 | 0
2 |3 | 1
1 |2 | 1
2 |1 | 0
1 |3 | 1
2 |2 | 0
2 |4 | 0
I wanna data partitioned by uid and order by timestamp asc.
uid |timestamp |status
1 |1 | 0
1 |2 | 1
1 |3 | 1
2 |1 | 0
2 |2 | 0
2 |3 | 1
2 |4 | 0
And get all consecutively the same status and conbine them to do other things.
Sorry, my English is ...shit.
The rusult is like below:
uid |status |timestamps-asc-order
1 |(0) | (1)
1 |(1,1) | (2,2)
2 |(0,0) | (1,2)
2 |(1) | (3)
2 |(0) | (4)
I can do partition and order with window function.
But then, how to get consecutively same status ?
val window = Window.partitionBy("uid").orderBy($"timestamp".asc)

Welcome to StackOverflow.
You are looking for the collect_list function.
You should be able to achieve what you ask with a
df.withColumn("timestamps-asc-order", collect_list("timestamp").over(Window.partitionBy("uid").orderBy("timestamp"))

Related

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Fixing hierarchy data with table transformation (Hive, scala, spark)

I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+

Spark Scala input empty values according result from self joined dataframe query

I struggle to write my spark scala code to fill rows for which the coverage is empty using self join with conditions.
This is the data :
+----+--------------+----------+--------+
| ID | date_in_days | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | | |
| 1 | 2020-09-06 | | |
| 1 | 2020-09-19 | | |
| 1 | 2020-09-12 | | |
| 1 | 2020-09-18 | | |
| 1 | 2020-09-11 | | |
| 1 | 2020-09-16 | | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | | |
| 1 | 2020-09-30 | | |
+----+--------------+----------+--------+
Expected result :
+----+--------------+----------+--------+
| ID | date_in_day | coverage | values |
+----+--------------+----------+--------+
| 1 | 2020-09-01 | -1 | 0.128 |
| 1 | 2020-09-03 | 0 | 0.358 |
| 1 | 2020-09-04 | 0 | 0.035 |
| 1 | 2020-09-05 | 0 | |
| 1 | 2020-09-06 | 0 | |
| 1 | 2020-09-19 | 0 | |
| 1 | 2020-09-12 | 0 | |
| 1 | 2020-09-18 | 0 | |
| 1 | 2020-09-11 | 0 | |
| 1 | 2020-09-16 | 0 | |
| 1 | 2020-09-21 | 13 | 0.554 |
| 1 | 2020-09-23 | -1 | |
| 1 | 2020-09-30 | -1 | |
What I am trying to do:
For each different ID (Dataframe partitioned by ID) sorted by date
Use case : row coverage column is null let's call it rowEmptycoverage:
Find in the DF the first row with date_in_days > rowEmptycoverage.date_in_days and with coverage >= 0. Let's call it rowFirstDateGreater
Then if rowFirstDateGreater.values > 500 set rowEmptycoverage.coverage to 0. Set it to -1 otherwise.
I am kind of lost in mixing when join where...
I am assuming that you mean values > 0.500 and not values > 500. Also the logic remains unclear. Here I am assuming that you are searching in the order of the column date_in_days and not in the order of the dataframe.
In any case we can refine the solution to match your exact need. The overall idea is to use a Window to fetch the next date for which the coverage is not null, check if values meet the desired criteria and update coverage.
It goes as follows:
val win = Window.partitionBy("ID").orderBy("date_in_days")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
df
// creating a struct binding coverage and values
.withColumn("cov_str", when('coverage isNull, lit(null))
.otherwise(struct('coverage, 'values)))
// finding the first row (starting from the current date, in order of
// date_in_days) for which the coverage is not null
.withColumn("next_cov_str", first('cov_str, ignoreNulls=true) over win)
// updating coverage. We keep the original value if not null, put 0 if values
// meets the criteria (that you can change) and -1 otherwise.
.withColumn("coverage", coalesce(
'coverage,
when($"next_cov_str.values" > 0.500, lit(0)),
lit(-1)
))
.show(false)
+---+-------------------+--------+------+-----------+------------+
|ID |date_in_days |coverage|values|cov_str |next_cov_str|
+---+-------------------+--------+------+-----------+------------+
|1 |2020-09-01 00:00:00|-1 |0.128 |null |[0, 0.358] |
|1 |2020-09-03 00:00:00|0 |0.358 |[0, 0.358] |[0, 0.358] |
|1 |2020-09-04 00:00:00|0 |0.035 |[0, 0.035] |[0, 0.035] |
|1 |2020-09-05 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-06 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-11 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-12 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-16 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-18 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-19 00:00:00|0 |null |null |[13, 0.554] |
|1 |2020-09-21 00:00:00|13 |0.554 |[13, 0.554]|[13, 0.554] |
|1 |2020-09-23 00:00:00|-1 |null |null |null |
|1 |2020-09-30 00:00:00|-1 |null |null |null |
+---+-------------------+--------+------+-----------+------------+
You may then use drop("cov_str", "next_cov_str") but I leave them here for clarity.

Create another col using value of other col

I have a dataframe in which I need to add another col based on the grouping logic.
Dataframe
id|x_id|y_id|val_id|
1| 2 | 3 | 4 |
10| 2 | 3 | 40 |
1| 12 | 13 | 14 |
I need to add other col parent_id which will be based on this rule:
over x_id and y_id select the max value in col val_id and use its corresponding id value
Final frame will look like this
id|x_id|y_id|val_id| parent_id
91| 2 | 3 | 4 | 10 (coming from row 2)
10| 2 | 3 | 40 | 10 (coming from row 2)
1| 12 | 13 | 14 | 14
I have tried using withColumn, but I can only set the row over that group that its value will be parent.
Explanation: Here parent_id is 10 because its coming from col id. Row 2 was chosen because it has max value of val_id over group x_id and y_id
I am using scala
Use Window to split the ids and calculate the maximum over the window by sorting for each partition with respect to the val_id.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('x_id, 'y_id).orderBy('val_id.desc)
df.withColumn("parent_id", first('id).over(w))
.show(false)
The result is:
+---+----+----+------+---------+
|id |x_id|y_id|val_id|parent_id|
+---+----+----+------+---------+
|10 |2 |3 |40 |10 |
|1 |2 |3 |4 |10 |
|1 |12 |13 |14 |1 |
+---+----+----+------+---------+

Converting dataframe column into onehotencoder like columns

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------
I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+