Spark: remove all duplicated lines - scala

I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate lines (note: not the same as distinct(), I do not want to still have a distinct line, but to remove both lines) per keys (id1,id2,id3), the expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove line 1 and line 2 because we have 2 values for the key group.
I try to achieve this using:
ds.groupBy(id1,id2,id3).distinct()
But it's not working.

You can use window function with filter on count as below
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+

Related

Fixing hierarchy data with table transformation (Hive, scala, spark)

I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+

Spark: Replicate each row but with change in one column value

How to perform the following operation in spark,
Initially:
+-----------+-----+------+
|date |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2 | abc |
|2020-08-17 | 3 | def |
|2020-08-18 | 4 | ghi |
|2020-08-19 | 5 | jkl |
|2020-08-20 | 6 | mno |
+-----------+-----+------+
Final result:
+-----------+-----+------+
|date |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2 | abc |
|2020-08-15 | 2 | abc |
|2020-08-17 | 3 | def |
|2020-08-16 | 3 | def |
|2020-08-18 | 4 | ghi |
|2020-08-17 | 4 | ghi |
|2020-08-19 | 5 | jkl |
|2020-08-18 | 5 | jkl |
|2020-08-20 | 6 | mno |
|2020-08-19 | 6 | mno |
+-----------+-----+------+
So in essence need to duplicate each row with a change in one of the column values i.e. for each row, duplicate with date column as minus 1 day of current value.
Try with date_add function then create array with date column and date-1 column and finally explode the column.
Example:
df.show()
/*
+----------+----+----+
| date|col1|col2|
+----------+----+----+
|2020-08-16| 2| abc|
|2020-08-17| 3| def|
+----------+----+----+
*/
import org.apache.spark.sql.functions._
df.withColumn("new_date",array(col("date"),date_add(col("date"),-1))).
drop("date").
selectExpr("explode(new_date) as date","*").
drop("new_date").
show(10,false)
/*
+----------+----+----+
|date |col1|col2|
+----------+----+----+
|2020-08-16|2 |abc |
|2020-08-15|2 |abc |
|2020-08-17|3 |def |
|2020-08-16|3 |def |
+----------+----+----+
*/
I was thinking union would be quite elegant for this solution, eg
// Union the two dataframes together, take 1 day away from the date
df.union(df.select(date_add($"date", -1), $"col1", $"col2"))
Full sample script where I create the test data:
import org.apache.spark.sql.functions._
val dfOriginal = Seq(("2020-08-16", 2, "abc"), ("2020-08-17", 3, "def"), ("2020-08-18", 4, "ghi"), ("2020-08-19", 5, "jkl"), ("2020-08-20", 6, "mno"))
.toDF("date", "col1", "col2")
val df = dfOriginal
.select (to_date($"date", "yyyy-MM-dd").as("date"), $"col1", $"col2")
// Union the two dataframes together, take 1 day away from the date
df.union(df.select(date_add($"date", -1), $"col1", $"col2"))
.orderBy("date", "col1", "col2")
.show
My results:
Maybe a bit late for this but answering this on python so others might find it useful.
from pyspark.sql.functions import *
Initial DF looks like this:
+-----------+-----+------+
|date |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2 | abc |
|2020-08-17 | 3 | def |
|2020-08-18 | 4 | ghi |
|2020-08-19 | 5 | jkl |
|2020-08-20 | 6 | mno |
+-----------+-----+------+
df.withColumn("dates_array",array(col("date"),date_add(col("date"),-1))))
.drop("date")
.withColumn("date",explode("dates_array"))
.drop("dates_array")
.show()
Then you'll get what you want:
+-----------+-----+------+
|date |col1 | col2 |
+-----------+-----+------+
|2020-08-16 | 2 | abc |
|2020-08-15 | 2 | abc |
|2020-08-17 | 3 | def |
|2020-08-16 | 3 | def |
|2020-08-18 | 4 | ghi |
|2020-08-17 | 4 | ghi |
|2020-08-19 | 5 | jkl |
|2020-08-18 | 5 | jkl |
|2020-08-20 | 6 | mno |
|2020-08-19 | 6 | mno |
+-----------+-----+------+

How can I get consecutively the same dataframe in spark

My data is like this, status is 0 or 1, uid is user id.
uid |timestamp |status
1 |1 | 0
2 |3 | 1
1 |2 | 1
2 |1 | 0
1 |3 | 1
2 |2 | 0
2 |4 | 0
I wanna data partitioned by uid and order by timestamp asc.
uid |timestamp |status
1 |1 | 0
1 |2 | 1
1 |3 | 1
2 |1 | 0
2 |2 | 0
2 |3 | 1
2 |4 | 0
And get all consecutively the same status and conbine them to do other things.
Sorry, my English is ...shit.
The rusult is like below:
uid |status |timestamps-asc-order
1 |(0) | (1)
1 |(1,1) | (2,2)
2 |(0,0) | (1,2)
2 |(1) | (3)
2 |(0) | (4)
I can do partition and order with window function.
But then, how to get consecutively same status ?
val window = Window.partitionBy("uid").orderBy($"timestamp".asc)
Welcome to StackOverflow.
You are looking for the collect_list function.
You should be able to achieve what you ask with a
df.withColumn("timestamps-asc-order", collect_list("timestamp").over(Window.partitionBy("uid").orderBy("timestamp"))

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please
With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+

Spark SQL Map only one column of DataFrame

Sorry for the noob question, I have a dataframe in SparkSQL like this:
id | name | data
----------------
1 | Mary | ABCD
2 | Joey | DOGE
3 | Lane | POOP
4 | Jack | MEGA
5 | Lynn | ARGH
I want to know how to do two things:
1) use a scala function on one or more columns to produce another column
2) use a scala function on one or more columns to replace a column
Examples:
1) Create a new boolean column that tells whether the data starts with A:
id | name | data | startsWithA
------------------------------
1 | Mary | ABCD | true
2 | Joey | DOGE | false
3 | Lane | POOP | false
4 | Jack | MEGA | false
5 | Lynn | ARGH | true
2) Replace the data column with its lowercase counterpart:
id | name | data
----------------
1 | Mary | abcd
2 | Joey | doge
3 | Lane | poop
4 | Jack | mega
5 | Lynn | argh
What is the best way to do this in SparkSQL? I've seen many examples of how to return a single transformed column, but I don't know how to get back a new DataFrame with all the original columns as well.
You can use withColumn to add new column or to replace the existing column
as
val df = Seq(
(1, "Mary", "ABCD"),
(2, "Joey", "DOGE"),
(3, "Lane", "POOP"),
(4, "Jack", "MEGA"),
(5, "Lynn", "ARGH")
).toDF("id", "name", "data")
val resultDF = df.withColumn("startsWithA", $"data".startsWith("A"))
.withColumn("data", lower($"data"))
If you want separate dataframe then
val resultDF1 = df.withColumn("startsWithA", $"data".startsWith("A"))
val resultDF2 = df.withColumn("data", lower($"data"))
withColumn replaces the old column if the same column name is provided and creates a new column if new column name is provided.
Output:
+---+----+----+-----------+
|id |name|data|startsWithA|
+---+----+----+-----------+
|1 |Mary|abcd|true |
|2 |Joey|doge|false |
|3 |Lane|poop|false |
|4 |Jack|mega|false |
|5 |Lynn|argh|true |
+---+----+----+-----------+