How do I modify dataframe rows based on another dataframe in Spark?

How do I modify dataframe rows based on another dataframe in Spark? - scala

I have a dataframe, df2 such as:
ID | data
--------
1 | New
3 | New
5 | New
and a main dataframe, df1:
ID | data | more
----------------
1 | OLD | a
2 | OLD | b
3 | OLD | c
4 | OLD | d
5 | OLD | e
I want to achieve something of the sort:
ID | data | more
----------------
1 | NEW | a
2 | OLD | b
3 | NEW | c
4 | OLD | d
5 | NEW | e
I want to update df1 based on df2, keeping the original values of df1 when they dont exist in df2.
Is there a fast way to do this than using isin? Isin is very slow when df1 and df2 are both very large.

With left join and "coalesce":
val df1 = Seq(
(1, "OLD", "a"),
(2, "OLD", "b"),
(3, "OLD", "c"),
(4, "OLD", "d"),
(5, "OLD", "e")).toDF("ID", "data", "more")
val df2 = Seq(
(1, "New"),
(3, "New"),
(5, "New")).toDF("ID", "data")
// action
val result = df1.alias("df1")
.join(
df2.alias("df2"),$"df2.ID" === $"df1.ID", "left")
.select($"df1.ID",
coalesce($"df2.data", $"df1.data").alias("data"),
$"more")
Output:
+---+----+----+
|ID |data|more|
+---+----+----+
|1 |New |a |
|2 |OLD |b |
|3 |New |c |
|4 |OLD |d |
|5 |New |e |
+---+----+----+

Related

Create new columns from values of other columns in Scala Spark

I have an input dataframe:
inputDF=
+--------------------------+-----------------------------+
| info (String) | chars (Seq[String]) |
+--------------------------+-----------------------------+
|weight=100,height=70 | [weight,height] |
+--------------------------+-----------------------------+
|weight=92,skinCol=white | [weight,skinCol] |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |
+--------------------------+-----------------------------+
How to I get this dataframe as an output? I do not know in advance what are the strings contained in chars column
outputDF=
+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String) | chars (Seq[String]) | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70 | [weight,height] | 100 | 70 | null |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white | [weight,skinCol] | 92 |null |white |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |null |null |white |gray |
+--------------------------+-----------------------------+-------+-------+-------+-------+
I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.
val aVariable: Seq[String] = [weight, height, skinCol, hairCol]

You create another dataframe pivoting on the key of info column than join it back using an id column:
import spark.implicits._
val data = Seq(
("weight=100,height=70", Seq("weight", "height")),
("weight=92,skinCol=white", Seq("weight", "skinCol")),
("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)
val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
.withColumn("id", monotonically_increasing_id() + 1)
val pivotDf = df
.withColumn("tmp", split(col("info"), ","))
.withColumn("tmp", explode(col("tmp")))
.withColumn("val1", split(col("tmp"), "=")(0))
.withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
.groupBy("id").pivot("val1").agg(first(col("val2")))
df.join(pivotDf, Seq("id"), "left").drop("id").show(false)
+--------------------------+------------------+-------+------+-------+------+
|info |chars |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70 |[weight, height] |null |70 |null |100 |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray |null |white |null |
|weight=92,skinCol=white |[weight, skinCol] |null |null |white |92 |
+--------------------------+------------------+-------+------+-------+------+
for your second question you can get those values in a dataframe like this:
df.withColumn("tmp", explode(split(col("info"), ",")))
.withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()
+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+
but you cannot get them in Seq variable without using collect, that just impossible.

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+

You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

Create New Column with range of integer by using existing Integer Column in Spark Scala Dataframe

Suppose I have a Spark Scala DataFrame object like:
+--------+
|col1 |
+--------+
|1 |
|3 |
+--------+
And I want a DataFrame like:
+-----------------+
|col1 |col2 |
+-----------------+
|1 |[0,1] |
|3 |[0,1,2,3] |
+-----------------+

Spark offers plenty of APIs/Functions to play around, most of the time default functions come handy however for a specific task UserDefinedFunctions UDFs could be written.
Reference https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(1,3)).toDF("index")
val rangeDF = df.withColumn("range", indexToRange(col("index")))
rangeDF.show(10)
def indexToRange: UserDefinedFunction = udf((index: Integer) => for (i <- 0 to index) yield i)

You can achieve it with the below approach
val input_df = spark.sparkContext.parallelize(List(1, 2, 3, 4, 5)).toDF("col1")
input_df.show(false)
Input:
+----+
|col1|
+----+
|1 |
|2 |
|3 |
|4 |
|5 |
+----+
val output_df = input_df.rdd.map(x => x(0).toString()).map(x => (x, Range(0, x.toInt + 1).mkString(","))).toDF("col1", "col2")
output_df.withColumn("col2", split($"col2", ",")).show(false)
Output:
+----+------------------+
|col1|col2 |
+----+------------------+
|1 |[0, 1] |
|2 |[0, 1, 2] |
|3 |[0, 1, 2, 3] |
|4 |[0, 1, 2, 3, 4] |
|5 |[0, 1, 2, 3, 4, 5]|
+----+------------------+
Hope this helps!

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please

With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+

How to select the most distinct value or How to perform a Inner/Nested groupBy in Spark?

Original Dataframe
+-------+---------------+
| col_a | col_b |
+-------+---------------+
| 1 | aaa |
| 1 | bbb |
| 1 | ccc |
| 1 | aaa |
| 1 | aaa |
| 1 | aaa |
| 2 | eee |
| 2 | eee |
| 2 | ggg |
| 2 | hhh |
| 2 | iii |
| 3 | 222 |
| 3 | 333 |
| 3 | 222 |
+-------+---------------+
Result Dataframe I needed
+----------------+---------------------+-----------+
| group_by_col_a | most_distinct_value | col_a cnt |
+----------------+---------------------+-----------+
| 1 | aaa | 6 |
| 2 | eee | 5 |
| 3 | 222 | 3 |
+----------------+---------------------+-----------+
Here is what I have tried so far
val DF = originalDF
.groupBy($"col_a")
.agg(
max(countDistinct("col_b"))
count("col_a").as("col_a_cnt"))
and error msg.
org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
what is the problem?
Is there an efficient method to select the most distinct value?

You need two groupBy for this and a join to get the results as below
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "aaa"), (1, "bbb"),
(1, "ccc"), (1, "aaa"),
(1, "aaa"), (1, "aaa"),
(2, "eee"), (2, "eee"),
(2, "ggg"), (2, "hhh"),
(2, "iii"), (3, "222"),
(3, "333"), (3, "222")
)).toDF("a", "b")
//calculating the count for coulmn a
val countDF = data.groupBy($"a").agg(count("a").as("col_a cnt"))
val distinctDF = data.groupBy($"a", $"b").count()
.groupBy("a").agg(max(struct("count","b")).as("max"))
//calculating and selecting the most distinct value
.select($"a", $"max.b".as("most_distinct_value"))
//joining both dataframe to get final result
.join(countDF, Seq("a"))
distinctDF.show()
Output:
+---+-------------------+---------+
| a|most_distinct_value|col_a cnt|
+---+-------------------+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+---+-------------------+---------+
Hope this was helpful!

Another approach, you can do the conversion using RDD level. Because RDD level conversion much faster the DataFrame level.
val input = Seq((1, "aaa"), (1, "bbb"), (1, "ccc"), (1, "aaa"), (1, "aaa"),
(1, "aaa"), (2, "eee"), (2, "eee"), (2, "ggg"), (2, "hhh"), (2, "iii"),
(3, "222"), (3, "333"), (3, "222"))
import sparkSession.implicits._
val inputRDD: RDD[(Int, String)] = sc.parallelize(input)
convertion:
val outputRDD: RDD[(Int, String, Int)] =
inputRDD.groupBy(_._1)
.map(row =>
(row._1,
row._2.map(_._2)
.groupBy(identity)
.maxBy(_._2.size)._1,
row._2.size))
Now, you can create data frame and display.
val outputDf: DataFrame = outputRDD.toDF("col_a", "col_b", "col_a cnt")
outputDf.show()
Output:
+-----+-----+---------+
|col_a|col_b|col_a cnt|
+-----+-----+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+-----+-----+---------+

You can achieve your requirement by simply defining a udf function, using collect_list function and count function (which you've already done)
In udf function, you can send the collected list of col_b values and return the max occuring string in the group as
import org.apache.spark.sql.functions._
def maxCountdinstinct = udf((list: mutable.WrappedArray[String]) => {
list.groupBy(identity) // grouping with the strings
.mapValues(_.size) // counting the grouped strings
.maxBy(_._2)._1 // returning the string with max count
}
)
And you can call the udf function as
val DF = originalDF
.groupBy($"col_a")
.agg(maxCountdinstinct(collect_list("col_b")).as("most_distinct_value"), count("col_a").as("col_a_cnt"))
which should give you
+-----+-------------------+---------+
|col_a|most_distinct_value|col_a_cnt|
+-----+-------------------+---------+
|3 |222 |3 |
|1 |aaa |6 |
|2 |eee |5 |
+-----+-------------------+---------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do I modify dataframe rows based on another dataframe in Spark? - scala

Related

Create new columns from values of other columns in Scala Spark

Spark DF create Seq column in witcolumn

Create New Column with range of integer by using existing Integer Column in Spark Scala Dataframe

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

How to select the most distinct value or How to perform a Inner/Nested groupBy in Spark?

Categories

Resources