How do I modify dataframe rows based on another dataframe in Spark? - scala

I have a dataframe, df2 such as:
ID | data
--------
1 | New
3 | New
5 | New
and a main dataframe, df1:
ID | data | more
----------------
1 | OLD | a
2 | OLD | b
3 | OLD | c
4 | OLD | d
5 | OLD | e
I want to achieve something of the sort:
ID | data | more
----------------
1 | NEW | a
2 | OLD | b
3 | NEW | c
4 | OLD | d
5 | NEW | e
I want to update df1 based on df2, keeping the original values of df1 when they dont exist in df2.
Is there a fast way to do this than using isin? Isin is very slow when df1 and df2 are both very large.

With left join and "coalesce":
val df1 = Seq(
(1, "OLD", "a"),
(2, "OLD", "b"),
(3, "OLD", "c"),
(4, "OLD", "d"),
(5, "OLD", "e")).toDF("ID", "data", "more")
val df2 = Seq(
(1, "New"),
(3, "New"),
(5, "New")).toDF("ID", "data")
// action
val result = df1.alias("df1")
.join(
df2.alias("df2"),$"df2.ID" === $"df1.ID", "left")
.select($"df1.ID",
coalesce($"df2.data", $"df1.data").alias("data"),
$"more")
Output:
+---+----+----+
|ID |data|more|
+---+----+----+
|1 |New |a |
|2 |OLD |b |
|3 |New |c |
|4 |OLD |d |
|5 |New |e |
+---+----+----+

Related

Create new columns from values of other columns in Scala Spark

I have an input dataframe:
inputDF=
+--------------------------+-----------------------------+
| info (String) | chars (Seq[String]) |
+--------------------------+-----------------------------+
|weight=100,height=70 | [weight,height] |
+--------------------------+-----------------------------+
|weight=92,skinCol=white | [weight,skinCol] |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |
+--------------------------+-----------------------------+
How to I get this dataframe as an output? I do not know in advance what are the strings contained in chars column
outputDF=
+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String) | chars (Seq[String]) | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70 | [weight,height] | 100 | 70 | null |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white | [weight,skinCol] | 92 |null |white |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |null |null |white |gray |
+--------------------------+-----------------------------+-------+-------+-------+-------+
I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.
val aVariable: Seq[String] = [weight, height, skinCol, hairCol]
You create another dataframe pivoting on the key of info column than join it back using an id column:
import spark.implicits._
val data = Seq(
("weight=100,height=70", Seq("weight", "height")),
("weight=92,skinCol=white", Seq("weight", "skinCol")),
("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)
val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
.withColumn("id", monotonically_increasing_id() + 1)
val pivotDf = df
.withColumn("tmp", split(col("info"), ","))
.withColumn("tmp", explode(col("tmp")))
.withColumn("val1", split(col("tmp"), "=")(0))
.withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
.groupBy("id").pivot("val1").agg(first(col("val2")))
df.join(pivotDf, Seq("id"), "left").drop("id").show(false)
+--------------------------+------------------+-------+------+-------+------+
|info |chars |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70 |[weight, height] |null |70 |null |100 |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray |null |white |null |
|weight=92,skinCol=white |[weight, skinCol] |null |null |white |92 |
+--------------------------+------------------+-------+------+-------+------+
for your second question you can get those values in a dataframe like this:
df.withColumn("tmp", explode(split(col("info"), ",")))
.withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()
+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+
but you cannot get them in Seq variable without using collect, that just impossible.

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon
You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+
You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

Create New Column with range of integer by using existing Integer Column in Spark Scala Dataframe

Suppose I have a Spark Scala DataFrame object like:
+--------+
|col1 |
+--------+
|1 |
|3 |
+--------+
And I want a DataFrame like:
+-----------------+
|col1 |col2 |
+-----------------+
|1 |[0,1] |
|3 |[0,1,2,3] |
+-----------------+
Spark offers plenty of APIs/Functions to play around, most of the time default functions come handy however for a specific task UserDefinedFunctions UDFs could be written.
Reference https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(1,3)).toDF("index")
val rangeDF = df.withColumn("range", indexToRange(col("index")))
rangeDF.show(10)
def indexToRange: UserDefinedFunction = udf((index: Integer) => for (i <- 0 to index) yield i)
You can achieve it with the below approach
val input_df = spark.sparkContext.parallelize(List(1, 2, 3, 4, 5)).toDF("col1")
input_df.show(false)
Input:
+----+
|col1|
+----+
|1 |
|2 |
|3 |
|4 |
|5 |
+----+
val output_df = input_df.rdd.map(x => x(0).toString()).map(x => (x, Range(0, x.toInt + 1).mkString(","))).toDF("col1", "col2")
output_df.withColumn("col2", split($"col2", ",")).show(false)
Output:
+----+------------------+
|col1|col2 |
+----+------------------+
|1 |[0, 1] |
|2 |[0, 1, 2] |
|3 |[0, 1, 2, 3] |
|4 |[0, 1, 2, 3, 4] |
|5 |[0, 1, 2, 3, 4, 5]|
+----+------------------+
Hope this helps!

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please
With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+

How to select the most distinct value or How to perform a Inner/Nested groupBy in Spark?

Original Dataframe
+-------+---------------+
| col_a | col_b |
+-------+---------------+
| 1 | aaa |
| 1 | bbb |
| 1 | ccc |
| 1 | aaa |
| 1 | aaa |
| 1 | aaa |
| 2 | eee |
| 2 | eee |
| 2 | ggg |
| 2 | hhh |
| 2 | iii |
| 3 | 222 |
| 3 | 333 |
| 3 | 222 |
+-------+---------------+
Result Dataframe I needed
+----------------+---------------------+-----------+
| group_by_col_a | most_distinct_value | col_a cnt |
+----------------+---------------------+-----------+
| 1 | aaa | 6 |
| 2 | eee | 5 |
| 3 | 222 | 3 |
+----------------+---------------------+-----------+
Here is what I have tried so far
val DF = originalDF
.groupBy($"col_a")
.agg(
max(countDistinct("col_b"))
count("col_a").as("col_a_cnt"))
and error msg.
org.apache.spark.sql.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
what is the problem?
Is there an efficient method to select the most distinct value?
You need two groupBy for this and a join to get the results as below
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "aaa"), (1, "bbb"),
(1, "ccc"), (1, "aaa"),
(1, "aaa"), (1, "aaa"),
(2, "eee"), (2, "eee"),
(2, "ggg"), (2, "hhh"),
(2, "iii"), (3, "222"),
(3, "333"), (3, "222")
)).toDF("a", "b")
//calculating the count for coulmn a
val countDF = data.groupBy($"a").agg(count("a").as("col_a cnt"))
val distinctDF = data.groupBy($"a", $"b").count()
.groupBy("a").agg(max(struct("count","b")).as("max"))
//calculating and selecting the most distinct value
.select($"a", $"max.b".as("most_distinct_value"))
//joining both dataframe to get final result
.join(countDF, Seq("a"))
distinctDF.show()
Output:
+---+-------------------+---------+
| a|most_distinct_value|col_a cnt|
+---+-------------------+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+---+-------------------+---------+
Hope this was helpful!
Another approach, you can do the conversion using RDD level. Because RDD level conversion much faster the DataFrame level.
val input = Seq((1, "aaa"), (1, "bbb"), (1, "ccc"), (1, "aaa"), (1, "aaa"),
(1, "aaa"), (2, "eee"), (2, "eee"), (2, "ggg"), (2, "hhh"), (2, "iii"),
(3, "222"), (3, "333"), (3, "222"))
import sparkSession.implicits._
val inputRDD: RDD[(Int, String)] = sc.parallelize(input)
convertion:
val outputRDD: RDD[(Int, String, Int)] =
inputRDD.groupBy(_._1)
.map(row =>
(row._1,
row._2.map(_._2)
.groupBy(identity)
.maxBy(_._2.size)._1,
row._2.size))
Now, you can create data frame and display.
val outputDf: DataFrame = outputRDD.toDF("col_a", "col_b", "col_a cnt")
outputDf.show()
Output:
+-----+-----+---------+
|col_a|col_b|col_a cnt|
+-----+-----+---------+
| 1| aaa| 6|
| 3| 222| 3|
| 2| eee| 5|
+-----+-----+---------+
You can achieve your requirement by simply defining a udf function, using collect_list function and count function (which you've already done)
In udf function, you can send the collected list of col_b values and return the max occuring string in the group as
import org.apache.spark.sql.functions._
def maxCountdinstinct = udf((list: mutable.WrappedArray[String]) => {
list.groupBy(identity) // grouping with the strings
.mapValues(_.size) // counting the grouped strings
.maxBy(_._2)._1 // returning the string with max count
}
)
And you can call the udf function as
val DF = originalDF
.groupBy($"col_a")
.agg(maxCountdinstinct(collect_list("col_b")).as("most_distinct_value"), count("col_a").as("col_a_cnt"))
which should give you
+-----+-------------------+---------+
|col_a|most_distinct_value|col_a_cnt|
+-----+-------------------+---------+
|3 |222 |3 |
|1 |aaa |6 |
|2 |eee |5 |
+-----+-------------------+---------+