Update data from two Data Frames Scala-Spark - scala

I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?

A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result

Related

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

Row manipulation for Dataframe in spark [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+

How to add a new column to data frame based on two columns of other data frames

I have two DataFrames df_data and df_node_labels:
df_data =
nodeId field1
1 abc
2 def
3 fed
4 kfl
df_node_labels =
srcId srcLabel dstId dstLabel
1 AAA 2 BBB
2 BBB 4 FFF
4 FFF 3 CCC
I want to add a column label to df_data. The values of label should be taken from srcLabel and dstLabel:
This is how I tried to grab label information:
var df = df_data.join(df_node_labels.select("srcId","srcLabel"),col("nodeId")===col("srcId"),"left")
df = df.join(df_node_labels.select("dstId","dstLabel"),col("nodeId")===col("dstId"),"left")
However, this creates two columns srcLabel and dstLabel in df, while I want to get just one column label.
This is the expected result:
df =
nodeId field1 label
1 abc AAA
2 def BBB
3 fed CCC
4 kfl FFF
Update:
I can do it this way, but in my opinion it is a long way to do a simple thing:
df = df.withColumn("label", when(col("srcLabel") =!= "", col("srcLabel")).otherwise(col("dstLabel"))).drop("srcLabel").drop("dstLabel")
You can create a unique data from df_node_labels as a finalDF in below and perform a join operation which will give you an expected result.
val finalDF = df_node_labels.select($"srcId".as("nodeId"), $"srcLabel".as("label"))
.union(
df_node_labels.select($"dstId".as("nodeId"), $"dstLabel".as("label"))
).dropDuplicates()
df_data.join(finalDF, Seq("nodeId"), "left")
.show(false)
Output:
+------+------+-----+
|nodeId|field1|label|
+------+------+-----+
|1 |abc |AAA |
|2 |def |BBB |
|3 |fed |CCC |
|4 |kfl |FFF |
+------+------+-----+
I hope this helped you!

Scala how to match two dfs if mathes then update the key in first df

I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.

Remove duplicates in Pair RDD based on Values

I have an RDD with multiple rows which looks like below.
val row = [(String, String), (String, String, String)]
The value is a sequence of Tuples. In the tuple, the last String is a timestamp and the second one is category. I want to filter this sequence based on maximum timestamp for each category.
(A,B) Id Category Timestamp
-------------------------------------------------------
(123,abc) 1 A 2016-07-22 21:22:59+0000
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
(234,bcd) 2 B 2015-07-20 21:21:20+0000
I want one row for each of the categories.I was looking for some help on getting the row with the max timestamp for each category. The result I am looking to get is
(A,B) Id Category Timestamp
-------------------------------------------------------
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
Given input dataframe as
+---------+---+--------+------------------------+
|(A,B) |Id |Category|Timestamp |
+---------+---+--------+------------------------+
|[123,abc]|1 |A |2016-07-22 21:22:59+0000|
|[234,bcd]|2 |B |2016-07-20 21:21:20+0000|
|[123,abc]|1 |A |2017-07-09 21:22:59+0000|
|[345,cde]|4 |C |2016-07-05 09:22:30+0000|
|[456,def]|5 |D |2016-07-21 07:32:06+0000|
|[234,bcd]|2 |B |2015-07-20 21:21:20+0000|
+---------+---+--------+------------------------+
You can do the following to get the result dataframe you require
import org.apache.spark.sql.functions._
val requiredDataframe = df.orderBy($"Timestamp".desc).groupBy("Category").agg(first("(A,B)").as("(A,B)"), first("Id").as("Id"), first("Timestamp").as("Timestamp"))
You should have the requiredDataframe as
+--------+---------+---+------------------------+
|Category|(A,B) |Id |Timestamp |
+--------+---------+---+------------------------+
|B |[234,bcd]|2 |2016-07-20 21:21:20+0000|
|D |[456,def]|5 |2016-07-21 07:32:06+0000|
|C |[345,cde]|4 |2016-07-05 09:22:30+0000|
|A |[123,abc]|1 |2017-07-09 21:22:59+0000|
+--------+---------+---+------------------------+
You can do the same by using Window function as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Category").orderBy($"Timestamp".desc)
df.withColumn("rank", rank().over(windowSpec)).filter($"rank" === lit(1)).drop("rank")