Spark SQL Dataframe API -build filter condition dynamically - scala

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.

You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+

If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

Related

Using rlike with list to create new df scala

just started with scala 2 days ago.
Here's the thing, I have a df and a list. The df contains two columns: paragraphs and authors, the list contains words (strings). I need to get the count of all the paragraphs where every word on list appears by author.
So far my idea was to create a for loop on the list to query the df using rlike and create a new df, but even if this does work, I wouldn't know how to do it. Any help is appreciated!
Edit: Adding example data and expected output
// Example df and list
val df = Seq(("auth1", "some text word1"), ("auth2","some text word2"),("auth3", "more text word1").toDF("a","t")
df.show
+-------+---------------+
| a| t|
+-------+---------------+
|auth1 |some text word1|
|auth2 |some text word2|
|auth1 |more text word1|
+-------+---------------+
val list = List("word1", "word2")
// Expected output
newDF.show
+-------+-----+----------+
| word| a|text count|
+-------+-----+----------+
|word1 |auth1| 2|
|word2 |auth2| 1|
+-------+-----+----------+
You can do a filter and aggregation for each word in the list, and combine all the resulting dataframes using unionAll:
val result = list.map(word =>
df.filter(df("t").rlike(s"\\b${word}\\b"))
.groupBy("a")
.agg(lit(word).as("word"), count(lit(1)).as("text count"))
).reduce(_ unionAll _)
result.show
+-----+-----+----------+
| a| word|text count|
+-----+-----+----------+
|auth3|word1| 1|
|auth1|word1| 1|
|auth2|word2| 1|
+-----+-----+----------+

Comparing two Identically structured Dataframes in Spark

val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
So the above two dataframes has the same table structure and I want to find out the id's for which the values have changed in the other dataframe(changedDF). I tried with the except() function in spark but its giving me two rows. Id is the common column between these two dataframes.
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Whereas I only want the common ids for which there has been any changes.Like this ->
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Is there any way to find out the only the common ids for which the data have changed.
Can anybody tell me any approach I can follow to achieve this.
You can do the inner join of the dataframes, that will give you the result with common ids.
originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
.select("a.*")
.except(changedDF)
.show
Then, your expected result will be out:
+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
| 2|sunil|noida| 600| 80000|
+---+-----+-----+------------+------------+

Spark dataframe replace values of specific columns in a row with Nulls

I am facing a problem when trying to replace the values of specific columns of a Spark dataframe with nulls.
I have a dataframe with more than fifty columns of which two are key columns. I want to create a new dataframe with same schema and the new dataframe should have values from the key columns and null values in non-key columns.
I tried the following ways but facing issues:
//old_df is the existing Dataframe
val key_cols = List("id", "key_number")
val non_key_cols = old_df.columns.toList.filterNot(key_cols.contains(_))
val key_col_df = old_df.select(key_cols.head, key_cols.tail:_*)
val non_key_cols_df = old_df.select(non_key_cols.head, non_key_cols.tail:_*)
val list_cols = List.fill(non_key_cols_df.columns.size)("NULL")
val rdd_list_cols = spark.sparkContext.parallelize(Seq(list_cols)).map(l => Row(l:_*))
val list_df = spark.createDataFrame(rdd_list_cols, non_key_cols_df.schema)
val new_df = key_col_df.crossJoin(list_df)
This approach was good when I only have string type columns in the old_df. But I have some columns of double type and int type which is throwing error because the rdd is a list of null strings.
To avoid this I tried the list_df as an empty dataframe with schema as the non_key_cols_df but the result of crossJoin is an empty dataframe which I believe is because one dataframe is empty.
My requirement is to have the non_key_cols as a single row dataframe with Nulls so that I can do crossJoin with key_col_df and form the required new_df.
Also any other easier way to update all columns except key columns of a dataframe to nulls will resolve my issue. Thanks in advance
crossJoin is an expensive operation so you want to avoid it if possible.
An easier solution would be to iterate over all non-key columns and insert null with lit(null). Using foldLeft this can be done as follows:
val keyCols = List("id", "key_number")
val nonKeyCols = df.columns.filterNot(keyCols.contains(_))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c, lit(null)))
Input example:
+---+----------+---+----+
| id|key_number| c| d|
+---+----------+---+----+
| 1| 2| 3| 4.0|
| 5| 6| 7| 8.0|
| 9| 10| 11|12.0|
+---+----------+---+----+
will give:
+---+----------+----+----+
| id|key_number| c| d|
+---+----------+----+----+
| 1| 2|null|null|
| 5| 6|null|null|
| 9| 10|null|null|
+---+----------+----+----+
Shaido answer has small drawback - column type will be lost.
Can be fixed with schema usage, like this:
val nonKeyCols = df.schema.fields.filterNot(f => keyCols.contains(f.name))
val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c.name, lit(null).cast(c.dataType)))

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))