Spark (Scala) sqlDataFrame Joins not working as expected - scala

I thought i knew sql joins, but now i'm not so sure about that.
I have a dataframe with movie ratings and another dataframe with userIds and their indexes. I want to join both dataframes so that i will have the corresponding user index for every movie rating. However after joining the tables i get more records than i had before the join which makes no sense to me. I expect to get the same amount of records but with an extra column of u_number:
My first idea was to use Left join with ratingsDf as the left and userDataFrame as the right but i get undesired results with any of the joins i tried.
The command i use for the join :
val ratingsUsers = ratingsDf.join(userDataFrame, ratingsDf("uid") === userDataFrame("uid"),"left" )
These are the tables :
scala> ratingsDf.show(5)
+--------------+----------+------+
| uid| mid|rating|
+--------------+----------+------+
|A1V0C9SDO4DKLA|B0002IQNAG| 4.0|
|A38WAOQVVWOVEY|B0002IQNAG| 4.0|
|A2JP0URFHXP6DO|B0002IQNAG| 5.0|
|A2X4HJ26YWTGJU|B0002IQNAG| 5.0|
|A3A98961GZKIGD|B0002IQNAG| 5.0|
+--------------+----------+------+
scala> userDataFrame.show(5)
+--------------+--------+
| uid|u_number|
+--------------+--------+
|A10049L7AJW9M7| 0|
|A1007G0226CSWC| 1|
|A100FQCUCZO2WG| 2|
|A100JCBNALJFAW| 3|
|A100K3KEMSVSCM| 4|
+--------------+--------+

So the issue was indeed a problem with duplicate keys in the UserDataFrame.
The issue was i used .distinct() on the users rdd which had (k,v) tuples and i thought distinct() worked on keys only, but it takes the whole tuple into consideration which left me with duplicate keys in the dataframe created from that rdd.
Thanks for the help.

Related

Spark scala selecting multiple columns from a list and single columns

I'm attempting to do a select on a dataframe but I'm having a little bit of trouble.
I have this initial dataframe
+----------+-------+-------+-------+
|id|value_a|value_b|value_c|value_d|
+----------+-------+-------+-------+
And what I have to do is sum value_a with value_b and keep the others the same. So I have this list
val select_list = List(id, value_c, value_d)
and after this I do the select
df.select(select_list.map(col):_*, (col(value_a) + col(value_b)).as("value_b"))
And I'm expecting to get this:
+----------+-------+-------+
|id|value_c|value_d|value_b| --- that value_b is the sum of value_a and value_b (original)
+----------+-------+-------+
But i'm getting "a no _* annotation allowed here". Keep in mind that in reality I have a lot of columns so I need to use a list, I can't simply select each column. I'm running into this trouble because the new column that is the result of the sum has the same name of an existing column, so I can't just select(column("*"), sum....).drop(value_b) or I'd be dropping the old column and the new one with the sum.
What is the correct syntax to add multiple and single columns in a single select, or how else can I solve this?
for now I decided to do this:
df.select(col("*"), (col(value_a) + col(value_b)).as("value_b_tmp")).
drop("value_a", "value_b").withColumnRenamed("value_b_tmp", "value_b")
Which works fine but I understand the withColumn and withColumnRenamed is expensive because I'm creating pretty much a new dataframe with a new or renamed column and I'm looking for the less expensive operation possible.
Thanks in advance!
Simply use .withColumn function, it will replace the column if it exists:
df
.withColumn("value_b", col("value_a") + col("value_b"))
.select(select_list.map(col):_*)
You can create a new sum field and collect the result of the operation for the sum of the n columns as:
val df: DataFrame =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1,2,3),Row(1,2,3))),
StructType(List(
StructField("field1", IntegerType),
StructField("field2", IntegerType),
StructField("field3", IntegerType))))
val columnsToSum = df.schema.fieldNames
columnsToSum.filter(name => name != "field1")
.foldLeft(df.withColumn("sum", lit(0)))((df, column) =>
df.withColumn("sum", col("sum") + col(column)))
Gives:
+------+------+------+---+
|field1|field2|field3|sum|
+------+------+------+---+
| 1| 2| 3| 5|
| 1| 2| 3| 5|
+------+------+------+---+

Comparing two Identically structured Dataframes in Spark

val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
So the above two dataframes has the same table structure and I want to find out the id's for which the values have changed in the other dataframe(changedDF). I tried with the except() function in spark but its giving me two rows. Id is the common column between these two dataframes.
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Whereas I only want the common ids for which there has been any changes.Like this ->
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Is there any way to find out the only the common ids for which the data have changed.
Can anybody tell me any approach I can follow to achieve this.
You can do the inner join of the dataframes, that will give you the result with common ids.
originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
.select("a.*")
.except(changedDF)
.show
Then, your expected result will be out:
+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
| 2|sunil|noida| 600| 80000|
+---+-----+-----+------------+------------+

Spark: Flatten simple multi-column DataFrame

How to flatten a simple (i.e. no nested structures) dataframe into a list?
My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.
This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.
Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1 |after.id2 |
+-----------+-----------+-----------+-----------+
| null| null| E2| E3|
| B3| B1| null| null|
| I1| I2| null| null|
| A2| A3| null| null|
| null| null| G3| G4|
The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:
{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}
Potential approaches:
Union all the columns separately and distinct
flatMap and distinct
map and flatten
Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?
Other notes
Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:
val df = Seq(("A","B"),(null,"A")).toDF
val result = df.rdd.map(_.toSeq.toList)
.collect().toList.flatten.toSet - null

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

Scala / DataFrame / Spark: How do I express multiple conditional aggregates?

Let's say I have a table like:
id,date,value
1,2017-02-12,3
2,2017-03-18,2
1,2017-03-20,5
1,2017-04-01,1
3,2017-04-01,3
2,2017-04-10,2
I already have this as a dataframe (it comes from a Hive table)
Now, I want an output that looks like (logically):
id, count($"date">"2017-03"), sum($"value" where $"date">"2017-03"), count($"date">"2017-02"), sum($"value" where $"date">"2017-02")
I've tried to express this in a single agg(), but I just can't figure out how to do the inner conditionals. I know how to filter ahead of the aggregation, but that doesn't do what I need with the two different sub-ranges.
// doesn't do the right thing
myDF.where($"date">"2017-03")
.groupBy("id")
.agg(sum("value") as "value_03", count("value") as "count_03")
.where($"date">"2017-04")
.agg(sum("value") as "value_04", count("value") as "value_04")
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses. How do I do something similar with DataFrames in Spark with Scala?
The closest I can think of is calculating membership for each tuple in each of the windows before the groupBy(), and summing over that membership times value (and straight sum for count.) It seems like there should be a better way to express this with conditionals inside the agg(), but I can't find it.
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses.
You can do exactly the same thing here:
import org.apache.spark.sql.functions.{sum, when}
myDF
.groupBy($"id")
.agg(
sum(when($"date" > "2017-03", $"value")).alias("value3"),
sum(when($"date" > "2017-04", $"value")).alias("value4")
)
+---+------+------+
| id|value3|value4|
+---+------+------+
| 1| 6| 1|
| 3| 3| 3|
| 2| 4| 2|
+---+------+------+