Scala / DataFrame / Spark: How do I express multiple conditional aggregates? - scala

Let's say I have a table like:
id,date,value
1,2017-02-12,3
2,2017-03-18,2
1,2017-03-20,5
1,2017-04-01,1
3,2017-04-01,3
2,2017-04-10,2
I already have this as a dataframe (it comes from a Hive table)
Now, I want an output that looks like (logically):
id, count($"date">"2017-03"), sum($"value" where $"date">"2017-03"), count($"date">"2017-02"), sum($"value" where $"date">"2017-02")
I've tried to express this in a single agg(), but I just can't figure out how to do the inner conditionals. I know how to filter ahead of the aggregation, but that doesn't do what I need with the two different sub-ranges.
// doesn't do the right thing
myDF.where($"date">"2017-03")
.groupBy("id")
.agg(sum("value") as "value_03", count("value") as "count_03")
.where($"date">"2017-04")
.agg(sum("value") as "value_04", count("value") as "value_04")
In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses. How do I do something similar with DataFrames in Spark with Scala?
The closest I can think of is calculating membership for each tuple in each of the windows before the groupBy(), and summing over that membership times value (and straight sum for count.) It seems like there should be a better way to express this with conditionals inside the agg(), but I can't find it.

In SQL I would have put all the aggregation into a single SELECT statement with conditionals inside the count/sum clauses.
You can do exactly the same thing here:
import org.apache.spark.sql.functions.{sum, when}
myDF
.groupBy($"id")
.agg(
sum(when($"date" > "2017-03", $"value")).alias("value3"),
sum(when($"date" > "2017-04", $"value")).alias("value4")
)
+---+------+------+
| id|value3|value4|
+---+------+------+
| 1| 6| 1|
| 3| 3| 3|
| 2| 4| 2|
+---+------+------+

Related

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Spark: Flatten simple multi-column DataFrame

How to flatten a simple (i.e. no nested structures) dataframe into a list?
My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.
This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.
Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1 |after.id2 |
+-----------+-----------+-----------+-----------+
| null| null| E2| E3|
| B3| B1| null| null|
| I1| I2| null| null|
| A2| A3| null| null|
| null| null| G3| G4|
The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:
{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}
Potential approaches:
Union all the columns separately and distinct
flatMap and distinct
map and flatten
Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?
Other notes
Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:
val df = Seq(("A","B"),(null,"A")).toDF
val result = df.rdd.map(_.toSeq.toList)
.collect().toList.flatten.toSet - null

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

Spark (Scala) sqlDataFrame Joins not working as expected

I thought i knew sql joins, but now i'm not so sure about that.
I have a dataframe with movie ratings and another dataframe with userIds and their indexes. I want to join both dataframes so that i will have the corresponding user index for every movie rating. However after joining the tables i get more records than i had before the join which makes no sense to me. I expect to get the same amount of records but with an extra column of u_number:
My first idea was to use Left join with ratingsDf as the left and userDataFrame as the right but i get undesired results with any of the joins i tried.
The command i use for the join :
val ratingsUsers = ratingsDf.join(userDataFrame, ratingsDf("uid") === userDataFrame("uid"),"left" )
These are the tables :
scala> ratingsDf.show(5)
+--------------+----------+------+
| uid| mid|rating|
+--------------+----------+------+
|A1V0C9SDO4DKLA|B0002IQNAG| 4.0|
|A38WAOQVVWOVEY|B0002IQNAG| 4.0|
|A2JP0URFHXP6DO|B0002IQNAG| 5.0|
|A2X4HJ26YWTGJU|B0002IQNAG| 5.0|
|A3A98961GZKIGD|B0002IQNAG| 5.0|
+--------------+----------+------+
scala> userDataFrame.show(5)
+--------------+--------+
| uid|u_number|
+--------------+--------+
|A10049L7AJW9M7| 0|
|A1007G0226CSWC| 1|
|A100FQCUCZO2WG| 2|
|A100JCBNALJFAW| 3|
|A100K3KEMSVSCM| 4|
+--------------+--------+
So the issue was indeed a problem with duplicate keys in the UserDataFrame.
The issue was i used .distinct() on the users rdd which had (k,v) tuples and i thought distinct() worked on keys only, but it takes the whole tuple into consideration which left me with duplicate keys in the dataframe created from that rdd.
Thanks for the help.