Comparing the value of columns in two dataframe - scala

I have two dataframe, one has unique value of id and other can have multiple values of different id.
This is dataframe df1:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:48:23 4 [8,-1,0,22,1,1,1]
df2:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:54:23 4 [9,0,0,22,1,1,1]
358899055773504 2018-07-31 18:58:34 0 [9,0,-1,22,0,1,0]
358899055773504 2018-07-31 18:28:34 0 [9,0,-1,22,0,1,0]
358899055773505 2018-07-31 18:38:23 4 [8,-1,0,22,1,1,1]
I aim to compare the second dataframe with the first dataframe and updating the values in first dataframe, only if the value of dt of a particular id of df2 is greater than that in df1 and if it satisfies the greater than condition then comparing the other fields as well.

You need to join the two dataframes together to make any comparison of their columns.
What you can do is first joining the dataframes and then perform all the filtering to get a new dataframe with all rows that should be updated:
val diffDf = df1.as("a").join(df2.as("b"), Seq("id"))
.filter($"b.dt" > $"a.dt")
.filter(...) // Any other filter required
.select($"id", $"b.dt", $"b.speed", $"b.stats")
Note: In some situations it would be required to do a groupBy(id) or use a window function since there should only be one final row per id in the diffDf dataframe. This can be done as as follows (the example here will select the row with maximum in the speed, but it depends on the actual requirements):
val w = Window.partitionBy($"id").orderBy($"speed".desc)
val diffDf2 = diffDf.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
More in-depth information about different approaches can be seen here: How to max value and keep all columns (for max records per group)?.
To replace the old rows with the same id in the df1 dataframe, combine the dataframes with an outer join and coalesce:
val df = df1.as("a").join(diffDf.as("b"), Seq("id"), "outer")
.select(
$"id",
coalesce($"b.dt", $"a.dt").as("dt"),
coalesce($"b.speed", $"a.speed").as("speed"),
coalesce($"b.stats", $"a.stats").as("stats")
)
coalesce works by first trying to take the value from the diffDf (b) dataframe. If that value is null it will take the value from df1 (a).
Result when only using the time filter with the provided example input dataframes:
+---------------+-------------------+-----+-----------------+
| id| dt|speed| stats|
+---------------+-------------------+-----+-----------------+
|358899055773504|2018-07-31 18:58:34| 0|[9,0,-1,22,0,1,0]|
|358899055773505|2018-07-31 18:54:23| 4| [9,0,0,22,1,1,1]|
+---------------+-------------------+-----+-----------------+

Related

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

Spark dataframe random sampling based on frequency occurrence in dataframe

Input description
I have a spark job with input dataframe with a column queryId. This queryId is not unique with respect to the dataframe. For example, there are roughly 3M rows in the spark dataframe with 450k distinct query ids.
Problem
I am trying to implement sampling logic and create a new column sampledQueryId which contains randomly sampled query id for each dataframe row by looking up query ids from the aggregate spark dataframe query id set.
Sampling goal
The restriction is that sampled query id shouldn't be equal to input query id.
Sampling should correspond to frequency of occurrence of query id in the incoming spark dataframe - ie given two query id q1 and q2, if the ratio of occurrence is 10:1(q1:q2), then q1 should appear approximately 10 times more in the sample id column.
Solution tried so far
I have tried to implement this through collecting the query ids into a list and lookup query id list with random sampling but have some suspicion based on empirical evidence that the logic doesn't work as expected for eg I see a specific query id getting sampled 200 times but a query id with similar frequency never gets sampled.
Any suggestions on whether this spark code is expected to work as intended?
val random = new scala.util.Random
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((queryId: Long) => {
val sampledId = queryIds(random.nextInt(queryIds.length))
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId"))
Received response on different forum documenting for posterity's sake. The issue is that one random instance is being passed to all executors through the udf. So the n-th row on every executor is going to give the same output.
scala> val random = new scala.util.Random
scala> val getRandom = udf((data: Long) => random.nextInt(10000))
scala> spark.range(0, 12, 1, 4).withColumn("rnd", getRandom($"id")).orderBy($"id").show
+---+----+
| id| rnd|
+---+----+
| 0|6720|
| 1|7667|
| 2|3344|
| 3|6720|
| 4|7667|
| 5|3344|
| 6|6720|
| 7|7667|
| 8|3344|
| 9|6720|
| 10|7667|
| 11|3344|
+---+----+
This df had 4 partitions. The value of rrd for every n-th row is the same (e.g. id = 1, 4, 7, 10 are the same).The solution is to use rand() built-in function in Spark like below.
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((companyId: Long, rand: Double) => {
val sampledId = queryIds(scala.math.floor(rand*queryIds.length).toInt)
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId", rand()))

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
I need the following output:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.
PySpark SQL solution shown here.
from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()
A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

Compare dates in dataframes

I have two dataframes in Scala:
df1 =
ID start_date_time
1 2016-10-12 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
and
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
The result should be this one:
df1 =
ID start_date_time check
1 2016-10-12-11-55-23 1
2 2016-10-12-12-25-00 0
3 2016-10-12-16-20-00 0
How can I do it?
I assume that the logic should be something like this:
df1 = df.withColumn("check", define(df("ID"),df("start_date")))
val define = udf {(id: String,dateString:String) =>
val formatter = new SimpleDateFormat("yyyy-MM-dd")
val date = formatter.format(dateString)
val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
if(checks.collect().length>0) "1" else "0"
}
However, I have doubts regarding how to compare dates, because df1 and df2 have differently formatted dates. How to better implement it?
You can use spark datetime functions to create date columns on both df1 and df2 and then do a left join on df1, df2, here you create an extra constant column check on df2 to indicate if there is a match in the result:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
+---+--------------------+-----+
| ID| start_date_time|check|
+---+--------------------+-----+
| 1|2016-10-12 11:55:...| 1|
| 2|2016-10-12 12:25:...| 0|
| 3|2016-10-12 16:20:...| 0|
+---+--------------------+-----+
I don't have the exact logic I would do something like that:
val df3 = df2.
join(df1,df1("ID") === df2("ID")).
filter( ($"start_date_time").isBefore($"start_date") )
You will need to convert the 2 timestamp to joda time using this: Converting a date string to a DateTime object using Joda Time library
Good luck !