joining two dataframes having duplicate row

joining two dataframes having duplicate row - scala

I have the following two dataframes
df1
+--------+-----------------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|3 | 90 | 130.0 |
df2
+--------+--------------------------------
|exId | exAmount | exFee |
|1 | 10.00 | 5.0 |
|1 | 10.0 | 5.0 |
|3 | 90.0 | 130.0 |
I am joining between them using all three columns and trying to identify columns which are common between the two dataframes and the ones which are not.
I'm looking for output:
+--------+--------------------------------------------
|id | amount | fee |exId | exAmount | exFee |
|1 | 10.00 | 5.0 |1 | 10.0 | 5.0 |
|null| null | null |1 | 10.0 | 5.0 |
|3 | 90 | 130.0|3 | 90.0 | 130.0 |
Basically want the duplicate row in df2 with exId 1 to be listed separately.
Any thoughts?

One of the possible way is to group by all three columns and generate row numbers for each dataframe and use that additional column in addition to the rest three columns while joining. You should get what you desire.
import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("id", "amount", "fee").orderBy("fee")
def windowSpec2 = Window.partitionBy("exId", "exAmount", "exFee").orderBy("exFee")
import org.apache.spark.sql.functions._
df1.withColumn("sno", row_number().over(windowSpec1)).join(
df2.withColumn("exSno", row_number().over(windowSpec2)),
col("id") === col("exId") && col("amount") === col("exAmount") && col("fee") === col("exFee") && col("sno") === col("exSno"), "outer")
.drop("sno", "exSno")
.show(false)
and you should be getting
+----+------+-----+----+--------+-----+
|id |amount|fee |exId|exAmount|exFee|
+----+------+-----+----+--------+-----+
|null|null |null |1 |10.0 |5.0 |
|3 |90 |130.0|3 |90 |130.0|
|1 |10.00 |5.0 |1 |10.00 |5.0 |
+----+------+-----+----+--------+-----+
I hope the answer is helpful

Related

Delete values lower than cummax on multiple spark dataframe columns in scala

I am having a data frame as shown below. The number of signals are more than 100, so there will be more than 100 columns in the data frame.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |3 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |2 |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |1 |3 |1 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |3 |3 |null |
|052|2021-03-06 |2 |null |2 |
|052|2021-03-16 |3 |5 |5 |.......
+-------------------------------------------+
I have to find out cummax of each signal and then compare with respective signal columns and delete the signal records which are having value lower than cummax and null values.
step1. find cumulative max for each signal with respect to id column.
step2. delete the records which are having lower value than cummax for each signal.
step3. Take count of records which are having cummax less than signal value(excluded of null) for each signals with respect to id.
After the count the final output should be as shown below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 | 3 | 1 |
|050|2021-01-15 |null | null | 2 |
|050|2021-02-02 |2 | 3 | 3 |
|
|051|2021-01-14 |1 | 3 | 0 |
|051|2021-01-15 |2 | null | null |
|051|2021-02-02 |3 | 3 | 2 |
|051|2021-02-03 |null | 3 | null |
|
|052|2021-03-03 |1 | 3 | 0 |
|052|2021-03-05 |3 | 3 | null |
|052|2021-03-06 |null | null | 2 |
|052|2021-03-16 |3 | 5 | 5 | ......
+----------------+--------+--------+--------+
I have tried by using window function as below and it worked for almost all records.
val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val signalList01 = ListBuffer[Column]()
signalList01.append(col("id"), col("date"))
for (column <- signalColumns) {
// Applying the max non null aggregate function on each signal column
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax")) }
val cumMaxDf = df.select(signalList01: _*)
But I am getting error values as shown below for few records.
Is there any idea about how this error records in the cummax column? Any leads appreciated!

Just giving out hints here (as you suggested) to help you unblock the situation, but --WARNING-- haven't tested the code !
the code you provided in the comments looks good. It'll get you your max column
val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))
now, you need to be able to compare the two values in "singal01" and "singal01_cummax". A function like this, maybe:
def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any =
{ if (signal == null || signal < singal_cummax) null
else singal_cummax }
since we'll be applying it to columns, we'll wrap it up in a UDF
val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf {
(i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j)
}
and then, you can combine everything above so it can be applicable on your original dataframe. Something like this could work:
val signal_cummax_suffix = "_cummax"
val result = original_df.columns.foldLeft(original_df)(
(dfac, colname) => dfac
.withColumn(colname.concat(signal_cummax_suffix),
sum(col(colname)).over(windowCodedSO))
.withColumn(colname.concat("output"),
takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix))))
)

Select max common Date from differents DataFrames (Scala Spark)

I have differents dataframes and I want to select the max common Date of these DF. For example, I have the following dataframes:
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2017-11-19 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
+--------------+-------+
|Date | value |
+--------------+-------+
|2015-12-14 |5 |
|2012-12-21 |1 |
|2016-09-02 |1 |
|2015-12-14 |3 |
|2015-12-14 |1 |
The selected date would be 2016-09-02 because is the max date that exists in these 3 DF (the date 2017-11-19 is not in the third DF).
I am trying to do it with agg(max) but in this way I just have the highest date of a DataFrame:
df1.select("Date").groupBy("Date").agg(max("Date))
Thanks in advance!

You can do semi joins to get the common dates, and aggregate the maximum date. No need to group by date because you want to get its maximum.
val result = df1.join(df2, Seq("Date"), "left_semi").join(df3, Seq("Date"), "left_semi").agg(max("Date"))
You can also use intersect:
val result = df1.select("Date").intersect(df2.select("Date")).intersect(df3.select("Date")).agg(max("Date"))

SparkSQL- Add new column to DataFrame based on the aggregation

Having the following DataFrame:
+--------+----------+------------+
|user_id |level |new_columns |
+--------+----------+------------+
|4 |B |null |
|6 |B |null |
|5 |A |col1 |
|3 |B |col2 |
|5 |A |col2 |
|2 |A |null |
|1 |A |col3 |
+--------+----------+------------+
I need to convert each not null value of the new_columns column to a new column, which should be done based on the aggregation on the user_id column. The desired output would be
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+
As you can see, the value of the new columns comes from the level column in the base DF. I know how to use the withColumn method to add new columns on a DF but here the critical part is how to add new columns on the aggregated DF (for the case of the user_id = 5).
Every hint based on the DataFrame API would be appreciated.

You can do a pivot:
val df2 = df.groupBy("event_id")
.pivot("new_columns")
.agg(first("level"))
.drop("null")
df2.show
+--------+-------------+------+
|user_id | col1 | col2 | col3 |
+--------+------+------+------+
|4 | null | null | null |
|6 | null | null | null |
|5 | A | A | null |
|3 | null | B | null |
|2 | null | null | null |
|1 | null | null | A |
+--------+-------------+------+

You can collect the non-null values from new_columns first before doing pivot :
val nonNull = df.select("new_columns").filter("new_columns is not null").distinct().as[String].collect
val df1 = df.groupBy("user_id")
.pivot("new_columns", nonNull)
.agg(first("level"))
df1.show
//+-------+----+----+----+
//|user_id|col3|col1|col2|
//+-------+----+----+----+
//| 1| A|null|null|
//| 6|null|null|null|
//| 3|null|null| B|
//| 5|null| A| A|
//| 4|null|null|null|
//| 2|null|null|null|
//+-------+----+----+----+

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.

Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

very specific requirement for outlier treatment in Spark Dataframe

I have very specific requirement for outlier treatment in Spark Dataframe(Scala)
i want to treat just first outlier and make it equal to second group.
Input:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | ds |
|A |r1 | s |
|A |r1 | f |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
Now per market and responseVariable i want to treat just first outlier..
Group per market and responseVariable:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 5 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to treat outlier for group market=A and responseVariable=r1 in actual dataset. I want to randomly remove records from group 1 and make it equal to group 2.
Expected output:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | s |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
group:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 3 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to repeat this for multiple market.

You will have to know the first and the second groups counts and names which can be done as below
import org.apache.spark.sql.functions._
val first_two_values = df.groupBy("market", "responseVariable").agg(count("blabla").as("count")).orderBy($"count".desc).take((2)).map(row => (row(1) -> row(2))).toList
val rowsToFilter = first_two_values(0)._1
val countsToFilter = first_two_values(1)._2
After you know the first two groups, you need to filter out the extra rows from the first group which can be done by generating row_number and filtering out the extra rows as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("market","responseVariable").orderBy("blabla")
df.withColumn("rank", row_number().over(windowSpec))
.withColumn("rank", when(col("rank") > countsToFilter && col("responseVariable") === rowsToFilter, false).otherwise(true))
.filter(col("rank"))
.drop("rank")
.show(false)
You should get your requirement fulfilled

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

joining two dataframes having duplicate row - scala

Related

Delete values lower than cummax on multiple spark dataframe columns in scala

Select max common Date from differents DataFrames (Scala Spark)

SparkSQL- Add new column to DataFrame based on the aggregation

Scala group by with mapped keys

very specific requirement for outlier treatment in Spark Dataframe

Categories

Resources