very specific requirement for outlier treatment in Spark Dataframe - scala

I have very specific requirement for outlier treatment in Spark Dataframe(Scala)
i want to treat just first outlier and make it equal to second group.
Input:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | ds |
|A |r1 | s |
|A |r1 | f |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
Now per market and responseVariable i want to treat just first outlier..
Group per market and responseVariable:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 5 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to treat outlier for group market=A and responseVariable=r1 in actual dataset. I want to randomly remove records from group 1 and make it equal to group 2.
Expected output:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | s |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
group:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 3 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to repeat this for multiple market.

You will have to know the first and the second groups counts and names which can be done as below
import org.apache.spark.sql.functions._
val first_two_values = df.groupBy("market", "responseVariable").agg(count("blabla").as("count")).orderBy($"count".desc).take((2)).map(row => (row(1) -> row(2))).toList
val rowsToFilter = first_two_values(0)._1
val countsToFilter = first_two_values(1)._2
After you know the first two groups, you need to filter out the extra rows from the first group which can be done by generating row_number and filtering out the extra rows as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("market","responseVariable").orderBy("blabla")
df.withColumn("rank", row_number().over(windowSpec))
.withColumn("rank", when(col("rank") > countsToFilter && col("responseVariable") === rowsToFilter, false).otherwise(true))
.filter(col("rank"))
.drop("rank")
.show(false)
You should get your requirement fulfilled

Related

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

Converting dataframe column into onehotencoder like columns

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------
I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+

joining two dataframes having duplicate row

I have the following two dataframes
df1
+--------+-----------------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|3 | 90 | 130.0 |
df2
+--------+--------------------------------
|exId | exAmount | exFee |
|1 | 10.00 | 5.0 |
|1 | 10.0 | 5.0 |
|3 | 90.0 | 130.0 |
I am joining between them using all three columns and trying to identify columns which are common between the two dataframes and the ones which are not.
I'm looking for output:
+--------+--------------------------------------------
|id | amount | fee |exId | exAmount | exFee |
|1 | 10.00 | 5.0 |1 | 10.0 | 5.0 |
|null| null | null |1 | 10.0 | 5.0 |
|3 | 90 | 130.0|3 | 90.0 | 130.0 |
Basically want the duplicate row in df2 with exId 1 to be listed separately.
Any thoughts?
One of the possible way is to group by all three columns and generate row numbers for each dataframe and use that additional column in addition to the rest three columns while joining. You should get what you desire.
import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("id", "amount", "fee").orderBy("fee")
def windowSpec2 = Window.partitionBy("exId", "exAmount", "exFee").orderBy("exFee")
import org.apache.spark.sql.functions._
df1.withColumn("sno", row_number().over(windowSpec1)).join(
df2.withColumn("exSno", row_number().over(windowSpec2)),
col("id") === col("exId") && col("amount") === col("exAmount") && col("fee") === col("exFee") && col("sno") === col("exSno"), "outer")
.drop("sno", "exSno")
.show(false)
and you should be getting
+----+------+-----+----+--------+-----+
|id |amount|fee |exId|exAmount|exFee|
+----+------+-----+----+--------+-----+
|null|null |null |1 |10.0 |5.0 |
|3 |90 |130.0|3 |90 |130.0|
|1 |10.00 |5.0 |1 |10.00 |5.0 |
+----+------+-----+----+--------+-----+
I hope the answer is helpful

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+