How to calculate group-based quantiles?

How to calculate group-based quantiles? - scala

I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
My dataframe df:
+----+---------+-------------+----------+-----------+--------+
| id| date| revenue|con_dist_1| con_dist_2| state |
+----+---------+-------------+----------+-----------+--------+
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX |
| 10|1/15/2018| 0.010680705| 6|0.019875458| TX |
| 10|1/15/2018| 0.006628853| 4|0.816039063| AZ |
+----+---------+-------------+----------+-----------+--------+
How to find the quantile on the columns "con_dist_1" & "con_dist_2" for each state?

The possible solution could be:
scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id| date| revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX|
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
+---+---------+-----------+----------+-----------+-----+
scala> val df1 = input.groupBy("state").agg(collect_list("con_dist_1").as("combined_1"), collect_list("con_dist_2").as("combined_2"))
df1: org.apache.spark.sql.DataFrame = [state: string, combined_1: array<int> ... 1 more field]
scala> df1.show
+-----+----------+--------------------+
|state|combined_1| combined_2|
+-----+----------+--------------------+
| AZ| [4, 4]|[0.816039063, 0.8...|
| TX| [6, 4, 6]|[0.019875458, 0.0...|
+-----+----------+--------------------+
scala> df1.
| withColumn("comb1_Q1", sort_array($"combined_1")(((size($"combined_1")-1)*0.25).cast("int"))).
| withColumn("comb1_Q2", sort_array($"combined_1")(((size($"combined_1")-1)*0.5).cast("int"))).
| withColumn("comb1_Q3", sort_array($"combined_1")(((size($"combined_1")-1)*0.75).cast("int"))).
| withColumn("comb_2_Q1", sort_array($"combined_2")(((size($"combined_2")-1)*0.25).cast("int"))).
| withColumn("comb_2_Q2", sort_array($"combined_2")(((size($"combined_2")-1)*0.5).cast("int"))).
| withColumn("comb_2_Q3", sort_array($"combined_2")(((size($"combined_2")-1)*0.75).cast("int"))).
| show
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
|state|combined_1| combined_2|comb1_Q1|comb1_Q2|comb1_Q3| comb_2_Q1| comb_2_Q2| comb_2_Q3|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
| AZ| [4, 4]|[0.816039063, 0.8...| 4| 4| 4|0.816039063|0.816039063|0.816039063|
| TX| [6, 4, 6]|[0.019875458, 0.0...| 4| 6| 6|0.019875458|0.019875458|0.019875458|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
EDIT
I don't think we can achieve using approx quantile method as you want it for each state for which you will need to group by on state column and aggregate the con_dist columns and approx quantile expects a whole column of integers or float but not of array types.
The other solution is to use spark-sql as shown below:
scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id| date| revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
| 10|1/15/2018| 0.01378215| 4|0.082049528| TX|
| 10|1/15/2018|0.010680705| 6|0.019875458| TX|
| 10|1/15/2018|0.006628853| 4|0.816039063| AZ|
+---+---------+-----------+----------+-----------+-----+
scala> input.createOrReplaceTempView("input")
scala> :paste
// Entering paste mode (ctrl-D to finish)
val query = "select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, " +
"percentile_approx(con_dist_1,0.5) as col1_quantile_2," +
"percentile_approx(con_dist_1,0.75) as col1_quantile_3, " +
"percentile_approx(con_dist_2,0.25) as col2_quantile_1,"+
"percentile_approx(con_dist_2,0.5) as col2_quantile_2," +
"percentile_approx(con_dist_2,0.75) as col2_quantile_3 " +
"from input group by state"
// Exiting paste mode, now interpreting.
query: String = select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, percentile_approx(con_dist_1,0.5) as col1_quantile_2,percentile_approx(con_dist_1,0.75) as col1_quantile_3, percentile_approx(con_dist_2,0.25) as col2_quantile_1,percentile_approx(con_dist_2,0.5) as col2_quantile_2,percentile_approx(con_dist_2,0.75) as col2_quantile_3 from input group by state
scala> val df2 = spark.sql(query)
df2: org.apache.spark.sql.DataFrame = [state: string, col1_quantile_1: int ... 5 more fields]
scala> df2.show
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|state|col1_quantile_1|col1_quantile_2|col1_quantile_3|col2_quantile_1|col2_quantile_2|col2_quantile_3|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
| AZ| 4| 4| 4| 0.816039063| 0.816039063| 0.816039063|
| TX| 4| 6| 6| 0.019875458| 0.019875458| 0.082049528|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
Let me know if it helps!!

Related

How to include/map calculated percentiles to the result dataframe?

I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
As I am doing multiple percentiles, how to retrieve each calculated percentile from the results?
My dataframe df:
+----+---------+-------------+----------+-----------+
| id| date| revenue|con_dist_1| con_dist_2|
+----+---------+-------------+----------+-----------+
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
| 10|1/15/2018| 0.01378215| 4|0.082049528|
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
+----+---------+-------------+----------+-----------+
I need to get expected output/result as below:
+----+---------+-------------+-------------+------------+-------------+
| id| date| revenue| perctile_col| quantile_0 |quantile_10 |
+----+---------+-------------+-------------+------------+-------------+
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
+----+---------+-------------+-------------+------------+-------------+
I have already calculated the quantiles like this but need to add them to the output dataframe:
val col_list = Array("con_dist_1","con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5),0.0)
val percentile_0 = 0;
val percentile_10 = 1;
val Q0 = quantiles(col_list.indexOf("con_dist_1"))(percentile_0)
val Q10 =quantiles(col_list.indexOf("con_dist_1"))(percentile_10)
How to get expected output show above?

An easy solution would be to create multiple dataframes, one for each "con_dist" column, and then use union to merge them together. This can easily be done using a map over col_list as follows:
val col_list = Array("con_dist_1", "con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5), 0.0)
val df2 = df.drop(col_list: _*) // we don't need these columns anymore
val result = col_list
.zipWithIndex
.map{case (col, colIndex) =>
val Q0 = quantiles(colIndex)(percentile_0)
val Q10 = quantiles(colIndex)(percentile_10)
df2.withColumn("perctile_col", lit(col))
.withColumn("quantile_0", lit(Q0))
.withColumn("quantile_10", lit(Q10))
}.reduce(_.union(_))
The final dataframe will then be:
+---+---------+-----------+------------+-----------+-----------+
| id| date| revenue|perctile_col| quantile_0|quantile_10|
+---+---------+-----------+------------+-----------+-----------+
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018| 0.01378215| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018| 0.01378215| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
+---+---------+-----------+------------+-----------+-----------+

Building up a dataframe

I am trying to build a dataframe of 10k records to then save to a parquet file on Spark 2.4.3 standalone
The following works in a small scale up to 1000 records but takes forever when ramping up to 10k
scala> import spark.implicits._
import spark.implicits._
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> for ( i <- 1 to 1000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
scala> someDF.show
+---+------+
| x| y|
+---+------+
| 0| item0|
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
+---+------+
only showing top 20 rows
[Stage 2:=========================================================(20 + 0) / 20]
scala> var someDF = Seq((0, "item0")).toDF("x", "y")
someDF: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> someDF.show
+---+-----+
| x| y|
+---+-----+
| 0|item0|
+---+-----+
scala> for ( i <- 1 to 10000 ) {someDF = someDF.union(Seq((i,"item"+i)).toDF("x", "y")) }
Just want to save someDF to a parquet file to then load into Impala

//declare Range that you want
scala> val r = 1 to 10000
//create DataFrame with range
scala> val df = sc.parallelize(r).toDF("x")
//Add new column "y"
scala> val final_df = df.select(col("x"),concat(lit("item"),col("x")).alias("y"))
scala> final_df.show
+---+------+
| x| y|
+---+------+
| 1| item1|
| 2| item2|
| 3| item3|
| 4| item4|
| 5| item5|
| 6| item6|
| 7| item7|
| 8| item8|
| 9| item9|
| 10|item10|
| 11|item11|
| 12|item12|
| 13|item13|
| 14|item14|
| 15|item15|
| 16|item16|
| 17|item17|
| 18|item18|
| 19|item19|
| 20|item20|
+---+------+
scala> final_df.count
res17: Long = 10000
//Write final_df to path in parquet format
scala> final_df.write.format("parquet").save(<path to write>)

how to apply joins in spark scala when we have multiple values in the join column

I have data in two text files as
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
| 3| u,t,k|
| 4| f,o,k|
| 5| e,o,u|
+----------+-------+
file2(diagnosis code,diagnosis description) Time T1
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| yen|
| t| ten|
| k| ken|
| u| uen|
| p| pen|
| f| fen|
| o| oen|
| e| een|
+-------+---------+
data in file 2 is not fixed and keeps on changing, means at any given point of time diagnosis code y can have diagnosis description as yen and at other point of time it can have diagnosis description as ten. For example below
file2 at Time T2
+-------+---------+
|diag_cd|diag_desc|
+-------+---------+
| y| ten|
| t| yen|
| k| uen|
| u| oen|
| p| ken|
| f| pen|
| o| een|
| e| fen|
+-------+---------+
I have to read these two files data in spark and want only those patients id who are diagnosed with uen.
it can be done using spark sql or scala both.
I tried to read the file1 in spark-shell. The two columns in file1 are pipe delimited.
scala> val tes1 = sc.textFile("file1.txt").map(x => x.split('|')).filter(y => y(1).contains("u")).collect
tes1: Array[Array[String]] = Array(Array(2, u,t,p), Array(3, u,t,k), Array(5, e,o,u))
But as the diagnosis code related to a diagnosis description is not constant in file2 so will have to use the join condition. But I dont know how to apply joins when the diag_cd column in file1 has multiple values.
any help would be appreciated.

Please find the answer below
//Read the file1 into a dataframe
val file1DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file1PATH")
//Read the file2 into a dataframe
val file2DF = spark.read.format("csv").option("delimiter","|")
.option("header",true)
.load("file2path")
//get the patient id dataframe for the diag_desc as uen
file1DF.join(file2DF,file1DF.col("diag_cd").contains(file2DF.col("diag_cd")),"inner")
.filter(file2DF.col("diag_desc") === "uen")
.select("patient_id").show

Convert the table t1 from format1 to format2 using explode method.
Format1:
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y,t,k|
| 2| u,t,p|
+----------+-------+
to
file 1:(patient id,diagnosis code)
+----------+-------+
|patient_id|diag_cd|
+----------+-------+
| 1| y |
| 1| t |
| 1| k |
| 2| u |
| 2| t |
| 2| p |
+----------+-------+
Code:
scala> val data = Seq("1|y,t,k", "2|u,t,p")
data: Seq[String] = List(1|y,t,k, 2|u,t,p)
scala> val df1 = sc.parallelize(data).toDF("c1").withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).withColumn("col2", split(col("c1"), "\\|").getItem(1)).select("patient_id", "col2").withColumn("diag_cd", explode(split($"col2", "\\,"))).select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
scala> df1.collect()
res4: Array[org.apache.spark.sql.Row] = Array([1,y], [1,t], [1,k], [2,u], [2,t], [2,p])
I have created dummy data here for illustration. Note how we are exploding the particular column above using
scala> val df1 = sc.parallelize(data).toDF("c1").
| withColumn("patient_id", split(col("c1"), "\\|").getItem(0)).
| withColumn("col2", split(col("c1"), "\\|").getItem(1)).
| select("patient_id", "col2").
| withColumn("diag_cd", explode(split($"col2", "\\,"))).
| select("patient_id", "diag_cd")
df1: org.apache.spark.sql.DataFrame = [patient_id: string, diag_cd: string]
Now you can create df2 for file 2 using -
scala> val df2 = sc.textFile("file2.txt").map(x => (x.split(",")(0),x.split(",")(1))).toDF("diag_cd", "diag_desc")
df2: org.apache.spark.sql.DataFrame = [diag_cd: string, diag_desc: string]
Join df1 with df2 and filter as per the requirement.
df1.join(df2, df1.col("diag_cd") === df2.col("diag_cd")).filter(df2.col("diag_desc") === "ten").select(df1.col("patient_id")).collect()

Convert the map RDD into dataframe

I am using Spark 1.6.0, I have input map RDD (key,value) pair and want to convert to dataframe.
Input format RDD:
((1, A, ABC), List(pz,A1))
((2, B, PQR), List(az,B1))
((3, C, MNR), List(cs,c1))
Output format:
+----+----+-----+----+----+
| c1 | c2 | c3 | c4 | c5 |
+----+----+-----+----+----+
| 1 | A | ABC | pz | A1 |
+----+----+-----+----+----+
| 2 | B | PQR | az | B1 |
+----+----+-----+----+----+
| 3 | C | MNR | cs | C1 |
+----+----+-----+----+----+
Can someone help me on this.

I would suggest you to go with datasets as datasets are optimized and typesafe dataframes.
first you need to create a case class as
case class table(c1: Int, c2: String, c3: String, c4:String, c5:String)
then you would just need a map function to parse your data to the case class and call .toDS
rdd.map(x => table(x._1._1, x._1._2, x._1._3, x._2(0), x._2(1))).toDS().show()
you should have following output
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 1| A|ABC| pz| A1|
| 2| B|PQR| az| B1|
| 3| C|MNR| cs| c1|
+---+---+---+---+---+
you can use dataframe as well, for that you can use .toDF() instead of .toDS().

val a = Seq(((1,"A","ABC"),List("pz","A1")),((2, "B", "PQR"),
List("az","B1")),((3,"C", "MNR"), List("cs","c1")))
val a1 = sc.parallelize(a);
val a2 = a1.map(rec=>
(rec._1._1,rec._1._2,rec._1._3,rec._2(0),rec._2(1))).toDF()
a2.show()
+---+---+---+---+---+
| _1| _2| _3| _4| _5|
+---+---+---+---+---+
| 1| A|ABC| pz| A1|
+---+---+---+---+---+
| 2 | B |PQR| az| B1|
+---+---+---+---+---+
| 3 | C |MNR| cs| C1|

How to use NOT IN clause in filter condition in spark

I want to filter a column of an RDD source :
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
I want to use IN clause in filter condition to filter out only the values present in src from source, something like below(EDITED):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
equivalent SQL code is
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
Thank you

Since your code isn't reproducible, here is a small example using spark-sql on how to select * from t where id in (...) :
// create a DataFrame for a range 'id' from 1 to 9.
scala> val df = spark.range(1,10).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
// values to exclude
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
// select * from df where id is not in the values to exclude
scala> df.filter(!col("id").isin(f : _*)).show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 8|
| 9|
+---+
// select * from df where id is in the values to exclude
scala> df.filter(col("id").isin(f : _*)).show
Here is the RDD version of the not isin :
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> val f = Seq(5,6,7)
f: Seq[Int] = List(5, 6, 7)
scala> val rdd2 = rdd.filter(x => !f.contains(x))
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at filter at <console>:28
Nevertheless, I still believe this is an overkill since you are already using spark-sql.
It seems in your case that you are actually dealing with DataFrames, thus the solutions mentioned above don't work.
You can use the left anti join approach :
scala> val source = spark.read.format("csv").load("source.file")
source: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> val destination = spark.read.format("csv").load("destination.file")
destination: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 9 more fields]
scala> source.show
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Ravi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2|Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
scala> destination.show
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
|_c0| _c1| _c2| _c3| _c4|_c5|_c6| _c7| _c8| _c9| _c10|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
| 1| Ravi kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi1 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 1| Ravi2 kumar| Revi | kumar| MSO | 1| M|17-01-1994| 74.5| 24000.78| Alabama |
| 2| Shekhar shudhanshu| Shekhar|shudhanshu| Manulife | 2| M|18-01-1994|76.34| 250000| Alaska |
| 3|Preethi Narasingam1| Preethi|Narasingam| Retail | 3| F|19-01-1994|77.45|270000.01| Arizona |
| 4| Abhishek Nair1|Abhishek| Nair| Banking | 4| M|20-01-1994|78.65| 345000| Arkansas |
| 5| Ram Sharma| Ram| Sharma|Infrastructure | 5| M|21-01-1994|79.12| 45000| California |
| 6| Chandani Kumari|Chandani| Kumari| BNFS | 6| F|22-01-1994|80.13| 43000.02| Colorado |
| 7| Balaji Kumar| Balaji| Kumar| MSO | 1| M|23-01-1994|81.33| 1234678|Connecticut |
| 8| Naveen Shekrappa| Naveen| Shekrappa| Manulife | 2| M|24-01-1994| 100| 789414| Delaware |
| 9| Milind Chavan| Milind| Chavan| Retail | 3| M|25-01-1994|83.66| 245555| Florida |
| 10| Raghu Rajeev| Raghu| Rajeev| Banking | 4| M|26-01-1994|87.65| 235468| Georgia|
+---+-------------------+--------+----------+---------------+---+---+----------+-----+---------+------------+
You'll just need to do the following :
scala> val res1 = source.join(destination, Seq("_c0"), "leftanti")
scala> val res2 = destination.join(source, Seq("_c0"), "leftanti")
It's the same logic I mentioned in my answer here.

You can try like--
df.filter(~df.Dept.isin("30","20")).show()
//This will list all the columns of df where Dept NOT IN 30 or 20

You can try something similar in Java,
ds = ds.filter(functions.not(functions.col(COLUMN_NAME).isin(exclusionSet)));
where exclusionSet is a set of objects that needs to be removed from your dataset.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to calculate group-based quantiles? - scala

Related

How to include/map calculated percentiles to the result dataframe?

Building up a dataframe

how to apply joins in spark scala when we have multiple values in the join column

Convert the map RDD into dataframe

How to use NOT IN clause in filter condition in spark

Categories

Resources