Apache Spark DataFrame: df.where() with Java:List attribute - scala

imagine you have a df like this:
a b
1 1
1 2
1 3
2 1
2 2
2 3
and you want to implement a generic .where functionnality;
how can you filter by a List
val l1:List[Int] = List (1,2)
df.where($"b" === l1:_*) // does not work
or is there even a option, where you can ask sth like this:
df.where($"a" === l1:_* && $"b" === l1:_*)

If I got you right, you want IN semantics:
df.where($"b" isin (l1: _*)).show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+
And
df.where(($"a" isin (l1: _*)) and ($"b" isin (l1: _*))).show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+

Related

Rename Duplicate Columns of a Spark DataFrame?

There are several good answers about managing duplicate columns from joined dataframes, eg (How to avoid duplicate columns after join?), but what if I'm simply presented a DataFrame with duplicate columns that I have to deal with. I have no control over the processes leading up to this point.
What I have:
val data = Seq((1,2),(3,4)).toDF("a","a")
data.show
+---+---+
| a| a|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
What I want:
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
withColumnRenamed("a","a_2") does not work, for obvious reasons.
The simplest way I found to do this is:
val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
For a more general solution:
val data = Seq(
(1,2,3,4,5,6,7,8),
(9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show
+---+---+---+---+---+---+---+---+
| a| b| c| a| d| b| e| b|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+
import scala.annotation.tailrec
def dedupeColumnNames(df: DataFrame): DataFrame = {
#tailrec
def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
if (columns.isEmpty) fixed_columns
else {
val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
}
}
val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
df.toDF(new_columns:_*)
}
data
.transform(dedupeColumnNames)
.show
+---+---+---+---+---+---+---+---+
| a| b| c|a_2| d|b_2| e|b_3|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+

Use Iterator to get top k keywords

I am writing a Spark algorithm to get top k keywords for each country, now I already have a Dataframe containing all records and plan to do
df.repartition($"country_id").mapPartition()
to retrieve top k keywords but am confused on how I could write an iterator to get it.
If I am able to write a method or call native method, I can sort in each partition and get top k which seems not to be the correct approach if the input is an iterator.
Anyone has idea on it?
you can achieve this using window functions, let's assume that column _1 is your keyword and _2 is keyword's count. In this case k = 2
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.select('*,row_number().over(Window.orderBy('_2.desc).partitionBy('_1)).as("rn")).where('rn < 3).show()
+---+---+---+
| _1| _2| rn|
+---+---+---+
| 1| 10| 1|
| 1| 4| 2|
| 2| 5| 1|
| 2| 2| 2|
+---+---+---+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

How to make VectorAssembler do not compress data?

I want to transform multiple columns to one column using VectorAssembler,but the data is compressed by default without other options.
val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
val colNames=Array("a","b","c","e","f")
val assembler = new VectorAssembler()
.setInputCols(colNames)
.setOutputCol("newCol")
val transDF= assembler.transform(df).select(col("newCol"))
transDF.show(false)
The input is:
+---+---+---+---+---+
| a| b| c| e| f|
+---+---+---+---+---+
| 1| 2| 0| 0| 0|
| 1| 2| 3| 0| 0|
| 1| 2| 4| 5| 0|
| 1| 2| 2| 5| 6|
+---+---+---+---+---+
The result is:
+---------------------+
|newCol |
+---------------------+
|(5,[0,1],[1.0,2.0]) |
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
My expect result is:
+---------------------+
|newCol |
+---------------------+
|[1.0,2.0,0.0,0.0,0.0]|
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
What should I do to get my expect result?
If you really want to coerce all vectors to their dense representation, you can do it using a User Defined Function :
val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
transDF.select(toDense($"newCol")).show
+--------------------+
| UDF(newCol)|
+--------------------+
|[1.0,2.0,0.0,0.0,...|
|[1.0,2.0,3.0,0.0,...|
|[1.0,2.0,4.0,5.0,...|
|[1.0,2.0,2.0,5.0,...|
+--------------------+

Adding a Column in DataFrame from another column of same dataFrame Pyspark

I have a Pyspark dataframe df, like following:
+---+----+---+
| id|name| c|
+---+----+---+
| 1| a| 5|
| 2| b| 4|
| 3| c| 2|
| 4| d| 3|
| 5| e| 1|
+---+----+---+
I want to add a column match_name that have value from the name column where id == c
Is it possible to do it with function withColumn()?
Currently i have to create two dataframes and then perform join.
Which is inefficient on large dataset.
Expected Output:
+---+----+---+----------+
| id|name| c|match_name|
+---+----+---+----------+
| 1| a| 5| e|
| 2| b| 4| d|
| 3| c| 2| b|
| 4| d| 3| c|
| 5| e| 1| a|
+---+----+---+----------+
Yes, it is possible, with when:
from pyspark.sql.functions import when, col
condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))
result.show()
id name match match_name
1 a 3 null
2 b 2 b
3 c 5 null
4 d 4 d
5 e 1 null
You may also use otherwise to provide a different value if the condition is not met.