Sort a Dataset[String] in Scala - scala

I have dsString: Dataset[(String,Long)] (No DataFrame or Dataset[Row]) and I'm trying to order by Long .orderBy(_._2)
My problem is that .orderBy() and .sort() only accept columns, and I can only use .sortBy with RDDs.
DataFrame solution
dsString.toDF("a", "b")
.groupBy("b")
RDD solution:
dsString.toJavaRDD
.sortBy(_._2)
How could Dataset[(String,Long)] do the same?

Dataset also can be applied orderBy. For example,
+---+---+
| _1| _2|
+---+---+
| c| 3|
| b| 5|
| a| 4|
+---+---+
this is my Dataset and
df2.orderBy(col("_1").desc).show
df2.orderBy(col("_2").asc).show
give the results as follows:
+---+---+
| _1| _2|
+---+---+
| c| 3|
| b| 5|
| a| 4|
+---+---+
+---+---+
| _1| _2|
+---+---+
| c| 3|
| a| 4|
| b| 5|
+---+---+

Related

Rename Duplicate Columns of a Spark DataFrame?

There are several good answers about managing duplicate columns from joined dataframes, eg (How to avoid duplicate columns after join?), but what if I'm simply presented a DataFrame with duplicate columns that I have to deal with. I have no control over the processes leading up to this point.
What I have:
val data = Seq((1,2),(3,4)).toDF("a","a")
data.show
+---+---+
| a| a|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
What I want:
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
withColumnRenamed("a","a_2") does not work, for obvious reasons.
The simplest way I found to do this is:
val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show
+---+---+
| a|a_2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
For a more general solution:
val data = Seq(
(1,2,3,4,5,6,7,8),
(9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show
+---+---+---+---+---+---+---+---+
| a| b| c| a| d| b| e| b|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+
import scala.annotation.tailrec
def dedupeColumnNames(df: DataFrame): DataFrame = {
#tailrec
def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
if (columns.isEmpty) fixed_columns
else {
val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
}
}
val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
df.toDF(new_columns:_*)
}
data
.transform(dedupeColumnNames)
.show
+---+---+---+---+---+---+---+---+
| a| b| c|a_2| d|b_2| e|b_3|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+---+---+

Use Iterator to get top k keywords

I am writing a Spark algorithm to get top k keywords for each country, now I already have a Dataframe containing all records and plan to do
df.repartition($"country_id").mapPartition()
to retrieve top k keywords but am confused on how I could write an iterator to get it.
If I am able to write a method or call native method, I can sort in each partition and get top k which seems not to be the correct approach if the input is an iterator.
Anyone has idea on it?
you can achieve this using window functions, let's assume that column _1 is your keyword and _2 is keyword's count. In this case k = 2
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.select('*,row_number().over(Window.orderBy('_2.desc).partitionBy('_1)).as("rn")).where('rn < 3).show()
+---+---+---+
| _1| _2| rn|
+---+---+---+
| 1| 10| 1|
| 1| 4| 2|
| 2| 5| 1|
| 2| 2| 2|
+---+---+---+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10
One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+

Apache Spark DataFrame: df.where() with Java:List attribute

imagine you have a df like this:
a b
1 1
1 2
1 3
2 1
2 2
2 3
and you want to implement a generic .where functionnality;
how can you filter by a List
val l1:List[Int] = List (1,2)
df.where($"b" === l1:_*) // does not work
or is there even a option, where you can ask sth like this:
df.where($"a" === l1:_* && $"b" === l1:_*)
If I got you right, you want IN semantics:
df.where($"b" isin (l1: _*)).show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+
And
df.where(($"a" isin (l1: _*)) and ($"b" isin (l1: _*))).show()
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
| 2| 1|
| 2| 2|
+---+---+