Top N items from a Spark DataFrame/RDD

Top N items from a Spark DataFrame/RDD - scala

My requirement is to get the top N items from a dataframe.
I've this DataFrame:
val df = List(
("MA", "USA"),
("MA", "USA"),
("OH", "USA"),
("OH", "USA"),
("OH", "USA"),
("OH", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA")).toDF("value", "country")
I was able to map it to an RDD[((Int, String), Long)]
colValCount:
Read: ((colIdx, value), count)
((0,CT),5)
((0,MA),2)
((0,OH),4)
((0,NY),6)
((1,USA),17)
Now I need to get the top 2 items for each column index. So my expected output is this:
RDD[((Int, String), Long)]
((0,CT),5)
((0,NY),6)
((1,USA),17)
I've tried using freqItems api in DataFrame but it's slow.
Any suggestions are welcome.

For example:
import org.apache.spark.sql.functions._
df.select(lit(0).alias("index"), $"value")
.union(df.select(lit(1), $"country"))
.groupBy($"index", $"value")
.count
.orderBy($"count".desc)
.limit(3)
.show
// +-----+-----+-----+
// |index|value|count|
// +-----+-----+-----+
// | 1| USA| 17|
// | 0| NY| 6|
// | 0| CT| 5|
// +-----+-----+-----+
where:
df.select(lit(0).alias("index"), $"value")
.union(df.select(lit(1), $"country"))
creates a two column DataFrame:
// +-----+-----+
// |index|value|
// +-----+-----+
// | 0| MA|
// | 0| MA|
// | 0| OH|
// | 0| OH|
// | 0| OH|
// | 0| OH|
// | 0| NY|
// | 0| NY|
// | 0| NY|
// | 0| NY|
// | 0| NY|
// | 0| NY|
// | 0| CT|
// | 0| CT|
// | 0| CT|
// | 0| CT|
// | 0| CT|
// | 1| USA|
// | 1| USA|
// | 1| USA|
// +-----+-----+
If you want specifically two values for each column:
import org.apache.spark.sql.DataFrame
def topN(df: DataFrame, key: String, n: Int) = {
df.select(
lit(df.columns.indexOf(key)).alias("index"),
col(key).alias("value"))
.groupBy("index", "value")
.count
.orderBy($"count")
.limit(n)
}
topN(df, "value", 2).union(topN(df, "country", 2)).show
// +-----+-----+-----+
// |index|value|count|
// +-----+-----+-----+
// | 0| MA| 2|
// | 0| OH| 4|
// | 1| USA| 17|
// +-----+-----+-----+
So like pault said - just "some combination of sort() and limit()".

The easiest way to do this - a natural window function - is by writing SQL. Spark comes with SQL syntax, and SQL is a great and expressive tool for this problem.
Register your dataframe as a temp table, and then group and window on it.
spark.sql("""SELECT idx, value, ROW_NUMBER() OVER (PARTITION BY idx ORDER BY c DESC) as r
FROM (
SELECT idx, value, COUNT(*) as c
FROM (SELECT 0 as idx, value FROM df UNION ALL SELECT 1, country FROM df)
GROUP BY idx, value)
HAVING r <= 2""").show()
I'd like to see if any of the procedural / scala approaches will let you perform the window function without an iteration or loop. I'm not aware of anything in the Spark API that would support it.
Incidentally, if you have an arbitrary number of columns you want to include then you can quite easily generate the inner section (SELECT 0 as idx, value ... UNION ALL SELECT 1, country, etc) dynamically using the list of columns.

Given your last RDD:
val rdd =
sc.parallelize(
List(
((0, "CT"), 5),
((0, "MA"), 2),
((0, "OH"), 4),
((0, "NY"), 6),
((1, "USA"), 17)
))
rdd.filter(_._1._1 == 0).sortBy(-_._2).take(2).foreach(println)
> ((0,NY),6)
> ((0,CT),5)
rdd.filter(_._1._1 == 1).sortBy(-_._2).take(2).foreach(println)
> ((1,USA),17)
We first get items for a given column index (.filter(_._1._1 == 0)). Then we sort items by decreasing order (.sortBy(-_._2)). And finally, we take at most the 2 first elements (.take(2)), which takes only 1 element if the nbr of record is lower than 2.

You can map each single partition using this helper function defined in Sparkz and then combine them together:
package sparkz.utils
import scala.reflect.ClassTag
object TopElements {
def topN[T: ClassTag](elems: Iterable[T])(scoreFunc: T => Double, n: Int): List[T] =
elems.foldLeft((Set.empty[(T, Double)], Double.MaxValue)) {
case (accumulator#(topElems, minScore), elem) =>
val score = scoreFunc(elem)
if (topElems.size < n)
(topElems + (elem -> score), math.min(minScore, score))
else if (score > minScore) {
val newTopElems = topElems - topElems.minBy(_._2) + (elem -> score)
(newTopElems, newTopElems.map(_._2).min)
}
else accumulator
}
._1.toList.sortBy(_._2).reverse.map(_._1)
}
Source: https://github.com/gm-spacagna/sparkz/blob/master/src/main/scala/sparkz/utils/TopN.scala

Related

Spark Scala row-wise average by handling null

I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+

Spark >= 2.4
It is possible to use aggregate:
val row_mean = expr("""aggregate(
CAST(array(_1, _2, _3) AS array<double>),
-- Initial value
-- Note that aggregate is picky about types
CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>),
-- Merge function
(acc, x) -> (
acc.sum + coalesce(x, 0.0),
acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END),
-- Finalize function
acc -> acc.sum / acc.n)""")
Usage:
df.withColumn("row_mean", row_mean).show
Result:
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
Version independent
Compute sum and count of NOT NULL columns and divide one over another:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def row_mean(cols: Column*) = {
// Sum of values ignoring nulls
val sum = cols
.map(c => coalesce(c, lit(0)))
.foldLeft(lit(0))(_ + _)
// Count of not null values
val cnt = cols
.map(c => when(c.isNull, 0).otherwise(1))
.foldLeft(lit(0))(_ + _)
sum / cnt
}
Example data:
val df = Seq(
(None, None, None),
(Some(2.0), None, None),
(Some(50.0), Some(34.0), None),
(Some(1.0), Some(2.0), Some(3.0))
).toDF
Result:
df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+

def average(l: Seq[Integer]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Integer]))
val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])
Testing on the data you supplied gives the correct result:
val df = List(
(Some(10),Some(5), Some(5), None, None),
(None, Some(5), Some(5), None, Some(5)),
(Some(2), Some(8), Some(5), Some(1), Some(2)),
(None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")
Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])
Note if you have columns you do not want included make sure they are filtered when populating the array passed to the UDF.
Finally:
val df = List(
(Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])
Which again is the correct result.
If you want to use Doubles...
def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[java.lang.Double]))
val df = List(
(Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!

In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>

If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

Combining RDD's with some values missing

Hi I have two RDD's I want to combine into 1.
The first RDD is of the format
//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
I have another RDD
//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))
So the keys in the second RDD are more in no. but overlap with RDD1. I need to combine the RDD's so that only those keys of 2nd RDD append to RDD1 which are not there in RDD1.

You can do something like this -
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)
val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))
val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))
val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)
// Convert to Dataframe so it is easier to handle
val df1 = rdd1.toDF
val df2 = rdd2.toDF
// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
+----+---+----+
df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| -1|
| U5| M4| 3|
| U6| M6| -1|
| U3| M2| -1|
| U4| M2| -1|
| U4| M3| 5|
+----+---+----+
// Figure out the extra reviews in second dataframe that do not match (user, mov) in first
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")
// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).union(b.select(columns: _*))
}
// Final result of combining only unique values in df2
unionByName(df1, xtraReviews).show
+----+---+----+
|user|mov|rate|
+----+---+----+
| U1| M1| 1|
| U2| M2| 3|
| U3| M1| 3|
| U3| M2| 1|
| U4| M2| 2|
| U5| M4| 3|
| U4| M3| 5|
| U6| M6| -1|
+----+---+----+

It might also be possible to do it in this way:
RDD's are really slow, so read your data or convert your data in dataframes.
Use spark dropDuplicates() on both the dataframes like df.dropDuplicates(['Key1', 'Key2']) to get distinct values on keys in both of your dataframe and then
simply union them like df1.union(df2).
Benefit is you are doing it in spark way and hence you have all the parallelism and speed.

Spark : Aggregating based on a column

I have a file consisting of 3 fields (Emp_ids, Groups, Salaries)
100 A 430
101 A 500
201 B 300
I want to get result as
1) Group name and count(*)
2) Group name and max( salary)
val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc= new SparkContext( conf)
val sal= sc.textFile(myfile)

Scala DSL:
case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
val arr = v.split(' ').map(_.trim())
Data(arr(0).toInt, arr(1), arr(2).toInt)
})
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
| 100| A| 430|
| 101| A| 500|
| 201| B| 300|
+-----+-----+------+
df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
Or with plain SQL:
df.registerTempTable("salaries")
sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
While Spark SQL is recommended way to do such aggregations due to performance reasons, it can be easily done with RDD API:
val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))
rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))
rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?

You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+

First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Top N items from a Spark DataFrame/RDD - scala

Related

Spark Scala row-wise average by handling null

How to assign keys to items in a column in Scala?

Combining RDD's with some values missing

Spark : Aggregating based on a column

How to combine (join) information across an Array[DataFrame]

Categories

Resources