Computing rank of a row - scala

I want to rank user id based on one field. For the same value of the field, rank should be same. That data is in Hive table.
e.g.
user value
a 5
b 10
c 5
d 6
Rank
a - 1
c - 1
d - 3
b - 4
How can i do that?

It is possible to use rank window function either with a DataFrame API:
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"value")
val df = sc.parallelize(Seq(
("a", 5), ("b", 10), ("c", 5), ("d", 6)
)).toDF("user", "value")
df.select($"user", rank.over(w).alias("rank")).show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
or raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT user, RANK() OVER (ORDER BY value) AS rank FROM df").show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
but it is extremely inefficient.
You can also try to use RDD API but it is not exactly straightforward. First lets convert DataFrame to RDD:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.RangePartitioner
val rdd: RDD[(Int, String)] = df.select($"value", $"user")
.map{ case Row(value: Int, user: String) => (value, user) }
val partitioner = new RangePartitioner(rdd.partitions.size, rdd)
val sorted = rdd.repartitionAndSortWithinPartitions(partitioner)
Next we have to compute ranks per partition:
def rank(iter: Iterator[(Int,String)]) = {
val zero = List((-1L, Integer.MIN_VALUE, "", 1L))
def f(acc: List[(Long,Int,String,Long)], x: (Int, String)) =
(acc.head, x) match {
case (
(prevRank: Long, prevValue: Int, _, offset: Long),
(currValue: Int, label: String)) => {
val newRank = if (prevValue == currValue) prevRank else prevRank + offset
val newOffset = if (prevValue == currValue) offset + 1L else 1L
(newRank, currValue, label, newOffset) :: acc
}
}
iter.foldLeft(zero)(f).reverse.drop(1).map{case (rank, _, label, _) =>
(rank, label)}.toIterator
}
val partRanks = sorted.mapPartitions(rank)
offset for each partition
def getOffsets(sorted: RDD[(Int, String)]) = sorted
.mapPartitionsWithIndex((i: Int, iter: Iterator[(Int, String)]) =>
Iterator((i, iter.size)))
.collect
.foldLeft(List((-1, 0)))((acc: List[(Int, Int)], x: (Int, Int)) =>
(x._1, x._2 + acc.head._2) :: acc)
.toMap
val offsets = sc.broadcast(getOffsets(sorted))
and the final ranks:
def adjust(i: Int, iter: Iterator[(Long, String)]) =
iter.map{case (rank, label) => (rank + offsets.value(i - 1).toLong, label)}
val ranks = partRanks
.mapPartitionsWithIndex(adjust)
.map{case (i, label) => (1 + i , label)}

Related

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Scala Dataframe get max value of specific row

Given a dataframe with an index column ("Z"):
val tmp= Seq(("D",0.1,0.3, 0.4), ("E",0.3, 0.1, 0.4), ("F",0.2, 0.2, 0.5)).toDF("Z", "a", "b", "c")
+---+---+---+---+
| Z | a| b| c|
---+---+---+---+
| "D"|0.1|0.3|0.4|
| "E"|0.3|0.1|0.4|
| "F"|0.2|0.2|0.5|
+---+---+---+---+
Say im interested in the first row where Z = "D":
tmp.filter(col("Z")=== "D")
+---+---+---+---+
| Z | a| b| c|
+---+---+---+---+
|"D"|0.1|0.3|0.4|
+---+---+---+---+
How do i get the min and max values of that Dataframe row and its corresponding column name while keeping the index column?
Desired output if i want top 2 max
+---+---+---
| Z | b|c |
+---+---+--+
| D |0.3|0.4|
+---+---+---
Desired output if i want min
+---+---+
| Z | a|
+---+---+
| D |0.1|
+---+---+
What i tried:
// first convert that DF to an array
val tmp = df.collect.map(_.toSeq).flatten
// returns
tmp: Array[Any] = Array(0.1, 0.3, 0.4) <---dont know why Any is returned
//take top values of array
val n = 1
tmp.zipWithIndex.sortBy(-_._1).take(n).map(_._2)
But got error:
No implicit Ordering defined for Any.
Any way to do it straight from dataframe instead of array?
You can do something like this
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
Seq(row.getDouble(0), row.getDouble(1), row.getDouble(2))
}
.head
.sortBy(d => -d)
.take(2)
Or if you have big amount of fields you can take schema and pattern match row fields against schema data types like this
import org.apache.spark.sql.types._
val schemaWithIndex = tmp.schema.zipWithIndex
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
for {
tuple <- schemaWithIndex
} yield {
val field = tuple._1
val index = tuple._2
field.dataType match {
case DoubleType => row.getDouble(index)
}
}
}
.head
.sortBy(d => -d)
.take(2)
Maybe there is easier way to do this.
Definitely not the fastest way, but straight from dataframe
More generic solution:
// somewhere in codebase
import spark.implicits._
import org.apache.spark.sql.functions._
def transform[T, R : Encoder](ds: DataFrame, colsToSelect: Seq[String])(func: Map[String, T] => Map[String, R])
(implicit encoder: Encoder[Map[String, R]]): DataFrame = {
ds.map(row => func(row.getValuesMap(colsToSelect)))
.toDF()
.select(explode(col("value")))
.withColumn("idx", lit(1))
.groupBy(col("idx")).pivot(col("key")).agg(first(col("value")))
.drop("idx")
}
Now it's about working with Map where the map key is a field name and map value is the field value.
def fuzzyStuff(values: Map[String, Any]): Map[String, String] = {
val valueForA = values("a").asInstanceOf[Double]
//Do whatever you want to do
// ...
//use map as a return type where key is a column name and value is whatever yo want to
Map("x" -> (s"fuzzyA-$valueForA"))
}
def maxN(n: Int)(values: Map[String, Double]): Map[String, Double] = {
println(values)
values.toSeq.sorted.reverse.take(n).toMap
}
Usage:
val tmp = Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
transform(filtered, colsToSelect = Seq("a", "b", "c"))(maxN(2))
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+
transform(filtered, colsToSelect = Seq("a", "b", "c"))(fuzzyStuff)
.show()
+----------+
| x|
+----------+
|fuzzyA-0.1|
+----------+
Define max and min functions
def maxN(values: Map[String, Double], n: Int): Map[String, Double] = {
values.toSeq.sorted.reverse.take(n).toMap
}
def min(values: Map[String, Double]): Map[String, Double] = {
Map(values.toSeq.min)
}
Create dataset
val tmp= Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
Exple and pivot map type
val df = filtered.map(row => maxN(row.getValuesMap(Seq("a", "b", "c")), 2)).toDF()
val exploded = df.select(explode($"value"))
+---+-----+
|key|value|
+---+-----+
| a| 0.1|
| b| 0.3|
+---+-----+
//Then pivot
exploded.withColumn("idx", lit(1))
.groupBy($"idx").pivot($"key").agg(first($"value"))
.drop("idx")
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+

Generic iterator over dataframe (Spark/scala)

I need to iterate over data frame in specific order and apply some complex logic to calculate new column.
In below example I'll be using simple expression where current value for s is multiplication of all previous values thus it may seem like this can be done using UDF or even analytic functions. However, in reality logic is much more complex.
Below code does what is needed
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val q = """
select 10 x, 1 y
union all select 10, 2
union all select 10, 3
union all select 20, 6
union all select 20, 4
union all select 20, 5
"""
val df = spark.sql(q)
def f_row(iter: Iterator[Row]) : Iterator[Row] = {
iter.scanLeft(Row(0,0,1)) {
case (r1, r2) => {
val (x1, y1, s1) = r1 match {case Row(x: Int, y: Int, s: Int) => (x, y, s)}
val (x2, y2) = r2 match {case Row(x: Int, y: Int) => (x, y)}
Row(x2, y2, s1 * y2)
}
}.drop(1)
}
val schema = new StructType().
add(StructField("x", IntegerType, true)).
add(StructField("y", IntegerType, true)).
add(StructField("s", IntegerType, true))
val encoder = RowEncoder(schema)
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
Output
scala> df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
+---+---+---+
| x| y| s|
+---+---+---+
| 20| 4| 4|
| 20| 5| 20|
| 20| 6|120|
| 10| 1| 1|
| 10| 2| 2|
| 10| 3| 6|
+---+---+---+
What I do not like about it is
1) I explicitly define schema even though Spark can infer names and types for data frame
scala> df
res1: org.apache.spark.sql.DataFrame = [x: int, y: int]
2) If I add any new column to data frame then I have to declare schema again and what is more annoying - re-define function!
Assume there is new column z in data frame. In this case I have to change almost every line in f_row.
def f_row(iter: Iterator[Row]) : Iterator[Row] = {
iter.scanLeft(Row(0,0,"",1)) {
case (r1, r2) => {
val (x1, y1, z1, s1) = r1 match {case Row(x: Int, y: Int, z: String, s: Int) => (x, y, z, s)}
val (x2, y2, z2) = r2 match {case Row(x: Int, y: Int, z: String) => (x, y, z)}
Row(x2, y2, z2, s1 * y2)
}
}.drop(1)
}
val schema = new StructType().
add(StructField("x", IntegerType, true)).
add(StructField("y", IntegerType, true)).
add(StructField("z", StringType, true)).
add(StructField("s", IntegerType, true))
val encoder = RowEncoder(schema)
df.withColumn("z", lit("dummy")).repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
Output
scala> df.withColumn("z", lit("dummy")).repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
+---+---+-----+---+
| x| y| z| s|
+---+---+-----+---+
| 20| 4|dummy| 4|
| 20| 5|dummy| 20|
| 20| 6|dummy|120|
| 10| 1|dummy| 1|
| 10| 2|dummy| 2|
| 10| 3|dummy| 6|
+---+---+-----+---+
Is there a way to implement logic in more generic way so I do not need to create function to iterate over every specific data frame?
Or at least to avoid code changes after adding new columns into data frame which are not used in calculation logic.
Please see updated question below.
Update
Below are two options to iterate in more generic way but still with some drawbacks.
// option 1
def f_row(iter: Iterator[Row]): Iterator[Row] = {
val r = Row.fromSeq(Row(0, 0).toSeq :+ 1)
iter.scanLeft(r)((r1, r2) =>
Row.fromSeq(r2.toSeq :+ r1.getInt(r1.size - 1) * r2.getInt(r2.fieldIndex("y")))
).drop(1)
}
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
// option 2
def f_row(iter: Iterator[Row]): Iterator[Row] = {
iter.map{
var s = 1
r => {
s = s * r.getInt(r.fieldIndex("y"))
Row.fromSeq(r.toSeq :+ s)
}
}
}
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
If a new column added to data frame then initial value for iter.scanLeft has to be changed in Option 1. Also I do not really like Option 2 because it uses mutable var.
Is there a way to improve the code so it's purely functional and no changes are needed when new column added to the data frame?
Well, sufficient solution is below
def f_row(iter: Iterator[Row]): Iterator[Row] = {
if (iter.hasNext) {
val head = iter.next
val r = Row.fromSeq(head.toSeq :+ head.getInt(head.fieldIndex("y")))
iter.scanLeft(r)((r1, r2) =>
Row.fromSeq(r2.toSeq :+ r1.getInt(r1.size - 1) * r2.getInt(r2.fieldIndex("y"))))
} else iter
}
val encoder =
RowEncoder(StructType(df.schema.fields :+ StructField("s", IntegerType, false)))
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row)(encoder).show
Update
Functions like getInt can be avoided in favor of more generic getAs.
Also, in order to be able to access rows of r1 by name we can generate GenericRowWithSchema which is subclass of Row.
Implicit parameter has been added to f_row so that function can use current schema of the data frame and in the same time it can be used as a parameter of the mapPartitions.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.catalyst.encoders.RowEncoder
implicit val schema = StructType(df.schema.fields :+ StructField("result", IntegerType))
implicit val encoder = RowEncoder(schema)
def mul(x1: Int, x2: Int) = x1 * x2;
def f_row(iter: Iterator[Row])(implicit currentSchema : StructType) : Iterator[Row] = {
if (iter.hasNext) {
val head = iter.next
val r =
new GenericRowWithSchema((head.toSeq :+ (head.getAs("y"))).toArray, currentSchema)
iter.scanLeft(r)((r1, r2) =>
new GenericRowWithSchema((r2.toSeq :+ mul(r1.getAs("result"), r2.getAs("y"))).toArray, currentSchema))
} else iter
}
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row).show
Finally, logic can be implemented in a tail recursive manner.
import scala.annotation.tailrec
def f_row(iter: Iterator[Row]) = {
#tailrec
def f_row_(iter: Iterator[Row], tmp: Int, result: Iterator[Row]): Iterator[Row] = {
if (iter.hasNext) {
val r = iter.next
f_row_(iter, mul(tmp, r.getAs("y")),
result ++ Iterator(Row.fromSeq(r.toSeq :+ mul(tmp, r.getAs("y")))))
} else result
}
f_row_(iter, 1, Iterator[Row]())
}
df.repartition($"x").sortWithinPartitions($"y").mapPartitions(f_row).show

Spark: reduce/aggregate by key

I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.
So the output would be something like:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.
What is the most efficient way to create the features DataFrame?
A little bit more DataFrame centric solution:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
and an UDAF alternative:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
with example usage:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
In this case order of values is defined by indexer.labels:
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
In practice I would prefer solution by Odomontois so these are provided mostly for reference.
Suppose:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
Where userId and frequency are bigint columns which corresponds to scala.Long
We are creating intermediate mapping RDD:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
Then collecting all presented categories in the lexicographic order
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
Or creating it manually
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
Finally we're transforming maps to arrays with 0-values for non-existing values
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
now catArrays.show() prints something like
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.
Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...
Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it
Given your input:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
Just run:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
Finally, use VectorAssembler to create a org.apache.spark.ml.linalg.Vector
NOTE: I have not checked performances on this yet...
EDIT: Possibly more complex, but likely more efficient!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
NOTE: StringIndexer will sort categories by frequency => most frequent category will be at index=0 in indexerModel.labels. Feel free to use your own mapping if you'd like and pass that directly to toSparseVectorUdf.