Scala Dataframe get max value of specific row - scala

Given a dataframe with an index column ("Z"):
val tmp= Seq(("D",0.1,0.3, 0.4), ("E",0.3, 0.1, 0.4), ("F",0.2, 0.2, 0.5)).toDF("Z", "a", "b", "c")
+---+---+---+---+
| Z | a| b| c|
---+---+---+---+
| "D"|0.1|0.3|0.4|
| "E"|0.3|0.1|0.4|
| "F"|0.2|0.2|0.5|
+---+---+---+---+
Say im interested in the first row where Z = "D":
tmp.filter(col("Z")=== "D")
+---+---+---+---+
| Z | a| b| c|
+---+---+---+---+
|"D"|0.1|0.3|0.4|
+---+---+---+---+
How do i get the min and max values of that Dataframe row and its corresponding column name while keeping the index column?
Desired output if i want top 2 max
+---+---+---
| Z | b|c |
+---+---+--+
| D |0.3|0.4|
+---+---+---
Desired output if i want min
+---+---+
| Z | a|
+---+---+
| D |0.1|
+---+---+
What i tried:
// first convert that DF to an array
val tmp = df.collect.map(_.toSeq).flatten
// returns
tmp: Array[Any] = Array(0.1, 0.3, 0.4) <---dont know why Any is returned
//take top values of array
val n = 1
tmp.zipWithIndex.sortBy(-_._1).take(n).map(_._2)
But got error:
No implicit Ordering defined for Any.
Any way to do it straight from dataframe instead of array?

You can do something like this
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
Seq(row.getDouble(0), row.getDouble(1), row.getDouble(2))
}
.head
.sortBy(d => -d)
.take(2)
Or if you have big amount of fields you can take schema and pattern match row fields against schema data types like this
import org.apache.spark.sql.types._
val schemaWithIndex = tmp.schema.zipWithIndex
tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
for {
tuple <- schemaWithIndex
} yield {
val field = tuple._1
val index = tuple._2
field.dataType match {
case DoubleType => row.getDouble(index)
}
}
}
.head
.sortBy(d => -d)
.take(2)
Maybe there is easier way to do this.

Definitely not the fastest way, but straight from dataframe
More generic solution:
// somewhere in codebase
import spark.implicits._
import org.apache.spark.sql.functions._
def transform[T, R : Encoder](ds: DataFrame, colsToSelect: Seq[String])(func: Map[String, T] => Map[String, R])
(implicit encoder: Encoder[Map[String, R]]): DataFrame = {
ds.map(row => func(row.getValuesMap(colsToSelect)))
.toDF()
.select(explode(col("value")))
.withColumn("idx", lit(1))
.groupBy(col("idx")).pivot(col("key")).agg(first(col("value")))
.drop("idx")
}
Now it's about working with Map where the map key is a field name and map value is the field value.
def fuzzyStuff(values: Map[String, Any]): Map[String, String] = {
val valueForA = values("a").asInstanceOf[Double]
//Do whatever you want to do
// ...
//use map as a return type where key is a column name and value is whatever yo want to
Map("x" -> (s"fuzzyA-$valueForA"))
}
def maxN(n: Int)(values: Map[String, Double]): Map[String, Double] = {
println(values)
values.toSeq.sorted.reverse.take(n).toMap
}
Usage:
val tmp = Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
transform(filtered, colsToSelect = Seq("a", "b", "c"))(maxN(2))
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+
transform(filtered, colsToSelect = Seq("a", "b", "c"))(fuzzyStuff)
.show()
+----------+
| x|
+----------+
|fuzzyA-0.1|
+----------+
Define max and min functions
def maxN(values: Map[String, Double], n: Int): Map[String, Double] = {
values.toSeq.sorted.reverse.take(n).toMap
}
def min(values: Map[String, Double]): Map[String, Double] = {
Map(values.toSeq.min)
}
Create dataset
val tmp= Seq((0.1,0.3, 0.4), (0.3, 0.1, 0.4), (0.2, 0.2, 0.5)).toDF("a", "b", "c")
val filtered = tmp.filter(col("a") === 0.1)
Exple and pivot map type
val df = filtered.map(row => maxN(row.getValuesMap(Seq("a", "b", "c")), 2)).toDF()
val exploded = df.select(explode($"value"))
+---+-----+
|key|value|
+---+-----+
| a| 0.1|
| b| 0.3|
+---+-----+
//Then pivot
exploded.withColumn("idx", lit(1))
.groupBy($"idx").pivot($"key").agg(first($"value"))
.drop("idx")
.show()
+---+---+
| b| c|
+---+---+
|0.3|0.4|
+---+---+

Related

How to make column from a list in scala/pyspark dataframe ? Error: The feature is not supported: literal for 'List()

enter code hereI am practising to add a list into dataframe col. I can def udf and register and then apply on dataframe but I want to try a different approach that extracting a list from dataframe col and them map it and then readd to the original dataframe in new column.
val df = spark.createDataFrame(Seq(("A",1),("B",2),("C",3))).toDF("Str", "Num")
+---+---+
|Str|Num|
+---+---+
| A| 1|
| B| 2|
| C| 3|
+---+---+
list collected:
scala> var ls : List[String] = df.select("Str").collect().map(f=>f.getString(0)).toList
var ls: List[String] = List(A, B, C, d)
Transformation:
def f(x : String) : String = {
if (x=="A") {x + "100"}
else {x + x.length.toString}
}
apply transformation:
scala> ls.map(x => f(x))
val res95: List[String] = List(A100, B1, C1, d1)
add column from the list: ERROR
import org.apache.spark.sql.functions.{lit,col}
df.withColumn("new", lit(ls)).show()
error: org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'List(A100, B1, C1)' of class scala.collection.immutable.$colon$colon.
//Please correct here
Create the udf:
val myUdf = udf { x: String =>
if (x=="A") {x + "100"}
else {x + x.length.toString}
}
and the apply to the df:
df.withColumn("new", myUdf(col("Str")))
to add a new column from a List:
df.withColumn("fromListColumn", array(Seq("one", "two").map(lit(_)):_*))

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Spark: reduce/aggregate by key

I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.
So the output would be something like:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.
What is the most efficient way to create the features DataFrame?
A little bit more DataFrame centric solution:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
and an UDAF alternative:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
with example usage:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
In this case order of values is defined by indexer.labels:
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
In practice I would prefer solution by Odomontois so these are provided mostly for reference.
Suppose:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
Where userId and frequency are bigint columns which corresponds to scala.Long
We are creating intermediate mapping RDD:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
Then collecting all presented categories in the lexicographic order
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
Or creating it manually
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
Finally we're transforming maps to arrays with 0-values for non-existing values
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
now catArrays.show() prints something like
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.
Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...
Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it
Given your input:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
Just run:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
Finally, use VectorAssembler to create a org.apache.spark.ml.linalg.Vector
NOTE: I have not checked performances on this yet...
EDIT: Possibly more complex, but likely more efficient!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
NOTE: StringIndexer will sort categories by frequency => most frequent category will be at index=0 in indexerModel.labels. Feel free to use your own mapping if you'd like and pass that directly to toSparseVectorUdf.

Computing rank of a row

I want to rank user id based on one field. For the same value of the field, rank should be same. That data is in Hive table.
e.g.
user value
a 5
b 10
c 5
d 6
Rank
a - 1
c - 1
d - 3
b - 4
How can i do that?
It is possible to use rank window function either with a DataFrame API:
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"value")
val df = sc.parallelize(Seq(
("a", 5), ("b", 10), ("c", 5), ("d", 6)
)).toDF("user", "value")
df.select($"user", rank.over(w).alias("rank")).show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
or raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT user, RANK() OVER (ORDER BY value) AS rank FROM df").show
// +----+----+
// |user|rank|
// +----+----+
// | a| 1|
// | c| 1|
// | d| 3|
// | b| 4|
// +----+----+
but it is extremely inefficient.
You can also try to use RDD API but it is not exactly straightforward. First lets convert DataFrame to RDD:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.RangePartitioner
val rdd: RDD[(Int, String)] = df.select($"value", $"user")
.map{ case Row(value: Int, user: String) => (value, user) }
val partitioner = new RangePartitioner(rdd.partitions.size, rdd)
val sorted = rdd.repartitionAndSortWithinPartitions(partitioner)
Next we have to compute ranks per partition:
def rank(iter: Iterator[(Int,String)]) = {
val zero = List((-1L, Integer.MIN_VALUE, "", 1L))
def f(acc: List[(Long,Int,String,Long)], x: (Int, String)) =
(acc.head, x) match {
case (
(prevRank: Long, prevValue: Int, _, offset: Long),
(currValue: Int, label: String)) => {
val newRank = if (prevValue == currValue) prevRank else prevRank + offset
val newOffset = if (prevValue == currValue) offset + 1L else 1L
(newRank, currValue, label, newOffset) :: acc
}
}
iter.foldLeft(zero)(f).reverse.drop(1).map{case (rank, _, label, _) =>
(rank, label)}.toIterator
}
val partRanks = sorted.mapPartitions(rank)
offset for each partition
def getOffsets(sorted: RDD[(Int, String)]) = sorted
.mapPartitionsWithIndex((i: Int, iter: Iterator[(Int, String)]) =>
Iterator((i, iter.size)))
.collect
.foldLeft(List((-1, 0)))((acc: List[(Int, Int)], x: (Int, Int)) =>
(x._1, x._2 + acc.head._2) :: acc)
.toMap
val offsets = sc.broadcast(getOffsets(sorted))
and the final ranks:
def adjust(i: Int, iter: Iterator[(Long, String)]) =
iter.map{case (rank, label) => (rank + offsets.value(i - 1).toLong, label)}
val ranks = partRanks
.mapPartitionsWithIndex(adjust)
.map{case (i, label) => (1 + i , label)}