Spark reduce and aggregate on same data-set

Spark reduce and aggregate on same data-set - scala

I have a text file which I read and then split using the split operation. This results in an RDD with Array(A, B, C, D, E, F, G, H, I).
I would like to find max(F) - min(G) for every key E (reduce by key E). Then I want to combine the resulting values by key C and concatenate this sum result for every row with the same key.
For example:
+--+--+--+--+
| C| E| F| G|
+--+--+--+--+
|en| 1| 3| 1|
|en| 1| 4| 0|
|nl| 2| 1| 1|
|nl| 2| 5| 2|
|nl| 3| 9| 3|
|nl| 3| 6| 4|
|en| 4| 9| 1|
|en| 4| 2| 1|
+-----------+
Should result in
+--+--+-------------+---+
| C| E|max(F)-min(G)|sum|
+--+--+-------------+---+
|en| 1| 4 |12 |
|nl| 2| 4 |10 |
|nl| 3| 6 |10 |
|en| 4| 8 |12 |
+--+--+-------------+---+
What would be the best way to tackle this? Currently I am trying to perform the max(F)-min(G) by running
val maxCounts = logEntries.map(line => (line(4), line(5).toLong)).reduceByKey((x, y) => math.max(x, y))
val minCounts = logEntries.map(line => (line(4), line(6).toLong)).reduceByKey((x, y) => math.min(x, y))
val maxMinCounts = maxCounts.join(minCounts).map{ case(id, maxmin) => (id, (maxmin._1 - maxmin._2)) }
And then join the resulting RDDs. However, this becomes tricky when I also want to sum these values and append them to my existing data set.
I would love to hear any suggestions!

This kind of logic is easily implemented in the dataframe API (also). But you need to explicitly form your columns from the array:
val window = Window.partitionBy('C)
val df = rdd
.map { case Array(_, _, c, _, e, f, g, _, _) => (c,e,f,g) }
.toDF("C","E","F","G")
.groupBy('C,'E)
.agg((max('F) - min('G)).as("diff"))
.withColumn("sum",sum('diff).over(window))

assuming, like your sample data, that unique E's never span multiple C's... you could do something like this.
import math.{max,min}
case class FG(f: Int, g: Int) {
def combine(that: FG) =
FG(max(f, that.f), min(g, that.g))
def result = f - g
}
val result = {
rdd
.map{ case Array(_, _, c, _, e, f, g, _, _) =>
((c, e), FG(f, g)) }
.reduceByKey(_ combine _)
.map{ case ((c, _), fg) =>
(c, fg.result) }
.reduceByKey(_+_)
}

Related

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?

You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)

Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

How to retain the column structure of a Spark Dataframe following a map operation on rows

I am trying to apply a function to each row of a Spark DataFrame, as in the example.
val df = sc.parallelize(
Seq((1, 2, 0), (0, 0, 1), (0, 0, 0))).toDF("x", "y", "z")
df.show()
which yields
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 0| 0| 1|
| 0| 0| 0|
+---+---+---+
Suppose I want to do something to the values in each row, for example changing 0 to 5.
val b = df.map(row => row.toSeq.map(x => x match{
case 0 => 5
case x: Int => x
}))
b.show()
+---------+
| value|
+---------+
|[1, 2, 5]|
|[5, 5, 1]|
|[5, 5, 5]|
+---------+
The function worked, but I now have one column whose entries are Lists, instead of 3 columns of Ints. I would like my named columns back.

You can define an UDF to apply this substitution. For example:
def subsDef(k: Int): Int = if(k==0) 5 else k
val subs = udf[Int, Int](subsDef)
Then you can apply the UDF to a specific column or, if you desire, to every columns of the DF:
// to a single column, for example "x"
df = df.withColumn("x", subs(col("x")))
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 0|
| 5| 0| 1|
| 5| 0| 0|
+---+---+---+
// to every columns of DF
df.columns.foreach(c => {
df = df.withColumn(c, subs(col(c)))
})
df.show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 5|
| 5| 5| 1|
| 5| 5| 5|
+---+---+---+

Rather than transforming the DataFrame row-wise, consider using built-in Spark API function when/otherwise, as follows:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((1, 2, 0), (0, 0, 1), (0, 0, 0)).toDF("x", "y", "z")
val vFrom = 0
val vTo = 5
val cols = df.columns // Filter for specific columns if necessary
df.select( cols.map( c =>
when(col(c) === vFrom, vTo).otherwise(col(c)).as(c)
): _*
).show
// +---+---+---+
// | x| y| z|
// +---+---+---+
// | 1| 2| 5|
// | 5| 5| 1|
// | 5| 5| 5|
// +---+---+---+

There are various ways to do it here are some:
df.map(row => {
val size = row.size
var seq: Seq[Int] = Seq.empty[Int]
for (a <- 0 to size - 1) {
val value: Int = row(a).asInstanceOf[Int]
val newVal: Int = value match {
case 0 =>
5
case _ =>
value
}
seq = seq :+ newVal
}
Row.fromSeq(seq)
})(RowEncoder.apply(df.schema))
val columns = df.columns
df.select(
columns.map(c => when(col(c) === 0, 5).otherwise(col(c)).as(c)): _*)
.show()
def fun: (Int => Int) = { x =>
if (x == 0) 5 else x
}
val function = udf(fun)
df.select(function(col("x")).as("x"),
function(col("y")).as("y"),
function(col("z")).as("z"))
.show()
def checkZero(a: Int): Int = if (a == 0) 5 else a
df.map {
case Row(a: Int, b: Int, c: Int) =>
Row(checkZero(a), checkZero(b), checkZero(c))
} { RowEncoder.apply(df.schema) }
.show()

Rank per row over multiple columns in Spark Dataframe

I am using spark with Scala to transform a Dataframe , where I would like to compute a new variable which calculates the rank of one variable per row within many variables.
Example -
Input DF-
+---+---+---+
|c_0|c_1|c_2|
+---+---+---+
| 11| 11| 35|
| 22| 12| 66|
| 44| 22| 12|
+---+---+---+
Expected DF-
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 11| 35| 2| 3| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
This has aleady been answered using R - Rank per row over multiple columns in R,
but I need to do the same in spark-sql using scala. Thanks for the Help!
Edit- 4/1 . Encountered one scenario where if the values are same the ranks should be different. Editing first row for replicating the situation.

If I understand correctly, you want to have the rank of each column, within each row.
Let's first define the data, and the columns to "rank".
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
val cols = df.columns
Then we define a UDF that finds the index of an element in an array.
val pos = udf((a : Seq[Int], elt : Int) => a.indexOf(elt)+1)
We finally create a sorted array (in descending order) and use the UDF to find the rank of each column.
val ranks = cols.map(c => pos(col("array"), col(c)).as(c+"_rank"))
df.withColumn("array", sort_array(array(cols.map(col) : _*), false))
.select((cols.map(col)++ranks) :_*).show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 12| 35| 3| 2| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
EDIT:
As of Spark 2.4, the pos UDF that I defined can be replaced by the built in function array_position(column: Column, value: Any) that works exactly the same way (the first index is 1). This avoids using UDFs that can be slightly less efficient.
EDIT2:
The code above will generate duplicated indices in case you have duplidated keys. If you want to avoid it, you can create the array, zip it to remember which column is which, sort it and zip it again to get the final rank. It would look like this:
val colMap = df.columns.zipWithIndex.map(_.swap).toMap
val zip = udf((s: Seq[Int]) => s
.zipWithIndex
.sortBy(-_._1)
.map(_._2)
.zipWithIndex
.toMap
.mapValues(_+1))
val ranks = (0 until cols.size)
.map(i => 'zip.getItem(i) as colMap(i) + "_rank")
val result = df
.withColumn("zip", zip(array(cols.map(col) : _*)))
.select(cols.map(col) ++ ranks :_*)

One way to go about this would be to use windows.
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
d.withColumn(column+"_rank", rank() over Window.orderBy(desc(column))))
.show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 22| 12| 66| 2| 3| 1|
| 11| 21| 35| 3| 2| 2|
| 44| 22| 12| 1| 1| 3|
+---+---+---+--------+--------+--------+
But this is not a good idea. All the data will end up in one partition which will cause an OOM error if all the data does not fit inside one executor.
Another way would require to sort the dataframe three times, but at least that would scale to any size of data.
Let's define a function that zips a dataframe with consecutive indices (it exists for RDDs but not for dataframes)
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ (i+1)) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
And let's use it on the same dataframe df:
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
zipWithIndex(d.orderBy(desc(column)), column+"_rank"))
.show
which provides the exact same result as above.

You could probably create a window function. Do note that this is susceptible to OOM if you have too much data. But, I just wanted to introduce to the concept of window functions here.
inputDF.createOrReplaceTempView("my_df")
val expectedDF = spark.sql("""
select
c_0
, c_1
, c_2
, rank(c_0) over (order by c_0 desc) c_0_rank
, rank(c_1) over (order by c_1 desc) c_1_rank
, rank(c_2) over (order by c_2 desc) c_2_rank
from my_df""")
expectedDF.show()
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 44| 22| 12| 3| 3| 1|
| 11| 21| 35| 1| 2| 2|
| 22| 12| 66| 2| 1| 3|
+---+---+---+--------+--------+--------+

How to get the last row from DataFrame?

I hava a DataFrame,the DataFrame hava two column 'value' and 'timestamp',,the 'timestmp' is ordered,I want to get the last row of the DataFrame,what should I do?
this is my input:
+-----+---------+
|value|timestamp|
+-----+---------+
| 1| 1|
| 4| 2|
| 3| 3|
| 2| 4|
| 5| 5|
| 7| 6|
| 3| 7|
| 5| 8|
| 4| 9|
| 18| 10|
+-----+---------+
this is my code:
val arr = Array((1,1),(4,2),(3,3),(2,4),(5,5),(7,6),(3,7),(5,8),(4,9),(18,10))
var df=m_sparkCtx.parallelize(arr).toDF("value","timestamp")
this is my expected result:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+

Try this, it works for me.
df.orderBy($"value".desc).show(1)

I would use simply the query that
- orders your table by descending order
- takes 1st value from this order
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY value DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()

I'd simply reduce:
df.reduce { (x, y) =>
if (x.getAs[Int]("timestamp") > y.getAs[Int]("timestamp")) x else y
}

The most efficient way is to reduce your DataFrame. This gives you a single row which you can convert back to a DataFrame, but as it contains only 1 record, this does not make much sense.
sparkContext.parallelize(
Seq(
df.reduce {
(a, b) => if (a.getAs[Int]("timestamp") > b.getAs[Int]("timestamp")) a else b
} match {case Row(value:Int,timestamp:Int) => (value,timestamp)}
)
)
.toDF("value","timestamp")
.show
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
Less efficient (as it needs shuffling) although shorter is this solution:
df
.where($"timestamp" === df.groupBy().agg(max($"timestamp")).map(_.getInt(0)).collect.head)

If your timestamp column is unique and is in increasing order then there are following ways to get the last row
println(df.sort($"timestamp", $"timestamp".desc).first())
// Output [1,1]
df.sort($"timestamp", $"timestamp".desc).take(1).foreach(println)
// Output [1,1]
df.where($"timestamp" === df.count()).show
Output:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
If not create a new column with the index and select the last index as below
val df1 = spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("index", LongType, false)))
df1.where($"timestamp" === df.count()).drop("index").show
Output:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+

Java:
Dataset<Row> sortDF = inputDF.orderBy(org.apache.spark.sql.functions.col(config.getIncrementingColumn()).desc());
Row row = sortDF.first()

You can also use this function desc: Column desc(String columnName)
df.orderBy(desc("value")).show(1)
which gives same result as
df.orderBy($"value".desc).show(1)

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?

You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+

First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark reduce and aggregate on same data-set - scala

Related

Collect most occurring unique values across columns after a groupby in Spark

How to retain the column structure of a Spark Dataframe following a map operation on rows

Rank per row over multiple columns in Spark Dataframe

How to get the last row from DataFrame?

How to combine (join) information across an Array[DataFrame]

Categories

Resources