Spark Scala row-wise average by handling null

Spark Scala row-wise average by handling null - scala

I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+

Spark >= 2.4
It is possible to use aggregate:
val row_mean = expr("""aggregate(
CAST(array(_1, _2, _3) AS array<double>),
-- Initial value
-- Note that aggregate is picky about types
CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>),
-- Merge function
(acc, x) -> (
acc.sum + coalesce(x, 0.0),
acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END),
-- Finalize function
acc -> acc.sum / acc.n)""")
Usage:
df.withColumn("row_mean", row_mean).show
Result:
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
Version independent
Compute sum and count of NOT NULL columns and divide one over another:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def row_mean(cols: Column*) = {
// Sum of values ignoring nulls
val sum = cols
.map(c => coalesce(c, lit(0)))
.foldLeft(lit(0))(_ + _)
// Count of not null values
val cnt = cols
.map(c => when(c.isNull, 0).otherwise(1))
.foldLeft(lit(0))(_ + _)
sum / cnt
}
Example data:
val df = Seq(
(None, None, None),
(Some(2.0), None, None),
(Some(50.0), Some(34.0), None),
(Some(1.0), Some(2.0), Some(3.0))
).toDF
Result:
df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+

def average(l: Seq[Integer]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Integer]))
val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])
Testing on the data you supplied gives the correct result:
val df = List(
(Some(10),Some(5), Some(5), None, None),
(None, Some(5), Some(5), None, Some(5)),
(Some(2), Some(8), Some(5), Some(1), Some(2)),
(None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")
Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])
Note if you have columns you do not want included make sure they are filtered when populating the array passed to the UDF.
Finally:
val df = List(
(Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])
Which again is the correct result.
If you want to use Doubles...
def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[java.lang.Double]))
val df = List(
(Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

Related

Dynamically apply aggregate function in SPARK data frame

How do I parameterize the below spark function. The groupBy and Pivot values are constant. I need to parameterize
var final_df_transpose=df_transpose.groupBy("_id").pivot("Type").agg(first("Value").alias("Value"),first("OType").alias("OType"),first("DateTime").alias("DateTime"))
Unable to pass alias as dynamically in the above scenario.
agg_Map scala.collection.mutable.Map[String,String] = Map( OType -> first, Type -> first, Value -> first, DateTime -> first)
var agg_Map = collection.mutable.Map[String, String]()
for (aggDataCol <- fin_agg_col) {
agg_Map1 += (aggDataCol -> "first")
}
df_transpose.groupBy("_id").pivot("Type").agg(agg_Map.toMap).show

I can think of two ways, but I'm not happy with either.
In the first, define your aggregates as a list of Columns. The annoying thing here is that to satisfy the method signature, you need to add a dummy column and then remove it after the aggregation:
scala> val in = spark.read.option("header", true).csv("""_id,Type,Value,OType,DateTime
| 0,a,b,c,d
| 1,aaa,bbb,ccc,ddd""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [_id: string, Type: string ... 3 more fields]
scala> in.show
+---+----+-----+-----+--------+
|_id|Type|Value|OType|DateTime|
+---+----+-----+-----+--------+
| 0| a| b| c| d|
| 1| aaa| bbb| ccc| ddd|
+---+----+-----+-----+--------+
scala> val aggColumns = Seq("Value", "OType", "DateTime").map{c => first(c).alias(c)}
aggColumns: Seq[org.apache.spark.sql.Column] = List(first(Value, false) AS `Value`, first(OType, false) AS `OType`, first(DateTime, false) AS `DateTime`)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(lit("dummy"), aggColumns : _*)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_dummy: string ... 7 more fields]
scala> df_intermediate.show
+---+-------+-------+-------+----------+---------+---------+---------+------------+
|_id|a_dummy|a_Value|a_OType|a_DateTime|aaa_dummy|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
| 0| dummy| b| c| d| dummy| null| null| null|
| 1| dummy| null| null| null| dummy| bbb| ccc| ddd|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
scala> val df_final = df_intermediate.drop(df_intermediate.schema.collect{case c if c.name.endsWith("_dummy") => c.name} : _*)
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_Value: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_Value|a_OType|a_DateTime|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| b| c| d| null| null| null|
| 1| null| null| null| bbb| ccc| ddd|
+---+-------+-------+----------+---------+---------+------------+
In the second, continue using a Map of agg expressions, then use a regex to find the renamed columns and change them back:
scala> val aggExprs = Map(("OType" -> "first"), ("Value" -> "first"), "DateTime" -> "first")
aggExprs: scala.collection.immutable.Map[String,String] = Map(OType -> first, Value -> first, DateTime -> first)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(aggExprs)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_first(OType, false): string ... 5 more fields]
scala> df_intermediate.show
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
|_id|a_first(OType, false)|a_first(Value, false)|a_first(DateTime, false)|aaa_first(OType, false)|aaa_first(Value, false)|aaa_first(DateTime, false)|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
scala> val regex = "^(.*)_first\\((.*), false\\)$".r
regex: scala.util.matching.Regex = ^(.*)_first\((.*), false\)$
scala> val replacements = df_intermediate.schema.collect{ case c if regex.findFirstMatchIn(c.name).isDefined =>
| val regex(pivotVal, colName) = c.name
| c.name -> s"${pivotVal}_$colName"
| }.toMap
replacements: scala.collection.immutable.Map[String,String] = Map(a_first(DateTime, false) -> a_DateTime, aaa_first(DateTime, false) -> aaa_DateTime, aaa_first(OType, false) -> aaa_OType, a_first(Value, false) -> a_Value, a_first(OType, false) -> a_OType, aaa_first(Value, false) -> aaa_Value)
scala> val df_final = replacements.foldLeft(df_intermediate){(df, c) => df.withColumnRenamed(c._1, c._2)}
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_OType: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_OType|a_Value|a_DateTime|aaa_OType|aaa_Value|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+-------+-------+----------+---------+---------+------------+
Take your pick, but both involve some unneeded steps.

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!

In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>

If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()

Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>

There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show

import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Nulls in spark.createDataset

I'm writing a unit test that the test data need to have some null values.
I tried putting the nulls straight in the tuples and I also tried using Options. It didn't work out.
Here is my code:
import sparkSession.implicits._
// Data set with null for even values
val sampleData = sparkSession.createDataset(Seq(
(1, Some("Yes"), None),
(2, None, None),
(3, Some("Okay"), None),
(4, None, None)))
.toDF("id", "title", "value")
Stack trace:
None.type (of class scala.reflect.internal.Types$UniqueSingleType)
scala.MatchError: None.type (of class scala.reflect.internal.Types$UniqueSingleType)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:472)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)

You should use None: Option[String] instead of None
scala> val maybeString = None: Option[String]
maybeString: Option[String] = None
scala> val sampleData = spark.createDataset(Seq(
| (1, Some("Yes"), maybeString),
| (2, maybeString, maybeString),
| (3, Some("Okay"), maybeString),
| (4, maybeString, maybeString))).toDF("id", "title", "value")
sampleData: org.apache.spark.sql.DataFrame = [id: int, title: string ... 1 more field]
scala> sampleData.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Or you can use: null.asInstanceOf[String] If you're just dealing with Strings
val df1 = sc.parallelize(Seq((1, "Yes", null.asInstanceOf[String]),
| (2, null.asInstanceOf[String], null.asInstanceOf[String]),
| (3, "Okay", null.asInstanceOf[String]),
| (4, null.asInstanceOf[String], null.asInstanceOf[String]))).toDF("id", "title", "value")
df1: org.apache.spark.sql.DataFrame = [id: int, title: string, value: string]
scala> df1.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Spark : Aggregating based on a column

I have a file consisting of 3 fields (Emp_ids, Groups, Salaries)
100 A 430
101 A 500
201 B 300
I want to get result as
1) Group name and count(*)
2) Group name and max( salary)
val myfile = "/home/hduser/ScalaDemo/Salary.txt"
val conf = new SparkConf().setAppName("Salary").setMaster("local[2]")
val sc= new SparkContext( conf)
val sal= sc.textFile(myfile)

Scala DSL:
case class Data(empId: Int, group: String, salary: Int)
val df = sqlContext.createDataFrame(lst.map {v =>
val arr = v.split(' ').map(_.trim())
Data(arr(0).toInt, arr(1), arr(2).toInt)
})
df.show()
+-----+-----+------+
|empId|group|salary|
+-----+-----+------+
| 100| A| 430|
| 101| A| 500|
| 201| B| 300|
+-----+-----+------+
df.groupBy($"group").agg(count("*") as "count").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
df.groupBy($"group").agg(max($"salary") as "maxSalary").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
Or with plain SQL:
df.registerTempTable("salaries")
sqlContext.sql("select group, count(*) as count from salaries group by group").show()
+-----+-----+
|group|count|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
sqlContext.sql("select group, max(salary) as maxSalary from salaries group by group").show()
+-----+---------+
|group|maxSalary|
+-----+---------+
| A| 500|
| B| 300|
+-----+---------+
While Spark SQL is recommended way to do such aggregations due to performance reasons, it can be easily done with RDD API:
val rdd = sc.parallelize(Seq(Data(100, "A", 430), Data(101, "A", 500), Data(201, "B", 300)))
rdd.map(v => (v.group, 1)).reduceByKey(_ + _).collect()
res0: Array[(String, Int)] = Array((B,1), (A,2))
rdd.map(v => (v.group, v.salary)).reduceByKey((s1, s2) => if (s1 > s2) s1 else s2).collect()
res1: Array[(String, Int)] = Array((B,300), (A,500))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Scala row-wise average by handling null - scala

Related

Dynamically apply aggregate function in SPARK data frame

How to assign keys to items in a column in Scala?

drop duplicate words in long string using scala

Nulls in spark.createDataset

Spark : Aggregating based on a column

Categories

Resources