How to get max value of corresponding item in many arrays in DataFrame column? - scala

A DataFrame as following:
import spark.implicits._
val df1 = List(
("id1", Array(0,2)),
("id1",Array(2,1)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+
I want to groupBy id to get max pooling of every value array. Max id1 value is Array(2,2). The result I want to get is:
import spark.implicits._
val res = List(
("id1", Array(2,2)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+

import spark.implicits._
val df1 = List(
("id1", Array(0,2,3)),
("id1",Array(2,1,4)),
("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
.map(x => (x(0).toString,x.getSeq[Int](1)))
.reduceByKey((x,y) => {
val arrlength = x.length
var i = 0
val resarr = scala.collection.mutable.ArrayBuffer[Int]()
while(i < arrlength){
if (x(i) >= y(i)){
resarr.append(x(i))
} else {
resarr.append(y(i))
}
i += 1
}
resarr
}).toDF("id","newvalue")

You can do like below
//Input df
+---+---------+
| id| value|
+---+---------+
|id1|[0, 2, 3]|
|id1|[2, 1, 4]|
|id2|[0, 7, 3]|
+---+---------+
//Solution approach:
import org.apache.spark.sql.functions.udf
val df1=df.groupBy("id").agg(collect_set("value").as("value"))
val maxUDF = udf{(s:Seq[Seq[Int]])=>s.reduceLeft((prev,next)=>prev.zip(next).map(tup=>if(tup._1>tup._2) tup._1 else tup._2))}
df1.withColumn("value",maxUDF(df1.col("value"))).show
//Sample Output:
+---+---------+
| id| value|
+---+---------+
|id1|[2, 2, 4]|
|id2|[0, 7, 3]|
+---+---------+
I hope, this will help you.

Related

Scala - Return the largest string within each group

DataSet:
+---+--------+
|age| name|
+---+--------+
| 33| Will|
| 26|Jean-Luc|
| 55| Hugh|
| 40| Deanna|
| 68| Quark|
| 59| Weyoun|
| 37| Gowron|
| 54| Will|
| 38| Jadzia|
| 27| Hugh|
+---+--------+
Here is my attempt but it just returns the size of the largest string rather than the largest string:
AgeName.groupBy("age")
.agg(max(length(AgeName("name")))).show()
The usual row_number trick should work if you specify the Window correctly. Using #LeoC's example,
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
val df2 = df.withColumn(
"rownum",
expr("row_number() over (partition by age order by length(name) desc)")
).filter("rownum = 1").drop("rownum")
df2.show
+---+---------+
|age| name|
+---+---------+
| 22|Alexander|
| 35| Michelle|
+---+---------+
Here's one approach using Spark higher-order function, aggregate, as shown below:
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
df.
groupBy("age").agg(collect_list("name").as("names")).
withColumn(
"longest_name",
expr("aggregate(names, '', (acc, x) -> case when length(acc) < length(x) then x else acc end)")
).
show(false)
// +---+----------------------------+------------+
// |age|names |longest_name|
// +---+----------------------------+------------+
// |22 |[Jennifer, Alexander, Celia]|Alexander |
// |35 |[John, Michelle] |Michelle |
// +---+----------------------------+------------+
Note that higher-order functions are available only on Spark 2.4+.
object BasicDatasetTest {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[*]")
.appName("BasicDatasetTest")
.getOrCreate()
val pairs=List((33,"Will"),(26,"Jean-Luc"),
(55, "Hugh"),
(26, "Deanna"),
(26, "Quark"),
(55, "Weyoun"),
(33, "Gowron"),
(55, "Will"),
(26, "Jadzia"),
(27, "Hugh"))
val schema=new StructType(Array(
StructField("age",IntegerType,false),
StructField("name",StringType,false))
)
val dataRDD=spark.sparkContext.parallelize(pairs).map(record=>Row(record._1,record._2))
val dataset=spark.createDataFrame(dataRDD,schema)
val ageNameGroup=dataset.groupBy("age","name")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageNameGroup.printSchema()
val ageGroup=dataset.groupBy("age")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageGroup.printSchema()
ageGroup.createOrReplaceTempView("age_group")
ageNameGroup.createOrReplaceTempView("age_name_group")
spark.sql("select ag.age,ang.name from age_group as ag, age_name_group as ang " +
"where ag.age=ang.age and ag.length=ang.length")
.show()
}
}

Evaluate formulas in Spark DataFrame

Is it possible to evaluate formulas in a dataframe which refer to columns? e.g. if I have data like this (Scala example):
val df = Seq(
( 1, "(a+b)/d", 1, 20, 2, 3, 1 ),
( 2, "(c+b)*(a+e)", 0, 1, 2, 3, 4 ),
( 3, "a*(d+e+c)", 7, 10, 6, 2, 1 )
)
.toDF( "Id", "formula", "a", "b", "c", "d", "e" )
df.show()
Expected results:
I have been unable to get selectExpr, expr, eval() or combinations of them to work.
You can use the scala toolbox eval in a UDF:
import org.apache.spark.sql.functions.col
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val cols = df.columns.tail
val eval_udf = udf(
(r: Seq[String]) =>
tb.eval(tb.parse(
("val %s = %s;" * cols.tail.size).format(
cols.tail.zip(r.tail).flatMap(x => List(x._1, x._2)): _*
) + r(0)
)).toString
)
val df2 = df.select(col("id"), eval_udf(array(df.columns.tail.map(col):_*)).as("result"))
df2.show
+---+------+
| id|result|
+---+------+
| 1| 7|
| 2| 12|
| 3| 63|
+---+------+
A slightly different version of mck's answer, by replacing the variables in the formula column by their corresponding values from the other columns then calling eval udf :
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val eval = udf((f: String) => {
val toolbox = currentMirror.mkToolBox()
toolbox.eval(toolbox.parse(f)).toString
})
val formulaExpr = expr(df.columns.drop(2).foldLeft("formula")((acc, c) => s"replace($acc, '$c', $c)"))
df.select($"Id", eval(formulaExpr).as("result")).show()
//+---+------+
//| Id|result|
//+---+------+
//| 1| 7|
//| 2| 12|
//| 3| 63|
//+---+------+

How write code that creates a Dataset with columns that have the elements of an array column as values and their names being positions?

Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit
Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")

Cumulative product in Spark

I try to implement a cumulative product in Spark Scala, but I really don't know how to it. I have the following dataframe:
Input data:
+--+--+--------+----+
|A |B | date | val|
+--+--+--------+----+
|rr|gg|20171103| 2 |
|hh|jj|20171103| 3 |
|rr|gg|20171104| 4 |
|hh|jj|20171104| 5 |
|rr|gg|20171105| 6 |
|hh|jj|20171105| 7 |
+-------+------+----+
And I would like to have the following output:
Output data:
+--+--+--------+-----+
|A |B | date | val |
+--+--+--------+-----+
|rr|gg|20171105| 48 | // 2 * 4 * 6
|hh|jj|20171105| 105 | // 3 * 5 * 7
+-------+------+-----+
As long as the number are strictly positive (0 can be handled as well, if present, using coalesce) as in your example, the simplest solution is to compute the sum of logarithms and take the exponential:
import org.apache.spark.sql.functions.{exp, log, max, sum}
val df = Seq(
("rr", "gg", "20171103", 2), ("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4), ("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6), ("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
exp(sum(log($"val"))).as("val"))
Since this uses FP arithmetic the result won't be exact:
result.show
+---+---+--------+------------------+
| A| B| date| val|
+---+---+--------+------------------+
| hh| jj|20171105|104.99999999999997|
| rr| gg|20171105|47.999999999999986|
+---+---+--------+------------------+
but after rounding should good enough for majority of applications.
result.withColumn("val", round($"val")).show
+---+---+--------+-----+
| A| B| date| val|
+---+---+--------+-----+
| hh| jj|20171105|105.0|
| rr| gg|20171105| 48.0|
+---+---+--------+-----+
If that's not enough you can define an UserDefinedAggregateFunction or Aggregator (How to define and use a User-Defined Aggregate Function in Spark SQL?) or use functional API with reduceGroups:
import scala.math.Ordering
case class Record(A: String, B: String, date: String, value: Long)
df.withColumnRenamed("val", "value").as[Record]
.groupByKey(x => (x.A, x.B))
.reduceGroups((x, y) => x.copy(
date = Ordering[String].max(x.date, y.date),
value = x.value * y.value))
.toDF("key", "value")
.select($"value.*")
.show
+---+---+--------+-----+
| A| B| date|value|
+---+---+--------+-----+
| hh| jj|20171105| 105|
| rr| gg|20171105| 48|
+---+---+--------+-----+
You can solve this using either collect_list+UDF or an UDAF. UDAF may be more efficient, but harder to implement due to the local aggregation.
If you have a dataframe like this :
+---+---+
|key|val|
+---+---+
| a| 1|
| a| 2|
| a| 3|
| b| 4|
| b| 5|
+---+---+
You can invoke an UDF :
val prod = udf((vals:Seq[Int]) => vals.reduce(_ * _))
df
.groupBy($"key")
.agg(prod(collect_list($"val")).as("val"))
.show()
+---+---+
|key|val|
+---+---+
| b| 20|
| a| 6|
+---+---+
Since Spark 2.4, you could also compute this using the higher order function aggregate:
import org.apache.spark.sql.functions.{expr, max}
val df = Seq(
("rr", "gg", "20171103", 2),
("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4),
("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6),
("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
expr("""
aggregate(
collect_list(val),
cast(1 as bigint),
(acc, x) -> acc * x)""").alias("val")
)
Spark 3.2+
product(e: Column): Column
Aggregate function: returns the product of all numerical elements in a group.
Scala
import spark.implicits._
var df = Seq(
("rr", "gg", 20171103, 2),
("hh", "jj", 20171103, 3),
("rr", "gg", 20171104, 4),
("hh", "jj", 20171104, 5),
("rr", "gg", 20171105, 6),
("hh", "jj", 20171105, 7)
).toDF("A", "B", "date", "val")
df = df.groupBy("A", "B").agg(max($"date").as("date"), product($"val").as("val"))
df.show(false)
// +---+---+--------+-----+
// |A |B |date |val |
// +---+---+--------+-----+
// |hh |jj |20171105|105.0|
// |rr |gg |20171105|48.0 |
// +---+---+--------+-----+
PySpark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data = [('rr', 'gg', 20171103, 2),
('hh', 'jj', 20171103, 3),
('rr', 'gg', 20171104, 4),
('hh', 'jj', 20171104, 5),
('rr', 'gg', 20171105, 6),
('hh', 'jj', 20171105, 7)]
df = spark.createDataFrame(data, ['A', 'B', 'date', 'val'])
df = df.groupBy('A', 'B').agg(F.max('date').alias('date'), F.product('val').alias('val'))
df.show()
#+---+---+--------+-----+
#| A| B| date| val|
#+---+---+--------+-----+
#| hh| jj|20171105|105.0|
#| rr| gg|20171105| 48.0|
#+---+---+--------+-----+

MinMax Normalization in scala

I have an org.apache.spark.sql.DataFrame with multiple columns. I want to scale 1 column (lat_long_dist) using MinMax Normalization or any technique to scale the data between -1 and 1 and retain the data type as org.apache.spark.sql.DataFrame
scala> val df = sqlContext.csvFile("tenop.csv")
df: org.apache.spark.sql.DataFrame = [gst_id_matched: string,
ip_crowding: string, lat_long_dist: double, stream_name_1: string]
I found the StandardScaler option but that requires to transform the dataset before I can do the transformation.Is there a simple clean way.
Here's another suggestion when you are already playing with Spark.
Why don't you use MinMaxScaler in ml package?
Let's try this with the same example from zero323.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.sql.functions.udf
val df = sc.parallelize(Seq(
(1L, 0.5), (2L, 10.2), (3L, 5.7), (4L, -11.0), (5L, 22.3)
)).toDF("k", "v")
//val df.map(r => Vectors.dense(Array(r.getAs[Double]("v"))))
val vectorizeCol = udf( (v:Double) => Vectors.dense(Array(v)) )
val df2 = df.withColumn("vVec", vectorizeCol(df("v"))
val scaler = new MinMaxScaler()
.setInputCol("vVec")
.setOutputCol("vScaled")
.setMax(1)
.setMin(-1)
scaler.fit(df2).transform(df2).show
+---+-----+-------+--------------------+
| k| v| vVec| vScaled|
+---+-----+-------+--------------------+
| 1| 0.5| [0.5]|[-0.3093093093093...|
| 2| 10.2| [10.2]|[0.27327327327327...|
| 3| 5.7| [5.7]|[0.00300300300300...|
| 4|-11.0|[-11.0]| [-1.0]|
| 5| 22.3| [22.3]| [1.0]|
+---+-----+-------+--------------------+
Take advantage of scaling multiple columns at once.
val df = sc.parallelize(Seq(
(1.0, -1.0, 2.0),
(2.0, 0.0, 0.0),
(0.0, 1.0, -1.0)
)).toDF("a", "b", "c")
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b", "c"))
.setOutputCol("features")
val df2 = assembler.transform(df)
// Reusing the scaler instance above with the same min(-1) and max(1)
scaler.setInputCol("features").setOutputCol("scaledFeatures").fit(df2).transform(df2).show
+---+----+----+--------------+--------------------+
| a| b| c| features| scaledFeatures|
+---+----+----+--------------+--------------------+
|1.0|-1.0| 2.0|[1.0,-1.0,2.0]| [0.0,-1.0,1.0]|
|2.0| 0.0| 0.0| [2.0,0.0,0.0]|[1.0,0.0,-0.33333...|
|0.0| 1.0|-1.0|[0.0,1.0,-1.0]| [-1.0,1.0,-1.0]|
+---+----+----+--------------+--------------------+
I guess what you want is something like this
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{min, max, lit}
val df = sc.parallelize(Seq(
(1L, 0.5), (2L, 10.2), (3L, 5.7), (4L, -11.0), (5L, 22.3)
)).toDF("k", "v")
val (vMin, vMax) = df.agg(min($"v"), max($"v")).first match {
case Row(x: Double, y: Double) => (x, y)
}
val scaledRange = lit(2) // Range of the scaled variable
val scaledMin = lit(-1) // Min value of the scaled variable
val vNormalized = ($"v" - vMin) / (vMax - vMin) // v normalized to (0, 1) range
val vScaled = scaledRange * vNormalized + scaledMin
df.withColumn("vScaled", vScaled).show
// +---+-----+--------------------+
// | k| v| vScaled|
// +---+-----+--------------------+
// | 1| 0.5| -0.3093093093093092|
// | 2| 10.2| 0.27327327327327344|
// | 3| 5.7|0.003003003003003...|
// | 4|-11.0| -1.0|
// | 5| 22.3| 1.0|
// +---+-----+--------------------+
there is another solution. Take codes from Matt, Lyle and zero323, thanks!
import org.apache.spark.ml.feature.{MinMaxScaler, VectorAssembler}
val df = sc.parallelize(Seq(
(1L, 0.5), (2L, 10.2), (3L, 5.7), (4L, -11.0), (5L, 22.3)
)).toDF("k", "v")
val assembler = new VectorAssembler().setInputCols(Array("v")).setOutputCol("vVec")
val df2= assembler.transform(df)
val scaler = new MinMaxScaler().setInputCol("vVec").setOutputCol("vScaled").setMax(1).setMin(-1)
scaler.fit(df2).transform(df2).show
result:
+---+-----+-------+--------------------+
| k| v| vVec| vScaled|
+---+-----+-------+--------------------+
| 1| 0.5| [0.5]|[-0.3093093093093...|
| 2| 10.2| [10.2]|[0.27327327327327...|
| 3| 5.7| [5.7]|[0.00300300300300...|
| 4|-11.0|[-11.0]| [-1.0]|
| 5| 22.3| [22.3]| [1.0]|
+---+-----+-------+--------------------+
btw: the other solutions produce error on my side
java.lang.IllegalArgumentException: requirement failed: Column vVec must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:43)
at org.apache.spark.ml.feature.MinMaxScalerParams$class.validateAndTransformSchema(MinMaxScaler.scala:67)
at org.apache.spark.ml.feature.MinMaxScaler.validateAndTransformSchema(MinMaxScaler.scala:93)
at org.apache.spark.ml.feature.MinMaxScaler.transformSchema(MinMaxScaler.scala:129)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.MinMaxScaler.fit(MinMaxScaler.scala:119)
... 50 elided
THANKS A LOT whatever!