Spark Scala SQL: Take average of non-null columns - scala

How do I take the average of columns in an array cols with non-null values in a dataframe df? I can do this for all columns but it gives null when any of the values are null.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)
I don't want to na.fill because I want to preserve the true average.

I guess you can do something like this:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
But be careful, here average is counted only for not null elements.
If you need exactly solution like in your code:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)

aggregate function can do it without udf.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn(
"avgCols",
aggregate(
cols,
struct(lit(0).alias("sum"), lit(0).alias("count")),
(acc, x) => struct((acc("sum") + coalesce(x, lit(0))).alias("sum"), (acc("count") + coalesce(x.cast("boolean").cast("int"), lit(0))).alias("count")),
(s) => s("sum") / s("count")
)
)

Related

For loop column expression

I'm doing average expression with multiple columns. Is there any way I can loop my list of columns so it can be like my sequence in example below?
val cols = List("col1", "col2", "col3","col4")
val expressions = Seq("avg(col1) as col1","avg(col2) as col1", "...")
df.selectExpr(expressions:_*)
**Pyspark Equivalent
exprs = [avg(_col).alias(_col) for _col in cols]
You can use something like this:
val cols = List("col1", "col2", "col3","col4")
val expressions = cols.map(colName => avg(col(colName)).as("col1"))
df.select(expressions:_*)
This should work for you.
val cols = List("col1", "col2", "col3", "col4")
val expressions = cols.map(c => avg(c).as(c))
df.groupBy(cols.head, cols.tail: _*).agg(expressions.head, expressions.tail: _*)

Spark scala - Dataframes comparison

How to compare 2 Dataframes based on PK.
Basically want to create a scala spark code to compare 2 big Dataframes (10M records each, 100 columns each) and show output as:
ID Diff
1 [ {Col1: [1,2]}, {col3: [5,10]} ...]
2 [ {Col3: [4,2]}, {col7: [2,6]} ...]
ID is PK
Diff column - show first Column name where is the difference and then which value is different one from another in that column.
Each different column can be converted to string, and then all columns are concated:
// ---- data ---
val leftDF = Seq(
(1, 1, 5, 0),
(2, 0, 4, 2)
).toDF("ID", "Col1", "col3", "col7")
val rightDF = Seq(
(1, 2, 10, 0),
(2, 0, 2, 6)
).toDF("ID", "Col1", "col3", "col7")
def getDifferenceForColumn(name: String): Column =
when(
col("l." + name) =!= col("r." + name),
concat(lit("{" + name + ": ["), col("l." + name), lit(","), col("r." + name), lit("]}")))
.otherwise(lit(""))
val diffColumn = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
.reduce((l, r) => concat(l,
when(length(r) =!= 0 && length(l) =!= 0, lit(",")).otherwise(lit(""))
, r))
val diffColumnWithBraces = concat(lit("["), diffColumn, lit("]"))
leftDF
.alias("l")
.join(rightDF.alias("r"), Seq("id"))
.select(col("ID"), diffColumnWithBraces.alias("DIFF"))
Output:
+---+------------------------------+
|ID |DIFF |
+---+------------------------------+
|1 |[{Col1: [1,2]},{col3: [5,10]}]|
|2 |[{col3: [4,2]},{col7: [2,6]}] |
+---+------------------------------+
If columns cannot have value "}{", in solution above two variables can be changed, maybe, performance can be better:
val diffColumns = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
val diffColumnWithBraces = concat(lit("["), regexp_replace(concat(diffColumns: _*),"\\}\\{","},{"), lit("]"))
Also UDF can be used, incoming data and output is the same as in my first answer:
val colNames = leftDF
.columns
.filter(_ != "ID")
val generateSeqDiff = (colNames: Seq[String], leftValues: Seq[Any], rightValues: Seq[Any]) => {
val nameValues = colNames
.zip(leftValues)
.zip(rightValues)
.filterNot({ case ((_, l), r) => l == r })
.map({ case ((name, l), r) => s"{$name: [$l,$r]}" })
.mkString(",")
s"[$nameValues]"
}
val generateSeqDiffUDF = udf(generateSeqDiff)
leftDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("leftValues"))
.alias("l")
.join(
rightDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("rightValues"))
.alias("r"), Seq("id"))
.select($"ID", generateSeqDiffUDF(lit(colNames), $"leftValues", $"rightValues").alias("DIFF"))

Spark Join of 2 dataframes which have 2 different column names in list

Is there a way to join two Spark Dataframes with different column names via 2 lists?
I know that if they had the same names in a list I could do the following:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
or if I knew the different column names I could do this:
df1.join(
df2,
df1("col_a") <=> df2("col_x")
&& df1("col_b") <=> df2("col_y"),
"left"
)
Since my method is expecting inputs of 2 lists which specify which columns are to be used for the join for each DF, I was wondering if Scala Spark had a way of doing this?
P.S
I'm looking for something like a python pandas merge:
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')
You can easely define such a method yourself:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")
If you expect two lists of strings:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
Just zip and reduce:
import org.apache.spark.sql.functions.col
val on = leftOn.zip(rightOn)
.map { case (x, y) => df1(x) <=> df2(y) }
.reduce(_ and _)
df1.join(df2, on, "left")

spark dataframe: how to explode a IntegerType column

val schema = StructType(Array(StructField("id", IntegerType, false),StructField("num", IntegerType, false)))
I want to generate continuous number from 0 to num by every id。
I don't know how to do ..
Thanks
data and result here !!!
You can use UDF and explode function:
import org.apache.spark.sql.functions.{udf, explode}
val range = udf((i: Int) => (0 to i).toArray)
df.withColumn("num", explode(range($"num")))
Try DataFrame.explode:
df.explode(col("id"), col("num")) {case row: Row =>
val id = row(0).asInstanceOf[Int]
val num = row(1).asInstanceOf[Int]
(0 to num).map((id, _))
}
Or in RDD land, you can use flatmap for this:
df.rdd.flatMap(x => (0 to x._2).map((x._1, _)))

moving transformations from hive sql query to Spark

val temp = sqlContext.sql(s"SELECT A, B, C, (CASE WHEN (D) in (1,2,3) THEN ((E)+0.000)/60 ELSE 0 END) AS Z from TEST.TEST_TABLE")
val temp1 = temp.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
instead of the above code which is doing the computation(case evaluation) on hive layer I would like to have the transformation done in scala. How would I do it?
Is it possible to do the same while filling the data inside Map?
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
val tempTransform = temp.map(row => {
val z = List[Double](1, 2, 3).contains(row.getDouble(3)) match {
case true => row.getDouble(4) / 60
case _ => 0
}
Row(row.getShort(0), Row.getString(1), Row.getDouble(2), z)
})
val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
you can use this syntax as well
new_df = old_df.withColumn('target_column', udf(df.name))
as reffered by this example
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
In your case, execute sql which be dataframe like below
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
and apply withColumn with case or when otherwise or if needed spark udf
, call scala function logic instead of hiveudf