moving transformations from hive sql query to Spark - scala

val temp = sqlContext.sql(s"SELECT A, B, C, (CASE WHEN (D) in (1,2,3) THEN ((E)+0.000)/60 ELSE 0 END) AS Z from TEST.TEST_TABLE")
val temp1 = temp.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
instead of the above code which is doing the computation(case evaluation) on hive layer I would like to have the transformation done in scala. How would I do it?
Is it possible to do the same while filling the data inside Map?

val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
val tempTransform = temp.map(row => {
val z = List[Double](1, 2, 3).contains(row.getDouble(3)) match {
case true => row.getDouble(4) / 60
case _ => 0
}
Row(row.getShort(0), Row.getString(1), Row.getDouble(2), z)
})
val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

you can use this syntax as well
new_df = old_df.withColumn('target_column', udf(df.name))
as reffered by this example
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
In your case, execute sql which be dataframe like below
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
and apply withColumn with case or when otherwise or if needed spark udf
, call scala function logic instead of hiveudf

Related

Spark scala - Dataframes comparison

How to compare 2 Dataframes based on PK.
Basically want to create a scala spark code to compare 2 big Dataframes (10M records each, 100 columns each) and show output as:
ID Diff
1 [ {Col1: [1,2]}, {col3: [5,10]} ...]
2 [ {Col3: [4,2]}, {col7: [2,6]} ...]
ID is PK
Diff column - show first Column name where is the difference and then which value is different one from another in that column.
Each different column can be converted to string, and then all columns are concated:
// ---- data ---
val leftDF = Seq(
(1, 1, 5, 0),
(2, 0, 4, 2)
).toDF("ID", "Col1", "col3", "col7")
val rightDF = Seq(
(1, 2, 10, 0),
(2, 0, 2, 6)
).toDF("ID", "Col1", "col3", "col7")
def getDifferenceForColumn(name: String): Column =
when(
col("l." + name) =!= col("r." + name),
concat(lit("{" + name + ": ["), col("l." + name), lit(","), col("r." + name), lit("]}")))
.otherwise(lit(""))
val diffColumn = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
.reduce((l, r) => concat(l,
when(length(r) =!= 0 && length(l) =!= 0, lit(",")).otherwise(lit(""))
, r))
val diffColumnWithBraces = concat(lit("["), diffColumn, lit("]"))
leftDF
.alias("l")
.join(rightDF.alias("r"), Seq("id"))
.select(col("ID"), diffColumnWithBraces.alias("DIFF"))
Output:
+---+------------------------------+
|ID |DIFF |
+---+------------------------------+
|1 |[{Col1: [1,2]},{col3: [5,10]}]|
|2 |[{col3: [4,2]},{col7: [2,6]}] |
+---+------------------------------+
If columns cannot have value "}{", in solution above two variables can be changed, maybe, performance can be better:
val diffColumns = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
val diffColumnWithBraces = concat(lit("["), regexp_replace(concat(diffColumns: _*),"\\}\\{","},{"), lit("]"))
Also UDF can be used, incoming data and output is the same as in my first answer:
val colNames = leftDF
.columns
.filter(_ != "ID")
val generateSeqDiff = (colNames: Seq[String], leftValues: Seq[Any], rightValues: Seq[Any]) => {
val nameValues = colNames
.zip(leftValues)
.zip(rightValues)
.filterNot({ case ((_, l), r) => l == r })
.map({ case ((name, l), r) => s"{$name: [$l,$r]}" })
.mkString(",")
s"[$nameValues]"
}
val generateSeqDiffUDF = udf(generateSeqDiff)
leftDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("leftValues"))
.alias("l")
.join(
rightDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("rightValues"))
.alias("r"), Seq("id"))
.select($"ID", generateSeqDiffUDF(lit(colNames), $"leftValues", $"rightValues").alias("DIFF"))

Spark Scala SQL: Take average of non-null columns

How do I take the average of columns in an array cols with non-null values in a dataframe df? I can do this for all columns but it gives null when any of the values are null.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)
I don't want to na.fill because I want to preserve the true average.
I guess you can do something like this:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
But be careful, here average is counted only for not null elements.
If you need exactly solution like in your code:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
aggregate function can do it without udf.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn(
"avgCols",
aggregate(
cols,
struct(lit(0).alias("sum"), lit(0).alias("count")),
(acc, x) => struct((acc("sum") + coalesce(x, lit(0))).alias("sum"), (acc("count") + coalesce(x.cast("boolean").cast("int"), lit(0))).alias("count")),
(s) => s("sum") / s("count")
)
)

collapse the rows with flatmap or reducedbyKey

I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+

Spark Join of 2 dataframes which have 2 different column names in list

Is there a way to join two Spark Dataframes with different column names via 2 lists?
I know that if they had the same names in a list I could do the following:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
or if I knew the different column names I could do this:
df1.join(
df2,
df1("col_a") <=> df2("col_x")
&& df1("col_b") <=> df2("col_y"),
"left"
)
Since my method is expecting inputs of 2 lists which specify which columns are to be used for the join for each DF, I was wondering if Scala Spark had a way of doing this?
P.S
I'm looking for something like a python pandas merge:
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')
You can easely define such a method yourself:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")
If you expect two lists of strings:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
Just zip and reduce:
import org.apache.spark.sql.functions.col
val on = leftOn.zip(rightOn)
.map { case (x, y) => df1(x) <=> df2(y) }
.reduce(_ and _)
df1.join(df2, on, "left")

How do I create a set of ngrams in Spark?

I am extracting Ngrams from a Spark 2.2 dataframe column using Scala, thus (trigrams in this example):
val ngram = new NGram().setN(3).setInputCol("incol").setOutputCol("outcol")
How do I create an output column that contains all of 1 to 5 grams? So it might be something like:
val ngram = new NGram().setN(1:5).setInputCol("incol").setOutputCol("outcol")
but that doesn't work.
I could loop through N and create new dataframes for each value of N but this seems inefficient. Can anyone point me in the right direction, as my Scala is ropey?
If you want to combine these into vectors you can rewrite Python answer by zero323.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
def buildNgrams(inputCol: String = "tokens",
outputCol: String = "features", n: Int = 3) = {
val ngrams = (1 to n).map(i =>
new NGram().setN(i)
.setInputCol(inputCol).setOutputCol(s"${i}_grams")
)
val vectorizers = (1 to n).map(i =>
new CountVectorizer()
.setInputCol(s"${i}_grams")
.setOutputCol(s"${i}_counts")
)
val assembler = new VectorAssembler()
.setInputCols(vectorizers.map(_.getOutputCol).toArray)
.setOutputCol(outputCol)
new Pipeline().setStages((ngrams ++ vectorizers :+ assembler).toArray)
}
val df = Seq((1, Seq("a", "b", "c", "d"))).toDF("id", "tokens")
Result
buildNgrams().fit(df).transform(df).show(1, false)
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |id |tokens |1_grams |2_grams |3_grams |1_counts |2_counts |3_counts |features |
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d]|[a b, b c, c d]|[a b c, b c d]|(4,[0,1,2,3],[1.0,1.0,1.0,1.0])|(3,[0,1,2],[1.0,1.0,1.0])|(2,[0,1],[1.0,1.0])|[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
This could be simpler with a UDF:
val ngram = udf((xs: Seq[String], n: Int) =>
(1 to n).map(i => xs.sliding(i).filter(_.size == i).map(_.mkString(" "))).flatten)
spark.udf.register("ngram", ngram)
val ngramer = new SQLTransformer().setStatement(
"""SELECT *, ngram(tokens, 3) AS ngrams FROM __THIS__"""
)
ngramer.transform(df).show(false)
// +---+------------+----------------------------------+
// |id |tokens |ngrams |
// +---+------------+----------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d, ab, bc, cd, abc, bcd]|
// +---+------------+----------------------------------+