collapse the rows with flatmap or reducedbyKey - scala

I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]

It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD =":")).map(arr => (arr(0), arr(1).toInt)).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD ={ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
val dfResult = spark.createDataFrame(rowRDD, df.schema)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+


Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
Please help me.
emphasized text
You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))$"a", $"b", $"c")).as("sum")).show
//Solution 2> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
| 6|
| 15|
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)

Reduce and sum tuples by key

In my Spark Scala application I have an RDD with the following format:
(05/05/2020, (name, 1))
(05/05/2020, (name, 1))
(05/05/2020, (name2, 1))
(06/05/2020, (name, 1))
What I want to do is group these elements by date and sum the tuples that have the same "name" as key.
Expected Output:
(05/05/2020, List[(name, 2), (name2, 1)]),
(06/05/2020, List[(name, 1)])
In order to do that, I am currently using a groupByKey operation and some extra transformations in order to group the tuples by key and calculate the sum for those that share the same one.
For performance reasons, I would like to replace this groupByKey operation with a reduceByKey or an aggregateByKey in order to reduce the amount of data transferred over the network.
However, I can't get my head around on how to do this. Both of these transformations take as parameter a function between the values (tuples in my case) so I can't see how I can group the tuples by key in order to calculate their sum.
Is it doable?
Here's how you can merge your Tuples using reduceByKey:
File /path/to/file1:
15/04/2010 name
15/04/2010 name
15/04/2010 name2
15/04/2010 name2
15/04/2010 name3
16/04/2010 name
16/04/2010 name
File /path/to/file2:
15/04/2010 name
15/04/2010 name3
import org.apache.spark.rdd.RDD
val filePaths = Array("/path/to/file1", "/path/to/file2").mkString(",")
val rdd: RDD[(String, (String, Int))] = sc.textFile(filePaths).
map{ line =>
val pair = line.split("\\t", -1)
(pair(0), (pair(1), 1))
map{ case (k, (n, v)) => (k, Map(n -> v)) }.
reduceByKey{ (acc, m) =>
acc ++{ case (n, v) => (n -> (acc.getOrElse(n, 0) + v)) }
map(x => (x._1, x._2.toList)).
// res1: Array[(String, List[(String, Int)])] = Array(
// (15/04/2010, List((name,3), (name2,2), (name3,2))), (16/04/2010, List((name,2)))
// )
Note that the initial mapping is needed because we want to merge the Tuples as elements in a Map, and reduceByKey for RDD[K, V] requires the same data type V before and after the transformation:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
Yes .aggeregateBykey() can be used as follows:
import scala.collection.mutable.HashMap
def merge(map: HashMap[String, Int], element: (String, Int)) = {
if(map.contains(element._1)) map(element._1) += element._2 else map(element._1) = element._2
val input = sc.parallelize(List(("05/05/2020",("name",1)),("05/05/2020", ("name", 1)),("05/05/2020", ("name2", 1)),("06/05/2020", ("name", 1))))
val output = input.aggregateByKey(HashMap[String, Int]())({
//combining map & tuple
case (map, element) => merge(map, element)
}, {
// combining two maps
case (map1, map2) => {
val combined = (map1.keySet ++ map2.keySet).map { i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0)) }.toMap
collection.mutable.HashMap(combined.toSeq: _*)
credits: Best way to merge two maps and sum the values of same key?
You could convert the RDD to a DataFrame and just use a groupBy with sum, here is one way to do it
import org.apache.spark.sql.types._
val schema = StructType(StructField("date", StringType, false) :: StructField("name", StringType, false) :: StructField("value", IntegerType, false) :: Nil)
val rd = sc.parallelize(Seq(("05/05/2020", ("name", 1)),
("05/05/2020", ("name", 1)),
("05/05/2020", ("name2", 1)),
("06/05/2020", ("name", 1))))
val df = spark.createDataFrame({ case (a, (b,c)) => Row(a,b,c)},schema)
| date| name|value|
|05/05/2020| name| 1|
|05/05/2020| name| 1|
|05/05/2020|name2| 1|
|06/05/2020| name| 1|
val sumdf = df.groupBy("date","name").sum("value")
| date| name|sum(value)|
|06/05/2020| name| 1|
|05/05/2020| name| 2|
|05/05/2020|name2| 1|

How do I create a set of ngrams in Spark?

I am extracting Ngrams from a Spark 2.2 dataframe column using Scala, thus (trigrams in this example):
val ngram = new NGram().setN(3).setInputCol("incol").setOutputCol("outcol")
How do I create an output column that contains all of 1 to 5 grams? So it might be something like:
val ngram = new NGram().setN(1:5).setInputCol("incol").setOutputCol("outcol")
but that doesn't work.
I could loop through N and create new dataframes for each value of N but this seems inefficient. Can anyone point me in the right direction, as my Scala is ropey?
If you want to combine these into vectors you can rewrite Python answer by zero323.
def buildNgrams(inputCol: String = "tokens",
outputCol: String = "features", n: Int = 3) = {
val ngrams = (1 to n).map(i =>
new NGram().setN(i)
val vectorizers = (1 to n).map(i =>
new CountVectorizer()
val assembler = new VectorAssembler()
new Pipeline().setStages((ngrams ++ vectorizers :+ assembler).toArray)
val df = Seq((1, Seq("a", "b", "c", "d"))).toDF("id", "tokens")
buildNgrams().fit(df).transform(df).show(1, false)
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |id |tokens |1_grams |2_grams |3_grams |1_counts |2_counts |3_counts |features |
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d]|[a b, b c, c d]|[a b c, b c d]|(4,[0,1,2,3],[1.0,1.0,1.0,1.0])|(3,[0,1,2],[1.0,1.0,1.0])|(2,[0,1],[1.0,1.0])|[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]|
// +---+------------+------------+---------------+--------------+-------------------------------+-------------------------+-------------------+-------------------------------------+
This could be simpler with a UDF:
val ngram = udf((xs: Seq[String], n: Int) =>
(1 to n).map(i => xs.sliding(i).filter(_.size == i).map(_.mkString(" "))).flatten)
spark.udf.register("ngram", ngram)
val ngramer = new SQLTransformer().setStatement(
"""SELECT *, ngram(tokens, 3) AS ngrams FROM __THIS__"""
// +---+------------+----------------------------------+
// |id |tokens |ngrams |
// +---+------------+----------------------------------+
// |1 |[a, b, c, d]|[a, b, c, d, ab, bc, cd, abc, bcd]|
// +---+------------+----------------------------------+

Merge multiple RDD in a specific order

I am trying to merge multiple RDDs of strings to a RDD of row in a specific order. I've tried to create a Map[String, RDD[Seq[String]]] (where the Seq contains only one element) and then merge them to a RDD[Row[String]], but it doesn't seems to work (content of RDD[Seq[String]] is lost).. Do someone have any ideas ?
val t1: StructType
val mapFields: Map[String, RDD[Seq[String]]]
var ordRDD: RDD[Seq[String]] = context.emptyRDD
t1.foreach(field => ordRDD = ordRDD ++ mapFiels(
val rdd = => Row.fromSeq(line))
Using zip function lead to a spark exception, because my RDDs didn't have the same number of elements in each partition. I don't know how to make sure that they all have the same number of elements in each partition, so I've just zip them with index and then join them in good order using a ListMap. Maybe there is a trick to do with the mapPartitions function, but I don't know enough the Spark API yet.
val mapFields: Map[String, RDD[String]]
var ord: ListMap[String, RDD[String]] = ListMap()
t1.foreach(field => ord = ord ++ Map( -> mapFields(
// Note : zip = SparkException: Can only zip RDDs with same number of elements in each partition
//val rdd: RDD[Row] = => Seq(s))).reduceLeft((rdd1, rdd2) =>{ case (l1, l2) => l1 ++ l2 }).map(Row.fromSeq)
val zipRdd = => Seq(s)).zipWithIndex().map{ case (d, i) => (i, d) })
val concatRdd = zipRdd.reduceLeft((rdd1, rdd2) => rdd1.join(rdd2).map{ case (i, (l1, l2)) => (i, l1 ++ l2)})
val rowRdd: RDD[Row] ={ case (i, d) => Row.fromSeq(d) }
val df1 = spark.createDataFrame(rowRdd, t1)
The key here is using to "zip" the RDDs together (creating an RDD in which each record is the combination of records with same index in ell RDDs):
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// INPUT: Map does not preserve order (not the defaul implementation, at least) - using Seq
val rdds: Seq[(String, RDD[String])] = Seq(
"field1" -> sc.parallelize(Seq("a", "b", "c")),
"field2" -> sc.parallelize(Seq("1", "2", "3")),
"field3" -> sc.parallelize(Seq("Q", "W", "E"))
// Use to zip all RDDs together, then convert to Rows
val rowRdd: RDD[Row] = rdds
.map( => Seq(s)))
.reduceLeft((rdd1, rdd2) => { case (l1, l2) => l1 ++ l2 })
// Create schema using the column names:
val schema: StructType = StructType( => StructField(name, StringType)))
// Create DataFrame:
val result: DataFrame = spark.createDataFrame(rowRdd, schema)
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| 1| Q|
// | b| 2| W|
// | c| 3| E|
// +------+------+------+

moving transformations from hive sql query to Spark

val temp = sqlContext.sql(s"SELECT A, B, C, (CASE WHEN (D) in (1,2,3) THEN ((E)+0.000)/60 ELSE 0 END) AS Z from TEST.TEST_TABLE")
val temp1 ={ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
instead of the above code which is doing the computation(case evaluation) on hive layer I would like to have the transformation done in scala. How would I do it?
Is it possible to do the same while filling the data inside Map?
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
val tempTransform = => {
val z = List[Double](1, 2, 3).contains(row.getDouble(3)) match {
case true => row.getDouble(4) / 60
case _ => 0
Row(row.getShort(0), Row.getString(1), Row.getDouble(2), z)
val temp1 ={ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
you can use this syntax as well
new_df = old_df.withColumn('target_column', udf(
as reffered by this example
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
In your case, execute sql which be dataframe like below
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
and apply withColumn with case or when otherwise or if needed spark udf
, call scala function logic instead of hiveudf