Is there an explode function equivalent in plain Scala? - scala

I am trying to look for explode function or its equivalent in plain scala rather Spark.
Using the explode function in Spark, I was able to flatten a row with multiple elements into multiple rows as below.
scala> import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.explode
scala> val test = spark.read.json(spark.sparkContext.parallelize(Seq("""{"a":1,"b":[2,3]}""")))
scala> test.schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true), StructField(b,ArrayType(LongType,true),true))
scala> test.show
+---+------+
| a| b|
+---+------+
| 1|[2, 3]|
+---+------+
scala> val flat = test.withColumn("b",explode($"b"))
flat: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> flat.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
+---+---+
Is there an explode equivalent function in plain scala without using Spark ? Is there anyway I can implement it if there is no explode function available in scala ?

Simple flatMap should help you in this case. I don't know exact data structure, which you would like to work with in scala, but let's take a bit artificial example:
val l: List[(Int, List[Int])] = List(1 -> List(2, 3))
val result: List[(Int, Int)] = l.flatMap {
case (a, b) => b.map(i => a -> i)
}
println(result)
Which will produce next result:
List((1,2), (1,3))
UPDATE
As suggested in comment section by #jwvh, or same result may be achieved using for-comprehension construction and hiding explicit flatMap & map invocations:
val result2: List[(Int, Int)] = for((a, bList) <- l; b <- bList) yield a -> b
Hope this helps!

Related

Apache Spark - How to define a UserDefinedAggregateFunction after 3?

I'm using Spark 3.0, and in order to use a user-defined function as a window function, I need an instance of UserDefinedAggregateFunction. Initially I thought that using the new Aggregator and udaf would solve this problem (as shown here), but udaf returns a UserDefinedFunction, not a UserDefinedAggregateFunction.
Since Spark 3.0, UserDefinedAggregateFunction is deprecated, as stated here (despite being possible to still find it around)
So the question is: is there a correct (not deprecated) way in Spark 3.0 to define a proper UserDefinedAggregateFunction and use it as a window function?
In Spark 3, the new API uses Aggregator to define user-defined aggregations:
abstract class Aggregator[-IN, BUF, OUT] extends Serializable:
A base class for user-defined aggregations, which can be used in
Dataset operations to take all of the elements of a group and reduce
them to a single value.
Aggregator brings performance improvements comparing to deprecated UDAF. You can see the issue Efficient User Defined Aggregators.
Here's an example on how to define a mean Aggregator and register it using functions.udaf method:
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
val meanAgg= new Aggregator[Long, (Long, Long), Double]() {
def zero = (0L, 0L) // Init the buffer
def reduce(y: (Long, Long), x: Long) = (y._1 + x, y._2 + 1)
def merge(a: (Long, Long), b: (Long, Long)) = (a._1 + b._1, a._2 + b._2)
def finish(r: (Long, Long)) = r._1.toDouble / r._2
def bufferEncoder: Encoder[(Long, Long)] = implicitly(ExpressionEncoder[(Long, Long)])
def outputEncoder: Encoder[Double] = implicitly(ExpressionEncoder[Double])
}
val meanUdaf = udaf(meanAgg)
Using it with Window:
val df = Seq(
(1, 2), (1, 5),
(2, 3), (2, 1),
).toDF("id", "value")
df.withColumn("mean", meanUdaf($"value").over(Window.partitionBy($"id"))).show
//+---+-----+----+
//| id|value|mean|
//+---+-----+----+
//| 1| 2| 3.5|
//| 1| 5| 3.5|
//| 2| 3| 2.0|
//| 2| 1| 2.0|
//+---+-----+----+

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

This question already has an answer here:
How to get Array[Seq[String]] from DataFrame?
(1 answer)
Closed 3 years ago.
I have a DataFrame and I want to convert it into a sequence of sequences and vice versa.
Now the thing is, I want to do it dynamically, and write something which runs for DataFrame with any number/type of columns.
In summary, these are the questions:
How to convert Seq[Seq[String]] to a DataFrame?
How to convert DataFrame to Seq[Seq[String]?
How to perform 2 but also make the DataFrame infer the schema and decide column types by itself?
UPDATE 1
This is not a duplicate of this question because in answer to that question solution provided is not dynamic, it works for two columns or how many columns is to be hardcoded. I am trying to find a dynamic solution.
This is how you can dynamically create a dataframe from Seq[Seq[String]]:
scala> val seqOfSeq = Seq(Seq("a","b", "c"),Seq("3","4", "5"))
seqOfSeq: Seq[Seq[String]] = List(List(a, b, c), List(3, 4, 5))
scala> val lengthOfRow = seqOfSeq(0).size
lengthOfRow: Int = 3
scala> val tempDf = sc.parallelize(seqOfSeq).toDF
tempDf: org.apache.spark.sql.DataFrame = [value: array<string>]
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias(s"col$i")): _*)
requiredDf: org.apache.spark.sql.DataFrame = [col0: string, col1: string ... 1 more field]
scala> requiredDf.show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| 3| 4| 5|
+----+----+----+
How to convert DataFrame to Seq[Seq[String]:
val newSeqOfSeq = requiredDf.collect().map(row => row.toSeq.map(_.toString).toSeq).toSeq
To use custom column names:
scala> val myCols = Seq("myColA", "myColB", "myColC")
myCols: Seq[String] = List(myColA, myColB, myColC)
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias( myCols(i) )): _*)
requiredDf: org.apache.spark.sql.DataFrame = [myColA: string, myColB: string ... 1 more field]

Spark: reduce/aggregate by key

I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+

Spark: How to convert a Dataset[(String , array[int])] to Dataset[(String, int, int, int, int, ...)] using Scala [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
The Dataset I have has the following schema: [String, array[int]]. Now, I want to convert it to a Dataset (or Dataframe) with the following schema: [String, int, int, int, ...]. Note that array[int] is dynamic, it can therefore have different length for different rows.
The problem stems from the fact that the tuple (String, Array[Int]) is a specific type and it is the same type no matter how many elements are in the Array.
On the other hand, the tuple (String, Int) is a different type from (String, Int, Int), which is different still from (String, Int, Int, Int), and so on. Being a strongly typed language, Scala doesn't easily allow for a method that takes one type as input and produces one of many possible, and unrelated, types as output.
Perhaps if you describe why you think you want to do this we can offer a better solution for your situation.
As #jwvh suggest, you probably cannot do this in a type-safe Dataset way. If you relax type-safety you can probably do this using DataFrames (assuming your arrays are not crazy long - I believe currently columns are restricted to Int.MaxValue number of columns).
Here is the solution using (primarily) DataFrames on Spark 2.0.2:
We start with a toy example:
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
// ds: org.apache.spark.sql.Dataset[(String, Array[Int])] = [_1: string, _2: array<int>]
ds.show()
+-----+-------------------+
| _1| _2|
+-----+-------------------+
|Hello| [1, 2, 3]|
|There|[1, 2, 10, 11, 100]|
| A| [5, 6, 7, 8]|
+-----+-------------------+
Next we compute the max length of the arrays we have (we hope is not crazy long here):
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
Next, we want a function to select an element of the array at a particular index. We express the array selection function as a UDF:
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
Now we create all the columns we want to generate:
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
Then hopefully we are done:
ds.select(columns:_*).show()
+-----+----+----+----+----+----+
| _1|a[0]|a[1]|a[2]|a[3]|a[4]|
+-----+----+----+----+----+----+
|Hello| 1| 2| 3| 0| 0|
|There| 1| 2| 10| 11| 100|
| A| 5| 6| 7| 8| 0|
+-----+----+----+----+----+----+
Here is the complete code for copy paste
import org.apache.spark.sql.functions._
import spark.implicits._
val ds = spark.createDataset(("Hello", Array(1,2,3)) :: ("There", Array(1,2,10,11,100)) :: ("A", Array(5,6,7,8)) :: Nil)
val maxLen = ds.select(max(size($"_2")).as[Long]).collect().head
val u = udf((a: Seq[Int], i: Int) => if(a.size <= i) null.asInstanceOf[Int] else a(i))
val columns = ds.col("_1") +: (for(i <- 0 until maxLen.toInt ) yield u(ds.col("_2"), lit(i)).as(s"a[$i]"))
ds.select(columns:_*).show()

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+