Scala DataTable to List of Maps - scala

Could you please suggest how I can implement the following :
I have a dataTable in a Cucumber feature file such as :
|A |B |C |
|1 |2 |3 |
|11 |22 |33 |
|111|222|333|
I try to get a List of Maps like this:
A:1,11,111; B:2,22,222; C:3,33,333
If I do like this
List[Map[String, Any]] =
data.asMaps(classOf[String], classOf[Any]).asScala.map(_.asScala.toMap).toList
I got a bit another staff: A:1, B:2, C:3, A:11 ....

Transpose, and then map to Maps.
val source = List(
List("A", "B", "C"),
List(1, 2, 3),
List(11, 22, 33),
List(111, 222, 333)
)
val transposed = source.transpose
println(transposed) // List(List(A, 1, 11, 111), List(B, 2, 22, 222), List(C, 3, 33, 333))
val mapped = transposed.map {
case l: List[Any] => Map(l.head -> l.tail)
}
println(mapped) // List(Map(A -> List(1, 11, 111)), Map(B -> List(2, 22, 222)), Map(C -> List(3, 33, 333)))

Related

Conditional Spark map() function based on input columns

What I'm trying to achieve here is sending to Spark SQL map function conditionally generated columns depending on if they have null, 0 or any other value I may want.
Take for example this initial DF.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
From that initial DataFrame I want to generate yet another column which will be a map, like this.
initialDF.withColumn("thisMap", MY_FUNCTION)
My current approach to this is basically take a Seq[String] in a method a flatMap the key-value pairs that the Spark SQL method receives, like this.
def toMap(columns: String*): Column = {
map(
columns.flatMap(column => List(lit(column), col(column))): _*
)
}
But then, filtering becomes a Scala thing and is quite a mess.
What I would like to obtain after the processing would be, for each of those rows, the next DataFrame.
val initialDF = Seq(
("a", "b", 1, Map("field1" -> "a", "field2" -> "b", "field3" -> 1)),
("a", "b", null, Map("field1" -> "a", "field2" -> "b")),
("a", null, 0, Map("field1" -> "a"))
)
.toDF("field1", "field2", "field3", "thisMap")
I was wondering if this can be achieved using the Column API which is way more intuitive with .isNull or .equalTo?
Here's a small improvement on Lamanus' answer above which only loops over df.columns once:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
df.withColumn("thisMap", map_concat(
df.columns.map { colName =>
when(col(colName).isNull or col(colName) === 0, map())
.otherwise(map(lit(colName), col(colName)))
}: _*
)).show(false)
// +------+------+------+---------------------------------------+
// |field1|field2|field3|thisMap |
// +------+------+------+---------------------------------------+
// |a |b |1 |[field1 -> a, field2 -> b, field3 -> 1]|
// |a |b |null |[field1 -> a, field2 -> b] |
// |a |null |0 |[field1 -> a] |
// +------+------+------+---------------------------------------+
UPDATE
I found a way to achieve the expected result but it is a bit dirty.
val df2 = df.columns.foldLeft(df) { (df, n) => df.withColumn(n + "_map", map(lit(n), col(n))) }
val col_cond = df.columns.map(n => when(not(col(n + "_map").getItem(n).isNull || col(n + "_map").getItem(n) === lit("0")), col(n + "_map")).otherwise(map()))
df2.withColumn("map", map_concat(col_cond: _*))
.show(false)
ORIGINAL
Here is my try with the function map_from_arrays that is possible to use in spark 2.4+.
df.withColumn("array", array(df.columns.map(col): _*))
.withColumn("map", map_from_arrays(lit(df.columns), $"array")).show(false)
Then, the result is:
+------+------+------+---------+---------------------------------------+
|field1|field2|field3|array |map |
+------+------+------+---------+---------------------------------------+
|a |b |1 |[a, b, 1]|[field1 -> a, field2 -> b, field3 -> 1]|
|a |b |null |[a, b,] |[field1 -> a, field2 -> b, field3 ->] |
|a |null |0 |[a,, 0] |[field1 -> a, field2 ->, field3 -> 0] |
+------+------+------+---------+---------------------------------------+

Scala Transformation and action

I have an RDD List[(String, List[Int])] like List(("A",List(1,2,3,4)),("B",List(5,6,7)))
How to transform them to List(("A",1),("A",2),("A",3),("A",4),("B",5),("B",6),("B",7))
Then action would be reducing by key and generating result like List(("A",2.5)("B",6))
I have tried using map(e=>List(e._1,e._2)) but its not giving desired result.
Where 2.5 is average for "A" and 6 is average for "B"
Help me with these set of transformation and actions.
Thanks in advance
There are several ways to get what you want. You could use a for comprehension as well, but the very first one came up to my mind is this implementation:
val l = List(("A", List(1, 2, 3)), ("B", List(1, 2, 3)))
val flattenList = l.flatMap {
case (elem, _elemList) =>
_elemList.map((elem, _))
}
Output:
List((A,1), (A,2), (A,3), (B,1), (B,2), (B,3))
If what you want is the average of each list in the end, then it's not necessary to break them up into individual elements with a flatMap. Doing so with a large list would unnecessarily shuffle a lot of data with a large data set.
Since they are already aggregated by key, just transform them with something like this:
val l = spark.sparkContext.parallelize(Seq(
("A", List(1, 2, 3, 4)),
("B", List(5, 6, 7))
))
val avg = l.map(r => {
(r._1, (r._2.sum.toDouble / r._2.length.toDouble))
})
avg.collect.foreach(println)
Bear in mind that this will fail if any of your lists are 0 length. If you have some 0 length lists, you'll have to put a check condition in the map.
The above code gives you:
(A,2.5)
(B,6.0)
You can try explode()
scala> val df = List(("A",List(1,2,3,4)),("B",List(5,6,7))).toDF("x","y")
df: org.apache.spark.sql.DataFrame = [x: string, y: array<int>]
scala> df.withColumn("z",explode('y)).show(false)
+---+------------+---+
|x |y |z |
+---+------------+---+
|A |[1, 2, 3, 4]|1 |
|A |[1, 2, 3, 4]|2 |
|A |[1, 2, 3, 4]|3 |
|A |[1, 2, 3, 4]|4 |
|B |[5, 6, 7] |5 |
|B |[5, 6, 7] |6 |
|B |[5, 6, 7] |7 |
+---+------------+---+
scala> val df2 = df.withColumn("z",explode('y))
df2: org.apache.spark.sql.DataFrame = [x: string, y: array<int> ... 1 more field]
scala> df2.groupBy("x").agg(sum('z)/count('z) ).show(false)
+---+-------------------+
|x |(sum(z) / count(z))|
+---+-------------------+
|B |6.0 |
|A |2.5 |
+---+-------------------+
scala>

How to replicate an element in Spark dataframe in Scala?

Suppose I have a DataFrame:
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array")
+---+---+---+------------+
|one|two| X| Array|
+---+---+---+------------+
| 1| 2| x|[1, 2, 3, 4]|
+---+---+---+------------+
I want to replicate the single elements, let's say 4 times, in order to achieve a single row DataFrame with each field as an array of four elements. The desired output would be:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
You can use builit-in array function to replicate n time column of your choice.
Below is PoC code.
import org.apache.spark.sql.functions._
val replicate = (n: Int, colName: String) => array((1 to n).map(s => col(colName)):_*)
val replicatedCol = Seq("one", "two", "X").map(s => replicate(4, s).as(s))
val cols = col("Array") +: replicatedCol
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(cols:_*)
testDf.show(false)
+------------+------------+------------+------------+
|Array |one |two |X |
+------------+------------+------------+------------+
|[1, 2, 3, 4]|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|
+------------+------------+------------+------------+
In the case, you want different n for each column
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(replicate(2, "one").as("one"), replicate(3, "X").as("X"), replicate(4, "two").as("two"), $"Array")
testDf.show(false)
+------+---------+------------+------------+
|one |X |two |Array |
+------+---------+------------+------------+
|[1, 1]|[x, x, x]|[2, 2, 2, 2]|[1, 2, 3, 4]|
+------+---------+------------+------------+
Well, here is my solution:
First declare the columns you want to replicate:
val columnsToReplicate = List("one", "two", "X")
Then define the replication factor and the udf to perform it:
val replicationFactor = 4
val replicate = (s:String) => {
for {
i <- 1 to replicationFactor
} yield s
}
val replicateudf = functions.udf(replicate)
Then just perform the foldLeft on the DataFrame when the columname belongs to your list of desired column names:
testDf.columns.foldLeft(testDf)((acc, colname) => if (columnsToReplicate.contains(colname)) acc.withColumn(colname, replicateudf(acc.col(colname))) else acc)
Output:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
Note: You need to import this class:
import org.apache.spark.sql.functions
EDIT:
Variable replicationFactor as suggested in comments:
val mapColumnsToReplicate = Map("one"->4, "two"->5, "X"->6)
val replicateudf2 = functions.udf ((s: String, replicationFactor: Int) =>
for {
i <- 1 to replicationFactor
} yield s
)
testDf.columns.foldLeft(testDf)((acc, colname) => if (mapColumnsToReplicate.keys.toList.contains(colname)) acc.withColumn(colname, replicateudf2($"$colname", functions.lit(mapColumnsToReplicate(colname))))` else acc)
Output with those values above:
+------------+---------------+------------------+------------+
| one| two| X| Array|
+------------+---------------+------------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2, 2]|[x, x, x, x, x, x]|[1, 2, 3, 4]|
+------------+---------------+------------------+------------+
You can use explode und groupBy/collect_list :
val testDf = sc.parallelize(
Seq((1, 2, "x", Array(1, 2, 3, 4)),
(3, 4, "y", Array(1, 2, 3)),
(5,6, "z", Array(1)))
).toDF("one", "two", "X", "Array")
testDf
.withColumn("id",monotonically_increasing_id())
.withColumn("tmp", explode($"Array"))
.groupBy($"id")
.agg(
collect_list($"one").as("cl_one"),
collect_list($"two").as("cl_two"),
collect_list($"X").as("cl_X"),
first($"Array").as("Array")
)
.select(
$"cl_one".as("one"),
$"cl_two".as("two"),
$"cl_X".as("X"),
$"Array"
)
.show()
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
| [5]| [6]| [z]| [1]|
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
| [3, 3, 3]| [4, 4, 4]| [y, y, y]| [1, 2, 3]|
+------------+------------+------------+------------+
This solution has the advantage that it does not rely on constant array-sizes

efficient way to reformat/shift time series data using Spark

I want to build some time series models using spark. The first step is to reformat the sequence data into training samples. The idea is:
original sequential data (each t* is a number)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
desired output
t1 t2 t3 t4 t5 t6
t2 t3 t4 t5 t6 t7
t3 t4 t5 t6 t7 t8
..................
how to write a function in spark to do this.
The function signature should be like
reformat(Array[Integer], n: Integer)
return type is Dataframe or Vector
==========The code I tried on Spark 1.6.1 =========
val arraydata=Array[Double](1,2,3,4,5,6,7,8,9,10)
val slideddata = arraydata.sliding(4).toSeq
val rows = arraydata.sliding(4).map{x=>Row(x:_*)}
sc.parallelize(arraydata.sliding(4).toSeq).toDF("Values")
The final line can not go through with error:
Error:(52, 48) value toDF is not a member of org.apache.spark.rdd.RDD[Array[Double]]
sc.parallelize(arraydata.sliding(4).toSeq).toDF("Values")
I was not able to figure out the significance of n as it can be used as the window size as well as the value with which it has to shift.
Hence there are both the flavours:
If n is the window size :
def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
sc.parallelize(arrayOfInteger.sliding(shiftValue).toSeq).toDF("values")
}
On REPL:
scala> def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
| sc.parallelize(arrayOfInteger.sliding(shiftValue).toSeq).toDF("values")
| }
reformat: (arrayOfInteger: Array[Int], shiftValue: Int)org.apache.spark.sql.DataFrame
scala> val arrayofInteger=(1 to 10).toArray
arrayofInteger: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> reformat(arrayofInteger,3).show
+----------+
| values|
+----------+
| [1, 2, 3]|
| [2, 3, 4]|
| [3, 4, 5]|
| [4, 5, 6]|
| [5, 6, 7]|
| [6, 7, 8]|
| [7, 8, 9]|
|[8, 9, 10]|
+----------+
If n is the value to be shifted:
def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
val slidingValue=arrayOfInteger.size-shiftValue
sc.parallelize(arrayOfInteger.sliding(slidingValue).toSeq).toDF("values")
}
On REPL:
scala> def reformat(arrayOfInteger:Array[Int], shiftValue: Int) ={
| val slidingValue=arrayOfInteger.size-shiftValue
| sc.parallelize(arrayOfInteger.sliding(slidingValue).toSeq).toDF("values")
| }
reformat: (arrayOfInteger: Array[Int], shiftValue: Int)org.apache.spark.sql.DataFrame
scala> val arrayofInteger=(1 to 10).toArray
arrayofInteger: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> reformat(arrayofInteger,3).show(false)
+----------------------+
|values |
+----------------------+
|[1, 2, 3, 4, 5, 6, 7] |
|[2, 3, 4, 5, 6, 7, 8] |
|[3, 4, 5, 6, 7, 8, 9] |
|[4, 5, 6, 7, 8, 9, 10]|
+----------------------+

Sum the Distance in Apache-Spark dataframes

The Following code gives a dataframe having three values in each column as shown below.
import org.graphframes._
import org.apache.spark.sql.DataFrame
val v = sqlContext.createDataFrame(List(
("1", "Al"),
("2", "B"),
("3", "C"),
("4", "D"),
("5", "E")
)).toDF("id", "name")
val e = sqlContext.createDataFrame(List(
("1", "3", 5),
("1", "2", 8),
("2", "3", 6),
("2", "4", 7),
("2", "1", 8),
("3", "1", 5),
("3", "2", 6),
("4", "2", 7),
("4", "5", 8),
("5", "4", 8)
)).toDF("src", "dst", "property")
val g = GraphFrame(v, e)
val paths: DataFrame = g.bfs.fromExpr("id = '1'").toExpr("id = '5'").run()
paths.show()
val df=paths
df.select(df.columns.filter(_.startsWith("e")).map(df(_)) : _*).show
OutPut of Above Code is given below::
+-------+-------+-------+
| e0| e1| e2|
+-------+-------+-------+
|[1,2,8]|[2,4,7]|[4,5,8]|
+-------+-------+-------+
In the above output, we can see that each column has three values and they can be interpreted as follows.
e0 :
source 1, Destination 2 and distance 8
e1:
source 2, Destination 4 and distance 7
e2:
source 4, Destination 5 and distance 8
basically e0,e1, and e3 are the edges. I want to sum the third element of each column, i.e add the distance of each edge to get the total distance. How can I achieve this?
It can be done like this:
val total = df.columns.filter(_.startsWith("e"))
.map(c => col(s"$c.property")) // or col(c).getItem("property")
.reduce(_ + _)
df.withColumn("total", total)
I would make a collection of the columns to sum and then use a foldLeft on a UDF:
scala> val df = Seq((Array(1,2,8),Array(2,4,7),Array(4,5,8))).toDF("e0", "e1", "e2")
df: org.apache.spark.sql.DataFrame = [e0: array<int>, e1: array<int>, e2: array<int>]
scala> df.show
+---------+---------+---------+
| e0| e1| e2|
+---------+---------+---------+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]|
+---------+---------+---------+
scala> val colsToSum = df.columns
colsToSum: Array[String] = Array(e0, e1, e2)
scala> val accLastUDF = udf((acc: Int, col: Seq[Int]) => acc + col.last)
accLastUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List(IntegerType, ArrayType(IntegerType,false)))
scala> df.withColumn("dist", colsToSum.foldLeft(lit(0))((acc, colName) => accLastUDF(acc, col(colName)))).show
+---------+---------+---------+----+
| e0| e1| e2|dist|
+---------+---------+---------+----+
|[1, 2, 8]|[2, 4, 7]|[4, 5, 8]| 23|
+---------+---------+---------+----+