Transpose in Spark scala logic - scala

I have below dataframe in spark scala dataframe:
-------------
a | b| c| d |
-------------
1 | 2| 3 | 4 |
5 | 6| 7 | 8 |
9 | 10| 11 | 12 |
13 | 14| 15 | 16 |
From my code it becomes a map of every rows and,code I try is:
df.select(map(df.columns.flatMap(c => Seq(lit(c),col(c))):_*).as("map"))
Map(String-> String) with 4 records only
Map(a->1,b->2,c->3,d->4)
Map(a->5,b->6,c->7,d->8)
Map(a->9,b->10,c->11,d->12)
Map(a->13,b->14,c->15,d->16)
But I wanted to change like below:
a->1
b->2
c->3
d->4
a->5
b->6
c->7
d->8
a->9
b->10
c->11
d->12
a->13
b->14
c->15
d->16
Any suggestion to change/add code to get desired result, I think it should be any transpose logic I am kind of new in scala .

Use explode to explode map data.Try below code.
df.select(map(df.columns.flatMap(c => Seq(lit(c),col(c))):_*).as("map"))
.select(explode($"map"))
.show(false)
Without map used array
val colExpr = array(
df
.columns
.flatMap(c => Seq(struct(lit(c).as("key"),col(c).as("value")).as("map"))):_*
).as("map")
df
.select(colExpr)
.select(explode($"map").as("map"))
.select($"map.*").show(false)

Related

assign values to a new column depending on old column values in dataframe

I have assigned values to 4 variables in a conf or application.properties file,
A = 1
B = 2
C = 3
D = 4
I have a dataframe as follows,
+-----+
|name |
+-----+
| A |
| C |
| B |
| D |
| B |
+-----+
I want to add a new column that has the values assigned from the conf variables declared above for A,B,C,D respectively depending on the value in the name column.
Final Dataframe should have,
+----+----------+
|name|NAME_VALUE|
+----+----------+
| A | 1 |
| C | 3 |
| B | 2 |
| D | 4 |
| B | 2 |
+----+----------+
I tried lit function in .WITHCOLUMN with conf.getint($name), not accepting Column in lit func requires string, I have to hardcode the variable names in lit. Is there anyway for me to dynamically assign those respective conf variable names in LIT so it can automatically assign values to another column in spark scala?
For this moment i dont have any ideas how to do it as you intended with dynamic usage of vals names.
My proposition is to use a seq of tuples instead of multiple vals, in such case you can create some udf and try to map this value for each row, but you can also use join which i am showing in below example:
val data = Seq(("A"),("C"), ("B"), ("D"), ("B"))
val df = data.toDF("name")
val mappings = Seq(("A",1), ("B",2), ("C",3), ("D",4))
val mappingsDf = mappings.toDF("name", "value")
df.join(broadcast(mappingsDf), df("name") === mappingsDf("name"), "left")
.select(
df("name"),
mappingsDf("value")
).show
output is as expected:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| C| 3|
| B| 2|
| D| 4|
| B| 2|
+----+-----+
This solution is pretty generic as your mapping are df here so you can hardcode them as showed in my example or load them from some csv or json easily with spark api
Due to broadcast join it should be quite efficient (you should remove this hint if you want to use big amount of mappings!)
I think its easy to understand and maintain as its not udf but only Spark api

How to display mismatched report with a label in spark 1.6 - scala except function?

Consider there are 2 dataframes df1 and df2.
df1 has below data
A | B
-------
1 | m
2 | n
3 | o
df2 has below data
A | B
-------
1 | m
2 | n
3 | p
df1.except(df2) returns
A | B
-------
3 | o
3 | p
How to display the result as below?
df1: 3 | o
df2: 3 | p
As per the API docs df1.except(df2), Returns a new DataFrame containing rows in this frame but not in another frame. i.e, it will return rows that are in DF1 and not in DF2. Thus a custom except function could be written as:
def except(df1: DataFrame, df2: DataFrame): DataFrame = {
val edf1 = df1.except(df2).withColumn("df", lit("df1"))
val edf2 = df2.except(df1).withColumn("df", lit("df2"))
edf1.union(edf2)
}
//Output
+---+---+---+
| A| B| df|
+---+---+---+
| 3| o|df1|
| 3| p|df2|
+---+---+---+

Histogram -Doing it in a parallel way

+----+----+--------+
| Id | M1 | trx |
+----+----+--------+
| 1 | M1 | 11.35 |
| 2 | M1 | 3.4 |
| 3 | M1 | 10.45 |
| 2 | M1 | 3.95 |
| 3 | M1 | 20.95 |
| 2 | M2 | 25.55 |
| 1 | M2 | 9.95 |
| 2 | M2 | 11.95 |
| 1 | M2 | 9.65 |
| 1 | M2 | 14.54 |
+----+----+--------+
With the above dataframe I should be able to generate a histogram as below using the below code.
Similar Queston is here
val (Range,counts) = df
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
// Range: Array[Double] = Array(3.4, 5.615, 7.83, 10.045, 12.26, 14.475, 16.69, 18.905, 21.12, 23.335, 25.55)
// counts: Array[Long] = Array(2, 0, 2, 3, 0, 1, 0, 1, 0, 1)
But Issue here is,how can I parallely create the histogram based on column 'M1' ?This means I need to have two histogram output for column Values M1 and M2.
First, you need to know that histogram generates two separate sequential jobs. One to detect the minimum and maximum of your data, one to compute the actual histogram. You can check this using the Spark UI.
We can follow the same scheme to build histograms on as many columns as you wish, with only two jobs. Yet, we cannot use the histogram function which is only meant to handle one collection of doubles. We need to implement it by ourselves. The first job is dead simple.
val Row(min_trx : Double, max_trx : Double) = df.select(min('trx), max('trx)).head
Then we compute locally the ranges of the histogram. Note that I use the same ranges for all the columns. It allows to compare the results easily between the columns (by plotting them on the same figure). Having different ranges per column would just be a small modification of this code though.
val hist_size = 10
val hist_step = (max_trx - min_trx) / hist_size
val hist_ranges = (1 until hist_size)
.scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
// I add max_trx manually to avoid rounding errors that would exclude the value
That was the first part. Then, we can use a UDF to determine in what range each value ends up, and compute all the histograms in parallel with spark.
val range_index = udf((x : Double) => hist_ranges.lastIndexWhere(x >= _))
val hist_df = df
.withColumn("rangeIndex", range_index('trx))
.groupBy("M1", "rangeIndex")
.count()
// And voilĂ , all the data you need is there.
hist_df.show()
+---+----------+-----+
| M1|rangeIndex|count|
+---+----------+-----+
| M2| 2| 2|
| M1| 0| 2|
| M2| 5| 1|
| M1| 3| 2|
| M2| 3| 1|
| M1| 7| 1|
| M2| 10| 1|
+---+----------+-----+
As a bonus, you can shape the data to use it locally (within the driver), either using the RDD API or by collecting the dataframe and modifying it in scala.
Here is one way to do it with spark since this is a question about spark ;-)
val hist_map = hist_df.rdd
.map(row => row.getAs[String]("M1") ->
(row.getAs[Int]("rangeIndex"), row.getAs[Long]("count")))
.groupByKey
.mapValues( _.toMap)
.mapValues( hists => (1 to hist_size)
.map(i => hists.getOrElse(i, 0L)).toArray )
.collectAsMap
EDIT: how to build one range per column value:
Instead of computing the min and max of M1, we compute it for each value of the column with groupBy.
val min_max_map = df.groupBy("M1")
.agg(min('trx), max('trx))
.rdd.map(row => row.getAs[String]("M1") ->
(row.getAs[Double]("min(trx)"), row.getAs[Double]("max(trx)")))
.collectAsMap // maps each column value to a tuple (min, max)
Then we adapt the UDF so that it uses this map and we are done.
// for clarity, let's define a function that generates histogram ranges
def generate_ranges(min_trx : Double, max_trx : Double, hist_size : Int) = {
val hist_step = (max_trx - min_trx) / hist_size
(1 until hist_size).scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
}
// and use it to generate one range per column value
val range_map = min_max_map.keys
.map(key => key ->
generate_ranges(min_max_map(key)._1, min_max_map(key)._2, hist_size))
.toMap
val range_index = udf((x : Double, m1 : String) =>
range_map(m1).lastIndexWhere(x >= _))
Finally, just replace range_index('trx) by range_index('trx, 'M1) and you will have one range per column value.
The way I do histograms with Spark is as follows:
val binEdes = 0.0 to 25.0 by 5.0
val bins = binEdes.init.zip(binEdes.tail).toDF("bin_from","bin_to")
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
// add more, e.g. sum($"trx)
)
.orderBy($"bin_from",$"bin_to")
.show()
gives:
+--------+------+-----+
|bin_from|bin_to|count|
+--------+------+-----+
| 0.0| 5.0| 2|
| 5.0| 10.0| 2|
| 10.0| 15.0| 4|
| 15.0| 20.0| 0|
| 20.0| 25.0| 1|
+--------+------+-----+
Now if you have more dimensions, just add that to the groupBy-clause
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"M1",$"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
)
.orderBy($"M1",$"bin_from",$"bin_to")
.show()
gives:
+----+--------+------+-----+
| M1|bin_from|bin_to|count|
+----+--------+------+-----+
|null| 15.0| 20.0| 0|
| M1| 0.0| 5.0| 2|
| M1| 10.0| 15.0| 2|
| M1| 20.0| 25.0| 1|
| M2| 5.0| 10.0| 2|
| M2| 10.0| 15.0| 2|
+----+--------+------+-----+
You may tweak to code a bit to get the output you want, but this should get you started. You could also do the UDAF approach I posted here : Spark custom aggregation : collect_list+UDF vs UDAF
I think its not easily possible using RDD's, because histogram is only available on DoubleRDD, i.e. RDDs of Double. If you really need to use RDD API, you can do it in parallel by firing parallel jobs, this can be done using scalas parallel collection:
import scala.collection.parallel.immutable.ParSeq
val List((rangeM1,histM1),(rangeM2,histM2)) = ParSeq("M1","M2")
.map(c => df.where($"M1"===c)
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
).toList
println(rangeM1.toSeq,histM1.toSeq)
println(rangeM2.toSeq,histM2.toSeq)
gives:
(WrappedArray(3.4, 5.155, 6.91, 8.665000000000001, 10.42, 12.175, 13.930000000000001, 15.685, 17.44, 19.195, 20.95),WrappedArray(2, 0, 0, 0, 2, 0, 0, 0, 0, 1))
(WrappedArray(9.65, 11.24, 12.83, 14.420000000000002, 16.01, 17.6, 19.19, 20.78, 22.37, 23.96, 25.55),WrappedArray(2, 1, 0, 1, 0, 0, 0, 0, 0, 1))
Note that the bins differ here for M1 and M2

Map a multimap to columns of dataframe

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance
Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.

Scala Spark - How reduce a dataframe with many couple columns in a single couple columns?

i have a dataframe with many couple (count and score) columns.
This situation is not a pivot, but similar an unpivot.
Example:
|house_score | house_count | mobile_score | mobile_count | sport_score | sport_count | ....<other couple columns>.....|
| 20 2 48 6 6 78 |
| 40 78 47 74 69 6 |
I want a new dateframe with only two columns, score e count. The new dataframe reduce all couple columns in a only couple columns.
_________________
| score | count |
| 20 | 2 |
| 40 | 78 |
| 48 | 6 |
| 47 | 74 |
| 6 | 78 |
| 69 | 6 |
|_______________|
What's the best solution (elegant code/performance)?
You can achieve this using a foldLeft over the column names (excluding the part after the _). This is reasonably efficient since all intensive operations are distributed, and the code is fairly clean and concise.
// df from example
val df = sc.parallelize(List((20,2,48,6,6,78), (40,78,47,74,69,6) )).toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")
// grab column names (part before the _)
val cols = df.columns.map(col => col.split("_")(0)).distinct
// fold left over all columns
val result = cols.tail.foldLeft(
// init with cols.head column
df.select(col(s"${cols.head}_score").as("score"), col(s"${cols.head}_count").as("count"))
){case (acc,c) => {
// union current column c
acc.unionAll(df.select(col(s"${c}_score").as("score"), col(s"${c}_count").as("count")))
}}
result.show
Using unionAlls as suggested in another answer will require you to scan the data multiple times and on each scan project the df to only 2 columns. From a performance perspective scanning the data multiple times should be avoided if you can do the work in 1 pass especially if you have large datasets that are not cacheable or you need to do many scans.
You can do it in 1 pass, by generating all the tuples (score, count) and then flat mapping them. I let you decide how elegant it is:
scala> :paste
// Entering paste mode (ctrl-D to finish)
val df = List((20,2,48,6,6,78), (40,78,47,74,69,6))
.toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")
df.show
val result = df
.flatMap(r => Range(0, 5, 2).map(i => (r.getInt(i), r.getInt(i + 1))))
.toDF("score", "count")
result.show
// Exiting paste mode, now interpreting.
+-----------+-----------+------------+------------+-----------+-----------+
|house_score|house_count|mobile_score|mobile_count|sport_score|sport_count|
+-----------+-----------+------------+------------+-----------+-----------+
| 20| 2| 48| 6| 6| 78|
| 40| 78| 47| 74| 69| 6|
+-----------+-----------+------------+------------+-----------+-----------+
+-----+-----+
|score|count|
+-----+-----+
| 20| 2|
| 48| 6|
| 6| 78|
| 40| 78|
| 47| 74|
| 69| 6|
+-----+-----+
df: org.apache.spark.sql.DataFrame = [house_score: int, house_count: int ... 4 more fields]
result: org.apache.spark.sql.DataFrame = [score: int, count: int]