Map a multimap to columns of dataframe - scala

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance

Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.

Related

How to create multiples columns from a MapType columns efficiently (without foldleft)

My goal is to create columns from another MapType column. The names of the columns being the keys of the Map and their associated values.
Below my starting dataframe:
+-----------+---------------------------+
|id | mapColumn |
+-----------+---------------------------+
| 1 |Map(keyA -> 0, keyB -> 1) |
| 2 |Map(keyA -> 4, keyB -> 2) |
+-----------+---------------------------+
Below the desired output:
+-----------+----+----+
|id |keyA|keyB|
+-----------+----+----+
| 1 | 0| 1|
| 2 | 4| 2|
+-----------+----+----+
I found a solution whith a Foldleft with accumulators (work but extremely slow):
val colsToAdd = startDF.collect()(0)(1).asInstanceOf[Map[String,Integer]].map(x => x._1).toSeq
res1: Seq[String] = List(keyA, keyB)
val endDF = colsToAdd.foldLeft(startDF)((startDF, key) => startDF.withColumn(key, lit(0)))
//(lit(0) for testing)
The real starting dataframe being enormous, I need optimization.
You could simply use explode function to explode the map type column and then use pivot to get each key as new column. Something like this:
val df = Seq((1,Map("keyA" -> 0, "keyB" -> 1)), (2,Map("keyA" -> 4, "keyB" -> 2))
).toDF("id", "mapColumn")
df.select($"id", explode($"mapColumn"))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.show()
Gives:
+---+----+----+
| id|keyA|keyB|
+---+----+----+
| 1| 0| 1|
| 2| 4| 2|
+---+----+----+

Histogram -Doing it in a parallel way

+----+----+--------+
| Id | M1 | trx |
+----+----+--------+
| 1 | M1 | 11.35 |
| 2 | M1 | 3.4 |
| 3 | M1 | 10.45 |
| 2 | M1 | 3.95 |
| 3 | M1 | 20.95 |
| 2 | M2 | 25.55 |
| 1 | M2 | 9.95 |
| 2 | M2 | 11.95 |
| 1 | M2 | 9.65 |
| 1 | M2 | 14.54 |
+----+----+--------+
With the above dataframe I should be able to generate a histogram as below using the below code.
Similar Queston is here
val (Range,counts) = df
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
// Range: Array[Double] = Array(3.4, 5.615, 7.83, 10.045, 12.26, 14.475, 16.69, 18.905, 21.12, 23.335, 25.55)
// counts: Array[Long] = Array(2, 0, 2, 3, 0, 1, 0, 1, 0, 1)
But Issue here is,how can I parallely create the histogram based on column 'M1' ?This means I need to have two histogram output for column Values M1 and M2.
First, you need to know that histogram generates two separate sequential jobs. One to detect the minimum and maximum of your data, one to compute the actual histogram. You can check this using the Spark UI.
We can follow the same scheme to build histograms on as many columns as you wish, with only two jobs. Yet, we cannot use the histogram function which is only meant to handle one collection of doubles. We need to implement it by ourselves. The first job is dead simple.
val Row(min_trx : Double, max_trx : Double) = df.select(min('trx), max('trx)).head
Then we compute locally the ranges of the histogram. Note that I use the same ranges for all the columns. It allows to compare the results easily between the columns (by plotting them on the same figure). Having different ranges per column would just be a small modification of this code though.
val hist_size = 10
val hist_step = (max_trx - min_trx) / hist_size
val hist_ranges = (1 until hist_size)
.scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
// I add max_trx manually to avoid rounding errors that would exclude the value
That was the first part. Then, we can use a UDF to determine in what range each value ends up, and compute all the histograms in parallel with spark.
val range_index = udf((x : Double) => hist_ranges.lastIndexWhere(x >= _))
val hist_df = df
.withColumn("rangeIndex", range_index('trx))
.groupBy("M1", "rangeIndex")
.count()
// And voilĂ , all the data you need is there.
hist_df.show()
+---+----------+-----+
| M1|rangeIndex|count|
+---+----------+-----+
| M2| 2| 2|
| M1| 0| 2|
| M2| 5| 1|
| M1| 3| 2|
| M2| 3| 1|
| M1| 7| 1|
| M2| 10| 1|
+---+----------+-----+
As a bonus, you can shape the data to use it locally (within the driver), either using the RDD API or by collecting the dataframe and modifying it in scala.
Here is one way to do it with spark since this is a question about spark ;-)
val hist_map = hist_df.rdd
.map(row => row.getAs[String]("M1") ->
(row.getAs[Int]("rangeIndex"), row.getAs[Long]("count")))
.groupByKey
.mapValues( _.toMap)
.mapValues( hists => (1 to hist_size)
.map(i => hists.getOrElse(i, 0L)).toArray )
.collectAsMap
EDIT: how to build one range per column value:
Instead of computing the min and max of M1, we compute it for each value of the column with groupBy.
val min_max_map = df.groupBy("M1")
.agg(min('trx), max('trx))
.rdd.map(row => row.getAs[String]("M1") ->
(row.getAs[Double]("min(trx)"), row.getAs[Double]("max(trx)")))
.collectAsMap // maps each column value to a tuple (min, max)
Then we adapt the UDF so that it uses this map and we are done.
// for clarity, let's define a function that generates histogram ranges
def generate_ranges(min_trx : Double, max_trx : Double, hist_size : Int) = {
val hist_step = (max_trx - min_trx) / hist_size
(1 until hist_size).scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
}
// and use it to generate one range per column value
val range_map = min_max_map.keys
.map(key => key ->
generate_ranges(min_max_map(key)._1, min_max_map(key)._2, hist_size))
.toMap
val range_index = udf((x : Double, m1 : String) =>
range_map(m1).lastIndexWhere(x >= _))
Finally, just replace range_index('trx) by range_index('trx, 'M1) and you will have one range per column value.
The way I do histograms with Spark is as follows:
val binEdes = 0.0 to 25.0 by 5.0
val bins = binEdes.init.zip(binEdes.tail).toDF("bin_from","bin_to")
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
// add more, e.g. sum($"trx)
)
.orderBy($"bin_from",$"bin_to")
.show()
gives:
+--------+------+-----+
|bin_from|bin_to|count|
+--------+------+-----+
| 0.0| 5.0| 2|
| 5.0| 10.0| 2|
| 10.0| 15.0| 4|
| 15.0| 20.0| 0|
| 20.0| 25.0| 1|
+--------+------+-----+
Now if you have more dimensions, just add that to the groupBy-clause
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"M1",$"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
)
.orderBy($"M1",$"bin_from",$"bin_to")
.show()
gives:
+----+--------+------+-----+
| M1|bin_from|bin_to|count|
+----+--------+------+-----+
|null| 15.0| 20.0| 0|
| M1| 0.0| 5.0| 2|
| M1| 10.0| 15.0| 2|
| M1| 20.0| 25.0| 1|
| M2| 5.0| 10.0| 2|
| M2| 10.0| 15.0| 2|
+----+--------+------+-----+
You may tweak to code a bit to get the output you want, but this should get you started. You could also do the UDAF approach I posted here : Spark custom aggregation : collect_list+UDF vs UDAF
I think its not easily possible using RDD's, because histogram is only available on DoubleRDD, i.e. RDDs of Double. If you really need to use RDD API, you can do it in parallel by firing parallel jobs, this can be done using scalas parallel collection:
import scala.collection.parallel.immutable.ParSeq
val List((rangeM1,histM1),(rangeM2,histM2)) = ParSeq("M1","M2")
.map(c => df.where($"M1"===c)
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
).toList
println(rangeM1.toSeq,histM1.toSeq)
println(rangeM2.toSeq,histM2.toSeq)
gives:
(WrappedArray(3.4, 5.155, 6.91, 8.665000000000001, 10.42, 12.175, 13.930000000000001, 15.685, 17.44, 19.195, 20.95),WrappedArray(2, 0, 0, 0, 2, 0, 0, 0, 0, 1))
(WrappedArray(9.65, 11.24, 12.83, 14.420000000000002, 16.01, 17.6, 19.19, 20.78, 22.37, 23.96, 25.55),WrappedArray(2, 1, 0, 1, 0, 0, 0, 0, 0, 1))
Note that the bins differ here for M1 and M2

create a simple DF after reading a parquet file

I am a new developer on Scala and I met some problems to write a simple code on Spark Scala. I have this DF that I get after reading a parquet file :
ID Timestamp
1 0
1 10
1 11
2 20
3 15
And what I want is to create a DF result from the first DF (if the ID = 2 for example, the timestamp should be multiplied by two). So, I created a new class :
case class OutputData(id: bigint, timestamp:bigint)
And here is my code :
val tmp = spark.read.parquet("/user/test.parquet").select("id", "timestamp")
val outputData:OutputData = tmp.map(x:Row => {
var time_result
if (x.getString("id") == 2) {
time_result = x.getInt(2)* 2
}
if (x.getString("id") == 1) {
time_result = x.getInt(2) + 10
}
OutputData2(x.id, time_result)
})
case class OutputData2(id: bigint, timestamp:bigint)
Can you help me please ?
To make the implementation easier, you can cast your df using a case class, the process that Dataset with object notation instead of access to your row each time that you want the value of some element. Apart of that, based on your input and output will take have same format you can use same case class instead of define 2.
Code looks like:
// Sample intput data
val df = Seq(
(1, 0L),
(1, 10L),
(1, 11L),
(2, 20L),
(3, 15L)
).toDF("ID", "Timestamp")
df.show()
// Case class as helper
case class OutputData(ID: Integer, Timestamp: Long)
val newDF = df.as[OutputData].map(record=>{
val newTime = if(record.ID == 2) record.Timestamp*2 else record.Timestamp // identify your id and apply logic based on that
OutputData(record.ID, newTime)// return same format with updated values
})
newDF.show()
The output of above code:
// original
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 20|
| 3| 15|
+---+---------+
// new one
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 40|
| 3| 15|
+---+---------+

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+