Fill in missing weeks within a given date interval in Spark (Scala) - scala

Consider the following DataFrame:
val df = Seq("20140101", "20170619")
.toDF("date")
.withColumn("date", to_date($"date", "yyyyMMdd"))
.withColumn("week", date_format($"date", "Y-ww"))
The code yields:
date: date
week: string
date week
2014-01-01 2014-01
2017-06-19 2017-25
What I would like to do is thicken the dataframe so I'm left with one row for each week in the interval between 2014-01 and 2017-25. The date column isn't important so it can be discarded.
This needs to be done over a myriad of customer / product id combinations, so I'm looking for an efficient solution, preferably using nothing beyond java.sql.date and the built-in date functionalities in Spark.

Check this out. I have used the default "Sunday" as the week start number.
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val a = java.sql.Date.valueOf("2014-01-01")
a: java.sql.Date = 2014-01-01
scala> val b = java.sql.Date.valueOf("2017-12-31")
b: java.sql.Date = 2017-12-31
scala> val a1 = a.toLocalDate.toEpochDay.toInt
a1: Int = 16071
scala> val b1 = b.toLocalDate.toEpochDay.toInt
b1: Int = 17531
scala> val c1 = (a1 until b1).map(LocalDate.ofEpochDay(_)).map(x => (x,x.format(DateTimeFormatter.ofPattern("Y-ww")),x.format(DateTimeFormatter.ofPattern("E")) ) ).filter( x=> x._3 =="Sun" ).map(x => (java.sql.Date.valueOf(x._1),x._2) ).toMap
c1: scala.collection.immutable.Map[java.sql.Date,String] = Map(2014-06-01 -> 2014-23, 2014-11-02 -> 2014-45, 2017-11-05 -> 2017-45, 2016-10-23 -> 2016-44, 2014-11-16 -> 2014-47, 2014-12-28 -> 2015-01, 2017-04-30 -> 2017-18, 2015-01-04 -> 2015-02, 2015-10-11 -> 2015-42, 2014-09-07 -> 2014-37, 2017-09-17 -> 2017-38, 2014-04-13 -> 2014-16, 2014-10-19 -> 2014-43, 2014-01-05 -> 2014-02, 2016-07-17 -> 2016-30, 2015-07-26 -> 2015-31, 2016-09-18 -> 2016-39, 2015-11-22 -> 2015-48, 2015-10-04 -> 2015-41, 2015-11-15 -> 2015-47, 2015-01-11 -> 2015-03, 2016-12-11 -> 2016-51, 2017-02-05 -> 2017-06, 2016-03-27 -> 2016-14, 2015-11-01 -> 2015-45, 2017-07-16 -> 2017-29, 2015-05-24 -> 2015-22, 2017-06-18 -> 2017-25, 2016-03-13 -> 2016-12, 2014-11-09 -> 2014-46, 2014-09-21 -> 2014-39, 2014-01-26 -> 2014-05...
scala> val df = Seq( (c1) ).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map<date,string>]
scala> val df2 = df.select(explode('a).as(Seq("dt","wk")) )
df2: org.apache.spark.sql.DataFrame = [dt: date, wk: string]
scala> df2.orderBy('dt).show(false)
+----------+-------+
|dt |wk |
+----------+-------+
|2014-01-05|2014-02|
|2014-01-12|2014-03|
|2014-01-19|2014-04|
|2014-01-26|2014-05|
|2014-02-02|2014-06|
|2014-02-09|2014-07|
|2014-02-16|2014-08|
|2014-02-23|2014-09|
|2014-03-02|2014-10|
|2014-03-09|2014-11|
|2014-03-16|2014-12|
|2014-03-23|2014-13|
|2014-03-30|2014-14|
|2014-04-06|2014-15|
|2014-04-13|2014-16|
|2014-04-20|2014-17|
|2014-04-27|2014-18|
|2014-05-04|2014-19|
|2014-05-11|2014-20|
|2014-05-18|2014-21|
+----------+-------+
only showing top 20 rows
scala>

Related

Scala - How to convert Spark DataFrame to Map

How to conver Spark DataFrame to Map like below : I want to convert into Map and then Json. Pivot didn't worked to reshape the cplumn so
Any help will be appreciated to convert as a Map like below.
Input DataFrame :
+-----+-----+-------+--------------------+
|col1 |col2 |object | values |
+-------------------+--------------------+
|one | two | main |[101 -> A, 202 -> B]|
+-------------------+--------------------+
Expected Output DataFrame :
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|col1 |col2 |object | values | newMap |
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|one | two |main |[101 -> A, 202 -> B]|[col1 -> one, col2 -> two, object -> main, main -> [101 -> A, 202 -> B]]|
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
tried like below, but no success :
val toMap = udf((col1: String, col2: String, object: String, values: Map[String, String])) => {
col1.zip(values).toMap // need help for logic
// col1 -> col1_value, col2 -> col2_values, object -> object_value, object_value -> [values_of_Col_Values].toMap
})
df.withColumn("newMap", toMap($"col1", $"col2", $"object", $"values"))
I am stuck to format the code properly and get the output, please help either in Scala or Spark.
It's quit straight forward. Apparently the precondition is, you must have all the columns with same type otherwise you will get spark error.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("Foo", "L", "10"), ("Boo", "XL", "20"))
.toDF("brand", "size", "sales")
//Prepare your map columns.Bit of nasty iteration work is required
var preCol: Column = null
var counter = 1
val size = df.schema.fields.length
val mapColumns = df.schema.flatMap { field =>
val res = if (counter == size)
Seq(preCol, col(field.name))
else
Seq(lit(field.name), col(field.name))
//assign the current field name for tracking and increment the counter by 1
preCol = col(field.name)
counter += 1
res
}
df.withColumn("new", map(mapColumns: _*)).show(false)
Result
+-----+----+-----+---------------------------------------+
|brand|size|sales|new |
+-----+----+-----+---------------------------------------+
|Foo |L |10 |Map(brand -> Foo, size -> L, L -> 10) |
|Boo |XL |20 |Map(brand -> Boo, size -> XL, XL -> 20)|
+-----+----+-----+---------------------------------------+

Convert keys into columns names an values into rows (Map)

I have a dataframe that contains a Map column of and an id column.
key1 -> value1, key2 -> value2
key1 -> value3, key2 -> value4
I want to have as a result a dataframe like that:
id key1 key2
1 value1 value2
2 value3 value4
Thanks for your help.
I assume you are talking about Spark DataFrame. In that case, you can use the map method of the DataFrame to extract out the values you want. Here is an example using spark-shell (which automatically imports many of the implicit methods).
Note that toDF is used twice, once to load the sequence from built-in data structures, and another time to rename the columns in the new DataFrame obtained from the map method of the original DataFrame.
The show method is called to display "before" and "after"
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val m = Map(1-> Map("key1" -> "v1", "key2" -> "v2"), 2 -> Map("key1" -> "v3", "key2" -> "v4"))
m: scala.collection.immutable.Map[Int,scala.collection.immutable.Map[String,String]] = Map(1 -> Map(key1 -> v1, key2 -> v2), 2 -> Map(key1 -> v3, key2 -> v4))
scala> val df = m.toSeq.toDF("id", "map_value")
df: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> df.show()
+---+--------------------+
| id| map_value|
+---+--------------------+
| 1|[key1 -> v1, key2...|
| 2|[key1 -> v3, key2...|
+---+--------------------+
scala> val get_map:Function1[Row, Map[String,String]] = r => r.getAs[Map[String, String]]("map_value")
get_map: org.apache.spark.sql.Row => Map[String,String] = <function1>
scala> df.map(r => (r.getAs[Int]("id"), get_map(r).get("key1"), get_map(r).get("key2"))).toDF("id", "val1", "val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| v1| v2|
| 2| v3| v4|
+---+----+----+
Edit:
This answers how to address a variable number of columns. Here, N is the number of columns plus one (so there are 7 columns and N is 8). Note that 3 is the number of rows plus one (here there are 2 rows).
It is more convenient to use the select method of the DataFrame in this case, to avoid having to dynamically create tuples.
scala> val N = 8
N: Int = 8
scala> val map_value:Function1[Int,Map[String,String]] = (i: Int) => Map((for (n <- Range(1, N)) yield (s"k${n}", s"v${n*i}")).toList:_*)
map_value: Int => Map[String,String] = <function1>
scala> val m = Map((for (i <- Range(1, 3)) yield (i, map_value(i))).toList:_*)
m: scala.collection.immutable.Map[Int,Map[String,String]] = Map(1 -> Map(k2 -> v2, k5 -> v5, k6 -> v6, k7 -> v7, k1 -> v1, k4 -> v4, k3 -> v3), 2 -> Map(k2 -> v4, k5 -> v10, k6 -> v12, k7 -> v14, k1 -> v2, k4 -> v8, k3 -> v6))
scala> val df0 = m.toSeq.toDF("id", "map_value")
df0: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> val column_names:List[String] = (for (n <- Range(1, N)) yield (s"map_value.k${n}")).toList
column_names: List[String] = List(id, map_value.k1, map_value.k2, map_value.k3, map_value.k4, map_value.k5, map_value.k6, map_value.k7)
scala> df0.select("id", column_names:_*).show()
+---+---+---+---+---+---+---+---+
| id| k1| k2| k3| k4| k5| k6| k7|
+---+---+---+---+---+---+---+---+
| 1| v1| v2| v3| v4| v5| v6| v7|
| 2| v2| v4| v6| v8|v10|v12|v14|
+---+---+---+---+---+---+---+---+

Scala Maps, merging of multiple maps by key [duplicate]

This question already has answers here:
Scala: Merge map
(14 answers)
Closed 3 years ago.
I am trying to merge different maps that have the same key (account id), but different values such as average spending, max amount spent and min amount spent.
val a = Map(account1 -> 350.2, account2 -> 547.5, account3 -> 754.4)
val b = Map(account1 -> 1250, account2 -> 3221.785, account3 -> 2900)
val c = Map(account1 -> 50, account2 -> 21.5, account3 -> 12.7)
I want:
val d = Map(account1 -> (350.2, 1250 , 50), account2 -> (547.5, 3221.785 , 21.5), ... , ... )
I'd also like to create a list like:
((account1,350.2, 1250 , 50), (account2, 547.5, 3221.785), ... )
Any help would be great, thank you very much.
as stated here
// convert maps to seq, to keep duplicate keys and concat
val merged = Map(1 -> 2).toSeq ++ Map(1 -> 4).toSeq
// merged: Seq[(Int, Int)] = ArrayBuffer((1,2), (1,4))
// group by key
val grouped = merged.groupBy(_._1)
// grouped: scala.collection.immutable.Map[Int,Seq[(Int, Int)]] = Map(1 -> ArrayBuffer((1,2), (1,4)))
// remove key from value set and convert to list
val cleaned = grouped.mapValues(_.map(_._2).toList)
// cleaned: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(2, 4))

collapse the rows with flatmap or reducedbyKey

I got requirement to collapse the rows and have wrappedarray. here is original data and expected result. need to do it in spark scala.
Original Data:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
Expected Result:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
It would probably be best to use DataFrame APIs for what you need. If you prefer using row-based functions like reduceByKey, here's one approach:
Convert the DataFrame to a PairRDD
Apply reduceByKey to sum up Units and aggregate UnitsByDept by Dept
Convert the resulting RDD back to a DataFrame:
Sample code below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+

How to convert map to dataframe?

m is a map as following:
scala> m
res119: scala.collection.mutable.Map[Any,Any] = Map(A-> 0.11164610291904906, B-> 0.11856755943424617, C -> 0.1023171832681312)
I want to get:
name score
A 0.11164610291904906
B 0.11856755943424617
C 0.1023171832681312
How to get the final dataframe?
First covert it to a Seq, then you can use the toDF() function.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val m = Map("A"-> 0.11164610291904906, "B"-> 0.11856755943424617, "C" -> 0.1023171832681312)
val df = m.toSeq.toDF("name", "score")
df.show
Will give you:
+----+-------------------+
|name| score|
+----+-------------------+
| A|0.11164610291904906|
| B|0.11856755943424617|
| C| 0.1023171832681312|
+----+-------------------+