Scala Maps, merging of multiple maps by key [duplicate] - scala

This question already has answers here:
Scala: Merge map
(14 answers)
Closed 3 years ago.
I am trying to merge different maps that have the same key (account id), but different values such as average spending, max amount spent and min amount spent.
val a = Map(account1 -> 350.2, account2 -> 547.5, account3 -> 754.4)
val b = Map(account1 -> 1250, account2 -> 3221.785, account3 -> 2900)
val c = Map(account1 -> 50, account2 -> 21.5, account3 -> 12.7)
I want:
val d = Map(account1 -> (350.2, 1250 , 50), account2 -> (547.5, 3221.785 , 21.5), ... , ... )
I'd also like to create a list like:
((account1,350.2, 1250 , 50), (account2, 547.5, 3221.785), ... )
Any help would be great, thank you very much.

as stated here
// convert maps to seq, to keep duplicate keys and concat
val merged = Map(1 -> 2).toSeq ++ Map(1 -> 4).toSeq
// merged: Seq[(Int, Int)] = ArrayBuffer((1,2), (1,4))
// group by key
val grouped = merged.groupBy(_._1)
// grouped: scala.collection.immutable.Map[Int,Seq[(Int, Int)]] = Map(1 -> ArrayBuffer((1,2), (1,4)))
// remove key from value set and convert to list
val cleaned = grouped.mapValues(_.map(_._2).toList)
// cleaned: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(2, 4))

Related

Convert MapPartitionsRDD to DataFrame and grouping data by 2 keys

I have a dataframe which looks like this:
country | user | count
----------------------
Germany | Sarah| 2
China | Paul | 1
Germany | Alan | 3
Germany | Paul | 1
...
What I am trying to do is to convert this dataframe to another which looks like this:
dimension | value
--------------------------------------------
Country | [Germany -> 4, China -> 1]
--------------------------------------------
User | [Sarah -> 2, Paul -> 2, Alan -> 3]
...
At first I tried to do it by this:
var newDF = Seq.empty[(String, Map[String,Long])].toDF("dimension", "value")
df.collect()
.foreach(row => { Array(0,1)
.map(pos =>
newDF = newDF.union(Seq((df.columns.toSeq(pos).toString, Map(row.mkString(",").split(",")(pos) -> row.mkString(",").split(",")(2).toLong))).toDF())
)
})
val newDF2 = newDF.groupBy("dimension").agg(collect_list("value")).as[(String, Seq[Map[String, Long]])].map {case (id, list) => (id, list.reduce(_ |+| _))}.toDF("dimension", "value")
But the collect() was killing my driver. Therefore, I have tried to do it like this:
class DimItem[T](val dimension: String, val value: String, val metric: T)
val items: RDD[DimItem[Long]] = df.rdd.flatMap(row => {
dims.zipWithIndex.map{case (dim, i) =>
new DimItem(dim, row(i).toString, row(13).asInstanceOf[Long])
}
})
// with the format [ DimItem(Country, Germany, 2), DimItem(User, Sarah, 2)], ...
val itemsGrouped: RDD[((String, String), Iterable[DimItem[Long]])] = items.groupBy(x => (x.dimension, x.value))
val aggregatedItems: RDD[DimItem[Long]] = itemsGrouped.map{case (key, items) => new DimItem(key._1, key._2, items.reduce((a,b) => a.metric + b.metric)}
The idea is to save in an RDD objects like (Country, China, 1), (Country, Germany, 3), (Country, Germany, 1), ... and then group it by the 2 first keys (Country, China), (Country, Germany), ... Once grouped, sum the count they have. Ex: having (Country, Germany, 3), (Country, Germany, 1) will become (Country, Germany, 4).
But once I get here, it tells me that in items.reduce() there is a mismatch: it expects a DimItem[Long] but gets a Long.
Next step will be to group it by the key "dimension" and create the Map[String, Int]()format in the column "value" and convert it to a DF.
I have 2 questions.
First: is this last code correct?
Second: How can I convert this MapPartitionsRDD into a DF?
Here is one solution based on dataframe API:
import org.apache.spark.sql.functions.{lit, map_from_arrays, collect_list}
def transform(df :DataFrame, colName: String) : DataFrame =
df.groupBy(colName)
.agg{sum("count").as("sum")}
.agg{
map_from_arrays(
collect_list(colName),
collect_list("sum")
).as("value")
}.select(lit(colName).as("dimension"), $"value")
val countryDf = transform(df, "country")
val userDf = transform(df, "user")
countryDf.unionByName(userDf).show(false)
// +---------+----------------------------------+
// |dimension|value |
// +---------+----------------------------------+
// |Country |[Germany -> 6, China -> 1] |
// |User |[Sarah -> 2, Alan -> 3, Paul -> 2]|
// +---------+----------------------------------+
Analysis: first we get the sum by country and user grouping by country and user respectively. Next we add one more custom aggregation to the pipeline which collects the previous results into a map. Map will be populated via map_from_arrays function found in Spark 2.4.0. The keys/values of the map we collect them with collect_list. Finally we union the two dataframes to populate the final results.

Fill in missing weeks within a given date interval in Spark (Scala)

Consider the following DataFrame:
val df = Seq("20140101", "20170619")
.toDF("date")
.withColumn("date", to_date($"date", "yyyyMMdd"))
.withColumn("week", date_format($"date", "Y-ww"))
The code yields:
date: date
week: string
date week
2014-01-01 2014-01
2017-06-19 2017-25
What I would like to do is thicken the dataframe so I'm left with one row for each week in the interval between 2014-01 and 2017-25. The date column isn't important so it can be discarded.
This needs to be done over a myriad of customer / product id combinations, so I'm looking for an efficient solution, preferably using nothing beyond java.sql.date and the built-in date functionalities in Spark.
Check this out. I have used the default "Sunday" as the week start number.
scala> import java.time._
import java.time._
scala> import java.time.format._
import java.time.format._
scala> val a = java.sql.Date.valueOf("2014-01-01")
a: java.sql.Date = 2014-01-01
scala> val b = java.sql.Date.valueOf("2017-12-31")
b: java.sql.Date = 2017-12-31
scala> val a1 = a.toLocalDate.toEpochDay.toInt
a1: Int = 16071
scala> val b1 = b.toLocalDate.toEpochDay.toInt
b1: Int = 17531
scala> val c1 = (a1 until b1).map(LocalDate.ofEpochDay(_)).map(x => (x,x.format(DateTimeFormatter.ofPattern("Y-ww")),x.format(DateTimeFormatter.ofPattern("E")) ) ).filter( x=> x._3 =="Sun" ).map(x => (java.sql.Date.valueOf(x._1),x._2) ).toMap
c1: scala.collection.immutable.Map[java.sql.Date,String] = Map(2014-06-01 -> 2014-23, 2014-11-02 -> 2014-45, 2017-11-05 -> 2017-45, 2016-10-23 -> 2016-44, 2014-11-16 -> 2014-47, 2014-12-28 -> 2015-01, 2017-04-30 -> 2017-18, 2015-01-04 -> 2015-02, 2015-10-11 -> 2015-42, 2014-09-07 -> 2014-37, 2017-09-17 -> 2017-38, 2014-04-13 -> 2014-16, 2014-10-19 -> 2014-43, 2014-01-05 -> 2014-02, 2016-07-17 -> 2016-30, 2015-07-26 -> 2015-31, 2016-09-18 -> 2016-39, 2015-11-22 -> 2015-48, 2015-10-04 -> 2015-41, 2015-11-15 -> 2015-47, 2015-01-11 -> 2015-03, 2016-12-11 -> 2016-51, 2017-02-05 -> 2017-06, 2016-03-27 -> 2016-14, 2015-11-01 -> 2015-45, 2017-07-16 -> 2017-29, 2015-05-24 -> 2015-22, 2017-06-18 -> 2017-25, 2016-03-13 -> 2016-12, 2014-11-09 -> 2014-46, 2014-09-21 -> 2014-39, 2014-01-26 -> 2014-05...
scala> val df = Seq( (c1) ).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: map<date,string>]
scala> val df2 = df.select(explode('a).as(Seq("dt","wk")) )
df2: org.apache.spark.sql.DataFrame = [dt: date, wk: string]
scala> df2.orderBy('dt).show(false)
+----------+-------+
|dt |wk |
+----------+-------+
|2014-01-05|2014-02|
|2014-01-12|2014-03|
|2014-01-19|2014-04|
|2014-01-26|2014-05|
|2014-02-02|2014-06|
|2014-02-09|2014-07|
|2014-02-16|2014-08|
|2014-02-23|2014-09|
|2014-03-02|2014-10|
|2014-03-09|2014-11|
|2014-03-16|2014-12|
|2014-03-23|2014-13|
|2014-03-30|2014-14|
|2014-04-06|2014-15|
|2014-04-13|2014-16|
|2014-04-20|2014-17|
|2014-04-27|2014-18|
|2014-05-04|2014-19|
|2014-05-11|2014-20|
|2014-05-18|2014-21|
+----------+-------+
only showing top 20 rows
scala>

How can i sort Rdd[(Int, (val1, val2))] by val2, when only SortByKey is available as option?

I have a Rdd[(Int, (val1, val2))] which i want to sort by val2, but the only available option to use is SortByKey.
Is SortBy available only in older scala versions?
Is there another option than collecting it to driver?
In code i do only:
val nonslack = slacks.filter(x=> Vlts.contains(x._1))
where Vlts is Array[Int] and slacks is rdd read from file.
There is a sortBy in RDD:
val rdd = spark.sparkContext.parallelize(Seq(("one", ("one" -> 1)), ("two", ("two" -> 2)), ("three", ("three" -> 3))))
rdd.sortBy(_._2._2).collect().foreach(println(_))

Data frame creation from MAP of N elements with schemaDetails of N elements

How do I convert the input5 data format into DataFrame, using the schema details mentioned in schemanames?
The conversion should be dynamic without using Row(r(0),r(1)) - the number of columns can increase or decrease in input and schema, hence the code should be dynamic.
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(Entry("a","b",0,Map("col1 " -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")))
val schemanames= "col1,ref"
Target dataframe should be only from Map of input 5 (like col1 and ref). There can be many other columns (like col2, col3...). If there are more columns in Map same columns would be mentioned in schema name.
Schema name variable should be used to create structure , input5.row(Map) should be data source ...as number of columns in schema name can be in 100's , same applies to data in Input5.row.
This would work for any number of columns, as long as they're all Strings, and each Entry contains a map with values for all of these columns:
// split to column names:
val columns = schemanames.split(",")
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, StringType)))
// convert input5 to Seq[Row], while selecting the values from "row" Map in same order of columns
val rows = input5.map(_.row)
.map(valueMap => columns.map(valueMap.apply).toSeq)
.map(Row.fromSeq)
// finally - create dataframe
val dataframe = spark.createDataFrame(sc.parallelize(rows), schema)
You can go through entries in schemanames (which are presumably selected keys in the Map based on your description) along with a UDF for Map manipulation to assemble the dataframe as shown below:
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(
Entry("a", "b", 0, Map("col1" -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")),
Entry("c", "b", 1, Map("col1" -> "0000444", "col2" -> "0000444", "ref" -> "2017-08-14 14:14:14.0")),
Entry("a", "d", 0, Map("col2" -> "0000666", "ref" -> "2017-08-16 16:16:16.0")),
Entry("e", "f", 0, Map("col1" -> "0000777", "ref" -> "2017-08-17 17:17:17.0", "others" -> "x"))
)
val schemanames= "col1, ref"
// Create dataframe from input5
val df = input5.toDF
// A UDF to get column value from Map
def getColVal(c: String) = udf(
(m: Map[String, String]) =>
m.get(c).getOrElse("n/a")
)
// Add columns based on entries in schemanames
val df2 = schemanames.split(",").map(_.trim).
foldLeft( df )(
(acc, c) => acc.withColumn( c, getColVal(c)(df("row"))
))
val df3 = df2.select(cols.map(c => col(c)): _*)
df3.show(truncate=false)
+-------+--------------------------+
|col1 |ref |
+-------+--------------------------+
|0000555|2017-08-12 12:12:12.266528|
|0000444|2017-08-14 14:14:14.0 |
|n/a |2017-08-16 16:16:16.0 |
|0000777|2017-08-17 17:17:17.0 |
+-------+--------------------------+

Read CSV File to OrderedMap

I'm reading the CSV File and adding data to Map in Scala.
val br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(fileName)), "UTF-8"))
val inputFormat = CSVFormat.newFormat(delimiter.charAt(0)).withHeader().withQuote('"')
import scala.collection.JavaConverters._
import org.apache.commons.csv.{CSVFormat, CSVParser}
val csvRecords = new CSVParser(br, inputFormat).getRecords.asScala
val buffer = for (csvRecord <- csvRecords; if csvRecords != null && csvRecords.nonEmpty)
yield csvRecord.toMap.asScala
buffer.toList
But as the Map is not ordered I'm not able to read the columns in order. Is there any way to read the csvRecords in order?
The CSV file contains comma separated values along with the header. It should generate the output in List[mutable.LinkedHashMap[String, String]] format something like [["fname", "A", "lname", "B"], ["fname", "C", "lname", "D"]].
The above code is working but it is not preserving the order. For Ex: if CSV file contains columns in order fname, lname, the output map is having lname first and fname last.
If I understand you question correctly, here's one way to create a list of LinkedHashMaps with elements in order:
// Assuming your CSV File has the following content:
fname,lname,grade
John,Doe,A
Ann,Cole,B
David,Jones,C
Mike,Duke,D
Jenn,Rivers,E
import collection.mutable.LinkedHashMap
// Get indexed header from CSV
val indexedHeader = io.Source.fromFile("/path/to/csvfile").
getLines.take(1).next.
split(",").
zipWithIndex
indexedHeader: Array[(String, Int)] = Array((fname,0), (lname,1), (grade,2))
// Aggregate LinkedHashMap using foldLeft
val ListOfLHM = for ( csvRecord <- csvRecords ) yield
indexedHeader.foldLeft(LinkedHashMap[String, String]())(
(acc, x) => acc += (x._1 -> csvRecord.get(x._2))
)
ListOfLHM: scala.collection.mutable.Buffer[scala.collection.mutable.LinkedHashMap[String,String]] = ArrayBuffer(
Map(fname -> John, lname -> Doe, grade -> A),
Map(fname -> Ann, lname -> Cole, grade -> B),
Map(fname -> David, lname -> Jones, grade -> C),
Map(fname -> Mike, lname -> Duke, grade -> D),
Map(fname -> Jenn, lname -> Rivers, grade -> E)
)