Scala - group list by key and map values - scala

I'm trying to do sometihng in Scala,
var tupleList = new ListBuffer[(String,List[String])]
val num1 = List("1","2")
val num2= List("3","4")
val num3= List("5","6")
tupleList+=(("Joe",num1))
tupleList+=(("Ben",num2))
tupleList+=(("Joe",num3))
**I want to union between both of the lists with name 'Joe', and create one list of numbers: 1,2,5,6 .
So I thought to use groupByKey, but how to explode the values?
mapValues?
How Can I use reduceLeft in this case?
I tried something like that:
val try3 = vre.groupMapReduce(_._1)(_._2)(_ :: _)
Thanks!

Principle problem was in using :: instead of :::.
With immutable collections your example will look like this:
val input =
("Joe" -> List("1","2")) ::
("Ben" -> List("3","4")) ::
("Joe" -> List("5","6")) ::
Nil
val result = input.groupMapReduce(_._1)(_._2)(_ ::: _)
print(result.mkString("{", ", ", "}"))
// Output:
// {Joe -> List(1, 2, 5, 6), Ben -> List(3, 4)}

Related

How to avoid Duplicates in List.newBuilder Scala?

How do I avoid duplicates for this code:
val lastUpdatesBuilder = List.newBuilder[(String, Int)]
val somelist = List("a","a")
for (v <- somelist) {
lastUpdatesBuilder += v -> 1
}
println(lastUpdatesBuilder.result())
Result is List((a,1), (a,1)) and I want it to be List((a,1)) only.
Here you go:
object Demo extends App {
val lastUpdatesBuilder = Set.newBuilder[(String, Int)]
val somelist = List("a","a")
for (v <- somelist) {
lastUpdatesBuilder += v -> 1
}
println(lastUpdatesBuilder.result())
}
Tho i would suggest not to use mutable set you can do something like this.
val ans = somelist.map{ key =>
key -> 1
}.toMap
println(ans)
Or you can first remove the duplicate using distinct and then create a map out of it.
val somelist = List("a","a").distinct
val ans = somelist.map{ key =>
key -> 1
}.toMap
This is what the distinct method does.

How to head DataFrame with Map[String,Long] column and preserve types?

I have a data frame on which I have applied filter condition
val colNames = customerCountDF
.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)
out of all the selected rows, I just want the last column of one row.
The last column type is Map[String, Long]. I want all the keys of the map as List[String].
I tried below syntax
val colNames = customerCountDF
.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)
.head
.getMap(14)
.keySet
.toList
.map(_.toString)
I am using map(_.toString) to convert a List[Nothing] to List[String]. The error that I am getting is:
missing parameter type for expanded function ((x$1) => x$1.toString)
[error] val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head().getMap(14).keySet.toList.map(_.toString)
The df is as follows:
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
|division_name| low| call_type|fiscal_year|fiscal_month| region_name|abandon_rate_percent|answered_calls|connects|equiv_week_calls|equiv_weeks|equivalent_calls|num_customers|offered_calls| pv|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
| NATIONAL|PHONE|CABLE CARD| 2016| 1|ALL DIVISIONS| 0.02| 10626| 0| 0.0| 0.0| 10649.8| 0| 10864|Map(subscribers_c...|
| NATIONAL|PHONE|CABLE CARD| 2016| 1| CENTRAL| 0.02| 3591| 0| 0.0| 0.0| 3598.6| 0| 3667|Map(subscribers_c...|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
one row of just last column selected is
[Map(subscribers_connects -> 5521287, disconnects_hsd -> 7992, subscribers_xfinity home -> 6277491, subscribers_bulk units -> 4978892, connects_cdv -> 41464, connects_disconnects -> 16945, connects_hsd -> 32908, disconnects_internet essentials -> 10319, disconnects_disconnects -> 3506, disconnects_video -> 8960, connects_xfinity home -> 43012)]
I'd like to get the keys of the last column as List[String] after applying the filter condition and taking just one row from the data frame.
The type problem is easy to solve, by explicitly specifying the type parameters at the source which is getMap(14). Since you know that the you are expecting a Map of String -> Int key-value pairs, just replace getMap(14) by getMap[String, Int](14).
And as far as the getMap[String, Int](14) being an empty Map, that has to do with your data and you simply have an empty map at index 14 in the head row.
More Details
In Scala when you create a List[A], Scala infers the type by using the information available.
For example,
// Explicitly provide the type parameter info
scala> val l1: List[Int] = List(1, 2)
// l1: List[Int] = List(1, 2)
// Infer the type parameter by using the arguments passed to List constructor,
scala> val l2 = List(1, 2)
// l2: List[Int] = List(1, 2)
So, what happens when you create an empty list,
// Explicitly provide the type parameter info
scala> val l1: List[Int] = List()
// l1: List[Int] = List()
// Infer the type parameter by using the arguments passed to List constructor,
// but surprise, there are no argument since you are creating empty list
scala> val l2 = List()
// l2: List[Nothing] = List()
So, now when Scala does not know anything, it will choose the most suitable type it can find which is the "empty" type Nothing.
The same thing happens when you do a toList on other collection objects, it tries to infer the type parameter from the source object.
scala> val ks1 = Map.empty[Int, Int].keySet
// ks1: scala.collection.immutable.Set[Int] = Set()
scala> val l1 = ks1.toList
// l1: List[Int] = List()
scala> val ks2 = Map.empty.keySet
// ks: scala.collection.immutable.Set[Nothing] = Set()
scala> val l2 = ks2.toList
// l1: List[Nothing] = List()
Similarly, the getMap(14) which you called on the head Row of the DataFrame, infers the type parameters for the Map using the values it is getting from the Row at index 14. So, if it does not get anything at the said index the returned map will be same as Map.empty which is a Map[Nothing, Nothing].
Which means that your whole,
val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head.getMap(14).keySet.toList.map(_.toString)
is equivalent to,
val colNames = Map.empty.keySet.toList.map(_.toString)
And hence,
scala> val l = List()
// l1: List[Nothing] = List()
val colNames = l.map(_.toString)
To summarise the above, any List[Nothing] can only be an empty list.
Now, there are two problems, one is about the type-problem in List[Nothing] the other is about it being empty.
After filter you can just select the column and get as Map as below
first().getAs[Map[String, Long]]("pv").keySet
Since you're accessing a single column only (at the 14th position), why not making your developer's live a bit easier (and help the people who would support your code later)?
Try the following:
val colNames = customerCountDF
.where($"fiscal_year" === maxYear) // Split one long filter into two
.where($"fiscal_month" === maxMnth) // where is a SQL-like alias of filter
.select("pv") // Take just the field you need to work with
.as[Map[String, Long]] // Map it to the proper type
.head // Load just the single field (all others are left aside)
.keySet // That's just a pure Scala
I think the above code says what it does in such a clear way (and I think should be the fastest out of the provided solutions since it just loads a single pv field to a JVM object on the driver).
A workaround to get the final result in List[String]. Check this out:
scala> val customerCountDF=Seq((2018,12,Map("subscribers_connects" -> 5521287L, "disconnects_hsd" -> 7992L, "subscribers_xfinity home" -> 6277491L, "subscribers_bulk units" -> 4978892L, "connects_cdv" -> 41464L, "connects_disconnects" -> 16945L, "connects_hsd" -> 32908L, "disconnects_internet essentials" -> 10319L, "disconnects_disconnects" -> 3506L, "disconnects_video" -> 8960L, "connects_xfinity home" -> 43012L))).toDF("fiscal_year","fiscal_month","mapc")
customerCountDF: org.apache.spark.sql.DataFrame = [fiscal_year: int, fiscal_month: int ... 1 more field]
scala> val maxYear =2018
maxYear: Int = 2018
scala> val maxMnth = 12
maxMnth: Int = 12
scala> val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).first.getMap(2).keySet.mkString(",").split(",").toList
colNames: List[String] = List(subscribers_connects, disconnects_hsd, subscribers_xfinity home, subscribers_bulk units, connects_cdv, connects_disconnects, connects_hsd, disconnects_internet essentials, disconnects_disconnects, disconnects_video, connects_xfinity home)
scala>

Full outer join in Scala

Given a list of lists, where each list has an object that represents the key, I need to write a full outer join that combines all the lists. Each record in the resulting list is the combination of all the fields of all the lists. In case that one key is present in list 1 and not present in list 2, then the fields in list 2 should be null or empty.
One solution I thought of is to embed an in-memory database, create the tables, run a select and get the result. However, I'd like to know if there are any libraries that handle this in a more simpler way. Any ideas?
For example, let's say I have two lists, where the key is the first field in the list:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val allLists = List (list1, list2)
The full outer joined list would be:
val allListsJoined = List ((1,2,"A"), (3,4,None), (5,6,None), (7,None,"B"))
NOTE: the solution needs to work for N lists
def fullOuterJoin[K, V1, V2](xs: List[(K, V1)], ys: List[(K, V2)]): List[(K, Option[V1], Option[V2])] = {
val map1 = xs.toMap
val map2 = ys.toMap
val allKeys = map1.keySet ++ map2.keySet
allKeys.toList.map(k => (k, map1.get(k), map2.get(k)))
}
Example usage:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
println(fullOuterJoin(list1, list2))
Which prints:
List((1,Some(2),Some(A)), (3,Some(4),None), (5,Some(6),None), (7,None,Some(B)))
Edit per suggestion in comments:
If you're interested in joining an arbitrary number of lists and don't care about type info, here's a version that does that:
def fullOuterJoin[K](xs: List[List[(K, Any)]]): List[(K, List[Option[Any]])] = {
val maps = xs.map(_.toMap)
val allKeys = maps.map(_.keySet).reduce(_ ++ _)
allKeys.toList.map(k => (k, maps.map(m => m.get(k))))
}
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val list3 = List((1, 3.5), (7, 4.0))
val lists = List(list1, list2, list3)
println(fullOuterJoin(lists))
which outputs:
List((1,List(Some(2), Some(A), Some(3.5))), (3,List(Some(4), None, None)), (5,List(Some(6), None, None)), (7,List(None, Some(B), Some(4.0))))
If you want both an arbitrary number of lists and well-typed results, that's probably beyond the scope of a stackoverflow answer but could probably be accomplished with shapeless.
Here is a way to do it using collect separately on both list
val list1Ite = list1.collect{
case ele if list2.filter(e=> e._1 == ele._1).size>0 => { //if list2 _1 contains ele._1
val left = list2.find(e=> e._1 == ele._1) //find the available element
(ele._1, ele._2, left.get._2) //perform join
}
case others => (others._1, others._2, None) //others add None as _3
}
//list1Ite: List[(Int, Int, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None))
Do similar operation but exclude the elements which are already available in list1Ite
val list2Ite = list2.collect{
case ele if list1.filter(e=> e._1 == ele._1).size==0 => (ele._1, None , ele._2)
}
//list2Ite: List[(Int, None.type, String)] = List((7,None,B))
Combine both list1Ite and list2Ite to result
val result = list1Ite.++(list2Ite)
result: List[(Int, Any, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None), (7,None,B))

Map one value to all values with a common relation Scala

Having a set of data:
{sentenceA1}{\t}{sentenceB1}
{sentenceA1}{\t}{sentenceB2}
{sentenceA2}{\t}{sentenceB1}
{sentenceA3}{\t}{sentenceB1}
{sentenceA4}{\t}{sentenceB2}
I want to map a sentenceA to all the sentences that have a common sentenceB in Scala so the result will be something like this:
{sentenceA1}->{sentenceA2,sentenceA3,sentenceA4} or
{sentenceA2}->{sentenceA1, sentenceA3}
val lines = List(
"sentenceA1\tsentenceB1",
"sentenceA1\tsentenceB2",
"sentenceA2\tsentenceB1",
"sentenceA3\tsentenceB1",
"sentenceA4\tsentenceB2"
)
val afterSplit = lines.map(_.split("\t"))
val ba = afterSplit
.groupBy(_(1))
.mapValues(_.map(_(0)))
val ab = afterSplit
.groupBy(_(0))
.mapValues(_.map(_(1)))
val result = ab.map { case (a, b) =>
a -> b.foldLeft(Set[String]())(_ ++ ba(_)).diff(Set(a))
}

How does Scala's mutable Map update [map(key) = newValue] syntax work?

I'm working through Cay Horstmann's Scala for the Impatient book where I came across this way of updating a mutable map.
scala> val scores = scala.collection.mutable.Map("Alice" -> 10, "Bob" -> 3, "Cindy" -> 8)
scores: scala.collection.mutable.Map[String,Int] = Map(Bob -> 3, Alice -> 10, Cindy -> 8)
scala> scores("Alice") // retrieve the value of type Int
res2: Int = 10
scala> scores("Alice") = 5 // Update the Alice value to 5
scala> scores("Alice")
res4: Int = 5
It looks like scores("Alice") hits apply in MapLike.scala. But this only returns the value, not something that can be updated.
Out of curiosity I tried the same syntax on an immutable map and was presented with the following error,
scala> val immutableScores = Map("Alice" -> 10, "Bob" -> 3, "Cindy" -> 8)
immutableScores: scala.collection.immutable.Map[String,Int] = Map(Alice -> 10, Bob -> 3, Cindy -> 8)
scala> immutableScores("Alice") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[String,Int]
immutableScores("Alice") = 5
^
Based on that, I'm assuming that scores("Alice") = 5 is transformed into scores update ("Alice", 5) but I have no idea how it works, or how it is even possible.
How does it work?
This is an example of the apply, update syntax.
When you call map("Something") this calls map.apply("Something") which in turn calls get.
When you call map("Something") = "SomethingElse" this calls map.update("Something", "SomethingElse") which in turn calls put.
Take a look at this for a fuller explanation.
Can you try this: => to update list of Map
import java.util.concurrent.ConcurrentHashMap
import scala.collection.JavaConverters._
import scala.collection.concurrent
val map: concurrent.Map[String, List[String]] = new ConcurrentHashMap[String, List[String]].asScala
def updateMap(key: String, map: concurrent.Map[String, List[String]], value: String): Unit = {
map.get(key) match {
case Some(list: List[String]) => {
val new_list = value :: list
map.put(key, new_list)
}
case None => map += (key -> List(value))
}
}
The problem is you're trying to update immutable map. I had the same error message when my map was declared as
var m = new java.util.HashMap[String, Int]
But when i replaced the definition by
var m = new scala.collection.mutable.HashMap[String, Int]
the m.update worked.