Scala: How to create a map over a collection from a set of keys? - scala

Say I have a set of people Set[People]. Each person has an age. I want to create a function, which creates a Map[Int, Seq[People]] where for each age from, say, 0 to 100, there would be a sequence of people of that age or an empty sequence if there were no people of that age in the original collection.
I.e. I'm doing something along the lines
Set[People].groupBy(_.age)
where the output was
Map[Int, Seq[People]](0 -> Seq[John,Mary], 1-> Seq[People](), 2 -> Seq[People](Bill)...
groupBy of course omits all those ages for which there are no people. How should I implement this?

Configure a default value for your map:
val grouped = people.groupBy(_.age).withDefaultValue(Set())
if you need the values to be sequences you can map them
val grouped = people.groupBy(_.age).mapValues(_.toSeq).withDefaultValue(Seq())
Remember than, as the documentation puts it:
Note: `get`, `contains`, `iterator`, `keys`, etc are not affected by `withDefault`.

Since you've got map with not empty sequences corresponding to ages, you can fill the rest with empty collections:
val fullMap = (0 to 100).map (index => index -> map.getOrElse(index, None)).toMap

Related

How to sort spark countByKey() result which is in scala.collection.Map[(String, String),Long] based on value?

CSV table stored in location "/user/root/sqoopImport/orders"
val orders = sc.textFile("/user/root/sqoopImport/orders")
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().foreach(println)
Here I am getting this result in unsorted based on key (String,String)
((2014-03-19 00:00:00.0,PENDING),9)
((2014-04-18 00:00:00.0,ON_HOLD),11)
((2013-09-17 00:00:00.0,ON_HOLD),8)
((2014-07-10 00:00:00.0,COMPLETE),57)
I want to sort so I have tried
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
<console>:30: error: value sortBy is not a member of scala.collection.Map[(String, String),Long]
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
countByKey() is an action. It finishes Spark calculation and gives you a normal Scala Map. Since Map is unordered, it makes no sense to sort it: you need to convert it to Seq first, using toSeq. If you want to stay in Spark land, you should use a transformation instead, in this case reduceByKey():
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).reduceByKey(_ + _).sortBy(_._1).foreach(println)
Also, please note that foreach(println) will only work as you expect in local mode: https://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd.
A Map is an unordered collection. You would need to convert that map into a collection that maintains order and sort it by key. ex:
val sorted = map.toSeq.sortBy{
case (key,_) => key
}
This is because the
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey()
returns Map[(String, String),Long] in which we cannot apply sortBy() function
What you can do is
val result = orders.map(_.split(",")).
map(x=>((x(1),x(3)),1)).countByKey().toSeq
//and apply the sortby function in new RDD
sc.parallelize(result).sortBy(_._1).collect().foreach(println)
Hope this helps!

spark-graphx finding the most active user?

I have a graph of this form:
_ 3 _
/' '\
(1) (1)
/ \
1--(2)--->2
I want to count the most active user (who follow the most,here it's user 1 who follows two times user 2 and one time user 3).
My graph is of this form Graph[Int,Int]
val edges = Array(Edge(1,10,1), Edge(10,1,1), Edge(11,1,1), Edge(1,11,1), Edge(1,12,1))
val vertices = Array((12L,12), (10L,10), (11L,11), (1L,1))
val graph = Graph(sc.parallelize(vertices),sc.parallelize(edges),0)
My idea is to use to group srcId for the edges and to count using the iterator and then to sort but I have issues to use the iterator, the type are quite complex:
graph.edges.groupBy(_.dstId).collect() has type:
Array[(org.apache.spark.graphx.VertexId,Iterable[org.apache.spark.graphx.Edge[Int]])]
Any ideas ?
Your idea of grouping by srcId is good, since you are looking for the relation follows and not is followed by (your example uses dstId by the way)
val group = graph.edges.groupBy(_.srcId)
group now contains the edges going out of each vertex. We can now take the sum of the attributes to get the total time the user follows any user.
val followCount = group.map{
case (vertex, edges) => (vertex, edges.map(_.attr).sum)
}.collect
Which produces
Array((10,1), (11,1), (1,3))
Now if you want to extract the user which follows the most, you can simply sort it by descending order and take the head of the list, which will give the most active user.
val mostActiveUser = followCount.sortBy(- _._2).head

Count unique values in list of sub-lists

I have RDD of the following structure (RDD[(String,Map[String,List[Product with Serializable]])]):
This is a sample data:
(600,Map(base_data -> List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)), additional_data -> List((1,2)))
(601,Map(base_data -> List((10:01 01-08-2016,600,111,1,2), (10:02 01-08-2016,619,111,1,2), (10:01 01-08-2016,600,111,1,4)), additional_data -> List((5,6)))
I want to calculate the number of unique values of the 4th fields in sub-lists.
For instance let's take the first entry. The list is List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)). It contains 2 unique values (1 and 5) in the 4th field of sub-lists.
As to the second entry, it also contains 2 unique values (2 and 4), because 2 is repeated twice.
The resulting RDD should be of the format RDD[Map[String,Any]].
I tried to solve this task as follows:
val result = myRDD.map({
line => Map(("id",line._1),
("unique_count",line._2.get("base_data").groupBy(l => l).count(_))))
})
However this code does not do what I need. In fact, I don't know how to properly indicate that I want to group by 4th field...
You are quite close to the solution. There is no need to call groupBy, but you can access the item of the tuples by index, transform the resulting List into a Set and then just return the size of the Set, which corresponds to the number of unique elements:
("unique_count", line._2("base_data").map(bd => bd.productElement(4)).toSet.size)

Scala filter by set

Say I have a map that looks like this
val map = Map("Shoes" -> 1, "heels" -> 2, "sneakers" -> 3, "dress" -> 4, "jeans" -> 5, "boyfriend jeans" -> 6)
And also I have a set or collection that looks like this:
val set = Array(Array("Shoes", "heels", "sneakers"), Array("dress", "maxi dress"), Array("jeans", "boyfriend jeans", "destroyed jeans"))
I would like to perform a filter operation on my map so that only one element in each of my set retains. Expected output should be something like this:
map = Map("Shoes" -> 1, "dress" -> 4 ,"jeans" -> 5)
The purpose of doing this is so that if I have multiple sets that indicate different categories of outfits, my output map doesn't "repeat" itself on technically the same objects.
Any help is appreciated, thanks!
So first get rid of the confusion that your sets are actually arrays. For the rest of the example I will use this definition instead:
val arrays = Array(Array("Shoes", "heels", "sneakers"), Array("dress", "maxi dress"), Array("jeans", "boyfriend jeans", "destroyed jeans"))
So in a sense you have an array of arrays of equivalent objects and want to remove all but one of them?
Well first you have to find which of the elements in an array are actually used as keys in the mep. So we just filter out all elements that are not used as keys:
array.filter(map.keySet)
Now, we have to chose one element. As you said, we just take the first one:
array.filter(map.keySet).head
As your "sets" are actually arrays, this is really the first element in your array that is also used as a key. If you would actually use sets this code would still work as sets actually have a "first element". It is just highly implementations specific and it might not even be deterministic over various executions of the same program. At least for immutable sets it should however be deterministic over several calls to head, i.e., you should always get the same element.
Instead of the first element we are actually interested in all other elements, as we want to remove them from the map:
array.filter(map.keySet).tail
Now, we just have to remove those from the map:
map -- array.filter(map.keySet).tail
And to do it for all arrays:
map -- arrays.flatMap(_.filter(map.keySet).tail)
This works fine as long as the arrays are disjoined. If they are not, we can not take the initial map to filter the array in every step. Instead, we have to use one array to compute a new map, then take the next starting with the result from the last and so on. Luckily, we do not have to do much:
arrays.foldLeft(map){(m,a) => m -- a.filter(m.keySet).tail}
Note: Sets are also functions from elements to Boolean, this is, why this solution works.
This code solves the problem:
var newMap = map
set.foreach { list =>
var remove = false
list.foreach { _key =>
if (remove) {
newMap -= _key
}
if (newMap.contains(_key)) {
remove = true
}
}
}
I'm completely new at Scala. I have taken this as my first Scala
example, please any hints from Scala's Gurus is welcome.
The basic idea is to use groupBy. Something like
map.groupBy{ case (k,v) => g(k) }.
map{ case (_, kvs) => kvs.head }
This is the general way to group similar things (using some function g). Now the question is just how to make the g that you need. One way is
val g = set.zipWithIndex.
flatMap{ case (a, i) => a.map(x => x -> i) }.
toMap
which labels each set with a number, and then forms a map so you can look it up. Maps have an apply function, so you can use it as above.
A slightly simpler version
set.flatMap(_.find(map.contains).map(y => y -> map(y)))

scala neat way to match and map out Iterable

I am still learning to code in Scala/Spark and I have a problem and greatly appreciate your help.
I have an Iterable with a Key and a Double.
The Iterable [Key] is a subset of say 5 possible Keys say:
population
age
gender
height
weight
and the Double is the corresponding reading.
My question is that I want to represent my data in a flat format of:
(Double,Double,Double,Double,Double)
which corresponds to the readings from Keys in specific order:
(population,age,gender,height,weight)
but in the Iterable where a key does not exist, I want still need to pad it with a 0. So for example:
Iterable((population,10),(age,21),(gender,0))
I want to be able to represent this as
(10,21,0,0,0) //the last 2 zeros are padded because there is no key matching height and weight.
So far I've been doing individual match to each key (if exist then copy the Double and if not pad with zero), but I want to know if there is a neat way of doing this.
Thanks
Personally, I'd create a Map. So say you've got your Iterable:
val values = Seq(("population", 10), ("age", 21), ("gender", 0)).toIterable
Convert it to a map:
val keyValueMap = values.toMap
And when you extract the values from it, just use the getOrElse function:
keyValueMap.getOrElse("height", defaultHeight)
What I'd do is first define the order of the elements, so,
val keyOrder = List("population", "age", "gender", "height", "weight")
Next, you can just do something like;
val valMap = Map("population" -> 199D, "gender" -> 1D, "weight" -> 50D)
keyOrder.map(k => valMap.getOrElse(k, 0D))