I have an array of names. I need to group by elements and select the name with highest count. If there are ties, then return the alphabetically last name. I have the below so far:
val names = Array("Adam", "Eve", "Adam", "Eve", "John", "Doe")
val countNames = names.map(x => (x, names.count(_ == x))).toSeq.sortBy(_._2)
My result should be Eve. How can I get that?
You can use the natural ordering of a (Int, String) tuple, and get the last element:
scala> names.map(x => (names.count(_ == x), x)).sorted.last._2
res0: String = Eve
This would work as expected because this ordering would put the highest count last, and among tuples with same count would sort alphabetically, again placing "higher" alphabetical values last.
p.s. the grouping can also be done using groupBy:
names.groupBy(identity) // Map(Adam -> Array(Adam, Adam), Doe -> Array(Doe), Eve -> Array(Eve, Eve), John -> Array(John))
.mapValues(_.length) // Map(Adam -> 2, Doe -> 1, Eve -> 2, John -> 1)
.toSeq.map(_.swap) // ArrayBuffer((2,Adam), (1,Doe), (2,Eve), (1,John))
.sorted // ArrayBuffer((1,Doe), (1,John), (2,Adam), (2,Eve))
.last._2 // Eve
Define a lexicographical order, where the count has priority over the string. Then use it to sort your list and pick the first element:
countNames.sortWith( (x,y) =>
(x._2 > y._2) ||
(x._2 == y._2 && x._1 > y._1) )(0)._2
Related
I have this a CassandraTable. Access by SparkContext.cassandraTable(). Retrieve all my CassandraRow.
Now I want to store 3 information: (user, city, byte)
I store like this
rddUsersFilter.map(row =>
(row.getString("user"),(row.getString("city"),row.getString("byte").replace(",","").toLong))).groupByKey
I obtain a RDD[(String, Iterable[(String, Long)])]
Now for each user I want to sum all bytes and create a Map for city like: "city"->"occurencies" (how many time this city appairs for this user).
Previously, I split up this code in two differnt RDD, one to sum byte, the other one to create map as described.
Example for occurency for City
rddUsers.map(user => (user._1, user._2.size, user._2.groupBy(identity).map(city => (city._1,city._2.size))))
that's because I could access to second element of my tuple thanks to ._2 method. But now?
My second element is a Iterable[(String,Long)], and I can't map anymore like I did before.
Is there a solution to retrieve all my information with just one rdd and a single MapReduce?
You could do this easily by first grouping bytes and city occurrence for user,city and then do a group by user
val data = Array(("user1","city1",100),("user1","city1",100),
("user1","city1",100),("user1","city2",100),("user1","city2",100),
("user1","city3",100),("user1","city2",100),("user2","city1",100),
("user2","city2",100))
val rdd = sc.parallelize(data)
val res = rdd.map(x=> ((x._1,x._2),(1,x._3)))
.reduceByKey((x,y)=> (x._1+y._1,x._2+y._2))
.map(x => (x._1._1,(x._1._2,x._2._1,x._2._2)))
.groupByKey
val userCityUsageRdd = res.map(x => {
val m = x._2.toList
(x._1 ,m.map(y => (y._1->y._2)).toMap, m.map(x => x._3).reduce(_+_))
})
output
res20: Array[(String, scala.collection.immutable.Map[String,Int], Int)] =
Array((user1,Map(city1 -> 3, city3 -> 1, city2 -> 3),700),
(user2,Map(city1 -> 1, city2 -> 1),200))
How would I count the number of occurrences of an element in a Map?
example
val myMap = Map("word1" -> "foo", "word3" -> "word4", "word5" -> "foo")
myMap contains "foo" count //???
// returns 2
You can just use count with a predicate :
myMap.count({ case (k, v) => v == "word1" })
Alternatively:
myMap.values.count(_ == "word1")
Or even:
myMap.count(_._2 == "word1") // _2 is the second tuple element
Note: That's for values, not keys. Keys are unique.
In general, if you want to count occurences in a Map you can group by values and then transform the grouped submappings getting their size
scala> val occurrences = myMap groupBy ( _._2 ) mapValues ( _.size )
occurrences: scala.collection.immutable.Map[String,Int] = Map(word4 -> 1, foo -> 2)
This is handy if you need to have counts for every entry, and not only a single value.
Otherwise #Ven's solution is more efficient
Yet another way,
myMap.values.filter { _ == "foo" }.size
I have a Map like:
Map("product1" -> List(Product1ObjectTypes), "product2" -> List(Product2ObjectTypes))
where ProductObjectType has a field usage. Based on the other field (counter) I have to update all ProductXObjectTypes.
The issue is that this update depends on previous ProductObjectType, and I can't find a way to get previous item when iterating over mapValues of this map. So basically, to update current usage I need: CurrentProduct1ObjectType.counter - PreviousProduct1ObjectType.counter.
Is there any way to do this?
I started it like:
val reportsWithCalculatedUsage =
reportsRefined.flatten.flatten.toList.groupBy(_._2.product).mapValues(f)
but I don't know in mapValues how to access previous list item.
I'm not sure if I understand completely, but if you want to update the values inside the lists based on their predecessors, this can generally be done with a fold:
case class Thing(product: String, usage: Int, counter: Int)
val m = Map(
"product1" -> List(Thing("Fnord", 10, 3), Thing("Meep", 0, 5))
//... more mappings
)
//> Map(product1 -> List(Thing(Fnord,10,3), Thing(Meep,0,5)))
m mapValues { list => list.foldLeft(List[Thing]()){
case (Nil, head) =>
List(head)
case (tail, head) =>
val previous = tail.head
val current = head copy (usage = head.usage + head.counter - previous.counter)
current :: tail
} reverse }
//> Map(product1 -> List(Thing(Fnord,10,3), Thing(Meep,2,5)))
Note that regular map is an unordered collection, you need to use something like TreeMap to have predictable order of iteration.
Anyways, from what I understand you want to get pairs of all values in a map. Try something like this:
scala> val map = Map(1 -> 2, 2 -> 3, 3 -> 4)
scala> (map, map.tail).zipped.foreach((t1, t2) => println(t1 + " " + t2))
(1,2) (2,3)
(2,3) (3,4)
Suppose you have
val docs = List(List("one", "two"), List("two", "three"))
where e.g. List("one", "two") represents a document containing terms "one" and "two", and you want to build a map with the document frequency for every term, i.e. in this case
Map("one" -> 1, "two" -> 2, "three" -> 1)
How would you do that in Scala? (And in an efficient way, assuming a much larger dataset.)
My first Java-like thought is to use a mutable map:
val freqs = mutable.Map.empty[String,Int]
for (doc <- docs)
for (term <- doc)
freqs(term) = freqs.getOrElse(term, 0) + 1
which works well enough but I'm wondering how you could do that in a more "functional" way, without resorting to a mutable map?
Try this:
scala> docs.flatten.groupBy(identity).mapValues(_.size)
res0: Map[String,Int] = Map(one -> 1, two -> 2, three -> 1)
If you are going to be accessing the counts many times, then you should avoid mapValues since it is "lazy" and, thus, would recompute the size on every access. This version gives you the same result but won't require the recomputations:
docs.flatten.groupBy(identity).map(x => (x._1, x._2.size))
The identity function just means x => x.
docs.flatten.foldLeft(new Map.WithDefault(Map[String,Int](),Function.const(0))){
(m,x) => m + (x -> (1 + m(x)))}
What a train wreck!
[Edit]
Ah, that's better!
docs.flatten.foldLeft(Map[String,Int]() withDefaultValue 0){
(m,x) => m + (x -> (1 + m(x)))}
Starting Scala 2.13, after flattening the list of lists, we can use groupMapReduce which is a one-pass alternative to groupBy/mapValues:
// val docs = List(List("one", "two"), List("two", "three"))
docs.flatten.groupMapReduce(identity)(_ => 1)(_ + _)
// Map[String,Int] = Map("one" -> 1, "three" -> 1, "two" -> 2)
This:
flattens the List of Lists as a List
groups list elements (identity) (group part of groupMapReduce)
maps each grouped value occurrence to 1 (_ => 1) (map part of groupMapReduce)
reduces values within a group of values (_ + _) by summing them (reduce part of groupMapReduce).
Apologies: I'm well noob
I have an items class
class item(ind:Int,freq:Int,gap:Int){}
I have an ordered list of ints
val listVar = a.toList
where a is an array
I want a list of items called metrics where
ind is the (unique) integer
freq is the number of times that ind appears in list
gap is the minimum gap between ind and the number in the list before it
so far I have:
def metrics = for {
n <- 0 until 255
listVar filter (x == n) count > 0
}
yield new item(n, (listVar filter == n).count,0)
It's crap and I know it - any clues?
Well, some of it is easy:
val freqMap = listVar groupBy identity mapValues (_.size)
This gives you ind and freq. To get gap I'd use a fold:
val gapMap = listVar.sliding(2).foldLeft(Map[Int, Int]()) {
case (map, List(prev, ind)) =>
map + (ind -> (map.getOrElse(ind, Int.MaxValue) min ind - prev))
}
Now you just need to unify them:
freqMap.keys.map( k => new item(k, freqMap(k), gapMap.getOrElse(k, 0)) )
Ideally you want to traverse the list only once and in the course for each different Int, you want to increment a counter (the frequency) as well as keep track of the minimum gap.
You can use a case class to store the frequency and the minimum gap, the value stored will be immutable. Note that minGap may not be defined.
case class Metric(frequency: Int, minGap: Option[Int])
In the general case you can use a Map[Int, Metric] to lookup the Metric immutable object. Looking for the minimum gap is the harder part. To look for gap, you can use the sliding(2) method. It will traverse the list with a sliding window of size two allowing to compare each Int to its previous value so that you can compute the gap.
Finally you need to accumulate and update the information as you traverse the list. This can be done by folding each element of the list into your temporary result until you traverse the whole list and get the complete result.
Putting things together:
listVar.sliding(2).foldLeft(
Map[Int, Metric]().withDefaultValue(Metric(0, None))
) {
case (map, List(a, b)) =>
val metric = map(b)
val newGap = metric.minGap match {
case None => math.abs(b - a)
case Some(gap) => math.min(gap, math.abs(b - a))
}
val newMetric = Metric(metric.frequency + 1, Some(newGap))
map + (b -> newMetric)
case (map, List(a)) =>
map + (a -> Metric(1, None))
case (map, _) =>
map
}
Result for listVar: List[Int] = List(2, 2, 4, 4, 0, 2, 2, 2, 4, 4)
scala.collection.immutable.Map[Int,Metric] = Map(2 -> Metric(4,Some(0)),
4 -> Metric(4,Some(0)), 0 -> Metric(1,Some(4)))
You can then turn the result into your desired item class using map.toSeq.map((i, m) => new Item(i, m.frequency, m.minGap.getOrElse(-1))).
You can also create directly your Item object in the process, but I thought the code would be harder to read.