My dataset contain 5 columns with last column as classindex. I want the combination of each column with that classindex values.
"sunny", "hot", "high", "false","no"
"sunny", "hot", "high", "true","no"
"overcast", "hot", "high", "false","yes"
"rainy", "mild", "high", "false","yes"
I want the combination sunny & yes = 0, sunny & no = 2, overcast & yes = 1, rainy & yes = 2.
Gather each row into a case class Weather with 5 properties,
case class Weather(p1: String, p2: String, p3: String, p4: String, p5: String)
and so for
val xs = Array(
Weather("sunny", "hot", "high", "false","no"),
Weather("sunny", "hot", "high", "true","no"),
Weather("overcast", "hot", "high", "false","yes"),
Weather("rainy", "mild", "high", "false","yes"))
group the entries by the first and last properties, and then count the amount of grouped instances, for instance like like this,
xs.groupBy( w => (w.p1,w.p5) ).mapValues(_.size)
which delivers
Map((overcast,yes) -> 1, (sunny,no) -> 2, (rainy,yes) -> 1)
However this approach does not account for missing or not declared groups such as "sunny" and "yes".
The description of your dataset seems a little bit vague to me but, which data structure are you using to represent it?
Imagine it is a list, you could try something like:
l => (l.head, l.last)
Applying this to the entire set:
val dataset = List(
"sunny"::"hot"::"high"::"no"::Nil,
"sunny"::"hot"::"high"::"no"::Nil,
"overcast"::"hot"::"high"::"yes"::Nil,
"rainy"::"mild"::"high"::"yes"::Nil
)
val qualified = dataset.map(l => (l.head, l.last))
Once you have your elements qualified with "yes"/"no" class you can group your occurrences and count the number of element of each group:
val countMap = qualified.groupBy(x => x).map(kv => (kv._1, kv._2.size))
Or the shorter form:
val countMap = qualified.groupBy(x => x).mapValues(_.size)
In order to list all possibilities, even though their count is 0, you can generate all possible combinations and use the map to look-up each count value:
(
for(
st <- dataset.map(_.head).toSet[String];
q <- dataset.map(_.last).toSet[String]
) yield (st,q)
).map(k => (k, countMap.getOrElse(k,0)))
> Set(((rainy,no),0), ((sunny,yes),0), ((sunny,no),2), ((rainy,yes),1), ((overcast,yes),1), ((overcast,no),0))
Related
i have a hashset that has ids like:
d1, d2, d3
and a hashmap that has some variables as keys and ids as values :
(c1,d2), (c2,d1), (c1,d1), (c1,d3)
What i want is to find the keys in the hashmap that have all the ids from the set as values and put them in another set.
For example, for the data above, the only key that i would put in the new set is c1. That's because afterwards i want to do some computations and i want to prevent any Nans. I am thinking about creating a function that computes it but i was wondering if there was an easier and fastest way in scala that i don't know. Does anyone have any clue?
Hashmap can not containt distinct elements with same id e.g
Map((c1,d2), (c2,d1), (c1,d1), (c1,d3)) -> will become Map((c2,d1), (c1,d3)).
val set = Set("d1", "d2", "d3")
val arr = Array(("c1", "d2"), ("c2", "d1"), ("c1", "d1"), ("c1", "d3"))
arr.groupBy(_._1)
.mapValues(_.map(_._2))
.filter(_._2.sorted.sameElements(set)).keys
Here's one approach:
val s = Set("d1", "d2", "d3")
val l = List(("c1", "d2"), ("c2", "d1"), ("c1", "d1"), ("c1", "d3"))
l.groupBy(_._1).mapValues(_.map(_._2).toSet).
filter{ case (_, v) => (v intersect s) == s }.
keys.toList
// res2: List[String] = List(c1)
I have an array of names. I need to group by elements and select the name with highest count. If there are ties, then return the alphabetically last name. I have the below so far:
val names = Array("Adam", "Eve", "Adam", "Eve", "John", "Doe")
val countNames = names.map(x => (x, names.count(_ == x))).toSeq.sortBy(_._2)
My result should be Eve. How can I get that?
You can use the natural ordering of a (Int, String) tuple, and get the last element:
scala> names.map(x => (names.count(_ == x), x)).sorted.last._2
res0: String = Eve
This would work as expected because this ordering would put the highest count last, and among tuples with same count would sort alphabetically, again placing "higher" alphabetical values last.
p.s. the grouping can also be done using groupBy:
names.groupBy(identity) // Map(Adam -> Array(Adam, Adam), Doe -> Array(Doe), Eve -> Array(Eve, Eve), John -> Array(John))
.mapValues(_.length) // Map(Adam -> 2, Doe -> 1, Eve -> 2, John -> 1)
.toSeq.map(_.swap) // ArrayBuffer((2,Adam), (1,Doe), (2,Eve), (1,John))
.sorted // ArrayBuffer((1,Doe), (1,John), (2,Adam), (2,Eve))
.last._2 // Eve
Define a lexicographical order, where the count has priority over the string. Then use it to sort your list and pick the first element:
countNames.sortWith( (x,y) =>
(x._2 > y._2) ||
(x._2 == y._2 && x._1 > y._1) )(0)._2
I have this a CassandraTable. Access by SparkContext.cassandraTable(). Retrieve all my CassandraRow.
Now I want to store 3 information: (user, city, byte)
I store like this
rddUsersFilter.map(row =>
(row.getString("user"),(row.getString("city"),row.getString("byte").replace(",","").toLong))).groupByKey
I obtain a RDD[(String, Iterable[(String, Long)])]
Now for each user I want to sum all bytes and create a Map for city like: "city"->"occurencies" (how many time this city appairs for this user).
Previously, I split up this code in two differnt RDD, one to sum byte, the other one to create map as described.
Example for occurency for City
rddUsers.map(user => (user._1, user._2.size, user._2.groupBy(identity).map(city => (city._1,city._2.size))))
that's because I could access to second element of my tuple thanks to ._2 method. But now?
My second element is a Iterable[(String,Long)], and I can't map anymore like I did before.
Is there a solution to retrieve all my information with just one rdd and a single MapReduce?
You could do this easily by first grouping bytes and city occurrence for user,city and then do a group by user
val data = Array(("user1","city1",100),("user1","city1",100),
("user1","city1",100),("user1","city2",100),("user1","city2",100),
("user1","city3",100),("user1","city2",100),("user2","city1",100),
("user2","city2",100))
val rdd = sc.parallelize(data)
val res = rdd.map(x=> ((x._1,x._2),(1,x._3)))
.reduceByKey((x,y)=> (x._1+y._1,x._2+y._2))
.map(x => (x._1._1,(x._1._2,x._2._1,x._2._2)))
.groupByKey
val userCityUsageRdd = res.map(x => {
val m = x._2.toList
(x._1 ,m.map(y => (y._1->y._2)).toMap, m.map(x => x._3).reduce(_+_))
})
output
res20: Array[(String, scala.collection.immutable.Map[String,Int], Int)] =
Array((user1,Map(city1 -> 3, city3 -> 1, city2 -> 3),700),
(user2,Map(city1 -> 1, city2 -> 1),200))
Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
I have a Map like:
Map("product1" -> List(Product1ObjectTypes), "product2" -> List(Product2ObjectTypes))
where ProductObjectType has a field usage. Based on the other field (counter) I have to update all ProductXObjectTypes.
The issue is that this update depends on previous ProductObjectType, and I can't find a way to get previous item when iterating over mapValues of this map. So basically, to update current usage I need: CurrentProduct1ObjectType.counter - PreviousProduct1ObjectType.counter.
Is there any way to do this?
I started it like:
val reportsWithCalculatedUsage =
reportsRefined.flatten.flatten.toList.groupBy(_._2.product).mapValues(f)
but I don't know in mapValues how to access previous list item.
I'm not sure if I understand completely, but if you want to update the values inside the lists based on their predecessors, this can generally be done with a fold:
case class Thing(product: String, usage: Int, counter: Int)
val m = Map(
"product1" -> List(Thing("Fnord", 10, 3), Thing("Meep", 0, 5))
//... more mappings
)
//> Map(product1 -> List(Thing(Fnord,10,3), Thing(Meep,0,5)))
m mapValues { list => list.foldLeft(List[Thing]()){
case (Nil, head) =>
List(head)
case (tail, head) =>
val previous = tail.head
val current = head copy (usage = head.usage + head.counter - previous.counter)
current :: tail
} reverse }
//> Map(product1 -> List(Thing(Fnord,10,3), Thing(Meep,2,5)))
Note that regular map is an unordered collection, you need to use something like TreeMap to have predictable order of iteration.
Anyways, from what I understand you want to get pairs of all values in a map. Try something like this:
scala> val map = Map(1 -> 2, 2 -> 3, 3 -> 4)
scala> (map, map.tail).zipped.foreach((t1, t2) => println(t1 + " " + t2))
(1,2) (2,3)
(2,3) (3,4)