How to create Map using multiple lists in Spark - scala

I am trying to figure out how to access particular elements from RDD myRDD with example entries below:
(600,List((600,111,7,1), (615,111,3,5))
(601,List((622,112,2,1), (615,111,3,5), (456,111,9,12))
I want to extract some data from Redis DB using 3-rd field from sub-lists as ID. For example, in case of (600,List((600,111,1,1), (615,111,1,5)), the IDs are 7 and 3.
In case of (601,List((622,112,2,1), (615,111,3,5), (456,111,9,12)), the ID's are 2, 3 and 9.
The problem is that I don't know how to collect values using multiple IDs. In the given code below, I use line._2(3), but it's not correct, because this way I access sublists instead of the fields inside these sublists.
Should I use flatMap or similar?
val newRDD = myRDD.mapPartitions(iter => {
val redisPool = new Pool(new JedisPool(new JedisPoolConfig(), "localhost", 6379, 2000))
iter.map({line => (line._1,
redisPool.withJedisClient { client =>
val start_date: String = Dress.up(client).hget("id:"+line._2(3),"start_date")
val end_date: String = Dress.up(client).hget("id:"+line._2(3),"end_date")
val additionalData = List((start_date,end_date))
Map(("base_data", line._2), ("additional_data", additionalData))
})
})
})
newRDD.collect().foreach(println)
If we assume that Redis DB contains some relevant data, then the result newRDD could be the following:
(600,Map("base_data" -> List((600,111,7,1), (615,111,3,5)), "additional_data" -> List((2014,2015),(2015,2016)))
(601,Map("base_data" -> List((622,112,2,1), (615,111,3,5), (456,111,9,12)), "additional_data" -> List((2010,2015),(2011,2016),(2014,2016)))

To get a list of third elements of each tuple in line._2, use line._2.map(_._3) (assuming the type of line is (Int, List[(Int, Int, Int, Int)]), like it looks from your example, and types like Any aren't involved). Overall, it seems like your code should look like
iter.map({ case (first, second) => (first,
redisPool.withJedisClient { client =>
val additionalData = second.map { tuple =>
val start_date: String = Dress.up(client).hget("id:"+tuple._3,"start_date")
val end_date: String = Dress.up(client).hget("id:"+tuple._3,"end_date")
(start_date, end_date)
}
Map(("base_data", second), ("additional_data", additionalData))
})
})

Related

Scala create immutable nested map

I have a situation here
I have two strins
val keyMap = "anrodiApp,key1;iosApp,key2;xyz,key3"
val tentMap = "androidApp,tenant1; iosApp,tenant1; xyz,tenant2"
So what I want to add is to create a nested immutable nested map like this
tenant1 -> (andoidiApp -> key1, iosApp -> key2),
tenant2 -> (xyz -> key3)
So basically want to group by tenant and create a map of keyMap
Here is what I tried but is done using mutable map which I do want, is there a way to create this using immmutable map
case class TenantSetting() {
val requesterKeyMapping = new mutable.HashMap[String, String]()
}
val requesterKeyMapping = keyMap.split(";")
.map { keyValueList => keyValueList.split(',')
.filter(_.size==2)
.map(keyValuePair => (keyValuePair[0],keyValuePair[1]))
.toMap
}.flatten.toMap
val config = new mutable.HashMap[String, TenantSetting]
tentMap.split(";")
.map { keyValueList => keyValueList.split(',')
.filter(_.size==2)
.map { keyValuePair =>
val requester = keyValuePair[0]
val tenant = keyValuePair[1]
if (!config.contains(tenant)) config.put(tenant, new TenantSetting)
config.get(tenant).get.requesterKeyMapping.put(requester, requesterKeyMapping.get(requester).get)
}
}
The logic to break the strings into a map can be the same for both as it's the same syntax.
What you had for the first string was not quite right as the filter you were applying to each string from the split result and not on the array result itself. Which also showed in that you were using [] on keyValuePair which was of type String and not Array[String] as I think you were expecting. Also you needed a trim in there to cope with the spaces in the second string. You might want to also trim the key and value to avoid other whitespace issues.
Additionally in this case the combination of map and filter can be more succinctly done with collect as shown here:
How to convert an Array to a Tuple?
The use of the pattern with 2 elements ensures you filter out anything with length other than 2 as you wanted.
The iterator is to make the combination of map and collect more efficient by only requiring one iteration of the collection returned from the first split (see comments below).
With both strings turned into a map it just needs the right use of groupByto group the first map by the value of the second based on the same key to get what you wanted. Obviously this only works if the same key is always in the second map.
def toMap(str: String): Map[String, String] =
str
.split(";")
.iterator
.map(_.trim.split(','))
.collect { case Array(key, value) => (key.trim, value.trim) }
.toMap
val keyMap = toMap("androidApp,key1;iosApp,key2;xyz,key3")
val tentMap = toMap("androidApp,tenant1; iosApp,tenant1; xyz,tenant2")
val finalMap = keyMap.groupBy { case (k, _) => tentMap(k) }
Printing out finalMap gives:
Map(tenant2 -> Map(xyz -> key3), tenant1 -> Map(androidApp -> key1, iosApp -> key2))
Which is what you wanted.

How to use map / flatMap on a scala Map?

I have two sequences, i.e. prices: Seq[Price] and overrides: Seq[Override]. I need to do some magic on them yet only for a subset based on a shared id.
So I grouped them both into a Map each via groupBy:
I do the group by via:
val pricesById = prices.groupBy(_.someId) // Int => Seq[Cruise]
val overridesById = overrides.groupBy(_.someId) // // Int => Seq[Override]
I expected to be able to create my wanted sequence via flatMap:
val applyOverrides = (someId: Int, prices: Seq[Price]): Seq[Price] => {
val applicableOverrides = overridesById.getOrElse(someId, Seq())
magicMethod(prices, applicableOverrides) // returns Seq[Price]
}
val myPrices: Seq[Price] = pricesById.flatMap(applyOverrides)
I expected myPrices to contain just one big Seq[Price].
Yet I get a weird type mismatch within the flatMap method with NonInferedB I am unable to resolve.
In scala, maps are tuples, not a key-value pair.
The function for flatMap hence expects only one parameter, namely the tuple (key, value), and not two parameters key, value.
Since you can access first element of a tuple via _1, the second via _2 and so on, you can generate your desired function like so:
val pricesWithMagicApplied = pricesById.flatMap(tuple =>
applyOverrides(tuple._1, tuple._2)
Another approach is to use case matching:
val pricesWithMagicApplied: Seq[CruisePrice] = pricesById.flatMap {
case (someId, prices) => applyOverrides(someId, prices)
}.toSeq

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Filtering futures using values in another future

I have two futures.
One future (idsFuture) holds the computation to get the list of ids. The type of the idsFuture is Future[List[Int]]
Another Future(dataFuture) holds an array of A where A is defined as case class A(id: Int, data: String). The type of dataFuture is Future[Array[A]]
I want to filter dataFuture's using ids present in idsFuture.
For example-
case class A(id: Int, data: String)
val dataFuture = Future(Array(A(1,"a"), A(2,"b"), A(3,"c")))
val idsFuture = Future(List(1,2))
I should get another future having Array((A(1,"a"), A(2,"b"))
I currently do
idsFuture.flatMap{
ids => dataFuture.map(datas => datas.filter(data => ids.contains(data.id)))}
Is there a better solution?
You could use for-comprehension here instead of flatMap + map like this:
for {
ds <- dataFuture
idsList <- idsFuture
ids = idsList.toSet
} yield ds filter { d => ids(d.id) }
Note that apply on Set is faster then contains on List.

Scala: batch process Map of index-based form fields

Part of a web app I'm working on handles forms that need to be bound to a collection of model (case class) instances. See this question
So, if I were to add several users at one time, form fields would be named email[0], email[1], password[0], password[1], etc.
Posting the form results in a Map[String, Seq[String]]
Now, what I would like to do is to process the Map in batches, by index, so that for each iteration I can bind a User instance, creating a List[User] as the final result of the bindings.
The hacked approach I'm thinking of is to regex match against "[\d]" in the Map keys and then find the highest index via filter or count; with that, then (0..n).toList map{ ?? } through the number of form field rows, calling the binding/validation method (which also takes a Map[String, Seq[String]]) accordingly.
What is a concise way to achieve this?
Assuming that:
All map keys are in form "field[index]"
There is only one value in Seq for each key.
If there is entry for "email[x]" than there is entry for "password[x]" and vice versa.
I would done something like this:
val request = Map(
"email[0]" -> Seq("alice#example.com"),
"email[1]" -> Seq("bob#example.com"),
"password[0]" -> Seq("%vT*n7#4"),
"password[1]" -> Seq("Bfts7B&^")
)
case class User(email: String, password: String)
val Field = """(.+)\[(\d+)\]""".r
val userList = request.groupBy { case (Field(_, idx), _) => idx.toInt }
.mapValues { userMap =>
def extractField(name: String) =
userMap.collect{case (Field(`name`, _), values) => values.head}.head
User(extractField("email"), extractField("password"))}
.toList.sortBy(_._1).map(_._2)
// Exiting paste mode, now interpreting.
request: scala.collection.immutable.Map[String,Seq[String]] = Map(email[0] -> List(alice#example.com),
email[1] -> List(bob#example.com), password[0] -> List(%vT*n7#4), password[1] -> List(Bfts7B&^))
defined class User
Field: scala.util.matching.Regex = (.+)\[(\d+)\]
userList: List[User] = List(User(alice#example.com,%vT*n7#4), User(bob#example.com,Bfts7B&^))