How to get the globally declared MapState value in RichCoMapFunction [ Apache Flink ]? - scala

I'm implementing the Flink datastream for some real time data calculation. So that i'm getting datastream value from two type of source. And i need to do some transformation based on some key. When i'm using RichCoMapFunction, Mapstate is not visible to globally. My program as follows
class Transformer extends RichCoMapFunction[(String, Map[String, String]), (String, Map[String, String]), Map[String, String]] {
private var sourceMap1: MapState[String, Map[String, String]] = _
private var sourceMap2: MapState[String, Map[String, String]] = _
override def map1(in1: (String, Map[String, String])): Map[String, String] = {
sourceMap1.put(in1._2("key"), in1._2)
println(sourceMap1.keys()) // Working with updated values
println(sourceMap2.keys()) // Return empty value always
return in1._2
}
override def map2(in2: (String, Map[String, String])): Map[String, String] = {
sourceMap2.put(in2._2("key"), in2._2)
println(sourceMap1.keys()) // Return empty value always
println(sourceMap2.keys()) // Working with updated values
return in2._2
}
override def open(parameters: Configuration): Unit = {
val desc1: MapStateDescriptor[String, Map[String, String]] = new MapStateDescriptor[String, Map[String, String]]("sourceMap1", classOf[String], classOf[Map[String, String]])
sourceMap1 = getRuntimeContext.getMapState(desc1)
val desc2: MapStateDescriptor[String, Map[String, String]] = new MapStateDescriptor[String, Map[String, String]]("sourceMap2", classOf[String], classOf[Map[String, String]])
sourceMap2 = getRuntimeContext.getMapState(desc2)
}
}
I need to access sourceMap2 in map1 function since its declared as global. But when i'm trying to print the keys of sourceMap2 in map1 function it's always return as empty value. But if i'm printing the sourceMap1 in map1 function means it will print all the added keys.

When using keyed state, Flink will store a separate state value for each key value. This means that if you have a stateful mapper m with state s and you process records (x1, y1) and (x2, y2) where x is the key, Flink will store s(x1) = (x1, v1) and s(x2) = (x2, v2) in its state backend.
When processing (x2, y2), then you only have access to s(x2) and it is not possible to access s(x1).
I assume that this is the reason why you see presumably empty MapState. The incoming records for map1 and map2 will have different keys and, therefore, you access the sourceMap2 in map1 for a key (not the map key but the keyBy key) for which no key-value pairs have been stored. The same applies to map2 where you access sourceMap1 under a key for which no key-value pairs have been stored yet.

Your Transformer class is being applied to two connected, keyed streams. sourceMap1 and sourceMap2 are keyed state, meaning that you have a separate, nested hash map for every key of the two connected streams. One pair of these maps is in scope each time map1 or map2 is called, i.e., the pair corresponding to the key of the item being mapped.
If instead you want to have global state, shared across all the keys, have a look at the broadcast state pattern.

Related

Scala map with function call results in references to the function instead of results

I have a list of keys for which I want to fetch data. The data is fetched via a function call for each key. I want to end up with a Map of key -> data. Here's what I've tried:
case class MyDataClass(val1: Int, val2: Boolean)
def getData(key: String): MyDataClass = {
// Dummy implementation
MyDataClass(1, true)
}
def getDataMapForKeys(keys: Seq[String]): Map[String, MyDataClass] = {
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
dataMap
}
This results in a type mismatch error:
type mismatch;
found : scala.collection.immutable.Map[String,String => MyDataClass]
required: Map[String,MyDataClass]
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
Why is it setting the values in the resulting Map to instances of the getData() function, rather than its result? How do I make it actually CALL the getData() function for each key and put the results as the values in the Map?
The code you wrote is the same as the following statements:
keys.map((_, getData(_)))
keys.map(x => (x, getData(_)))
keys.map(x => (x, y => getData(y)))
This should clarify why you obtain the error.
As suggested in the comments, stay away from _ unless in simple cases with only one occurrences.
The gist of the issue is (_, getData(_))) is creating a Tuple instead of a map entry for each key that is being mapped over. Using -> creates a Map which is what you want.
...
val dataMap: Map[String, MyDataClass] = keys.map(key => (key -> getData(key))).toMap
...

Extend / Replicate Scala collections syntax to create your own collection?

I want to build a map however I want to discard all keys with empty values as shown below:
#tailrec
def safeFiltersMap(
map: Map[String, String],
accumulator: Map[String,String] = Map.empty): Map[String, String] = {
if(map.isEmpty) return accumulator
val curr = map.head
val (key, value) = curr
safeFiltersMap(
map.tail,
if(value.nonEmpty) accumulator + (key->value)
else accumulator
)
}
Now this is fine however I need to use it like this:
val safeMap = safeFiltersMap(Map("a"->"b","c"->"d"))
whereas I want to use it like the way we instantiate a map:
val safeMap = safeFiltersMap("a"->"b","c"->"d")
What syntax can I follow to achieve this?
The -> syntax isn't a special syntax in Scala. It's actually just a fancy way of constructing a 2-tuple. So you can write your own functions that take 2-tuples as well. You don't need to define a new Map type. You just need a function that filters the existing one.
def safeFiltersMap(args: (String, String)*): Map[String, String] =
Map(args: _*).filter {
result => {
val (_, value) = result
value.nonEmpty
}
}
Then call using
safeFiltersMap("a"->"b","c"->"d")

efficent way Add key and value to Map of Set in scala

I have an scala Map like:
val myMap: mutable.Map[String, mutable.Set[String]]= mutable.Map[String, mutable.Set[String]]()
I would like to add in the more efficient way an element key: String and another value. The addition will check if the new key String is in the Map in positive case then add the new value to the current corresponding Set. If the key is not present, then add the key and create a new Set of values with the first element: value.
Regards
Yasset
Given a (key, value) if key is present in the map then value is added to the set. if key is not present then key with empty set it added to the map.
def update(key: String, value: String, map: Map[String, Set[String]]): Unit =
map.get(key)
.map(_ => map(key) += value)
.getOrElse(map(key) = Set[String](value))
What you basically need is a MultiMap.
import collection.mutable.{ HashMap, MultiMap, Set }
val mm = new HashMap[Int, Set[String]] with MultiMap[Int, String]
mm.addBinding(1, "a")
mm.addBinding(2, "b")
mm.addBinding(1, "c")
println(mm)
//prints Map(2 -> Set(b), 1 -> Set(c, a))
Guava comes with different types of MultiMap. Array backed, hash based, Linked List based etc. And you don't have to mixin to create multimaps. The api is quite elaborate.
With Scala Multimap you have a restriction to use Map[A,Set[B]]. You cannot do new HashMap[Int, List[String]] with MultiMap[Int, String].

Typesafe keys for a map

Given the following code:
val m: Map[String, Int] = .. // fetch from somewhere
val keys: List[String] = m.keys.toList
val keysSubset: List[String] = ... // choose random keys
We can define the following method:
def sumValues(m: Map[String, Int], ks: List[String]): Int =
ks.map(m).sum
And call this as:
sumValues(m, keysSubset)
However, the problem with sumValues is that if ks happens to have a key not present on the map, the code will still compile but throw an exception at runtime. Ex:
// assume m = Map("two" -> 2, "three" -> 3)
sumValues(m, 1 :: Nil)
What I want instead is a definition for sumValues such that the ks argument should, at compile time, be guaranteed to only contain keys that are present on the map. As such, my guess is that the existing sumValues type signature needs to accept some form of implicit evidence that the ks argument is somehow derived from the list of keys of the map.
I'm not limited to a scala Map however, as any record-like structure would do. The map structure however won't have a hardcoded value, but something derived/passed on as an argument.
Note: I'm not really after summing the values, but more of figuring out a type signature for sumValues whose calls to it can only compile if the ks argument is provably from the list of keys the map (or record-like structure).
Another solution could be to map only the intersection (i.e. : between m keys and ks).
For example :
scala> def sumValues(m: Map[String, Int], ks: List[String]): Int = {
| m.keys.filter(ks.contains).map(m).sum
| }
sumValues: (m: Map[String,Int], ks: List[String])Int
scala> val map = Map("hello" -> 5)
map: scala.collection.immutable.Map[String,Int] = Map(hello -> 5)
scala> sumValues(map, List("hello", "world"))
res1: Int = 5
I think this solution is better than providing a default value because more generic (i.e. : you can use it not only with sums). However, I guess that this solution is less effective in term of performance because the intersection.
EDIT : As #jwvh pointed out in it message below, ks.intersect(m.keys.toSeq).map(m).sum is, to my opinion, more readable than m.keys.filter(ks.contains).map(m).sum.

idiomatic "get or else update" for immutable.Map?

What is the idiomatic way of a getOrElseUpdate for immutable.Map instances?. I use the snippet below, but it seems verbose and inefficient
var map = Map[Key, Value]()
def foo(key: Key) = {
val value = map.getOrElse(key, new Value)
map += key -> value
value
}
I would probably implement a getOrElseUpdated method like this:
def getOrElseUpdated[K, V](m: Map[K, V], key: K, op: => V): (Map[K, V], V) =
m.get(key) match {
case Some(value) => (m, value)
case None => val newval = op; (m.updated(key, newval), newval)
}
which either returns the original map if m has a mapping for key or another map with the mapping key -> op added. The definition of this method is similar to getOrElseUpdate of mutable.Map.
Let me summarise your problem:
You want to call a method on a immutable data structure
You want it to return some value and reassign a var
Because the data structure is immutable, you’ll need to
return a new immutable data structure, or
do the assignment inside the method, using a supplied closure
So, either your signature has to look like
def getOrElseUpdate(key: K): Tuple2[V, Map[K,V]]
//... use it like
val (v, m2) = getOrElseUpdate(k)
map = m2
or
def getOrElseUpdate(key: K, setter: (Map[K,V]) => Unit): V
//... use it like
val v = getOrElseUpdate(k, map = _)
If you can live with one of these solutions, you could add your own version with an implicit conversion but judging by the signatures alone, i wouldn’t think any of these is in the standard library.
There's no such way - map mutation (update), when you're getting a map value, is a side effect (which contradicts to immutability/functional style of programming).
When you want to make a new immutable map with the default value, if another value for the specified key doesn't exist, you can do the following:
map + (key -> map.getOrElse(key, new Value))
Why not use withDefault or withDefaultValue if you have an immutable map?