Using Flink to get Counts Within a Keyed Window - scala

I'm using Flink via the Scala interface to do some data processing. I have some user data that comes in tuples:
(user1, "titanic")
(user1, "titanic")
(user1, "batman")
(user2, "star wars")
(user2, "star wars")
(user2, "batman")
I want to key by the user, create a window and then count the number of times that a user has viewed a particular movie within that window, so that I end up with a Map from each movie to the number of view counts for each user. For example, for user1, the correct output is Map("titanic" -> 2, "batman" -> 1).
I know that the first part of my code should look something like this:
keyedStream.keyBy(0).window(EventTimeSessionWindows.withGap(Time.minutes(10)))
But I don't know how to do a further aggregation within the window so that I end up with a Map of view counts for each user/window. I've attempted to write my own AggregateFunction that collects these counts into a mutable Map but unfortunately a mutable Map is not serializable, so it fails.
How might I do this?

You should be able to solve the problem by using an AggregateFunction:
source
.keyBy(0)
.timeWindow(Time.seconds(10L))
.aggregate(new AggregateFunction[(String, String), (String, Map[String, Int]), (String, Map[String, Int])] {
override def createAccumulator(): (String, Map[String, Int]) = ("", Map())
override def add(value: (String, String), accumulator: (String, Map[String, Int])): (String, Map[String, Int]) = {
val counter = accumulator._2.getOrElse(value._2, 0)
(value._1, accumulator._2 + (value._2 -> (counter + 1)))
}
override def getResult(accumulator: (String, Map[String, Int])): (String, Map[String, Int]) = accumulator
override def merge(a: (String, Map[String, Int]), b: (String, Map[String, Int])): (String, Map[String, Int]) = {
(a._1, (a._2.keySet ++ b._2.keySet) map (k => k -> (a._2.getOrElse(k, 0) + b._2.getOrElse(k, 0))) toMap)
}
})

Related

Context bound for varargs

Few days ago I started to learn Cats and I want to implement method appendOptional for Map[String, _: Show].
I started with the following idea:
def appendOptional[T: Show](to: Map[String, String], values: (String, Option[T])*): Map[String, String] =
values.foldLeft(values) {
case (collector, (key, Some(value))) =>
collector + (key -> implicitly[Show[T]].show(value))
case (collector, _) => collector
}
And to use it like:
def createProps(initial: Map[String, String], name: Option[String], age: Option[Int])
val initial = Map("one" -> "one", "two" -> "two")
val props = appendOptional(initial, "name" -> name, "age" -> age)
I understand that this approach is quite naive and straightforward because implicitly[Show[T]].show(value) will actually lookup for Show[Any].
Also, I had an idea to accept HList with context bound, but I have not found any example of that.
One more variant is to create a lot of overloaded methods (as it is done in a lot of libraris):
def appendOptional[T1: Show, T2: Show](to: Map[String, String], v1: (String, Option[T1], v2: (String, Option[T2])))
Question: Is there a way to define context bound for varargs functions?
Strictly speaking, the first is the correct way to define the bound; varargs don't mean the type of arguments varies, only their number.
The way to achieve what you want with varying types is more involved and requires packing Show instances together with values. E.g.
case class HasShow[A](x: A)(implicit val ev: Show[A])
def appendOptional(to: Map[String, String], values: (String, Option[HasShow[_]])*): Map[String, String] =
values.foldLeft(values) {
// value.ev.show(value.x)) can be extracted into a method on HasShow as well
case (collector, (key, Some(value: HasShow[a]))) =>
collector + (key -> value.ev.show(value.x))
case (collector, _) => collector
}
val props = appendOptional(initial, "name" -> name.map(HasShow(_)), "age" -> age.map(HasShow(_)))
You can sprinkle some more implicit conversions for HasShow to simplify the call-site, but this way you can see what's going on better.
For this specific case I think a better and simpler solution would be
implicit class MapOp(self: Map[String, String]) extends AnyVal {
def appendOptional[A: Show](key: String, value: Option[A]) =
value.fold(self)(x => self + (key -> Show.show(x)))
}
val props = initial.appendOptional("name", name).appendOptional("age", age)

toMap error in mapping a Data Set

I'm having an error
error: Cannot prove that (Int, String, String, String, String, Double, String) <:< (T, U).
}.collect.toMap
when executing my application having the following code snippet.
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, fields(1),fields(2),fields(3),fields(4).toDouble,fields(5))
}.collect.toMap
What could be the cause and can anyone please suggest a solution ?
if you want to do toMap on Seq, you show have a Seq of Tuple2. ScalaDoc of toMap states :
This method is unavailable unless the elements are members of Tuple2,
each ((T, U)) becoming a key-value pair in the map
So you should do:
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, // first element of Tuple2 -> "key"
(fields(1),fields(2),fields(3),fields(4).toDouble,fields(5)) // 2nd element of Tuple2 -> "value"
)
}.collect.toMap
such that your map-statwment returns RDD[(Int, (String, String, String, String, Double, String))]

applying partial function on a tuple field, maintaining the tuple structure

I have a PartialFunction[String,String] and a Map[String,String].
I want to apply the partial functions on the map values and collect the entries for which it was applicaple.
i.e. given:
val m = Map( "a"->"1", "b"->"2" )
val pf : PartialFunction[String,String] = {
case "1" => "11"
}
I'd like to somehow combine _._2 with pfand be able to do this:
val composedPf : PartialFunction[(String,String),(String,String)] = /*someMagicalOperator(_._2,pf)*/
val collected : Map[String,String] = m.collect( composedPf )
// collected should be Map( "a"->"11" )
so far the best I got was this:
val composedPf = new PartialFunction[(String,String),(String,String)]{
override def isDefinedAt(x: (String, String)): Boolean = pf.isDefinedAt(x._2)
override def apply(v1: (String, String)): (String,String) = v1._1 -> pf(v1._2)
}
is there a better way?
Here is the magical operator:
val composedPf: PartialFunction[(String, String), (String, String)] =
{case (k, v) if pf.isDefinedAt(v) => (k, pf(v))}
Another option, without creating a composed function, is this:
m.filter(e => pf.isDefinedAt(e._2)).mapValues(pf)
There is a function in Scalaz, that does exactly that: second
scala> m collect pf.second
res0: scala.collection.immutable.Map[String,String] = Map(a -> 11)
This works, because PartialFunction is an instance of Arrow (a generalized function) typeclass, and second is one of the common operations defined for arrows.

Scala map and/or groupby functions

I am new to Scala and I am trying to figure out some scala syntax.
So I have a list of strings.
wordList: List[String] = List("this", "is", "a", "test")
I have a function that returns a list of pairs that contains consonants and vowels counts per word:
def countFunction(words: List[String]): List[(String, Int)]
So, for example:
countFunction(List("test")) => List(('Consonants', 3), ('Vowels', 1))
I now want to take a list of words and group them by count signatures:
def mapFunction(words: List[String]): Map[List[(String, Int)], List[String]]
//using wordList from above
mapFunction(wordList) => List(('Consonants', 3), ('Vowels', 1)) -> Seq("this", "test")
List(('Consonants', 1), ('Vowels', 1)) -> Seq("is")
List(('Consonants', 0), ('Vowels', 1)) -> Seq("a")
I'm thinking I need to use GroupBy to do this:
def mapFunction(words: List[String]): Map[List[(String, Int)], List[String]] = {
words.groupBy(F: (A) => K)
}
I've read the scala api for Map.GroupBy and see that F represents discriminator function and K is the type of keys you want returned. So I tried this:
words.groupBy(countFunction => List[(String, Int)]
However, scala doesn't like this syntax. I tried looking up some examples for groupBy and nothing seems to help me with my use case. Any ideas?
Based on your description, your count function should take a word instead of a list of words. I would have defined it like this:
def countFunction(words: String): List[(String, Int)]
If you do that you should be able to call words.groupBy(countFunction), which is the same as:
words.groupBy(word => countFunction(word))
If you cannot change the signature of countFunction, then you should be able to call group by like this:
words.groupBy(word => countFunction(List(word)))
You shouldn't put the return type of the function in the call. The compiler can figure this out itself. You should just call it like this:
words.groupBy(countFunction)
If that doesn't work, please post your countFunction implementation.
Update:
I tested it in the REPL and this works (note that my countFunction has a slightly different signature from yours):
scala> def isVowel(c: Char) = "aeiou".contains(c)
isVowel: (c: Char)Boolean
scala> def isConsonant(c: Char) = ! isVowel(c)
isConsonant: (c: Char)Boolean
scala> def countFunction(s: String) = (('Consonants, s count isConsonant), ('Vowels, s count isVowel))
countFunction: (s: String)((Symbol, Int), (Symbol, Int))
scala> List("this", "is", "a", "test").groupBy(countFunction)
res1: scala.collection.immutable.Map[((Symbol, Int), (Symbol, Int)),List[java.lang.String]] = Map((('Consonants,0),('Vowels,1)) -> List(a), (('Consonants,1),('Vowels,1)) -> List(is), (('Consonants,3),('Vowels,1)) -> List(this, test))
You can include the type of the function passed to groupBy, but like I said you don't need it. If you want to pass it in you do it like this:
words.groupBy(countFunction: String => ((Symbol, Int), (Symbol, Int)))

How to get Map-like sugar in another constructor

What I need is a class X I can construct with a Map that takes Strings into either other Strings or Maps that take Strings into Strings, and then an arbitrary number of other instances of X. With my limited grasp of Scala, I know I can do this:
class Person (stringParms : Map[String, String],
mapParms : Map[String, Map[String, String]],
children : List[X]) {
}
but that doesn't look very Scala-ish ("Scalish"? "Scalerific"? "Scalogical"?) I'd like to be able to do is the following:
Person bob = Person("name" -> "Bob", "pets" -> ("cat" -> "Mittens", "dog" -> "Spot"), "status" -> "asleep",
firstChild, secondChild)
I know I can get rid of the "new" by using the companion object and I'm sure I can look Scala varargs. What I'd like to know is:
How I can use -> (or some similarly plausible operator) to construct elements to be made into a Map in the construction?
How I can define a single map so either it can do an Option-like switch between two very disparate types or becomes a recursive tree, where each (named) node points to either a leaf in the form of a String or another node like itself?
The recursive version really appeals to me because, although it doesn't address a problem I actually have today, it maps neatly into a subset of JSON containing only objects and strings (no numbers or arrays).
Any help, as always, greatly appreciated.
-> is just a syntactic sugar to make a pair (A, B), so you can use it too. Map object takes a vararg of pairs:
def apply [A, B] (elems: (A, B)*) : Map[A, B]
You should first check out The Architecture of Scala Collections if you're interested in mimicking the collections library.
Having said that, I don't think the signature you proposed for Person looks like Map, because it expects variable argument, yet children are not continuous with the other (String, A) theme. If you say "child1" -> Alice, and internally store Alice seperately, you could define:
def apply(elems: (String, Any)*): Person
in the companion object. If Any is too loose, you could define PersonElem trait,
def apply(elems: (String, PersonElem)*): Person
and implicit conversion between String, Map[String, String], Person, etc to PersonElem.
This gets you almost there. There is still a Map I don't get easily rid of.
The basic approach is to have a somewhat artificial parameter types, which inherit from a common type. This way the apply method just takes a single vararg.
Using implicit conversion method I get rid of the ugly constructors for the parameter types
case class Child
case class Person(stringParms: Map[String, String],
mapParms: Map[String, Map[String, String]],
children: List[Child]) { }
sealed abstract class PersonParameter
case class MapParameter(tupel: (String, Map[String, String])) extends PersonParameter
case class StringParameter(tupel: (String, String)) extends PersonParameter
case class ChildParameter(child: Child) extends PersonParameter
object Person {
def apply(params: PersonParameter*): Person = {
var stringParms = Map[String, String]()
var mapParms = Map[String, Map[String, String]]()
var children = List[Child]()
for (p ← params) {
p match {
case StringParameter(t) ⇒ stringParms += t
case MapParameter(t) ⇒ mapParms += t
case ChildParameter(c) ⇒ children = c :: children
}
}
new Person(stringParms, mapParms, children)
}
implicit def tupel2StringParameter(t: (String, String)) = StringParameter(t)
implicit def child2ChildParameter(c: Child) = ChildParameter(c)
implicit def map2MapParameter(t: (String, Map[String, String])) = MapParameter(t)
def main(args: Array[String]) {
val firstChild = Child()
val secondChild = Child()
val bob: Person = Person("name" -> "Bob","pets" -> Map("cat" -> "Mittens", "dog" -> "Spot"),"status"
-> "asleep",
firstChild, secondChild)
println(bob)
} }
Here's one way:
sealed abstract class PersonParam
object PersonParam {
implicit def toTP(tuple: (String, String)): PersonParam = new TupleParam(tuple)
implicit def toMap(map: (String, Map[String, String])): PersonParam = new MapParam(map)
implicit def toSP(string: String): PersonParam = new StringParam(string)
}
class TupleParam(val tuple: (String, String)) extends PersonParam
class MapParam(val map: (String, Map[String, String])) extends PersonParam
class StringParam(val string: String) extends PersonParam
class Person(params: PersonParam*) {
val stringParams = Map(params collect { case parm: TupleParam => parm.tuple }: _*)
val mapParams = Map(params collect { case parm: MapParam => parm.map }: _*)
val children = params collect { case parm: StringParam => parm.string } toList
}
Usage:
scala> val bob = new Person("name" -> "Bob",
| "pets" -> Map("cat" -> "Mittens", "dog" -> "Spot"),
| "status" -> "asleep",
| "little bob", "little ann")
bob: Person = Person#5e5fada2
scala> bob.stringParams
res11: scala.collection.immutable.Map[String,String] = Map((name,Bob), (status,asleep))
scala> bob.mapParams
res12: scala.collection.immutable.Map[String,Map[String,String]] = Map((pets,Map(cat -> Mittens, dog -> Spot)))
scala> bob.children
res13: List[String] = List(little bob, little ann)