Flink: sum after window won't show result - scala

This is my original code, problem is that it won't print data
withTimestampsAndWatermarks
.map(line => (line._1,line._3))
.keyBy(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5L)))
// .sum(1)
// .print()
Then a test with reduce so I can see the process
.reduce(new ReduceFunction[(String, Int)] {
override def reduce(value1: (String, Int), value2: (String, Int)): (String, Int) = {
println("----------reduce invoked-------" + s"${value1._2} + ${value2._2} ==> ${value1._2+value2._2}" )
(value1._1,value1._2 + value2._2)
}
})
.print()
My input
a,1000,1
a,2000,2
a,3000,3
a,5000,4
a,6000,5
a,9999,6
Output
----------reduce invoked-------1 + 2 ==> 3
----------reduce invoked-------3 + 3 ==> 6
----------reduce invoked-------4 + 5 ==> 9
----------reduce invoked-------9 + 6 ==> 15
Seems window's setting is correct, but it won't print result with sum()

Related

How to get Map with matching values

I have a file with values like this :
user id | item id | rating | timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
......
I want to find all values where item id = "560"
and make Map from rating values(1-5) like this {1->6,2-5,3-10,4-6,5-14}
object Parse {
def main(args: Array[String]): Unit = {
//вытаскиваем данные с u.data
var a: List[(String, String, String, String)] = List()
for (line <- io.Source.fromFile("F:\\big data\\u.data").getLines) {
val newLine = line.replace("\t", ",")
if (newLine.split(",").length < 4) {
break
} else {
val asd = newLine.split(",")
val userId = asd(0)
val itemId = asd(1)
val rating = asd(2)
val timestamp = asd(3)
a = a :+ ((userId, itemId, rating, timestamp))
}
a = a.filter(_._2.equals("590")) <- filter list of tuples correctly
val empty: List[String] = a.map(_._2) <- have tyed to get list of all rating, but it does not work
}
}
How can I create a map of rating?
here as I can see we can generate a map of matching values
Scala groupBy for a list
If what you want is a Map of rating->count for a given "item id", this should do it.
util.Using(io.Source.fromFile("../junk.txt")) { file =>
val rec = raw"\d+\s+590\s+(\d+)\s+\d+".r //only this item id
file.getLines()
.collect { case rec(rating) => rating }
.foldLeft(Map.empty[String, Int]) {
case (m, r) => m + (r -> (m.getOrElse(r, 0) + 1))
}
}.getOrElse(Map.empty[String,Int])
Note that fromFile() is automatically closed at the end of the Using block.
I think using for-loop is not the better decision. Please, look at your problem from the data-stream problem not array. scala.io.Source.fromFile("F:\\big data\\u.data").getLines() returns to you Iterator[String] of your lines. It is more suitable to use it as data stream not as array of data. And in your conditions is better just use combination of map, filter, collect and groupBy functions to get grouped rows by rank.
Full correct code:
val sourceFile = scala.io.Source.fromFile("F:\\big data\\u.data")
try {
val linesOfArrays = sourceFile.getLines().map{
line => line.split(",")
}
require(!linesOfArrays.exists(_.length < 4)) // your data schema validation
val ratingCountsMap: Map[String, Int] = linesOfArrays.collect{
case rowValuesArray if rowValuesArray(1) == "590" =>
// in this line you will get rating and 1 for his counting
rowValuesArray(2) -> 1
}.toSeq
.groupBy{ case (rating, _) => rating }
.mapValues{ groupWithSameRating => groupWithSameRating.length }
} finally sourceFile.close()
And don't forget to release resource (in your case this is file) using close method in finally section or use scala-arm library (more about resources here)

Strange results when using Scala collections

I have some tests with results that I can't quite explain.
The first test does a filter, map and reduce on a list containing 4 elements:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{ i =>
counter.incrementAndGet()
true
}
val mapped = filtered.map{ i =>
counter.incrementAndGet()
i*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(11 == counter.get)
}
The counter is incremented 11 times as I expected: once for each element during filtering, once for each element during mapping and three times to add up the 4 elements.
Using wildcards the result changes:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
I can't work out how to use wildcards in the reduce (code doesnt compile), but now, the counter is only incremented 5 times!!
So, question #1: Why do wildcards change the number of times the counter is called and how does that even work?
Then my second, related question. My understanding of views was that they would lazily execute the functions passed to the monadic methods, but the following code doesn't show that.
{
val counter = new AtomicInteger(0)
val l = Seq(1, 2, 3, 4).view
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
println("after filter: " + counter.get)
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
println("after map: " + counter.get)
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("after reduce: " + counter.get)
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
The output is:
after filter: 1
after map: 2
after reduce: 5
counted 5 and result is 20
Question #2: How come the functions are being executed immediately?
I'm using Scala 2.10
You're probably thinking that
filter {
println
_ > 0
}
means
filter{ i =>
println
i > 0
}
but Scala has other ideas. The reason is that
{ println; _ > 0 }
is a statement that first prints something, and then returns the > 0 function. So it interprets what you're doing as a funny way to specify the function, equivalent to:
val p = { println; (i: Int) => i > 0 }
filter(p)
which in turn is equivalent to
println
val temp = (i: Int) => i > 0 // Temporary name, forget we did this!
val p = temp
filter(p)
which as you can imagine doesn't quite work out the way you want--you only print (or in your case do the increment) once at the beginning. Both your problems stem from this.
Make sure if you're using underscores to mean "fill in the parameter" that you only have a single expression! If you're using multiple statements, it's best to stick to explicitly named parameters.

How to fix my Fibonacci stream in Scala

I defined a function to return Fibonacci stream as follows:
def fib:Stream[Int] = {
Stream.cons(1,
Stream.cons(2,
(fib zip fib.tail) map {case (x, y) => println("%s + %s".format(x, y)); x + y}))
}
The functions work ok but it looks inefficient (see the output below)
scala> fib take 5 foreach println
1
2
1 + 2
3
1 + 2
2 + 3
5
1 + 2
1 + 2
2 + 3
3 + 5
8
So, it looks like the function calculates the n-th fibonacci number from the very beginning. Is it correct? How would you fix it?
That is because you have used a def. Try using a val:
lazy val fib: Stream[Int]
= 1 #:: 2 #:: (fib zip fib.tail map { case (x, y) => x + y })
Basically a def is a method; in your example you are calling the method each time and each time the method call constructs a new stream. The distinction between def and val has been covered on SO before, so I won't go into detail here. If you are from a Java background, it should be pretty clear.
This is another nice thing about scala; in Java, methods may be recursive but types and values may not be. In scala both values and types can be recursive.
You can do it the other way:
lazy val fibs = {
def f(a: Int, b: Int): Stream[Int] = a #:: f(b, a + b)
f(0, 1)
}

Integer partitioning in Scala

Given n ( say 3 people ) and s ( say 100$ ), we'd like to partition s among n people.
So we need all possible n-tuples that sum to s
My Scala code below:
def weights(n:Int,s:Int):List[List[Int]] = {
List.concat( (0 to s).toList.map(List.fill(n)(_)).flatten, (0 to s).toList).
combinations(n).filter(_.sum==s).map(_.permutations.toList).toList.flatten
}
println(weights(3,100))
This works for small values of n. ( n=1, 2, 3 or 4).
Beyond n=4, it takes a very long time, practically unusable.
I'm looking for ways to rework my code using lazy evaluation/ Stream.
My requirements : Must work for n upto 10.
Warning : The problem gets really big really fast. My results from Matlab -
---For s =100, n = 1 thru 5 results are ---
n=1 :1 combinations
n=2 :101 combinations
n=3 :5151 combinations
n=4 :176851 combinations
n=5: 4598126 combinations
---
You need dynamic programming, or memoization. Same concept, anyway.
Let's say you have to divide s among n. Recursively, that's defined like this:
def permutations(s: Int, n: Int): List[List[Int]] = n match {
case 0 => Nil
case 1 => List(List(s))
case _ => (0 to s).toList flatMap (x => permutations(s - x, n - 1) map (x :: _))
}
Now, this will STILL be slow as hell, but there's a catch here... you don't need to recompute permutations(s, n) for numbers you have already computed. So you can do this instead:
val memoP = collection.mutable.Map.empty[(Int, Int), List[List[Int]]]
def permutations(s: Int, n: Int): List[List[Int]] = {
def permutationsWithHead(x: Int) = permutations(s - x, n - 1) map (x :: _)
n match {
case 0 => Nil
case 1 => List(List(s))
case _ =>
memoP getOrElseUpdate ((s, n),
(0 to s).toList flatMap permutationsWithHead)
}
}
And this can be even further improved, because it will compute every permutation. You only need to compute every combination, and then permute that without recomputing.
To compute every combination, we can change the code like this:
val memoC = collection.mutable.Map.empty[(Int, Int, Int), List[List[Int]]]
def combinations(s: Int, n: Int, min: Int = 0): List[List[Int]] = {
def combinationsWithHead(x: Int) = combinations(s - x, n - 1, x) map (x :: _)
n match {
case 0 => Nil
case 1 => List(List(s))
case _ =>
memoC getOrElseUpdate ((s, n, min),
(min to s / 2).toList flatMap combinationsWithHead)
}
}
Running combinations(100, 10) is still slow, given the sheer numbers of combinations alone. The permutations for each combination can be obtained simply calling .permutation on the combination.
Here's a quick and dirty Stream solution:
def weights(n: Int, s: Int) = (1 until s).foldLeft(Stream(Nil: List[Int])) {
(a, _) => a.flatMap(c => Stream.range(0, n - c.sum + 1).map(_ :: c))
}.map(c => (n - c.sum) :: c)
It works for n = 6 in about 15 seconds on my machine:
scala> var x = 0
scala> weights(100, 6).foreach(_ => x += 1)
scala> x
res81: Int = 96560646
As a side note: by the time you get to n = 10, there are 4,263,421,511,271 of these things. That's going to take days just to stream through.
My solution of this problem, it can computer n till 6:
object Partition {
implicit def i2p(n: Int): Partition = new Partition(n)
def main(args : Array[String]) : Unit = {
for(n <- 1 to 6) println(100.partitions(n).size)
}
}
class Partition(n: Int){
def partitions(m: Int):Iterator[List[Int]] = new Iterator[List[Int]] {
val nums = Array.ofDim[Int](m)
nums(0) = n
var hasNext = m > 0 && n > 0
override def next: List[Int] = {
if(hasNext){
val result = nums.toList
var idx = 0
while(idx < m-1 && nums(idx) == 0) idx = idx + 1
if(idx == m-1) hasNext = false
else {
nums(idx+1) = nums(idx+1) + 1
nums(0) = nums(idx) - 1
if(idx != 0) nums(idx) = 0
}
result
}
else Iterator.empty.next
}
}
}
1
101
5151
176851
4598126
96560646
However , we can just show the number of the possible n-tuples:
val pt: (Int,Int) => BigInt = {
val buf = collection.mutable.Map[(Int,Int),BigInt]()
(s,n) => buf.getOrElseUpdate((s,n),
if(n == 0 && s > 0) BigInt(0)
else if(s == 0) BigInt(1)
else (0 to s).map{k => pt(s-k,n-1)}.sum
)
}
for(n <- 1 to 20) printf("%2d :%s%n",n,pt(100,n).toString)
1 :1
2 :101
3 :5151
4 :176851
5 :4598126
6 :96560646
7 :1705904746
8 :26075972546
9 :352025629371
10 :4263421511271
11 :46897636623981
12 :473239787751081
13 :4416904685676756
14 :38393094575497956
15 :312629484400483356
16 :2396826047070372396
17 :17376988841260199871
18 :119594570260437846171
19 :784008849485092547121
20 :4910371215196105953021

Difference between fold and foldLeft or foldRight?

NOTE: I am on Scala 2.8—can that be a problem?
Why can't I use the fold function the same way as foldLeft or foldRight?
In the Set scaladoc it says that:
The result of folding may only be a supertype of this parallel collection's type parameter T.
But I see no type parameter T in the function signature:
def fold [A1 >: A] (z: A1)(op: (A1, A1) ⇒ A1): A1
What is the difference between the foldLeft-Right and fold, and how do I use the latter?
EDIT: For example how would I write a fold to add all elements in a list? With foldLeft it would be:
val foo = List(1, 2, 3)
foo.foldLeft(0)(_ + _)
// now try fold:
foo.fold(0)(_ + _)
>:7: error: value fold is not a member of List[Int]
foo.fold(0)(_ + _)
^
Short answer:
foldRight associates to the right. I.e. elements will be accumulated in right-to-left order:
List(a,b,c).foldRight(z)(f) = f(a, f(b, f(c, z)))
foldLeft associates to the left. I.e. an accumulator will be initialized and elements will be added to the accumulator in left-to-right order:
List(a,b,c).foldLeft(z)(f) = f(f(f(z, a), b), c)
fold is associative in that the order in which the elements are added together is not defined. I.e. the arguments to fold form a monoid.
fold, contrary to foldRight and foldLeft, does not offer any guarantee about the order in which the elements of the collection will be processed. You'll probably want to use fold, with its more constrained signature, with parallel collections, where the lack of guaranteed processing order helps the parallel collection implements folding in a parallel way. The reason for changing the signature is similar: with the additional constraints, it's easier to make a parallel fold.
You're right about the old version of Scala being a problem. If you look at the scaladoc page for Scala 2.8.1, you'll see no fold defined there (which is consistent with your error message). Apparently, fold was introduced in Scala 2.9.
For your particular example you would code it the same way you would with foldLeft.
val ns = List(1, 2, 3, 4)
val s0 = ns.foldLeft (0) (_+_) //10
val s1 = ns.fold (0) (_+_) //10
assert(s0 == s1)
Agree with other answers. thought of giving a simple illustrative example:
object MyClass {
def main(args: Array[String]) {
val numbers = List(5, 4, 8, 6, 2)
val a = numbers.fold(0) { (z, i) =>
{
println("fold val1 " + z +" val2 " + i)
z + i
}
}
println(a)
val b = numbers.foldLeft(0) { (z, i) =>
println("foldleft val1 " + z +" val2 " + i)
z + i
}
println(b)
val c = numbers.foldRight(0) { (z, i) =>
println("fold right val1 " + z +" val2 " + i)
z + i
}
println(c)
}
}
Result is self explanatory :
fold val1 0 val2 5
fold val1 5 val2 4
fold val1 9 val2 8
fold val1 17 val2 6
fold val1 23 val2 2
25
foldleft val1 0 val2 5
foldleft val1 5 val2 4
foldleft val1 9 val2 8
foldleft val1 17 val2 6
foldleft val1 23 val2 2
25
fold right val1 2 val2 0
fold right val1 6 val2 2
fold right val1 8 val2 8
fold right val1 4 val2 16
fold right val1 5 val2 20
25
There is two way to solve problems, iterative and recursive. Let's understand by a simple example.let's write a function to sum till the given number.
For example if I give input as 5, I should get 15 as output, as mentioned below.
Input: 5
Output: (1+2+3+4+5) = 15
Iterative Solution.
iterate through 1 to 5 and sum each element.
def sumNumber(num: Int): Long = {
var sum=0
for(i <- 1 to num){
sum+=i
}
sum
}
Recursive Solution
break down the bigger problem into smaller problems and solve them.
def sumNumberRec(num:Int, sum:Int=0): Long = {
if(num == 0){
sum
}else{
val newNum = num - 1
val newSum = sum + num
sumNumberRec(newNum, newSum)
}
}
FoldLeft: is a iterative solution
FoldRight: is a recursive solution
I am not sure if they have memoization to improve the complexity.
And so, if you run the foldRight and FoldLeft on the small list, both will give you a result with similar performance.
However, if you will try to run a FoldRight on Long List it might throw a StackOverFlow error (depends on your memory)
Check the following screenshot, where foldLeft ran without error, however foldRight on same list gave OutofMemmory Error.
fold() does parallel processing so does not guarantee the processing order.
where as foldLeft and foldRight process the items in sequentially for left to right (in case of foldLeft) or right to left (in case of foldRight)
Examples of sum the list -
val numList = List(1, 2, 3, 4, 5)
val r1 = numList.par.fold(0)((acc, value) => {
println("adding accumulator=" + acc + ", value=" + value + " => " + (acc + value))
acc + value
})
println("fold(): " + r1)
println("#######################")
/*
* You can see from the output that,
* fold process the elements of parallel collection in parallel
* So it is parallel not linear operation.
*
* adding accumulator=0, value=4 => 4
* adding accumulator=0, value=3 => 3
* adding accumulator=0, value=1 => 1
* adding accumulator=0, value=5 => 5
* adding accumulator=4, value=5 => 9
* adding accumulator=0, value=2 => 2
* adding accumulator=3, value=9 => 12
* adding accumulator=1, value=2 => 3
* adding accumulator=3, value=12 => 15
* fold(): 15
*/
val r2 = numList.par.foldLeft(0)((acc, value) => {
println("adding accumulator=" + acc + ", value=" + value + " => " + (acc + value))
acc + value
})
println("foldLeft(): " + r2)
println("#######################")
/*
* You can see that foldLeft
* picks elements from left to right.
* It means foldLeft does sequence operation
*
* adding accumulator=0, value=1 => 1
* adding accumulator=1, value=2 => 3
* adding accumulator=3, value=3 => 6
* adding accumulator=6, value=4 => 10
* adding accumulator=10, value=5 => 15
* foldLeft(): 15
* #######################
*/
// --> Note in foldRight second arguments is accumulated one.
val r3 = numList.par.foldRight(0)((value, acc) => {
println("adding value=" + value + ", acc=" + acc + " => " + (value + acc))
acc + value
})
println("foldRight(): " + r3)
println("#######################")
/*
* You can see that foldRight
* picks elements from right to left.
* It means foldRight does sequence operation.
*
* adding value=5, acc=0 => 5
* adding value=4, acc=5 => 9
* adding value=3, acc=9 => 12
* adding value=2, acc=12 => 14
* adding value=1, acc=14 => 15
* foldRight(): 15
* #######################
*/