Strange results when using Scala collections - scala

I have some tests with results that I can't quite explain.
The first test does a filter, map and reduce on a list containing 4 elements:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{ i =>
counter.incrementAndGet()
true
}
val mapped = filtered.map{ i =>
counter.incrementAndGet()
i*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(11 == counter.get)
}
The counter is incremented 11 times as I expected: once for each element during filtering, once for each element during mapping and three times to add up the 4 elements.
Using wildcards the result changes:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
I can't work out how to use wildcards in the reduce (code doesnt compile), but now, the counter is only incremented 5 times!!
So, question #1: Why do wildcards change the number of times the counter is called and how does that even work?
Then my second, related question. My understanding of views was that they would lazily execute the functions passed to the monadic methods, but the following code doesn't show that.
{
val counter = new AtomicInteger(0)
val l = Seq(1, 2, 3, 4).view
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
println("after filter: " + counter.get)
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
println("after map: " + counter.get)
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("after reduce: " + counter.get)
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
The output is:
after filter: 1
after map: 2
after reduce: 5
counted 5 and result is 20
Question #2: How come the functions are being executed immediately?
I'm using Scala 2.10

You're probably thinking that
filter {
println
_ > 0
}
means
filter{ i =>
println
i > 0
}
but Scala has other ideas. The reason is that
{ println; _ > 0 }
is a statement that first prints something, and then returns the > 0 function. So it interprets what you're doing as a funny way to specify the function, equivalent to:
val p = { println; (i: Int) => i > 0 }
filter(p)
which in turn is equivalent to
println
val temp = (i: Int) => i > 0 // Temporary name, forget we did this!
val p = temp
filter(p)
which as you can imagine doesn't quite work out the way you want--you only print (or in your case do the increment) once at the beginning. Both your problems stem from this.
Make sure if you're using underscores to mean "fill in the parameter" that you only have a single expression! If you're using multiple statements, it's best to stick to explicitly named parameters.

Related

How to print a value outside a for loop in scala?

I am new to scala and would like to know way to access val which is defined inside for loop and would like to write that val to a file outside for loop.
def time[A](logFile: String, description: String)(job: => A): Unit = {
var totalDuration: Long = 0
for (i <- 1 to 3) {
val currentTime = System.currentTimeMillis
val result = job
val finalTime = System.currentTimeMillis
val duration = finalTime - currentTime
totalDuration = if (totalDuration == 0) duration else totalDuration.min(duration)
}
val pw = new PrintWriter(new FileOutputStream(new File(logFile),true))
pw.write(description + " " + result + " " + totalDuration +" ms"+"\n")
pw.flush()
pw.close
}
In the above code i am calculating my result which contains the length of bytes read from other function and would like to calculate the time it takes to read the total bytes. I would like to iterate 3 times and take the minimum of all the three. The val result contains the bytes length which also needs to be written in a file. i get a error because i am accessing the result val outside the scope of for loop. Can someone please help me solve this error. How can i access the val result outside for loop to write it to a file ?
Thanks in advance!!
While your question is answered, the for loop is not in it's typical form, which would look more like this:
def time[A] (logFile: String, description: String) (job: => A): Unit = {
val (result, totalDuration): (A, Long) = (for { i <- 1 to 3
currentTime = System.currentTimeMillis
result = job
finalTime = System.currentTimeMillis
duration = finalTime - currentTime
} yield (result, duration)).minBy {case (r, d) => d}
val pw = new PrintWriter (new FileOutputStream (new File (logFile), true))
pw.write (description + " " + result + " " + totalDuration +" ms"+"\n")
pw.flush()
pw.close
}
If I understood your code correctly. I don't know whether a side effect yields to different results for each job invocation.
I missed the internal discussion of the for-loop invention/definition, why the keyword val should be omitted here, but it is quiet handy.
What is more important, is, that you usually have all temporary assignments in the for (...(here part)...) { not here}. The consequences of round or curly braces in the first part aren't totally clear to me, but if you use round parens, you have terminate most statements with a semicolon.
scala> for (i <- 1 to 3;
| j <- 4 to 5;
| k = j-i;
| l = k/2) yield l * l;
res2: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 1, 1, 0, 1)
Note that neither i, nor j, k, l where declared as val or var.
scala> for {i <- 1 to 3
| j <- 4 to 5
| k = j-i
| l = k/2} yield l * l;
res3: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 1, 1, 0, 1)
You will find multiple questions here, which explain, how a for-loop is and can be translatet to a flatMap/map-combination, eventually with filter:
scala> for {i <- 1 to 3
| j <- 4 to 5
| k = j-i
| if (k > 1)
| l = k/2 } yield l * l;
res5: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 1, 1, 1)
And instead of yielding just one value, you can yield a tuple, and assign by
val (a:X, b:Y) = (for ..... yield (aa, bb))
The
yield (result, duration)).minBy {case (r, d) => d}
takes a tuple (result, duration) and selects the minimum by duration, but yields both values.
you can use yield. Yield will return data from for loop after completion of loop and use that data accordingly. Since I do not have your code. see this example
val j = for (i <- 1 to 10) yield i
println(j)
output of j will be
Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
For replacing your for loop to calculate the total duration of some jobs that are executed sequentially, you could use foldLeft
Make use of yield or you can use any of the fold method. Or if you are good with recursion you can use a tailrec method which returns what is Desired. Yield is the most simplest way to do it.
val answer = for(i<- 1 to 10) yield i
println(answer) // Vector(1,2,3,4,5,6,7,8,9,10)

Failed to print get count result in forloop

I have been trying to count inside a for loop, but the result just ends with a parentheses. I am just printing out the key here in map.
var count = 0
xs.foreach(x => (myMap += ((count+=1).toString+","+java.util.UUID.randomUUID.toString -> x)))
Output:
(),901e9926-be1e-4dc4-b3e3-6c3b2feea2c4
Expected output:
1,901e9926-be1e-4dc4-b3e3-6c3b2feea2c4
Within your foreach, count += 1 would be of type Unit. If I understand your question correctly, the example below (using an arbitrary xs collection) might be what you're looking for:
val xs = List("a", "b", "c", "d")
var count = 0
var myMap = Map[String, String]()
xs.foreach{ x =>
count += 1
myMap += ((count.toString + "," + java.util.UUID.randomUUID.toString) -> x)
}
myMap.keys
// res1: Iterable[String] = Set(
// 1,bd971c44-b9d0-41a0-b59f-3acbf2e0dee0, 2,5459eed9-309d-4f9c-afd7-10aced9df2a0,
// 3,5816ea42-d8ed-4beb-8b30-0376d0674700, 4,30f6f22f-1e6d-4eec-86af-5bc6734d5196
// )
In case you want a more idiomatic approach, using zip for the count and foldLeft for Map aggregation would produce similar result:
val myMap = Map[String, String]()
val resultMap = xs.zip(Stream from 1).foldLeft( myMap )(
(m, x) => m + ((x._2.toString + "," + java.util.UUID.randomUUID.toString) -> x._1)
)
What you are printing here is actually (count+=1).toString. In Scala, an assignment like this will be evaluated to Unit, which is expressed by parentheses. That's why you print () and not the value of count. If you check the count variable value afterwards you will see that it is 1 as expected.
Additionally, what you are trying to do could be expressed in a better way, e.g, you could do:
val myMap = xs.zipWithIndex.map(x => (x._2 + 1) + "," + java.util.UUID.randomUUID -> x._1).toMap

How to combine the results of spark computations in the following case?

The question is to calculate average of each of the columns corresponding to each class. Class number is given in the first column.
I am giving a part of test file for better clarity.
2 0.819039 -0.408442 0.120827
3 -0.063763 0.060122 0.250393
4 -0.304877 0.379067 0.092391
5 -0.168923 0.044400 0.074417
1 0.053700 -0.088746 0.228501
2 0.196758 0.035607 0.008134
3 0.006971 -0.096478 0.123718
4 0.084281 0.278343 -0.350414
So the task is to calculate
1: avg(), avg(), avg()
.
.
.
I am very new to Scala. After juggling a lot with the code I came up with the following code
val inputfile = sc.textFile ("testfile.txt")
val myArray = inputfile.map { line =>
(line.split(" ").toList)
}
var Avgmap:Map[String,List[Double]] = Map()
var countmap:Map[String,Int] = Map()
for( a <- myArray ){
//println( "Value of a: " + a + " " + a.size );
if(!countmap.contains(a(0))){
countmap += (a(0) -> 0)
Avgmap += (a(0) -> List.fill(a.size-1)(1.0))
}
var c = countmap(a(0)) + 1
val countmap2 = countmap + (a(0) -> c)
countmap = countmap2
var p = List[Double]()
for( i <- 1 to a.size - 1) {
var temp = (Avgmap(a(0))(i-1)*(countmap(a(0)) - 1) + a(i).toDouble)/countmap(a(0))
// println("i: "+i+" temp: "+temp)
var q = p :+ temp
p = q
}
val Avgmap2 = Avgmap + (a(0) -> p)
Avgmap = Avgmap2;
println("--------------------------------------------------")
println(countmap)
println(Avgmap)
}
When I execute this code I seem to be getting the results in two halves of the dataset. Please help me in combining them.
Edit: About the variables I am using. countmap keeps record of classnumber -> number of vectors encountered. Similarly Avgmap keeps record of average so far of each columns corresponding to the key.
at first, use DataFrame API. at secont, what you want is just one row
df.select(df.columns.map(c => mean(col(c))) :_*).show

flatMap in scala, the compiler says it's wrong

I have a file which contains lines which contain items separated by ","
for example:
2 1,3
3 2,5,7
5 4
Now I want to flatMap this file to such rdd:
2 1
2 3
3 2
3 5
5 7
5 4
I wonder how to realize this function in scala:
val pairs = lines.flatMap { line =>
val a = line.split(" ")(0)
val partb = line.split(" ")(1)
for (b <- partb.split(",")) {
yield a + " " + b
}
}
Is this correct?
Thank you for clarifying your code example. In your case, the only problem is the location of your yield keyword. Move it to before the curly braces, like so:
for (b <- partb.split(",")) yield {
a + " " + b
}
You need to do yield THEN the return logic
yield {a}
The way you are doing it now is a for loop, not a for comprehension, which will yell about the yield keyword, and even if not it would return a Unit
val pairs = lines.flatMap { line =>
for (a <- line.split(",")) yield {
a
}
}
In addition to the relocation of yield for delivering a collection, as already exposed, consider this possible refactoring where we extract the first two entries from split,
val pairs = lines.flatMap { line =>
val Array(a, partb, _*) = line.split(" ")
for (b <- partb.split(","))
yield a + " " + b
}
and yet more concise is
val pairs = lines.flatMap { line =>
val Array(a,tail) = line.split(" |,", 2)
for (t <- tail) yield s"$a $t"
}
where we split by either " " or "," and extract the head and the tail, then we apply string interpolation to produce the desired result.

Scala - can 'for-yield' clause yields nothing for some condition?

In Scala language, I want to write a function that yields odd numbers within a given range. The function prints some log when iterating even numbers. The first version of the function is:
def getOdds(N: Int): Traversable[Int] = {
val list = new mutable.MutableList[Int]
for (n <- 0 until N) {
if (n % 2 == 1) {
list += n
} else {
println("skip even number " + n)
}
}
return list
}
If I omit printing logs, the implementation become very simple:
def getOddsWithoutPrint(N: Int) =
for (n <- 0 until N if (n % 2 == 1)) yield n
However, I don't want to miss the logging part. How do I rewrite the first version more compactly? It would be great if it can be rewritten similar to this:
def IWantToDoSomethingSimilar(N: Int) =
for (n <- 0 until N) if (n % 2 == 1) yield n else println("skip even number " + n)
def IWantToDoSomethingSimilar(N: Int) =
for {
n <- 0 until N
if n % 2 != 0 || { println("skip even number " + n); false }
} yield n
Using filter instead of a for expression would be slightly simpler though.
I you want to keep the sequentiality of your traitement (processing odds and evens in order, not separately), you can use something like that (edited) :
def IWantToDoSomethingSimilar(N: Int) =
(for (n <- (0 until N)) yield {
if (n % 2 == 1) {
Option(n)
} else {
println("skip even number " + n)
None
}
// Flatten transforms the Seq[Option[Int]] into Seq[Int]
}).flatten
EDIT, following the same concept, a shorter solution :
def IWantToDoSomethingSimilar(N: Int) =
(0 until N) map {
case n if n % 2 == 0 => println("skip even number "+ n)
case n => n
} collect {case i:Int => i}
If you will to dig into a functional approach, something like the following is a good point to start.
First some common definitions:
// use scalaz 7
import scalaz._, Scalaz._
// transforms a function returning either E or B into a
// function returning an optional B and optionally writing a log of type E
def logged[A, E, B, F[_]](f: A => E \/ B)(
implicit FM: Monoid[F[E]], FP: Pointed[F]): (A => Writer[F[E], Option[B]]) =
(a: A) => f(a).fold(
e => Writer(FP.point(e), None),
b => Writer(FM.zero, Some(b)))
// helper for fixing the log storage format to List
def listLogged[A, E, B](f: A => E \/ B) = logged[A, E, B, List](f)
// shorthand for a String logger with List storage
type W[+A] = Writer[List[String], A]
Now all you have to do is write your filtering function:
def keepOdd(n: Int): String \/ Int =
if (n % 2 == 1) \/.right(n) else \/.left(n + " was even")
You can try it instantly:
scala> List(5, 6) map(keepOdd)
res0: List[scalaz.\/[String,Int]] = List(\/-(5), -\/(6 was even))
Then you can use the traverse function to apply your function to a list of inputs, and collect both the logs written and the results:
scala> val x = List(5, 6).traverse[W, Option[Int]](listLogged(keepOdd))
x: W[List[Option[Int]]] = scalaz.WriterTFunctions$$anon$26#503d0400
// unwrap the results
scala> x.run
res11: (List[String], List[Option[Int]]) = (List(6 was even),List(Some(5), None))
// we may even drop the None-s from the output
scala> val (logs, results) = x.map(_.flatten).run
logs: List[String] = List(6 was even)
results: List[Int] = List(5)
I don't think this can be done easily with a for comprehension. But you could use partition.
def getOffs(N:Int) = {
val (evens, odds) = 0 until N partition { x => x % 2 == 0 }
evens foreach { x => println("skipping " + x) }
odds
}
EDIT: To avoid printing the log messages after the partitioning is done, you can change the first line of the method like this:
val (evens, odds) = (0 until N).view.partition { x => x % 2 == 0 }