Scala: aggregate column based file - scala

I have a huge file (does not fit into memory) which is tab separated with two columns (key and value), and pre-sorted on the key column. I need to call a function on all values for a key and write out the result. For simplicity, one can assume that the values are numbers and the function is addition.
So, given an input:
A 1
A 2
B 1
B 3
The output would be:
A 3
B 4
For this question, I'm not so much interested in reading/writing the file, but more in the list comprehension side. It is important though that the whole content (input as well as output) doesn't fit into memory. I'm new to Scala, and coming from Java I'm interested what would be the functional/Scala way to do that.
Update:
Based on AmigoNico's comment, I came up with the below constant memory solution.
Any comments / improvements are appreciated!
val writeAggr = (kv : (String, Int)) => {println(kv._1 + " " + kv._2)}
writeAggr(
( ("", 0) /: scala.io.Source.fromFile("/tmp/xx").getLines ) { (keyAggr, line) =>
val Array(k,v) = line split ' '
if (keyAggr._1.equals(k)) {
(k, keyAggr._2 + v.toInt)
} else {
if (!keyAggr._1.equals("")) {
writeAggr(keyAggr)
}
(k, v.toInt)
}
}
)

This can be done quite elegantly with Scalaz streams (and unlike iterator-based solutions, it's "truly" functional):
import scalaz.stream._
val process =
io.linesR("input.txt")
.map { _.split("\\s") }
.map { case Array(k, v) => k -> v.toInt }
.pipe(process1.chunkBy2(_._1 == _._1))
.map { kvs => s"${ kvs.head._1 } ${ kvs.map(_._2).sum }\n" }
.pipe(text.utf8Encode)
.to(io.fileChunkW("output.txt"))
Not only will this read from the input, aggregate the lines, and write to the output in constant memory, but you also get nice guarantees about resource management that e.g. source.getLines can't offer.

You probably want to use a fold, like so:
scala> ( ( Map[String,Int]() withDefaultValue 0 ) /: scala.io.Source.fromFile("/tmp/xx").getLines ) { (map,line) =>
val Array(k,v) = line split ' '
map + ( k -> ( map(k) + v.toInt ) )
}
res12: scala.collection.immutable.Map[String,Int] = Map(A -> 3, B -> 4)
Folds are great for accumulating results (unlike for-comprehensions). And since getLines returns an Iterator, only one line is held in memory at a time.
UPDATE: OK, there is a new requirement that we not hold the results in memory either. In that case I think I'd just write a recursive function and use it like so:
scala> val kvPairs = scala.io.Source.fromFile("/tmp/xx").getLines map { line =>
val Array(k,v) = line split ' '
( k, v.toInt )
}
kvPairs: Iterator[(String, Int)] = non-empty iterator
scala> final def loop( key:String, soFar:Int ) {
if ( kvPairs.hasNext ) {
val (k,v) = kvPairs.next
if ( k == key )
loop( k, soFar+v )
else {
println( s"$key $soFar" )
loop(k,v)
}
} else println( s"$key $soFar" )
}
loop: (key: String, soFar: Int)Unit
scala> val (k,v) = kvPairs.next
k: String = A
v: Int = 1
scala> loop(k,v)
A 3
B 4
But the only thing functional about that is that it uses a recursive function rather than a loop. If you are OK with holding all of the values for a particular key in memory you could write a function that iterates over the lines of the file producing an Iterator of Iterators of like-keyed pairs, which you could then just sum and print, but the code would still not be particularly functional and it would be slower.
Travis's Scalaz pipeline solution looks like an interesting one along those lines, but with the iteration hidden behind some handy constructs. If you specifically want a functional solution, I'd say his is the best answer.

Related

How do you pad a string in Scala with a character for missing elements in a Vector?

If I have a sparse list of numbers:
Vector(1,3,7,8,9)
and I need to generate a string of a fixed size which replaces the 'missing' numbers with a given character that might look like this:
1.3...789
How would I do this in Scala?
Well, I'm not sure the range of the integers. So I'm assuming that they may not fit into a char and used a string. Try this:
val v = Vector(1,3,7,8,9)
val fixedStr = ( v.head to v.last )
.map( i => if (v.contains(i)) i.toString else "." )
.mkString
If you are only dealing with single digits then you may change the strings to chars in the above.
-- edit --
ok, so I couldn't help myself and addressed the issue of sparse vector and wanted to change it to use the sliding function. Figured it does no good sitting on my PC so sharing here:
v.sliding(2)
.map( (seq) => if (seq.size == 2) seq else seq ++ seq ) //normalize window to size 2
.foldLeft( new StringBuilder )( (sb, seq) => //fold into stringbuilder
seq match { case Seq(a,b) => sb.append(a).append( "." * (b - a - 1) ) } )
.append( v.last )
.toString
One way to do this is using sliding and pattern matching:
def mkNiceString(v: Vector[Int]) = {
v.sliding(2).map{
case Seq(a) => ""
case Seq(a,b) =>
val gap = b-a;
a.toString + (if(gap>1) "." * (gap-1) else "")
}.mkString + v.last
}
In the REPL:
scala> mkNiceString(Vector(1,3,7,8,9,11))
res22: String = 1.3...789.11
If the vector is sparse, this will be more efficient than checking the range between the first and the last number.
def padVector(numbers: Vector[Int], placeHolder: String) = {
def inner(nums: Vector[Int], prevNumber: Int, acc: String) : String =
if (nums.length == 0) acc
else (nums.head - prevNumber) match {
// the difference is 1 -> no gap between this and previous number
case 1 => inner(nums.tail, nums.head, acc + nums.head)
// gap between numbers -> add placeholder x times
case x => inner(nums.tail, nums.head, acc + (placeHolder * (x-1)) + nums.head)
}
if (numbers.length == 0) ""
else inner(numbers.tail, numbers.head, numbers.head.toString)
}
Output:
scala> padVector(Vector(1,3,7,8,9), ".")
res4: String = 1.3...789

Scala for ( ) vs for { }

I'm trying to understand for comprehensions in Scala, and I have a lot of examples that I sort of understand...
One thing I'm having a hard time figuring out is for ( ) vs for { }. I've tried both, and it seems like I can do one thing in one but it breaks in the other.
For example, this does NOT work:
def encode(number: String): Set[List[String]] =
if (number.isEmpty) Set(List())
else {
for (
split <- 1 to number.length
word <- wordsForNum(number take split)
rest <- encode(number drop split)
) yield word :: rest
}.toSet
However, if you change it to { }, it does compile:
def encode(number: String): Set[List[String]] =
if (number.isEmpty) Set(List())
else {
for {
split <- 1 to number.length
word <- wordsForNum(number take split)
rest <- encode(number drop split)
} yield word :: rest
}.toSet
These examples are from a Coursera class I'm taking. The professor didn't mention the "why" in the video & I was wondering if anyone else knows.
Thanks!
From the syntax in the spec, it might seem that parens and braces are interchangeable:
http://www.scala-lang.org/files/archive/spec/2.11/06-expressions.html#for-comprehensions-and-for-loops
but because the generators are separated by semis, the following rules kick in:
http://www.scala-lang.org/files/archive/spec/2.11/01-lexical-syntax.html#newline-characters
I have read and understood that section in the past, from which I vaguely recall the gist that newlines are enabled in the braces, which is to say, a newline char is taken as nl which serves as a semi.
So you can put the generators on separate lines instead of using semicolons.
This is the usual "semicolon inference" that lets you not write semicolons as statement terminators. So the newline in the middle of the generator is not taken as a semi, for instance:
scala> for (c <-
| List(1,2,3)
| ) yield c+1
res0: List[Int] = List(2, 3, 4)
scala> for { c <-
| List(1,2,3)
| i = c+1
| } yield i
res1: List[Int] = List(2, 3, 4)
In Scala () are usually for when you only have one statement. Something like this would have worked:
def encode(number: String): Set[Int] =
if (number.isEmpty) Set()
else {
for (
split <- 1 to number.length // Only one statement in the comprehension
) split
}.toSet
Add one more and it would fail to compile. The same is true for map for example
OK
List(1,2,3).map(number =>
number.toString
)
Not OK (have to use curly braces)
List(1,2,3).map(number =>
println("Hello world")
number.toString
)
Why that is. I have no idea :)

Scala unable to change values in mutable Map[Char, Map[Int, Double]] with default values

All maps in this code are mutable maps due to the import statement earlier on in the full code. The nGramGetter.getNGrams(...) method call returns a Map[String, Int].
def train(files: Array[java.io.File]): Map[Char, Map[Int, Double]] = {
val scores = Map[Char, Map[Int, Double]]().withDefault( x => Map[Int, Double]().withDefaultValue(0.0))
for{
i <- 1 to 4
nGram <- nGramGetter.getNGrams(files, i).filter( x => (x._1.size == 1 || x._2 > 4) && !hasUnwantedChar(x._1) )
char <- nGram._1
} scores(char)(i) += nGram._2
println(scores.size)
val nonUnigramTotals = scores.mapValues( x => x.values.reduce(_+_)-x(1) )
val unigramTotals = scores.mapValues( x => x(1) )
scores.map( x => x._1 -> x._2.map( y => y._1 -> (if(y._1 > 1) y._2/unigramTotals(x._1) else (y._2-nonUnigramTotals(x._1))/unigramTotals(x._1)) ) )
}
I have replaced the scores(char)(i) += nGram._2 line with a few print statements (printing the keys, values and individual chars in each key) to check the output, and the method call is NOT returning an empty list. The line that prints the size of scores, however, is printing a zero. I am almost sure I have used exactly this method to populate a frequency map before, but this time, the map always comes out empty. I have changed withDefault to withDefaultValue and passed in the result of the current function literal as the argument. I have tried both withDefault and withDefaultValue with Map[Int, Double](1->0.0,2->0.0,3->0.0,4->0.0). I am a bit of a Scala noob, so maybe I just don't understand something about the language that is causing the problem. Any idea what's wrong?
The methods withDefault and withDefaultValue do not change the map. Instead, they simply return a default value. Let's remove the syntactic sugar form your statement to see where it goes wrong:
scores(char)(i) += nGram._2
scores(char)(i) = scores(char)(i) + nGram._2
scores.apply(char)(i) = scores.apply(char)(i) + nGram._2
scores.apply(char).update(i, scores.apply(char).apply(i) + nGram._2)
Now, since scores.apply(char) does not exist, a default is being returned, Map[Int, Double]().withDefaultValue(0.0), and that map gets modified. Unfortunately, it never gets assigned to scores, because no update method is called on it. Try this code below -- it's untested, but it shouldn't be hard to get it to work:
scores(char) = scores(char) // initializes the map for that key, if it doesn't exist
scores(char)(i) += nGram._2

How to remove duplicates from a list then sort by most frequent

I have a list with assorted keywords that may repeat. I need to generate a list with distinct keywords but sorted by the frequency of which they appeared on the original list.
How would be the idiomatic Scala for that? Here is a working but ugly implementation:
val keys = List("c","a","b","b","a","a")
keys.groupBy(p => p).toList.sortWith( (a,b) => a._2.size > b._2.size ).map(_._1)
// List("a","b","c")
Shorter version:
keys.distinct.sortBy(keys count _.==).reverse
That is not particular efficient, however. The groupBy version ought to perform better, though it can be improved:
keys.groupBy(identity).toSeq.sortBy(_._2.size).map(_._1)
One can also get rid of the reverse in the first version by declaring an Ordering:
val ord = Ordering by (keys count (_: String).==)
keys.distinct.sorted(ord.reverse)
Note that reverse in this version just produces a new Ordering that works in the opposite manner of the original. This version also suggests a way to get better performance:
val freq = collection.mutable.Map.empty[String, Int] withDefaultValue 0
keys foreach (k => freq(k) += 1)
val ord = Ordering by freq
keys.distinct.sorted(ord.reverse)
Nothing wrong with that implementation that comments can't fix!
Seriously, break it down a bit and describe what & why you're taking each step.
Not as "concise" perhaps, but the purpose of concise code in scala is to make code more readable. When concise code is not clear it's time to back up, break up (introduce well named local variables), and comment.
Here's my take, don't know if it's less "ugly":
scala> keys.groupBy(p => p).values.toList.sortBy(_.size).reverse.map(_.head)
res39: List[String] = List(a, b, c)
fold version:
val keys = List("c","a","b","b","a","a")
val keysCounts =
(Map.empty[String, Int] /: keys) { case (counts, k) =>
counts updated (k, (counts getOrElse (k, 0)) + 1)
}
keysCounts.toList sortBy { case (_, count) => -count } map { case (w, _) => w }
Perhaps,
val mapCount = keys.map(x => (x,keys.count(_ == x))).distinct
// mapCount : List[(java.lang.String, Int)] = List((c,1), (a,3), (b,2))
val sortedList = mapCount.sortWith(_._2 > _._2).map(_._1)
// sortedList : List[java.lang.String] = List(a, b, c)
How about:
keys.distinct.sorted
Newbie didn't read the question carefully. Let me try again:
keys.foldLeft (Map[String,Int]()) { (counts, elem) => counts + (elem -> (counts.getOrElse(elem, 0) - 1))}
.toList.sortBy(_._2).map(_._1)
Could use a mutable Map if you prefer. Negative frequency counts are stored in the map. If that bothers you, you can make them positive and negate the sortBy argument.
Just a little change from #Daniel 's 4th version, may have a better performance:
scala> def sortByFreq[T](xs: List[T]): List[T] = {
| val freq = collection.mutable.Map.empty[T, Int] withDefaultValue 0
| xs foreach (k => freq(k) -= 1)
| xs.distinct sortBy freq
| }
sortByFreq: [T](xs: List[T])List[T]
scala> sortByFreq(keys)
res2: List[String] = List(a, b, c)
My prefered versions would be:
Most canonical / expressive?
keys.groupBy(identity).toList.map{ case (k,v) => (-v.size,k) }.sorted.map(_._2)
Shortest and probably most efficient?
keys.groupBy(identity).toList.sortBy(-_._2.size).map(_._1)
Straight forward
keys.groupBy(identity).values.toList.sortBy(-_.size).map(_.head)

How to implement lazy sequence (iterable) in scala?

I want to implement a lazy iterator that yields the next element in each call, in a 3-level nested loop.
Is there something similar in scala to this snippet of c#:
foreach (int i in ...)
{
foreach (int j in ...)
{
foreach (int k in ...)
{
yield return do(i,j,k);
}
}
}
Thanks, Dudu
Scala sequence types all have a .view method which produces a lazy equivalent of the collection. You can play around with the following in the REPL (after issuing :silent to stop it from forcing the collection to print command results):
def log[A](a: A) = { println(a); a }
for (i <- 1 to 10) yield log(i)
for (i <- (1 to 10) view) yield log(i)
The first will print out the numbers 1 to 10, the second will not until you actually try to access those elements of the result.
There is nothing in Scala directly equivalent to C#'s yield statement, which pauses the execution of a loop. You can achieve similar effects with the delimited continuations which were added for scala 2.8.
If you join iterators together with ++, you get a single iterator that runs over both. And the reduceLeft method helpfully joins together an entire collection. Thus,
def doIt(i: Int, j: Int, k: Int) = i+j+k
(1 to 2).map(i => {
(1 to 2).map(j => {
(1 to 2).iterator.map(k => doIt(i,j,k))
}).reduceLeft(_ ++ _)
}).reduceLeft(_ ++ _)
will produce the iterator you want. If you want it to be even more lazy than that, you can add .iterator after the first two (1 to 2) also. (Replace each (1 to 2) with your own more interesting collection or range, of course.)
You can use a Sequence Comprehension over Iterators to get what you want:
for {
i <- (1 to 10).iterator
j <- (1 to 10).iterator
k <- (1 to 10).iterator
} yield doFunc(i, j, k)
If you want to create a lazy Iterable (instead of a lazy Iterator) use Views instead:
for {
i <- (1 to 10).view
j <- (1 to 10).view
k <- (1 to 10).view
} yield doFunc(i, j, k)
Depending on how lazy you want to be, you may not need all of the calls to iterator / view.
If your 3 iterators are generally small (i.e., you can fully iterate them without concern for memory or CPU) and the expensive part is computing the result given i, j, and k, you can use Scala's Stream class.
val tuples = for (i <- 1 to 3; j <- 1 to 3; k <- 1 to 3) yield (i, j, k)
val stream = Stream(tuples: _*) map { case (i, j, k) => i + j + k }
stream take 10 foreach println
If your iterators are too large for this approach, you could extend this idea and create a Stream of tuples that calculates the next value lazily by keeping state for each iterator. For example (although hopefully someone has a nicer way of defining the product method):
def product[A, B, C](a: Iterable[A], b: Iterable[B], c: Iterable[C]): Iterator[(A, B, C)] = {
if (a.isEmpty || b.isEmpty || c.isEmpty) Iterator.empty
else new Iterator[(A, B, C)] {
private val aItr = a.iterator
private var bItr = b.iterator
private var cItr = c.iterator
private var aValue: Option[A] = if (aItr.hasNext) Some(aItr.next) else None
private var bValue: Option[B] = if (bItr.hasNext) Some(bItr.next) else None
override def hasNext = cItr.hasNext || bItr.hasNext || aItr.hasNext
override def next = {
if (cItr.hasNext)
(aValue get, bValue get, cItr.next)
else {
cItr = c.iterator
if (bItr.hasNext) {
bValue = Some(bItr.next)
(aValue get, bValue get, cItr.next)
} else {
aValue = Some(aItr.next)
bItr = b.iterator
(aValue get, bValue get, cItr.next)
}
}
}
}
}
val stream = product(1 to 3, 1 to 3, 1 to 3).toStream map { case (i, j, k) => i + j + k }
stream take 10 foreach println
This approach fully supports infinitely sized inputs.
I think the below code is what you're actually looking for... I think the compiler ends up translating it into the equivalent of the map code Rex gave, but is closer to the syntax of your original example:
scala> def doIt(i:Int, j:Int) = { println(i + ","+j); (i,j); }
doIt: (i: Int, j: Int)(Int, Int)
scala> def x = for( i <- (1 to 5).iterator;
j <- (1 to 5).iterator ) yield doIt(i,j)
x: Iterator[(Int, Int)]
scala> x.foreach(print)
1,1
(1,1)1,2
(1,2)1,3
(1,3)1,4
(1,4)1,5
(1,5)2,1
(2,1)2,2
(2,2)2,3
(2,3)2,4
(2,4)2,5
(2,5)3,1
(3,1)3,2
(3,2)3,3
(3,3)3,4
(3,4)3,5
(3,5)4,1
(4,1)4,2
(4,2)4,3
(4,3)4,4
(4,4)4,5
(4,5)5,1
(5,1)5,2
(5,2)5,3
(5,3)5,4
(5,4)5,5
(5,5)
scala>
You can see from the output that the print in "doIt" isn't called until the next value of x is iterated over, and this style of for generator is a bit simpler to read/write than a bunch of nested maps.
Turn the problem upside down. Pass "do" in as a closure. That's the entire point of using a functional language
Iterator.zip will do it:
iterator1.zip(iterator2).zip(iterator3).map(tuple => doSomething(tuple))
Just read the 20 or so first related links that are show on the side (and, indeed, where shown to you when you first wrote the title of your question).