Scala Processing a file in batches - scala

I have a flat file which contains several million lines like one below
59, 254, 2016-09-09T00:00, 1, 6, 3, 40, 18, 0
I want to process this file in batches of X rows at a time. So I wrote this code
def func(x: Int) = {
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(x, x)
} yield batches.map("(" + _ + ")").mkString(",")
}
func(2).foreach(println)
This code produces exactly the output I want. the function walks through entire file taking 2 rows at a time batch them into 1 string.
(59, 828, 2016-09-09T00:00, 0, 8, 2, 52, 0, 0),(59, 774, 2016-09-09T00:00, 0, 10, 2, 51, 0, 0)
But when I see scala pros write code everything happens inside the for comprehension and you just return the last thing from your comprehension.
So in order to be a scala pro I change my code
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
line <- batches.map("(" + _ + ")").mkString(",")
} yield line
This produces 1 character per line and not the output I expected. Why did the code behavior totally change? At least on reading they look the same to me.

In the line line <- batches.map("(" + _ + ")").mkString(","), the right-hand side produces a String (the result of mkString), and the loop iterates over this string. When you iterate over a string, the individual items are characters, so in your case line is going to be a character. What you actually want is not to iterate over that string, but to assign it to the variable name line, which you can do by replacing the <- with =: line = batches.map("(" + _ + ")").mkString(",").
By the way, sliding(2,2) can be more clearly written as grouped(2).

#dhg has given the explaination, here's my suggestion on how this could be done in another way
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
batch <- batches.map("(" + _ + ")")
} yield batch.mkString(",")
so batch would be a traversable consist of 2 lines

Related

How do iterate a sequence with varying starting positions

Say I have an array:
[10,12,20,50]
I can iterate though this array like normal which would look at the position at 0, then 1, 2, and 3.
What if I wanted to start an any arbritrary position in the array, and then go through all the numbers in order.
So the other permutations would be:
10,12,20,50
12,20,50,10
20,50,10,12
50,10,12,20
Is there a general function that would allow me to do this type of sliding iteration?
so looking at the index positions from the above it would be:
0,1,2,3
1,2,3,0
2,3,0,1
3,0,1,2
It would be great if some languages have this built in, but I want to know the algorithm to do this also so I understand.
Let's iterate over an array.
val arr = Array(10, 12, 20, 50)
for (i <- 0 to arr.length - 1) {
println(arr(i))
}
With output:
10
12
20
50
Pretty basic.
What about:
val arr = Array(10, 12, 20, 50)
for (i <- 2 to (2 + arr.length - 1)) {
println(arr(i))
}
Oops. Out of bounds. But what if we modulo that index by the length of the array?
val arr = Array(10, 12, 20, 50)
for (i <- 2 to (2 + arr.length - 1)) {
println(arr(i % arr.length))
}
20
50
10
12
Now you just need to wrap it up in a function that replaces 2 in that example with an argument.
There is no language builtin. There is a similar method permutations, but it will generate all permutations without the order, which doesn't really fit your need.
Your requirement can be implemented with a simple algorithm where you just concatenates two slices:
def orderedPermutation(in: List[Int]): Seq[List[Int]] = {
for(i <- 0 until in.size) yield
in.slice(i, in.size) ++ in.slice(0, i)
}
orderedPermutation(List(10,12,20,50)).foreach(println)
Working code here

Spark rdd split doesn't return last column?

I have the following data:
17|ABC|3|89|89|0|0|2|
17|DFD|3|89|89|0|0|2|
17|RFG|3|89|89|0|0|2|
17|TRF|3|89|89|0|0|2|
When I use the following code, I just get 8 elements instead of 9 since the last one doesn't contain any value. I can't use Dataframes as my csv is not fixed, every line can have a different number of elements. How can I get last column value even if its Null/None?
My current code:
data_rdd.filter(x => x contains '|').map{line => line.split('|')}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
})
In both Scala and Java, split will not return any trailing empty strings by default. Instead, you can use a slightly different version of split with a second argument (overloaded to Scala and seen in the Java docs here).
The method definition is:
split(String regex, int limit)
The second argument here limits how many time the regex pattern is applied, using a negative number will apply it as many times as possible.
Therefore, change the code to use:
.map{line => line.split("\\|", -1)}
Note that this split function takes a regex and not a normal string or char.
You can split your string as below:
scala> "17|ABC|3|89|89|0|0|2|".split("\\|", -1)
res24: Array[String] = Array(17, ABC, 3, 89, 89, 0, 0, 2, "")
updated code:
data_rdd.filter(x => x contains '|').map{line => line.split("\\|", -1)}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
}

add prefix to spark rdd elements

I have two string elements in my rdd as :
"53 45 61","0 1 2".
I would like to zip and map it together as Key value pair ,adding a prefix "C" to each of keys
expected output:
C53 -> 0, C45-> 1, C61-> 2
Currently this is the code I am using
val prefix = "C"
newRDD = RDD.map(x=>(prefix + (x._1.split(" ")) zip x._2.split(" "))
receiving result below:
53 -> 0, C45-> 1, 61-> 2 .
What am I missing here?
you're currently adding your prefix to an Array(53, 45, 61) (didn't know you could do that). Do you mean to do x._1.split(" ").map(prefix + _) to add it to each element instead?

What are some good use cases of lazy evaluation in Scala?

When working with large collections, we usually hear the term "lazy evaluation". I want to better demonstrate the difference between strict and lazy evaluation, so I tried the following example - getting the first two even numbers from a list:
scala> var l = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
l: List[Int] = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
scala> l.filter(_ % 2 == 0).take(2)
res0: List[Int] = List(38, 46)
scala> l.toStream.filter(_ % 2 == 0).take(2)
res1: scala.collection.immutable.Stream[Int] = Stream(38, ?)
I noticed that when I'm using toStream, I'm getting Stream(38, ?). What does the "?" mean here? Does this have something to do with lazy evaluation?
Also, what are some good example of lazy evaluation, when should I use it and why?
One benefit using lazy collections is to "save" memory, e.g. when mapping to large data structures. Consider this:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
And using lazy evaluation:
val r =(1 to 10000).toStream
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
The first statement will genrate 10000 Seqs of size 10000 and keeps them in memory, while in the second case only one Seq at a time needs to exist in memory, therefore its much faster...
Another use-case is when only a part of the data is actually needed. I often use lazy collections together with take, takeWhile etc
Let's take a real life scenario - Instead of having a list, you have a big log file that you want to extract first 10 lines that contains "Success".
The straight forward solution would be reading the file line-by-line, and once you have a line that contains "Success", print it and continue to the next line.
But since we love functional programming, we don't want to use the traditional loops. Instead, we want to achieve our goal by composing functions.
First attempt:
Source.fromFile("log_file").getLines.toList.filter(_.contains("Success")).take(10)
Let's try to understand what actually happened here:
we read the whole file
filter relevant lines
took the first 10 elements
If we try to print Source.fromFile("log_file").getLines.toList, we will get the whole file, which is obviously a waste, since not all lines are relevant for us.
Why we got all lines and only then we performed the filtering? That's because the List is a strict data structure, so when we call toList, it evaluates immediately, and only after having the whole data, the filtering is applied.
Luckily, Scala provides lazy data structures, and stream is one of them:
Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success")).take(10)
In order to demonstrate the difference, let's try:
Source.fromFile("log_file").getLines.toStream
Now we get something like:
Scala.collection.immutable.Stream[Int] = Stream(That's the first line, ?)
toStream evaluates to only one element - the first line in the file. The next element is represented by a "?", which indicates that the stream has not evaluated the next element, and that's because toStream is lazy function, and the next item is evaluated only when used.
Now after we apply the filter function, it will start reading the next line until we get the first line that contains "Success":
> var res = Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success"))
Scala.collection.immutable.Stream[Int] = Stream(First line contains Success!, ?)
Now we apply the take function. There is still no action is performed, but it knows that is should pick 10 lines, so it doesn't evaluate until we use the result:
res foreach println
Finally, i we now print res, we'll get a Stream containing the first 10 lines, as we expected.

help rewriting in functional style

I'm learning Scala as my first functional-ish language. As one of the problems, I was trying to find a functional way of generating the sequence S up to n places. S is defined so that S(1) = 1, and S(x) = the number of times x appears in the sequence. (I can't remember what this is called, but I've seen it in programming books before.)
In practice, the sequence looks like this:
S = 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7 ...
I can generate this sequence pretty easily in Scala using an imperative style like this:
def genSequence(numItems: Int) = {
require(numItems > 0, "numItems must be >= 1")
var list: List[Int] = List(1)
var seq_no = 2
var no = 2
var no_nos = 0
var num_made = 1
while(num_made < numItems) {
if(no_nos < seq_no) {
list = list :+ no
no_nos += 1
num_made += 1
} else if(no % 2 == 0) {
no += 1
no_nos = 0
} else {
no += 1
seq_no += 1
no_nos = 0
}
}
list
}
But I don't really have any idea how to write this without using vars and the while loop.
Thanks!
Pavel's answer has come closest so far, but it's also inefficient. Two flatMaps and a zipWithIndex are overkill here :)
My understanding of the required output:
The results contain all the positive integers (starting from 1) at least once
each number n appears in the output (n/2) + 1 times
As Pavel has rightly noted, the solution is to start with a Stream then use flatMap:
Stream from 1
This generates a Stream, a potentially never-ending sequence that only produces values on demand. In this case, it's generating 1, 2, 3, 4... all the way up to Infinity (in theory) or Integer.MAX_VALUE (in practice)
Streams can be mapped over, as with any other collection. For example: (Stream from 1) map { 2 * _ } generates a Stream of even numbers.
You can also use flatMap on Streams, allowing you to map each input element to zero or more output elements; this is key to solving your problem:
val strm = (Stream from 1) flatMap { n => Stream.fill(n/2 + 1)(n) }
So... How does this work? For the element 3, the lambda { n => Stream.fill(n/2 + 1)(n) } will produce the output stream 3,3. For the first 5 integers you'll get:
1 -> 1
2 -> 2, 2
3 -> 3, 3
4 -> 4, 4, 4
5 -> 5, 5, 5
etc.
and because we're using flatMap, these will be concatenated, yielding:
1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, ...
Streams are memoised, so once a given value has been calculated it'll be saved for future reference. However, all the preceeding values have to be calculated at least once. If you want the full sequence then this won't cause any problems, but it does mean that generating S(10796) from a cold start is going to be slow! (a problem shared with your imperative algorithm). If you need to do this, then none of the solutions so far is likely to be appropriate for you.
The following code produces exactly the same sequence as yours:
val seq = Stream.from(1)
.flatMap(Stream.fill(2)(_))
.zipWithIndex
.flatMap(p => Stream.fill(p._1)(p._2))
.tail
However, if you want to produce the Golomb sequence (that complies with the definition, but differs from your sample code result), you may use the following:
val seq = 1 #:: a(2)
def a(n: Int): Stream[Int] = (1 + seq(n - seq(seq(n - 2) - 1) - 1)) #:: a(n + 1)
You may check my article for more examples of how to deal with number sequences in functional style.
Here is a translation of your code to a more functional style:
def genSequence(numItems: Int): List[Int] = {
genSequenceR(numItems, 2, 2, 0, 1, List[Int](1))
}
def genSequenceR(numItems: Int, seq_no: Int, no:Int, no_nos: Int, numMade: Int, list: List[Int]): List[Int] = {
if(numMade < numItems){
if(no_nos < seq_no){
genSequenceR(numItems, seq_no, no, no_nos + 1, numMade + 1, list :+ no)
}else if(no % 2 == 0){
genSequenceR(numItems, seq_no, no + 1, 0, numMade, list)
}else{
genSequenceR(numItems, seq_no + 1, no + 1, 0, numMade, list)
}
}else{
list
}
}
The genSequenceR is the recursive function that accumulates values in the list and calls the function with new values based on the conditions. Like the while loop, it terminates, when numMade is less than numItems and returns the list to genSequence.
This is a fairly rudimentary functional translation of your code. It can be improved and there are better approaches typically used. I'd recommend trying to improve it with pattern matching and then work towards the other solutions that use Stream here.
Here's an attempt from a Scala tyro. Keep in mind I don't really understand Scala, I don't really understand the question, and I don't really understand your algorithm.
def genX_Ys[A](howMany : Int, ofWhat : A) : List[A] = howMany match {
case 1 => List(ofWhat)
case _ => ofWhat :: genX_Ys(howMany - 1, ofWhat)
}
def makeAtLeast(startingWith : List[Int], nextUp : Int, howMany : Int, minimumLength : Int) : List[Int] = {
if (startingWith.size >= minimumLength)
startingWith
else
makeAtLeast(startingWith ++ genX_Ys( howMany, nextUp),
nextUp +1, howMany + (if (nextUp % 2 == 1) 1 else 0), minimumLength)
}
def genSequence(numItems: Int) = makeAtLeast(List(1), 2, 2, numItems).slice(0, numItems)
This seems to work, but re-read the caveats above. In particular, I am sure there is a library function that performs genX_Ys, but I couldn't find it.
EDIT Could be
def genX_Ys[A](howMany : Int, ofWhat : A) : Seq[A] =
(1 to howMany) map { x => ofWhat }
Here is a very direct "translation" of the definition of the Golomb seqence:
val it = Iterator.iterate((1,1,Map(1->1,2->2))){ case (n,i,m) =>
val c = m(n)
if (c == 1) (n+1, i+1, m + (i -> n) - n)
else (n, i+1, m + (i -> n) + (n -> (c-1)))
}.map(_._1)
println(it.take(10).toList)
The tripel (n,i,m) contains the actual number n, the index i and a Map m, which contains how often an n must be repeated. When the counter in the Map for our n reaches 1, we increase n (and can drop n from the map, as it is not longer needed), else we just decrease n's counter in the map and keep n. In every case we add the new pair i -> n into the map, which will be used as counter later (when a subsequent n reaches the value of the current i).
[Edit]
Thinking about it, I realized that I don't need indexes and not even a lookup (because the "counters" are already in the "right" order), which means that I can replace the Map with a Queue:
import collection.immutable.Queue
val it = 1 #:: Iterator.iterate((2, 2, Queue[Int]())){
case (n,1,q) => (n+1, q.head, q.tail + (n+1))
case (n,c,q) => (n,c-1,q + n)
}.map(_._1).toStream
The Iterator works correctly when starting by 2, so I had to add a 1 at the beginning. The second tuple argument is now the counter for the current n (taken from the Queue). The current counter could be kept in the Queue as well, so we have only a pair, but then it's less clear what's going on due to the complicated Queue handling:
val it = 1 #:: Iterator.iterate((2, Queue[Int](2))){
case (n,q) if q.head == 1 => (n+1, q.tail + (n+1))
case (n,q) => (n, ((q.head-1) +: q.tail) + n)
}.map(_._1).toStream