Scala getLines() with yield not as I expect - scala

My understanding of the "Programming in Scala" book is that the following should return an Array[String] when instead it returns an Iterator[String]. What am I missing?
val data = for (line <- Source.fromFile("data.csv").getLines()) yield line
I'm using Scala 2.9.
Thanks in advance.

The chapter you want to read to understand what's happening is http://www.artima.com/pins1ed/for-expressions-revisited.html
for (x <- expr_1) yield expr_2
is translated to
expr_1.map(x => expr_2)
So if expr_1 is an Iterator[String] as it is in your case, then expr_1.map(line => line) is also an Iterator[String].

Nope, it returns an Iterator. See: http://www.scala-lang.org/api/current/index.html#scala.io.BufferedSource
But the following should work if an Array is your goal:
Source.fromFile("data.csv").getLines().toArray
If you want to convert an Iterator to an Array (as mentioned in your comment), then try the following after you've yielded your Iterator:
data.toArray

#dhg is correct and here is a bit more detail on why.
The code in your example calls calls Source.fromFile which returns a BufferedSource. Then you call getLines which returns an iterator. That iterator is then yielded and stored as data.
Calling toArray on the Iterator will get you an array of Strings like you want.

Related

Equivalent of Iterator.continually for an Iterable?

I need to produce an java.lang.Iterable[T] where the supply of T is some long running operation. In addition, after T is supplied it is wrapped and further computation is made to prepare for the next iteration.
Initially I thought I could do this with Iterator.continually. However, calling toIterable on the result actually creates a Stream[T] - with the problem being that the head is eagerly evaluated, which I don't want.
How can I either:
Create an Iterable[T] from a supplying function or
Convert an Iterator[T] into an Iterable[T] without using Stream?
In Scala 2.13, you can use LazyList:
LazyList.continually(1)
Unlike Stream, LazyList is also lazy in its head.
Because java.lang.Iterable is a very simple API, it's trivial to go from a scala.collection.Iterator to it.
case class IterableFromIterator[T](override val iterator:java.util.Iterator[T]) extends java.lang.Iterable[T]
val iterable:java.lang.Iterable[T] = IterableFromIterator(Iterator.continually(...).asJava)
Note this contradicts the expectation that iterable.iterator() produces a fresh Iterator each time; instead, iterable.iterator() can only be called once.

Apache Spark in Scala not printing rdd values

I am new to Spark and Scala as well, so this might be a very basic question.
I created a text file with 4 lines of some words. The rest of the code is as below:
val data = sc.textFile("file:///home//test.txt").map(x=> x.split(" "))
println(data.collect)
println(data.take(2))
println(data.collect.foreach(println))
All the above "println" commands are producing output as: [Ljava.lang.String;#1ebec410
Any idea how do I display the actual contents of the rdd, I have even tried "saveAstextfile", it also save the same line as java...
I am using Intellij IDE for spark scala and yes, I have gone through other posts related to this, but no help. Thanking you in advance
The final return type of RDD is RDD[Array[String]] Previously you were printing the Array[String] that prints something like this [Ljava.lang.String;#1ebec410) Because the toString() method of Array is not overridden so it is just printing the HASHCODE of object
You can try casting Array[String] to List[String] by using implicit method toList now you will be able to see the content inside the list because toString() method of list in scala in overridden and shows the content
That Means if you try
data.collect.foreach(arr => println(arr.toList))
this will show you the content or as #Raphael has suggested
data.collect().foreach(arr => println(arr.mkString(", ")))
this will also work because arr.mkString(", ")will convert the array into String and Each element Seperated by ,
Hope this clears you doubt
Thanks
data is of type RDD[Array[String]], what you print is the toString of the Array[String] ( [Ljava.lang.String;#1ebec410), try this:
data.collect().foreach(arr => println(arr.mkString(", ")))

Tail on scala iterator

I want to check that every line in a file, apart from the first header line, contains the string "14022015". I wanted to do this the Scala way and I came up with something clever (I thought) usingfoldLeft:
assert(Source.fromFile(new File(s"$outputDir${File.separator}priv.txt"))
.getLines().foldLeft(true)((bool, line) => (bool && line.contains("14022015"))))
Until I found out about the header line, which needs to be excluded from the test. tail will not work as getLines returns an Iterator and not a List. Is there something else I can do (Scala wise)?
Simply:
val res: Boolean = myFile.getLines.drop(1).forall(_.contains("14022015"))
iterator.drop(1) will help you achieve exactly what you need
UPD: A side note, consider using a recursive solution instead of fold in this case - fold will always scan the entire iterator, and based on your code it looks like you may want it to short-circuit, just like standard a && b && c is not going to evaluate expressions down the list as soon as it encounters a false

How to iterate through lazy iterable in scala? from stanford-tmt

Scala newbie here,
I'm using stanford's topic modelling toolkit
and it has a lazy iterable of type LazyIterable[(String, Array[Double])]
How should i iterate through all the elements in this iterable say it to print all these values?
I tried doing this by
while(it.hasNext){
System.out.println(it.next())
}
Gives an error
error: value next is not a member of scalanlp.collection.LazyIterable[(String, Array[Double])]
This is the API source -> iterable_name ->
InferCVB0DocumentTopicDistributions in
http://nlp.stanford.edu/software/tmt/tmt-0.4/api/edu/stanford/nlp/tmt/stage/package.html
Based on its source code, I can see that the LazyIterable implements the standard Scala Iterable interface, which means you have access to all the standard higher-order functions that all Scala collections implement - such as map, flatMap, filter, etc.
The one you will be interested in for printing all the values is foreach. So try this (no need for the while-loop):
it.foreach(println)
Seems like method invocation problem, just check the source code of LazyIterable, look at line 46
override def iterator : Iterator[A]
when you get an instance of LazyIterable, invoke iterator method, then you can do what you want.

Removing empty strings from maps in scala

val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?
tokenizedLines.filter(_.nonEmpty)
The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)
You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}
val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)