How to write program in Spark to replace word - scala

it is easy for Hadoop to use .replace() for example
String[] valArray = value.toString().replace("\N", "")
But it dosen't work in Spark,I write Scala in Spark-shell like below
val outFile=inFile.map(x=>x.replace("\N",""))
So,how to deal with it?

For some reason your x is an Array[String]. How did you get it like that? You can .toString.replace it if you like, but that will probably not get you what you want (and would give the wrong output in java anyway); you probably want to do another layer of map, inFile.map(x => x.map(_.replace("\N","")))

Related

Is it Scala style to use a for loop in Scala/Spark?

I have heard that it is a good practice in Scala to eliminate for loops and do things "the Scala way". I even found a Scala style checker at http://www.scalastyle.org. Are for loops a no-no in Scala? In a course at https://www.udemy.com/course/apache-spark-with-scala-hands-on-with-big-data/learn/lecture/5363798#overview I found this example, which makes me thing that for looks are okay to use, but using the Scala format and syntax of course, in a single line and not like the traditional Java for looks in multiple lines of code. See this example I found from that Udemy course:
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
That for loop prints this result, as expected:
Enterprise Defiant Voyager Deep Space Nine
I was wondering if using for as in the example above is acceptable Scala style code, or it if is a no-no and why. Thank you!
There is no problem in this for loop, but you can use functions form List object for your work in more functional way.
e.g. instead of using
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
You can use
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
shipList.foreach(element => println(element) )
or
shipList.foreach(println)
You can use for loops in Scala, there is no problem with that. But the difference is that this for-loop is not an expression and does not return a value, so you need to use a variable in order to return any value. Scala gives preference to work with immutable types.
In your example you print messages in the console, you need to perform a "side effect" to extract the value breaking the referencial transparency, I mean, you depend on the IO operation to extract a value, or you have mutate a variable which is in the scope which maybe is being accessed by another thread or another concurrent task thereby there is no guarantee that the value that you collect wont be what you are expecting. Obviously, all these hypothesis are related to concurrent/parallel programming and there is where Scala and the immutable style help.
To show the elements of a collection you can use a for loop, but if you want to count the total number of chars in Scala you do that using a expression like:
val chars = shipList.foldLeft(0)((a, b) => a + b.length)
To sum up, most of the times the Scala code that you will read uses immutable style of programming although not always because Scala supports the other way of coding too, but it is weird to find something using a classic Java OOP style, mutating object instances and using getters and setters.

recursive variable needs type

I have a code where I wanted to update an RDD as below:
val xRDD = xRDD.zip(tempRDD)
This gave me the error : recursive value x needs type
I want to maintain the xRDD over iterations and modifying it with tempRDD in each iteration. How can I achieve it?
Thanks in advance.
The compiler is telling you that you're attempting to define a variable with itself and use it in it's own definition within an action. To say this another way, you're attempting to use something that doesn't exist in an action to define it.
Edit:
If you have a list of actions that produce new RDD that you'd like to zip together, perhaps you should look at a Fold:
listMyActions.foldLeft(origRDD){ (rdd, f) =>
val tempRDD = f(rdd)
rdd.zip(tempRDD)
}
Don't forget that vals are immutable, this means that you can't reassign something to a previously defined variable. However if you want to do this, you can replace it for a var, which is not recommended, this question is more related to Scala's feature than to Apache-Spark's one. Besides, if you want more information you can consult this post Use of def val and vars in scala.

Create SortedMap from Iterator in scala

I have an val it:Iterator[(A,B)] and I want to create a SortedMap[A,B] with the elements I get out of the Iterator. The way I do it now is:
val map = SortedMap[A,B]() ++ it
It works fine but feels a little bit awkward to use. I checked the SortedMap doc but couldn't find anything more elegant. Is there something like:
it.toSortedMap
or
SortedMap.from(it)
in the standard Scala library that maybe I've missed?
Edit: mixing both ideas from #Rex's answer I came up with this:
SortedMap(it.to:_*)
Which works just fine and avoids having to specify the type signature of SortedMap. Still looks funny though, so further answers are welcome.
The feature you are looking for does exist for other combinations, but not the one you want. If your collection requires just a single parameter, you can use .to[NewColl]. So, for example,
import collection.immutable._
Iterator(1,2,3).to[SortedSet]
Also, the SortedMap companion object has a varargs apply that can be used to create sorted maps like so:
SortedMap( List((1,"salmon"), (2,"herring")): _* )
(note the : _* which means use the contents as the arguments). Unfortunately this requires a Seq, not an Iterator.
So your best bet is the way you're doing it already.

Readin a two-dimensional array using scala

Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread

Get only the file names using listFiles in Scala

Is there an idiomatic Scala solution to obtain only the file names from File.listFiles?
Perhaps something like:
val names = new File(dir).listFiles.somemagicTrait(_getName)
and have names become a List[String]?
I know I can just loop and add them to a mutable list.
how about?
new File(dir).listFiles.map(_.getName).toList
I'm always wary of answering the wrong part of the question, but as Jean-Phillipe commented, you can get an array of the names from
new File(dir).list
and if you really need a list call toList on that.