Merging a list of Strings using mkString vs foldRight - scala

I am currently trying out things in Scala, trying to get accustomed to functional programming as well as leaning a new language again (it's been a while since last time).
Now given a list of strings if I want to merge them into one long string (e.g. "scala", "is", "fun" => "scalaisfun") I figured one way to do it would be to do a foldRight and apply concatenation on the respective elements. Another way, admittedly much simpler, is to call mkString.
I checked on github but couldn't really find the source code for the respective functions (any help on that would be appreciated), so I am not sure how the functions are implemented. From the top of my head, I think the mkString is more flexible but it feels that there might be a foldRight in the implementation somewhere. Is there any truth to it?
Otherwise the scaladocs mention that mkString calls on toString for each respective element. Seeing that they are already strings to start with, that could be one negative point for mkStringin this particular case. Any comments on the pros and cons of both methods, with respect to performance, simplicity/elegance etc?

Simple answer: use mkString.
someString.toString returns the same object.
mkString is implemented with a single StringBuilder and it creates only 1 new string. With foldLeft you'll create N-1 new strings.
You could use StringBuilder in foldLeft, it will be as fast as mkString, but mkString is shorter:
strings.foldLeft(new StringBuilder){ (sb, s) => sb append s }.toString
strings.mkString // same result, at least the same speed

Don't use foldRight unless you really need it, as it will overflow your stack for large collections (for some types of collections). foldLeft or fold will work (does not store intermediate data on the stack), but will be slower and more awkward than mkString. If the list is nonempty, reduce and reduceLeft will also work.

Im memory serves, mkString uses a StringBuilder to build the String which is efficient. You could accomplish the same thing using a Scala StringBuilder as the accumulator to foldRight, but why bother if mkString can already do all that good stuff for you. Plus mkString gives you the added benefit of also including an optional delimiter. You could do that in foldRight but it's already done for you with mkString

Related

Is it Scala style to use a for loop in Scala/Spark?

I have heard that it is a good practice in Scala to eliminate for loops and do things "the Scala way". I even found a Scala style checker at http://www.scalastyle.org. Are for loops a no-no in Scala? In a course at https://www.udemy.com/course/apache-spark-with-scala-hands-on-with-big-data/learn/lecture/5363798#overview I found this example, which makes me thing that for looks are okay to use, but using the Scala format and syntax of course, in a single line and not like the traditional Java for looks in multiple lines of code. See this example I found from that Udemy course:
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
That for loop prints this result, as expected:
Enterprise Defiant Voyager Deep Space Nine
I was wondering if using for as in the example above is acceptable Scala style code, or it if is a no-no and why. Thank you!
There is no problem in this for loop, but you can use functions form List object for your work in more functional way.
e.g. instead of using
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
You can use
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
shipList.foreach(element => println(element) )
or
shipList.foreach(println)
You can use for loops in Scala, there is no problem with that. But the difference is that this for-loop is not an expression and does not return a value, so you need to use a variable in order to return any value. Scala gives preference to work with immutable types.
In your example you print messages in the console, you need to perform a "side effect" to extract the value breaking the referencial transparency, I mean, you depend on the IO operation to extract a value, or you have mutate a variable which is in the scope which maybe is being accessed by another thread or another concurrent task thereby there is no guarantee that the value that you collect wont be what you are expecting. Obviously, all these hypothesis are related to concurrent/parallel programming and there is where Scala and the immutable style help.
To show the elements of a collection you can use a for loop, but if you want to count the total number of chars in Scala you do that using a expression like:
val chars = shipList.foldLeft(0)((a, b) => a + b.length)
To sum up, most of the times the Scala code that you will read uses immutable style of programming although not always because Scala supports the other way of coding too, but it is weird to find something using a classic Java OOP style, mutating object instances and using getters and setters.

What's the most idiomatic way to express an iterable of a single element in Scala?

So far when I need to pass around an Iterable that consists of just one element, I pass a Some value; but that requires an implicit conversion.
In Java I would use java.util.Collections.singleton, and I guess there's something equivalent in Scala that better fits this use case.
Iterable(x), just as to get a Seq of a single element you write Seq(x), List(x), etc.
After taking a look to the implementation of the apply methods (constructors) of the collections (Iterable, Seq, List) that could fit the bill, they take varargs which require extra wrapping the object in an array and then looping over it or calling another method.
So I think I'm going to stick with consing the object like x :: Nil; that looks like the most lightweight way to achieve it and it's explicit that you are making a collection.

Immutable DataStructures In Scala

We know that Scala supports immutable data structures..i.e each time u update the list it will create a new object and reference in the heap.
Example
val xs:List[Int] = List.apply(22)
val newList = xs ++ (33)
So when i append the second element to a list it will create a new list which will contain both 22 and 33.This exactly works like how immutable String works in Java.
So the question is each time I append a element in the list a new object will be created each time..This ldoes not look efficient to me.
is there some special data structures like persistent data structures are used when dealing with this..Does anyone know about this?
Appending to a list has O(n) complexity and is inefficient. A general approach is to prepend to a list while building it, and finally reverse it.
Now, your question on creating new object still applies to the prepend. Note that since xs is immutable, newList just points to xs for the rest of the data after the prepend.
While #manojlds is correct in his analysis, the original post asked about the efficiency of duplicating list nodes whenever you do an operation.
As #manojlds said, constructing lists often require thinking backwards, i.e., building a list and then reversing it. There are a number of other situations where list building requires "needless" copying.
To that end, there is a mutable data structure available in Scala called ListBuffer which you can use to build up your list and then extract the result as an immutable list:
val xsa = ListBuffer[Int](22)
xsa += 33
val newList = xsa.toList
However, the fact that the list data structure is, in general, immutable means that you have some very useful tools to analyze, de-compose and re-compose the list. Many builtin operations take advantage of the immutability. By extension, your own programs can also take advantage of this immutability.

Is string concatenation in scala as costly as it is in Java?

In Java, it's a common best practice to do string concatenation with StringBuilder due to the poor performance of appending strings using the + operator. Is the same practice recommended for Scala or has the language improved on how java performs its string concatenation?
Scala uses Java strings (java.lang.String), so its string concatenation is the same as Java's: the same thing is taking place in both. (The runtime is the same, after all.) There is a special StringBuilder class in Scala, that "provides an API compatible with java.lang.StringBuilder"; see http://www.scala-lang.org/api/2.7.5/scala/StringBuilder.html.
But in terms of "best practices", I think most people would generally consider it better to write simple, clear code than maximally efficient code, except when there's an actual performance problem or a good reason to expect one. The + operator doesn't really have "poor performance", it's just that s += "foo" is equivalent to s = s + "foo" (i.e. it creates a new String object), which means that, if you're doing a lot of concatenations to (what looks like) "a single string", you can avoid creating unnecessary objects — and repeatedly recopying earlier portions from one string to another — by using a StringBuilder instead of a String. Usually the difference is not important. (Of course, "simple, clear code" is slightly contradictory: using += is simpler, using StringBuilder is clearer. But still, the decision should usually be based on code-writing considerations rather than minor performance considerations.)
Scalas String concatenation works the same way as Javas does.
val x = 5
"a"+"b"+x+"c"
is translated to
new StringBuilder()).append("ab").append(BoxesRunTime.boxToInteger(x)).append("c").toString()
StringBuilder is scala.collection.mutable.StringBuilder. That's the reason why the value appended to the StringBuilder is boxed by the compiler.
You can check the behavior by decompile the bytecode with javap.
I want to add: if you have a sequence of strings, then there is already a method to create a new string out of them (all items, concatenated). It's called mkString.
Example: (http://ideone.com/QJhkAG)
val example = Seq("11111", "2222", "333", "444444")
val result = example.mkString
println(result) // prints "111112222333444444"
Scala uses java.lang.String as the type for strings, so it is subject to the same characteristics.

Scala: read and save all elements of an Iterable

I have an Iterable[T] that is really a stream of unknown length, and want to read it all and save it into something that is still an instance of Iterable. I really do have to read it and save it; I can't do it in a lazy way. The original Iterable can have a few thousand elements, at least. What's the most efficient/best/canonical way? Should I use an ArrayBuffer, a List, a Vector?
Suppose xs is my Iterable. I can think of doing these possibilities:
xs.toArray.toIterable // Ugh?
xs.toList // Fast?
xs.copyToBuffer(anArrayBuffer)
Vector(xs: _*) // There's no toVector, sadly. Is this construct as efficient?
EDIT: I see by the questions I should be more specific. Here's a strawman example:
def f(xs: Iterable[SomeType]) { // xs might a stream, though I can't be sure
val allOfXS = <xs all read in at once>
g(allOfXS)
h(allOfXS) // Both g() and h() take an Iterable[SomeType]
}
This is easy. A few thousand elements is nothing, so it hardly matters unless it's a really tight loop. So the flippant answer is: use whatever you feel is most elegant.
But, okay, let's suppose that this is actually in some tight loop, and you can predict or have benchmarked your code enough to know that this is performance-limiting.
Your best performance for an immutable solution will likely be a Vector, used like so:
Vector() ++ xs
In my hands, this can copy a 10k iterable about 4k-5k times per second. List is about half the speed.
If you're willing to try a mutable solution under the hood, xs.toArray.toIterable usually takes the cake with about 10k copies per second. ArrayBuffer is about the same speed as List.
If you actually know the size of the target (i.e. size is O(1) or you know it from somewhere else), you can shave off another 20-30% of the execution speed by allocating just the right size and writing a while loop.
If it's actually primitives, you can gain a factor of 10 by writing your own specialized Iterable-like-thing that acts on arrays and converts to regular collections via the underlying array.
Bottom line: for a great blend of power, speed, and flexibility, use Vector() ++ xs in most situations. xs.toIndexedSeq defaults to the same thing, with the benefit that if it's already a Vector that it will take no time at all (and chains nicely without using parens), and the drawback that you are relying upon a convention, not a specification for behavior (and it takes 1-3 more characters to type).
How about Stream.force?
Forces evaluation of the whole stream and returns it.
This is hard. An Iterable's methods are defined in terms of its iterator, but that gets overridden by subtraits. For instance, IndexedSeq methods are usually defined in terms of apply.
There is the question of why do you want to copy the Iterable, but I suppose you might be guarding against the possibility of it being mutable. If you do not want to copy it, then you need to rephrase your question.
If you are going to copy it, and you want to be sure all elements are copied in a strict manner, you could use .toList. That will not copy a List, but a List does not need to be copied. For anything else, it will produce a new copy.