What are the differences between the following collections types in Scala: List and LazyList types?
LazyList is a new type introduced in Scala Standard Library 2.13.1.
The type it's an immutable one and it's placed into scala.collection.immutable package. The major difference between the common List type is the fact that the elements of the LazyList are computed lazily, so only those elements that are requested are computed. By this means, a lazy list can have an infinite number of elements.
In terms of performance, the two types (LazyList and List) are comparable.
A LazyList is constructed with an operator having a similar-looking to the one specific to the List type (::), #::.
Being lazy, a LazyList can't produce a StackOverFlowError in a recursive loop, as an old List could do.
The question
What are the differences between LazyList and List?
can be rephrased as the question
What are the differences between Stream and List?
because by Scala 2.13 release notes
immutable.LazyList replaces immutable.Stream. Stream had
different laziness behavior and is now deprecated. (#7558,
#7000)
and the answer to the rephrased question is provided by what is the difference between Scala Stream vs Scala List vs Scala Sequence.
Performance judgements are best addressed by measurements within particular scenarios.
Related
I am new to Scala and got to know that a list in Scala is a singly Linked List under the hood.
Here is the documentation for the same:
A class for immutable linked lists representing ordered collections of elements of type A.
This class comes with two implementing case classes scala.Nil and scala.:: that implement the abstract members isEmpty, head and tail.
This class is optimal for last-in-first-out (LIFO), stack-like access patterns. If you need another access pattern, for example, random access or FIFO, consider using a collection more suited to this than List.
Why is it like that the list is a linked list internally?
Isn't it less effective in case a random access is required?
There are a lot of RDDs in Spark; from the docs:
AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDD
MapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Question
Why are there different RDDs and what are their respective purposes?
What I understood so far
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.
What I suspect
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).
However - I am not sure about any of this.
First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.
One common example is RDD[(T, U)], commonly known as PairRDD, which is enriched by methods provided by PairRDDFunctions like combineByKeyWithClassTag which is a basic building block for all byKey transformations. It is worth nothing that there is no such class as PairRDD or PairwiseRDD and these names are purely informal.
There are also a few commonly used subclasses of the RDD which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD and MapPartitionsRDD.
RDD is an abstract class which doesn't implement two important methods:
compute which computes result for a given partition
getPartitions which return a sequence of partitions for a given RDD
In general there are two reasons to subclass RDD;
create a class representing input source (e.g ParallelCollectionRDD, JdbcRDD)
create an RDD which provides non standard transformations
So to summarize:
RDD class provides a minimal interface for RDDs.
subclasses of RDD provide internal logic required for actual computations based on external sources and / or parent RDDs. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.
type classes provide additional methods based on the type of the values which are stored in the RDD and not dependent on how it has been created.
TraversableOnce: "A template trait for collections which can be traversed either once only or one or more times."
I don't understand this sentence. Why can be traversed more times? Isn't only once?
Thank you!
The Scaladoc also says
This trait exists primarily to eliminate code duplication between Iterator and Traversable, and thus implements some of the common methods that can be implemented solely in terms of foreach without access to a Builder.
Iterators can only be 'traversed' once. A Traversable can be traversed many times.
Essentially, TraversableOnce is an interface that abstracts away how you handle Iterators and Traversables. Your code could receive either an Iterator or a Traversable and handle them in exactly the same way!
For a good explanation of many of the traits used in the Collections library, I believe the majority (if not all) of the Scala 2.8 Collections Design Tutorial is still correct.
Note that with Scala 2.13 (June 2019), there is no more Traversable and TraversableOnce: They remain only as deprecated aliases for Iterable and IterableOnce. (initially part of the Collections rework)
IterableOnce also has the same sentence:
A template trait for collections which can be traversed either once only or one or more times.
This time:
The goal is to provide a minimal interface without any sequential operations.
This allows third-party extension like Scala parallel collections to integrate at the level of IterableOnce without inheriting unwanted implementations.
Because there are something can only be traversed once, eg:
Iterator.continually(readline)
The expression will create an iterator, but it can be only traversed once, otherwise it must store all read data, which is a waste in most of the time.
And many container can be traversed as many times as you want, like Array, Map and so on
If a Traversable can be traversed more than once, it sure can be traversed once. So all Traversable are also TraversableOnce, TraversableOnce can be traversed at lease once, but also can be more times.
In case of Set or List, the choice seems to be easier, but what do I do for Java's Collection, Iterable equivalent? Do I go for Seq? Traversable? GenTraversableOnce?
You need to decide based on your need. For example: According to Scala Documentation the definition of Seq is
Sequences are special cases of iterable collections of class Iterable. Unlike iterables, sequences always have a defined order of elements. Sequences provide a method apply for indexing.
So if you want to benefit ordering or you want to retrieve element by index you can use Seq
Again according to Scala Documentation if you are mainly interested in iteration over your Collection Traversable is sufficient
Just notice that there is a general good practice that for your function signature like function return type, use more general (abstract) data type to prevent unnecessary performance penalty for the function callers.
As often, it will depend on the needs of your caller.
Traversable is pretty high level (you only get foreach) but it might be sufficient. Seq would be used if you need a defined order of elements. GenTraversableOnce would be a bit abstract for me and possibly for your fellow coders.
How can I refer to ArrayBuffer and Vector in a more generic way?
For example - one of my functions takes a Vector as an argument, while another returns an ArrayBuffer.
What is a common "iterface" that I can use?
For example, in Java I could use List or Collection interface to pass them around.
See here for an overview of the inheritance relationship between the collections classes.
You'll see that IndexedSeq is a common trait for both ArrayBuffer and Vector.
EDIT: IndexedSeq vs. Seq:
From the doc: Indexed sequences do not add any new methods wrt Seq, but promise efficient implementations of random access patterns. This means that, in this context, you could just as well use Seq, as the implementations will be provided by ArrayBuffer and Vector in any case.
I would use SeqLike or more generic TraversableOnce which would also apply for Maps.