I am new to Scala and got to know that a list in Scala is a singly Linked List under the hood.
Here is the documentation for the same:
A class for immutable linked lists representing ordered collections of elements of type A.
This class comes with two implementing case classes scala.Nil and scala.:: that implement the abstract members isEmpty, head and tail.
This class is optimal for last-in-first-out (LIFO), stack-like access patterns. If you need another access pattern, for example, random access or FIFO, consider using a collection more suited to this than List.
Why is it like that the list is a linked list internally?
Isn't it less effective in case a random access is required?
Related
The current collection framework favors the use of to method in order to convert to the target collection type, and with an implicit conversion available from various collection companion objects to the Factory argument, this creates a neat, uniform interface. It unfortunately makes very hard optimisations which were easy in the old framework with CanBuildFrom. Lets say I have a custom collection type Unique[T], which is a combination of Set and Seq in that the order in which elements follows the order of insertion, but any element can occur only once. Because it offers fast indexOf, apply(i :Int) and contains, conversions to both Set and Seq can be O(1) with a simple wrapper. I can override toSet and toSeq, but I see no way of determining inside to (other than extremely questionable reflection) that the target factory builds Seq or Set, because the Factory implementation used by the implicit conversion is generic and not a prototype instance like old ReusableCBF.
There are a lot of RDDs in Spark; from the docs:
AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDD
MapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Question
Why are there different RDDs and what are their respective purposes?
What I understood so far
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.
What I suspect
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).
However - I am not sure about any of this.
First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.
One common example is RDD[(T, U)], commonly known as PairRDD, which is enriched by methods provided by PairRDDFunctions like combineByKeyWithClassTag which is a basic building block for all byKey transformations. It is worth nothing that there is no such class as PairRDD or PairwiseRDD and these names are purely informal.
There are also a few commonly used subclasses of the RDD which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD and MapPartitionsRDD.
RDD is an abstract class which doesn't implement two important methods:
compute which computes result for a given partition
getPartitions which return a sequence of partitions for a given RDD
In general there are two reasons to subclass RDD;
create a class representing input source (e.g ParallelCollectionRDD, JdbcRDD)
create an RDD which provides non standard transformations
So to summarize:
RDD class provides a minimal interface for RDDs.
subclasses of RDD provide internal logic required for actual computations based on external sources and / or parent RDDs. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.
type classes provide additional methods based on the type of the values which are stored in the RDD and not dependent on how it has been created.
TraversableOnce: "A template trait for collections which can be traversed either once only or one or more times."
I don't understand this sentence. Why can be traversed more times? Isn't only once?
Thank you!
The Scaladoc also says
This trait exists primarily to eliminate code duplication between Iterator and Traversable, and thus implements some of the common methods that can be implemented solely in terms of foreach without access to a Builder.
Iterators can only be 'traversed' once. A Traversable can be traversed many times.
Essentially, TraversableOnce is an interface that abstracts away how you handle Iterators and Traversables. Your code could receive either an Iterator or a Traversable and handle them in exactly the same way!
For a good explanation of many of the traits used in the Collections library, I believe the majority (if not all) of the Scala 2.8 Collections Design Tutorial is still correct.
Note that with Scala 2.13 (June 2019), there is no more Traversable and TraversableOnce: They remain only as deprecated aliases for Iterable and IterableOnce. (initially part of the Collections rework)
IterableOnce also has the same sentence:
A template trait for collections which can be traversed either once only or one or more times.
This time:
The goal is to provide a minimal interface without any sequential operations.
This allows third-party extension like Scala parallel collections to integrate at the level of IterableOnce without inheriting unwanted implementations.
Because there are something can only be traversed once, eg:
Iterator.continually(readline)
The expression will create an iterator, but it can be only traversed once, otherwise it must store all read data, which is a waste in most of the time.
And many container can be traversed as many times as you want, like Array, Map and so on
If a Traversable can be traversed more than once, it sure can be traversed once. So all Traversable are also TraversableOnce, TraversableOnce can be traversed at lease once, but also can be more times.
I have defined a case class...
case class QueryRef[A](id: UUID, descriptor: (A) => Boolean, selector: immutable.Iterable[A] => A])
...that will be passed as a message between Akka Actors. The receiver will filter some collection of type A using the descriptor and then select a single element from the resulting filtered collection using the selector.
As written it will only work if the receiving actor's collection has type immutable.Seq[A]. I would like to generalize the above so that it would work with a generic collection of elements of type A. Is this possible?
Scala collections have a hierarchy, illustrated below. You just need to choose which level of the hierarchy is appropriate for your use-case. Iterable could be a good candidate for you if you want Maps and Sets to be allowed.
Of course, you can then only use those functions which are available at that level of the hierarchy, you wouldn't be able to use any Seq specific functionality.
How can I refer to ArrayBuffer and Vector in a more generic way?
For example - one of my functions takes a Vector as an argument, while another returns an ArrayBuffer.
What is a common "iterface" that I can use?
For example, in Java I could use List or Collection interface to pass them around.
See here for an overview of the inheritance relationship between the collections classes.
You'll see that IndexedSeq is a common trait for both ArrayBuffer and Vector.
EDIT: IndexedSeq vs. Seq:
From the doc: Indexed sequences do not add any new methods wrt Seq, but promise efficient implementations of random access patterns. This means that, in this context, you could just as well use Seq, as the implementations will be provided by ArrayBuffer and Vector in any case.
I would use SeqLike or more generic TraversableOnce which would also apply for Maps.