There are a lot of RDDs in Spark; from the docs:
AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDD
MapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Question
Why are there different RDDs and what are their respective purposes?
What I understood so far
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.
What I suspect
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).
However - I am not sure about any of this.
First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.
One common example is RDD[(T, U)], commonly known as PairRDD, which is enriched by methods provided by PairRDDFunctions like combineByKeyWithClassTag which is a basic building block for all byKey transformations. It is worth nothing that there is no such class as PairRDD or PairwiseRDD and these names are purely informal.
There are also a few commonly used subclasses of the RDD which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD and MapPartitionsRDD.
RDD is an abstract class which doesn't implement two important methods:
compute which computes result for a given partition
getPartitions which return a sequence of partitions for a given RDD
In general there are two reasons to subclass RDD;
create a class representing input source (e.g ParallelCollectionRDD, JdbcRDD)
create an RDD which provides non standard transformations
So to summarize:
RDD class provides a minimal interface for RDDs.
subclasses of RDD provide internal logic required for actual computations based on external sources and / or parent RDDs. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.
type classes provide additional methods based on the type of the values which are stored in the RDD and not dependent on how it has been created.
Related
I am new to Scala and got to know that a list in Scala is a singly Linked List under the hood.
Here is the documentation for the same:
A class for immutable linked lists representing ordered collections of elements of type A.
This class comes with two implementing case classes scala.Nil and scala.:: that implement the abstract members isEmpty, head and tail.
This class is optimal for last-in-first-out (LIFO), stack-like access patterns. If you need another access pattern, for example, random access or FIFO, consider using a collection more suited to this than List.
Why is it like that the list is a linked list internally?
Isn't it less effective in case a random access is required?
I have some data stored as parquet files and case classes matching the data schema. Spark deals well with regular Product types so if I have
case class A(s:String, i:Int)
I can easily do
spark.read.parquet(file).as[A]
But from what I understand, Spark doesn't handle disjunction types, so when I have enums in my parquet, previously encoded as integers, and a scala representation like
sealed trait E
case object A extends E
case object B extends E
I cannot do
spark.read.parquet(file).as[E]
// java.lang.UnsupportedOperationException: No Encoder found for E
Makes sense so far, but then, probably too naively, I try
implicit val eEncoder = new org.apache.spark.sql.Encoder[E] {
def clsTag = ClassTag(classOf[E])
def schema = StructType(StructField("e", IntegerType, nullable = false)::Nil)
}
And I still get the same "No Encoder found for E" :(
My question at this point is, why is the implicit missing in scope? (or not recognized as an Encoder[E]) and even if it did, how would such an interface allow me to actually decode the data? I would still need to map the value to the proper case object.
I did read a related answer that says "TL;DR There is no good solution right now, and given Spark SQL / Dataset implementation, it is unlikely there will be one in the foreseeable future." But I'm struggling to understand why a custom Encoder couldn't do the trick.
But I'm struggling to understand why a custom Encoder couldn't do the trick.
Two main reasons:
There is no API for custom Encoders. Publicly available are only "binary" Kryo and Java Encoders, which create useless (in case of DataFrame / Dataset[Row]) blobs with no support for any meaningful SQL / DataFrame operations.
Code like this would work fine
import org.apache.spark.sql.Encoders
spark.createDataset(Seq(A, B): Seq[E])(Encoders.kryo[E])
but it is nothing more than a curiosity.
DataFrame is a columnar store. It is technically possible to encode type hierarchies on top of this structure (private UserDefinedType API does that) but it is cumbersome (as you have to provide storage for all possible variants, see for example How to define schema for custom type in Spark SQL?) and inefficient (in general complex types are somewhat second class citizens in Spark SQL, and many optimizations are not accessible with complex schema, subject to future changes).
In broader sense DataFrame API is effectively relational (as in relational algebra) and tuples (main building block of relations) are by definition homogeneous, so by extension there is no place in SQL / DataFrame API, for heterogeneous structures.
TraversableOnce: "A template trait for collections which can be traversed either once only or one or more times."
I don't understand this sentence. Why can be traversed more times? Isn't only once?
Thank you!
The Scaladoc also says
This trait exists primarily to eliminate code duplication between Iterator and Traversable, and thus implements some of the common methods that can be implemented solely in terms of foreach without access to a Builder.
Iterators can only be 'traversed' once. A Traversable can be traversed many times.
Essentially, TraversableOnce is an interface that abstracts away how you handle Iterators and Traversables. Your code could receive either an Iterator or a Traversable and handle them in exactly the same way!
For a good explanation of many of the traits used in the Collections library, I believe the majority (if not all) of the Scala 2.8 Collections Design Tutorial is still correct.
Note that with Scala 2.13 (June 2019), there is no more Traversable and TraversableOnce: They remain only as deprecated aliases for Iterable and IterableOnce. (initially part of the Collections rework)
IterableOnce also has the same sentence:
A template trait for collections which can be traversed either once only or one or more times.
This time:
The goal is to provide a minimal interface without any sequential operations.
This allows third-party extension like Scala parallel collections to integrate at the level of IterableOnce without inheriting unwanted implementations.
Because there are something can only be traversed once, eg:
Iterator.continually(readline)
The expression will create an iterator, but it can be only traversed once, otherwise it must store all read data, which is a waste in most of the time.
And many container can be traversed as many times as you want, like Array, Map and so on
If a Traversable can be traversed more than once, it sure can be traversed once. So all Traversable are also TraversableOnce, TraversableOnce can be traversed at lease once, but also can be more times.
I have very large result sets being imported from json. Each row of data in the json returns a very specific "column" order, that I would like to quickly iterate through. I'd prefer to avoid the overhead of checking/matching keys to process each piece of data. Unfortunately, scala.util.parsing.json puts these columns into a Map object, and when iterating through the Map, the order in which it iterates is random, and does not necessarily mirror the order of the columns in the JSON result. Is there a way to make the parser enforce the order of the JSON columns? One thought was if there is a way to tell the parser to use LinkedHashMap or ListMap as it is generating the objects. Would this be possible by extending the class or adding other traits? Do I have alternative options?
I'd strongly discourage you from relying on the order of key/value pairs. JSON objects are defined as:
An object is an unordered set of name/value pairs.
Relying on the order will most likely introduce difficult bugs and incompatibility of your code. Trading correctness for speed is always a bad deal.
Instead I'd suggest to find a fast, correct parser. I've used Jackson before, which is very fast, and can be well used with Scala. You annotate an arbitrary class of yours and Jackson parses JSON into instances of the class. Then you can process these instances as native Java/Scala objects, which is both very fast and robust.
I would consider trying something like json4s.
It appears the JObject type has ordered fields.
https://github.com/json4s/json4s
Otherwise I would ask why you need them ordered?
You can always map.get by key.
I want to read a rather large csv file and process it (slice, dice, summarize etc.) interactively (data exploration). My idea is to read the file into a database (H2) and use SQL to process it:
Read the file: I use Ostermiller csv parser
Determine the type of each column: I select randomly 50 rows and derive the type (int, long, double, date, string) of each column
I want to use Squeryl to process. To do so I need to create a case class dynamically. That's the bottleneck so far!
I upload the file to H2 and use any SQL command.
My questions:
Is there a better general interactive way of doing this in Scala?
Is there a way to solve the 3rd point? To state it differently, given a list of types (corresponding to the columns in the csv file), is it possible to dynamically create a case class corresponding to the table in Squeryl? To my understanding I can do that using macros, but I do not have enough exposure to do that.
I think your approach to the first question sounds reasonable.
Regarding your 2nd question - as an addition to drexin's answer - it is possible to generate the bytecode, with a library such as ASM. With such a library you can generate the same byte code as a case class would.
As scala is a statically typed language there is no way to dynamically create classes except for reflection, which is slow and dangerous and therefore should be avoided. Even with macros you cannot do this. Macros are evaluated at compile-time, not at runtime, so you need to know the structure of your data at compile-time. What do you need the case classes for, if you don't even know what your data looks like? What benefit do you expect from this over using a Map[String,Any]?
I think you want to create a sealed base class and then a series of case classes as subclasses of it. Each subclass will wrap a different type that you support.
Then you can use match statements and deconstruction to deal with the individual types, and treat them generically via the base class in the places where it doesn't matter.
You can't create a class for an entire row since you don't know enough about it at compile time. Even if you could dynamically generate a class (maybe by invoking the compiler at runtime), you wouldn't be able to benefit from type-safety and most of your code would have to treat it generically anyway.