What is RDD in spark - scala

Definition says:
RDD is immutable distributed collection of objects
I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)
From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions:
Users create RDDs in two ways: by loading an external dataset, or by
distributing a collection of objects (e.g., a list or set) in their
driver program
I am really confused understanding RDD in general and in relation to spark and hadoop.
Can some one please help.

An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
The formal definition is:
RDDs are fault-tolerant, parallel data structures that let users
explicitly persist intermediate results in memory, control their
partitioning to optimize data placement, and manipulate them using a
rich set of operators.
If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.
dataset could be the data loaded externally by the user. It could be a json file, csv file or a text file with no specific data structure.
UPDATE: Here is the paper what describe RDD internals:
Hope this helps.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.
RDDs have the following properties –
Immutability and partitioning:
RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.
Coarse grained operations:
Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD.
Fault Tolerance:
Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph.
For example –
firstRDD=sc.textFile("hdfs://...")
secondRDD=firstRDD.filter(someFunction);
thirdRDD = secondRDD.map(someFunction);
result = thirdRDD.count()
In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.
Lazy evaluations:
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.
Persistence:
Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.)
These properties of RDDs make them useful for fast computations.

Resilient Distributed Dataset (RDD) is the way Spark represents data. The data can come from various sources :
Text File
CSV File
JSON File
Database (via JBDC driver)
RDD in relation to Spark
Spark is simply an implementation of RDD.
RDD in relation to Hadoop
The power of Hadoop reside in the fact that it let users write parallel computations without having to worry about work distribution and fault tolerance. However, Hadoop is inefficient for the applications that reuse intermediate results. For example, iterative machine learning algorithms, such as PageRank, K-means clustering and logistic regression, reuse intermediate results.
RDD allows to store intermediate results inside the RAM. Hadoop would have to write it to an external stable storage system, which generate disk I/O and serialization. With RDD, Spark is up to 20X faster than Hadoop for iterative applications.
Futher implementations details about Spark
Coarse-Grained transformations
The transformations applied to an RDD are Coarse-Grained. This means that the operations on a RDD are applied to the whole dataset, not on its individual elements. Therefore, operations like map, filter, group, reduce are allowed, but operations like set(i) and get(i) are not.
The inverse of coarse-grained is fine-grained. A fine-grained storage system would be a database.
Fault Tolerant
RDD are fault tolerant, which is a property that enable the system to continue working properly in the event of the failure of one of its components.
The fault tolerance of Spark is strongly linked to its coarse-grained nature. The only-way to implement fault tolerance in a fine-grained storage system is to replicate its data or log updates across machines. However, in a coarse-grained system like Spark, only the transformations are logged. If a partition of an RDD is lost, the RDD has enough information the recompute it quickly.
Data storage
The RDD is "distributed" (separated) in partitions. Each partitions can be present in the memory or on the disk of a machine. When Spark wants to launch a task on a partition, he sends it to the machine containing the partition. This is know as "locally aware scheduling".
Sources :
Great research papers about Spark :
http://spark.apache.org/research.html
Include the paper suggested by Ewan Leith.

RDD = Resilient Distributed Dataset
Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed
RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.
This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.
A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).
If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available.
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)

To compare RDD with scala collection, below are few differences
Same but runs on a cluster
Lazy in nature where scala collections are strict
RDD is always Immutable i.e., you can not change the state of the data in the collection
RDD are self recovered i.e., fault-tolerant

RDD (Resilient Distributed Datasets) are an abstraction for representing data. Formally they are a read-only, partitioned collection of records that provides a convenient API.
RDD provide a performant solution for processing large datasets on cluster computing frameworks such as MapReduce by addressing some key issues:
data is kept in memory to reduce disk I/O; this is particularly relevant for iterative computations -- not having to persist intermediate data to disk
fault-tolerance (resilience) is obtained not by replicating data but by keeping track of all transformations applied to the initial dataset (the lineage). This way, in case of failure lost data can always be recomputed from its lineage and avoiding data replication again reduces storage overhead
lazy evaluation, i.e. computations are carried out first when they're needed
RDD's have two main limitations:
they're immutable (read-only)
they only allow coarse-grained transformations (i.e. operations that apply to the entire dataset)
One nice conceptual advantage of RDD's is that they pack together data and code making it easier to reuse data pipelines.
Sources: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, An Architecture for Fast and General Data Processing on Large Clusters

RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source.
RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered.
In RDD data is available at all times.
However RDD are slow and hard to code hence outdated.
It has been replaced by concept of DataFrame and Dataset.

RDD
is an Resilient Distributed Data Set.
It is an core part of spark.
It is an Low Level API of spark.
DataFrame and DataSets are built on top of RDD.
RDD are nothing but row level data i.e. sits on n number of executors.
RDD's are immutable .means you cannot change the RDD. But you can create new RDD using Transformation and Actions

Resilient Distributed Datasets (RDDs)
Resilient: If an operation is lost while performing on a node in spark, the dataset can be reconstituted from history.
Distributed: Data in RDDs is divided into one or many partitions and distributed as in-memory collections of objects across worker nodes in the cluster.
Dataset: RDDs are datasets that consist of records, records are uniquely identifiable data collections within a dataset.

Related

How to do simple cache file in Flink-Scala?

I am new to Flink. I am really confused how to do file caching and load it into a dataset ? I can't find a simple example. I am confused why we need to create a dataset first to call "RichMapFunction" ? How I cache file that with nothing do with any other dataset? In sample I found, it kind of performed join with other dataset. Thank you.
For the case to join two data sets, and one data set is small, use broadcast to avoid shuffle. Without broadcasting, it is a pain to shuffle a large data set.
E.g. one dataset has 1 billion records, another one has 100 records. With broadcast, the small dataset will be distributed to all task managers processing those 1 billion records - no moving 1 billion record for join. Without broadcast, the typical behaviour for joining operation is to shuffle the 1 billion records and 100 records, so that records with same key are in the same machine, which is much more expensive compared to broadcast.
The RichMapFunction provides the open() method and method to access RuntimeContext. In the open() function, the Flink job can get broadcasted dataset through getRuntimeContext(). getBroadcastVariable(). The open() function is called only one time for each operator, so the broadcasted dataset is initialised one time and then it can be applied to all incoming records. That is the reason why to use RichMapFunction() instead of MapFunction().
Note - Broadcast applies to the case that the dataset to broadcast is small. Need to create a dataset first and then broadcast the dataset to all operator. Please refer to here for the usage of the API.
For distributed file caching, it is for the case that the operation(e.g. Map operation) needs to load external file one time and use it in the operation.
E.g. A trained model is saved on HDFS. In Flink job, it needs to load the model and apply the model to each record. For this case, the Flink job can use distributed file cache API. The model file will be pulled from HDFS to local machine, and all tasks running on that machine can share the pulled file locally, which saves network and time.
You do not need to create a dataset for the file to be distributed, but using registerCachedFile(). Like the same reason for broadcasting dataset, using RichMapFunction allows the Flink job to load/init distributed file one time.
Please refer to this document for the usage.

Parallelised collections in Spark

What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.

Understanding Spark partitioning

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).

How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Is there any way (or any plans) to be able to turn Spark distributed collections (RDDs, Dataframe or Datasets) directly into Broadcast variables without the need for a collect? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
sc.broadcast(myStringPairRdd.collect().toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
BONUS
Furthermore, there could be a Monoid-like API (a bit like combineByKey) for situations where the .toMap or whatever operation on Array[T] is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
CLARIFICATION
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd by no means covers the keys in myUberMap but I don't know which keys I need until I traverse someOtherRdd AND suppose I use myUberMap multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.
While theoretically this is an interesting idea I will argue that although theoretically possible it has very limited practical applications. Obviously I cannot speak for PMC so I cannot say if there are any plans to implement this type of broadcasting mechanism at all.
Possible implementation:
Since Spark already provides torrent broadcasting mechanism which behavior is described as follows:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager.
If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available.
Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from.
it should be possible to reuse the same mechanism for direct node-to-node broadcasting.
It is worth noting that this approach cannot completely eliminate driver communication. Even though blocks could be created locally you still need a single source of truth to advertise a set of blocks to fetch.
Limited applications
One problem with broadcast variables is that there are quite expensive. Even if you can eliminate driver bottleneck two problems remain:
Memory required to store deserialized object on each executor.
Cost of transferring broadcasted data to every executor.
The first problem should be relatively obvious. It is not only about direct memory usage but also about GC cost and its effect on overall latency. The second one is rather subtle. I partially covered this in my answer to Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark but let's discus this further.
From network traffic perspective broadcasting a whole dataset is pretty much equivalent to creating Cartesian product. So if dataset is large enough for driver becoming a bottleneck it is unlikely to be a good candidate for broadcasting and targeted approach like hash join can be preferred in practice.
Alternatives:
There are some methods which can be used to achieve similar results as direct broadcast and address issues enumerated above including:
Passing data via distributed file system.
Using replicated database collocated with worker nodes.
I don't know if we can do it for RDD but you can do it for Dataframe
import org.apache.spark.sql.functions
val df:DataFrame = your_data_frame
val broadcasted_df = functions.broadcast(df)
now you can use variable broadcasted_df and it will be broadcasted to executor.
Make sure broadcasted_df dataframe is not too big and can be send to executor.
broadcasted_df will be broadcaster in operations like for example
other_df.join(broadcasted_df)
and in this case join() operation executes faster because every executor has 1 partition of other_df and whole broadcasted_df
For your question i am not sure you can do what you want. You can not use one rdd inside #map() method of another rdd because spark doesn't allowed transformations inside transformations. And in your case you need to call collect() method to create map from your RDD because you can only use usual map object inside #map() method you can not use RDD there.

Understanding parallelism in Spark and Scala

I have some confusion about parallelism in Spark and Scala. I am running an experiment in which I have to read many (csv) files from the disk change/ process certain columns and then write it back to the disk.
In my experiments, if I use SparkContext's parallelize method only then it does not seem to have any impact on the performance. However simply using Scala's parallel collections (through par) reduces the time almost to half.
I am running my experiments in localhost mode with the arguments local[2] for the spark context.
My question is when should I use scala's parallel collections and when to use spark context's parallelize?
SparkContext will have additional processing in order to support generality of multiple nodes, this will be constant on the data size so may be negligible for huge data sets. On 1 node this overhead will make it slower than Scala's parallel collections.
Use Spark when
You have more than 1 node
You want your job to be ready to scale to multiple nodes
The Spark overhead on 1 node is negligible because the data is huge, so you might as well choose the richer framework
SparkContext's parallelize may makes your collection suitable for processing on multiple nodes, as well as on multiple local cores of your single worker instance ( local[2] ), but then again, you probably get too much overhead from running Spark's task scheduler an all that magic. Of course, Scala's parallel collections should be faster on single machine.
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#parallelized-collections - are your files big enough to be automatically split to multiple slices, did you try setting slices number manually?
Did you try running the same Spark job on single core and then on two cores?
Expect best result from Spark with one really big uniformly structured file, not with multiple smaller files.