What is difference between transformations and rdd functions in spark? - scala

I am reading spark textbooks and I see that transformations and actions and again I read rdd functions , so I am confuse, can anyone explain what is the basic difference between transformations and spark rdd functions.
Both are used to change the rdd data contents and return a new rdd but I want to know the precise explantion.

Spark rdd functions are transformations and actions both. Transformation is function that changes rdd data and Action is a function that doesn't change the data but gives an output.
For example :
map, filter, union etc are all transformation as they help in changing the existing data.
reduce, collect, count are all action as they give output and not change data.
for more info visit Spark and Jacek

RDDs support only two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
RDD Functions is a generic term used in textbook for internal mechanism.
For example, MAP is a transformation that passes each dataset element through a function and returns a new RDD representing the results. REDUCE is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

Since Spark's collections are immutable in nature, we can't change the data once the RDD is created.
Transformations are function that apply to RDDs and produce other RDDs in output (ie: map, flatMap, filter, join, groupBy, ...).
Actions are the functions that apply to RDDs and produce non-RDD (Array,List...etc) data as output (ie: count, saveAsText, foreach, collect, ...).

Related

Does it help to persist data between transformations in Scala Spark?

One
First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet:
val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)
Two
Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes) through a map inside toPairRdd:
val pairedRDD = toPairRdd(tweets).persist()
Question
After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.
Three
After computing the pairRDD, I want to compute scores of each hashtag:toScores uses reduceByKey
val scores = toScores(pairedRDD).persist()
Question
I use reduceByKey. Does this pairRDD method result in shuffling? I have read a paper that states:
"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD.
cogroup, groupWith, join, leftOuterJoin, rightOuterJoin,
groupByKey, reduceByKey, combineByKey, distinct, intersection,
repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"
The same paper also states that reduceByKey follows the same partition as the original RDD.
It's matter to use persist ( on mem/ disk/both) when you have many actions which always do the number of the same transformations again. And if it takes too long to recompute again & again.
In your case there is no persist or caching required as it is a one-pass process. You need to know that Stages are genned putting as many transformations together before shuffling. You would have 2 here.
If you were to process some other data requirements using the pairedRDD, then persist would be advisable.
The actions are more relevant in any event.
If you have multiple actions using the same rdd then it’s advisable to persist. I don’t see any action till now in your code. So I don’t see any reason to cache the rdd. Persist/cache also lazily evaluated.
Persist/cache - it is not guaranteed that data will be stayed during life time of execution as persisting follows LRU least recently used algorithm which may flush the data on basis of least used rdd if the memory is full. All the things need to keep in mind while using persist.
Reducebykey - it’s a wide transformation as shuffle may happen. But first of all it does combine the data w.r.t a key inside the partition first then do a reduce operation after that. So it’s less costly. Always avoid groupbykey where it shuffles the data directly without combining the the data w.r.t a key in a partition. Please avoid groupbykey while coding.

Why in Scio do you prefer aggregate over groupByKey?

From:
https://github.com/spotify/scio/wiki/Scio-data-guideline
"Prefer combine/aggregate/reduce transforms over groupByKey. Keep in mind that a reduce operation must be associative and commutative."
Why in particular would one prefer an aggregate over a groupByKey?
Combine, aggregation, and reduce transforms are preferred over groupByKey because the former are more memory efficient during pipeline execution. This is due to the implementation of the primitive GroupByKey and Combine transforms in Apache Beam. The answer to this question isn't necessarily specific to Scio.
GroupByKey requires that all key-value pairs remain in memory, which could result in OutOfMemoryErrors. All key-value pairs remain in memory per window. groupByKey uses Beam's primitive GroupByKey transform.
Aggregations remove the need to hold all values in memory because values are continually combined/reduced during the execution of the transform. Values are combined/reduced in a non-deterministic order, which is why all combine/reduce operations must be associative. Scio's implementation of aggregateByKey uses Beam's primitive Combine transform.
References:
1. Scio groupByKey
2. Scio aggregateByKey
3. Apache Beam GroupByKey
4. Apache Beam Combine
5. Google Cloud Dataflow Combine

Improve groupby operation in Spark 1.5.2

We are facing poor performance using Spark.
I have 2 specific questions:
When debugging we noticed that a few of the groupby operations done on Rdd are taking more time
Also a few of the stages are appearing twice, some finishing very quickly, some taking more time
Here is a screenshot of .
Currently running locally, having shuffle partitions set to 2 and number of partitions set to 5, data is around 1,00,000 records.
Speaking of groupby operation, we are grouping a dataframe (which is a result of several joins) based on two columns, and then applying a function to get some result.
val groupedRows = rows.rdd.groupBy(row => (
row.getAs[Long](Column1),
row.getAs[Int](Column2)
))
val rdd = groupedRows.values.map(Criteria)
Where Criteria is some function acted on the grouped resultant rows. Can we optimize this group by in any way?
Here is a screenshot of the .
I would suggest you not to convert the existing dataframe to rdd and do the complex process you are performing.
If you want to perform Criteria function on two columns (Column1 and Column2), you can do this directly on dataframe. Moreover, if your Criteria can be reduced to combination of inbuilt functions then it would be great. But you can always use udf functions for custom rules.
What I would suggest you to do is groupBy on the dataframe and apply aggregation functions
rows.groupBy("Column1", "Column2").agg(Criteria function)
You can use Window functions if you want multiple rows from the grouped dataframe. more info here
.groupBy is known to be not the most efficient approach:
Note: This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each
key, using PairRDDFunctions.aggregateByKey or
PairRDDFunctions.reduceByKey will provide much better performance.
Sometimes it is better to use .reduceByKey or .aggregateByKey, as explained here:
While both of these functions will produce the correct answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
Why .reduceByKey, .aggregateByKey work faster than .groupBy? Because part of the aggregation happens during map phase and less data is shuffled around worker nodes during reduce phase. Here is a good explanation on how does aggregateByKey work.

Confusion about spark streaming's transform function

I am a bit confused about the transform function of a DStream. For example, if I have the following.
val statusesSorted = statuses.transform(rdd => rdd.sortByKey())
Would the whole DStream be sorted by key or the individual RDDs inside the DStream would be sorted separately. If that is indeed the case, how can I sort keys of the whole DStream.
The transform function in Spark allows you to perform any Spark transformation on the RDDs within your DStream.
The map transform does a similar operation but on an element to element basis, whereas the transform operation on dstream allows you do the transformation on a complete RDD.
To answer your questions,
Would the whole DStream be sorted by key or the individual RDDs inside
the DStream would be sorted separately.
It will sort the individual RDDs in your dstream.
If that is indeed the case, how can I sort keys of the whole DStream.
To answer this, understand that Spark processes one batch at a time and the records in a batch correspond to the RDDs. So sorting the records in a batch(i.e. an RDD) will make sense because they form the data for computation. Sorting a dstream is not logical.

What is the efficient way to update value inside Spark's RDD?

I'm writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.
The Iteration has dependency, which means i+1 Iteration needs the result coming from i. So I need store the result of each Iteration for next step.
I'm trying to find an efficient way to update RDD, but have no clue so far.I find that PairRDD have a lookup function which could reduce the computation time from O(N), to O(M), N denote the total number of objects in RDD and M denote the number of elements in each partition.
So I'm thinking is there anyway that I could update an object in the RDD with O(M)? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup can be modified to achieve O(1))
Another thing is, if I could achieve O(M) for updating the RDD, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?
As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.
Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.
GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:
Changes to the values or structure of the graph are accomplished by
producing a new graph with the desired changes. Note that substantial
parts of the original graph (i.e., unaffected structure, attributes,
and indicies) are reused in the new graph reducing the cost of this
inherently functional data-structure.
It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html
An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.
For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)
Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.
The MapReduce programming model (and FP) doesn't really support updates of single values. Rather one is supposed to define a sequence of transformations.
Now when you have interdependent values, i.e. you cannot perform your transformation with a simple map but need to aggregate multiple values and update based on that value, then what you need to do is think of a way of grouping those values together then transforming each group - or define a monoidal operation so that the operation can be distributed and chopped up into substeps.
Group By Approach
Now I'll try to be a little more specific for your particular case. You say you have subtrees, is it possible to first map each node to an key that indicates the corresponding subtree? If so you could do something like this:
nodes.map(n => (getSubTreeKey(n), n)).grouByKey().map ...
Monoid
(strictly speaking you want a commutative monoid) Best you read http://en.wikipedia.org/wiki/Monoid#Commutative_monoid
For example + is a monoidal operation because when one wishes to compute the sum of, say, an RDD of Ints then the underlying framework can chop up the data into chunks, perform the sum on each chunk, then sum up the resulting sums (possibly in more than just 2 steps too). If you can find a monoid that will ultimately produce the same results you require from single updates, then you have a way to distribute your processing. E.g.
nodes.reduce(_ myMonoid _)