I am newbie to spark and I have observed that there are some cases that a window function approach and groupBy approach are alternatives for each other.Here,I want to understand from performance perspective which one is better and why?Both approaches will cause re-shuffling of data, but under which scenarios one will be effective over the other?
From my understanding, groupBy is more performant because it uses partial aggregaton. So using groupBy, not all records are shuffled but only the partial aggregators (for example for avg, that would be sum and count).
On the other hand, window-function will always shuffle your records and aggregation is done afterward and should therefore be slower.
But in reality there is not the choice groupBy vs window-functions since in most cases you will need to combine groupBy results with a join with the original data (which can be expensive unless you can use broadcast join), and more often you cannot achieve the logic with groupBy (running sum/average, lead/lag etc).
But unfortunately, there is very little (official) literature about topics like that...
Related
I am new new to pyspark, i read somewhere "By applying bucketing on the convenient columns in the data frames before shuffle required operations, we might avoid multiple probable expensive shuffles. Bucketing boosts performance by already sorting and shuffling data before performing sort-merge joins"
so keen to know, how we can "avoid multiple probable expensive shuffles" with bucketing before join
two dataframe
Here's a great article that helps you understand bucketing sorting.
Basically it comes down to you will "pre-chew" your data so that it's easy to join. You do this by using creating table definitions with CLUSTERED BY and BUCKET. If you regularly join two tables using identical clusterd by/bucketing on both tables will enable super fast joins between the tables. (They can be mapped to the same reducer to speed up the join).
I am new to Spark and Scala. I was reading upon distinct() function of Spark. But I could not find any proper details . I have a few doubts which I could not resolve and have written them down .
How distinct() is implemented in Spark ?
I am not that good with Spark source code to be able to identify the whole flow .
When I check for execution plan, I can only see a ShuffleRDD
What is the Time Complexity of distinct ?
I also found from Google searching that it also uses hashing and sorting in some way .
So, I thought whether it uses the same principle as getting unique elements from array with help of Hashset .
If it was one system , I would have guessed that time complexity is O(nlogn) .
But it is distributed among many partitions and shuffled , what would be order of time complexity ?
Is there a way to avoid shuffling in particular cases ?
If I make sure to properly partition my data as per my use-case ,
can I avoid shuffling ?
i.e. for example , say exploding an ArrayType column in dataframe with unique rows creates new rows with other columns being duplicated .
I will select the other columns .
In this way I made sure duplicates are unique per partition .
Since I know duplicates are unique per partition ,
I can avoid shuffle and just keenly drop duplicates in that partition
I also found this Does spark's distinct() function shuffle only the distinct tuples from each partition .
Thanks For your help .
Please correct me if I am wrong anywhere .
How distinct() is implemented in Spark ?
By applying a dummy aggregation with None value. Roughly
rdd.map((_, None)).reduceByKey((a, b) => a)
What is the Time Complexity of distinct ?
Given overall complexity of the process it is hard to estimate. It is at least O(N log N), as shuffle requires sort, but given multiple other operations required to build additional off core data structures (including associative arrays), serialize / deserialize the data can be higher, and in practice dominated by IO operations, not pure algorithm complexity.
Is there a way to avoid shuffling in particular cases ?
Yes, if potential duplicates are guaranteed to be placed on the same partition.,
You can use mapPartitions to dedpulicate the data, especially if data is sorted or in other way guaranteed to have duplicates in a isolated neighborhood. Without this you might be limited by the memory requirements, unless you accept approximate results with probabilistic filter (like Bloom filter).
In general though it is not possible, and operation like this will be non-local.
One
First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet:
val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)
Two
Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes) through a map inside toPairRdd:
val pairedRDD = toPairRdd(tweets).persist()
Question
After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.
Three
After computing the pairRDD, I want to compute scores of each hashtag:toScores uses reduceByKey
val scores = toScores(pairedRDD).persist()
Question
I use reduceByKey. Does this pairRDD method result in shuffling? I have read a paper that states:
"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD.
cogroup, groupWith, join, leftOuterJoin, rightOuterJoin,
groupByKey, reduceByKey, combineByKey, distinct, intersection,
repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"
The same paper also states that reduceByKey follows the same partition as the original RDD.
It's matter to use persist ( on mem/ disk/both) when you have many actions which always do the number of the same transformations again. And if it takes too long to recompute again & again.
In your case there is no persist or caching required as it is a one-pass process. You need to know that Stages are genned putting as many transformations together before shuffling. You would have 2 here.
If you were to process some other data requirements using the pairedRDD, then persist would be advisable.
The actions are more relevant in any event.
If you have multiple actions using the same rdd then it’s advisable to persist. I don’t see any action till now in your code. So I don’t see any reason to cache the rdd. Persist/cache also lazily evaluated.
Persist/cache - it is not guaranteed that data will be stayed during life time of execution as persisting follows LRU least recently used algorithm which may flush the data on basis of least used rdd if the memory is full. All the things need to keep in mind while using persist.
Reducebykey - it’s a wide transformation as shuffle may happen. But first of all it does combine the data w.r.t a key inside the partition first then do a reduce operation after that. So it’s less costly. Always avoid groupbykey where it shuffles the data directly without combining the the data w.r.t a key in a partition. Please avoid groupbykey while coding.
We are facing poor performance using Spark.
I have 2 specific questions:
When debugging we noticed that a few of the groupby operations done on Rdd are taking more time
Also a few of the stages are appearing twice, some finishing very quickly, some taking more time
Here is a screenshot of .
Currently running locally, having shuffle partitions set to 2 and number of partitions set to 5, data is around 1,00,000 records.
Speaking of groupby operation, we are grouping a dataframe (which is a result of several joins) based on two columns, and then applying a function to get some result.
val groupedRows = rows.rdd.groupBy(row => (
row.getAs[Long](Column1),
row.getAs[Int](Column2)
))
val rdd = groupedRows.values.map(Criteria)
Where Criteria is some function acted on the grouped resultant rows. Can we optimize this group by in any way?
Here is a screenshot of the .
I would suggest you not to convert the existing dataframe to rdd and do the complex process you are performing.
If you want to perform Criteria function on two columns (Column1 and Column2), you can do this directly on dataframe. Moreover, if your Criteria can be reduced to combination of inbuilt functions then it would be great. But you can always use udf functions for custom rules.
What I would suggest you to do is groupBy on the dataframe and apply aggregation functions
rows.groupBy("Column1", "Column2").agg(Criteria function)
You can use Window functions if you want multiple rows from the grouped dataframe. more info here
.groupBy is known to be not the most efficient approach:
Note: This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each
key, using PairRDDFunctions.aggregateByKey or
PairRDDFunctions.reduceByKey will provide much better performance.
Sometimes it is better to use .reduceByKey or .aggregateByKey, as explained here:
While both of these functions will produce the correct answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
Why .reduceByKey, .aggregateByKey work faster than .groupBy? Because part of the aggregation happens during map phase and less data is shuffled around worker nodes during reduce phase. Here is a good explanation on how does aggregateByKey work.
I'am using enumerator/iteratee from Play framework
I have several enumerators that each provide sorted sequence of values. I want to write Iteratee/Enumeratee that merges values from these enumerators to provide sorted sequence of all values.
Is it a good idea to use Iteratee or should I eimplement enumeratee directly?
I know that I can zip values from enumerators and reconstruct their data stream in memory and then merge such data.
But I'am wondering if there is a way how to implement a "classic" merge sort - to "read" first value from all enumerators, then select minimal value and then let enumerator that provided it to read another value (while other enumerators are on hold). As a result I want to have enumeratee that provides resulting sorted sequence without storing all streams in memory. And I would like to follow functional style - keep everything immutable.
Thanks for ideas.
You will still need to do some insertion sorting in a standard collection in memory. Imagine this pathological case:
Enumerator(3, 2, 1) and Enumerator(4, -1 , -2, -3)
Here you cannot just take the smallest element and tack it on the end of your collection. You will have to put values at arbitrary places in the collection as you go. This is part of what makes sorting fundamentally O(n log(n)) is that you have to know your full universe of what you have to sort in order to do it any faster than that. (Bucket sort is a linear time sorting algorithm assuming you know the distribution of the values you are trying to sort)
To address your question more specifically:
The enumerator/iteratee library isn't really expressive enough for your use case. If you want to merge enumerators you can use Enumerator.interleave and do some insertion sorting in your Iteratee with whatever elements come in first.
If this mechanic is important to you, you could consider using the recently released Akka Streams with which you can implement a custom FlexiMerge push/pull stage which will allow you to do what you seek.