Spark is very efficient in reading through a set of billion dataset within 4 seconds but the count of distinct value in a df is pretty slow and less efficient & it's taking more than 5 mins even for a small set of data, I have tried these approaches:
value1 = df.where(df['values'] == '1').count()
or
df.groupBy("values").count().orderBy("value_count").show()
both returns the correct result but the time is the essence here.
I know that count is an lazy operator but is there an alternate approach to solve this problem ?
TIA
Count() is a function that makes Spark literally count through the rows. Operations like count(), distinct(), etc. will obviously take time due to the nature of those operations and are advised against in a distributed environment.
Spark has data structures that do not support indexing, and hence count() will take almost a full search of the data.
Related
I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.
I am new to Spark and Scala. I was reading upon distinct() function of Spark. But I could not find any proper details . I have a few doubts which I could not resolve and have written them down .
How distinct() is implemented in Spark ?
I am not that good with Spark source code to be able to identify the whole flow .
When I check for execution plan, I can only see a ShuffleRDD
What is the Time Complexity of distinct ?
I also found from Google searching that it also uses hashing and sorting in some way .
So, I thought whether it uses the same principle as getting unique elements from array with help of Hashset .
If it was one system , I would have guessed that time complexity is O(nlogn) .
But it is distributed among many partitions and shuffled , what would be order of time complexity ?
Is there a way to avoid shuffling in particular cases ?
If I make sure to properly partition my data as per my use-case ,
can I avoid shuffling ?
i.e. for example , say exploding an ArrayType column in dataframe with unique rows creates new rows with other columns being duplicated .
I will select the other columns .
In this way I made sure duplicates are unique per partition .
Since I know duplicates are unique per partition ,
I can avoid shuffle and just keenly drop duplicates in that partition
I also found this Does spark's distinct() function shuffle only the distinct tuples from each partition .
Thanks For your help .
Please correct me if I am wrong anywhere .
How distinct() is implemented in Spark ?
By applying a dummy aggregation with None value. Roughly
rdd.map((_, None)).reduceByKey((a, b) => a)
What is the Time Complexity of distinct ?
Given overall complexity of the process it is hard to estimate. It is at least O(N log N), as shuffle requires sort, but given multiple other operations required to build additional off core data structures (including associative arrays), serialize / deserialize the data can be higher, and in practice dominated by IO operations, not pure algorithm complexity.
Is there a way to avoid shuffling in particular cases ?
Yes, if potential duplicates are guaranteed to be placed on the same partition.,
You can use mapPartitions to dedpulicate the data, especially if data is sorted or in other way guaranteed to have duplicates in a isolated neighborhood. Without this you might be limited by the memory requirements, unless you accept approximate results with probabilistic filter (like Bloom filter).
In general though it is not possible, and operation like this will be non-local.
We are facing poor performance using Spark.
I have 2 specific questions:
When debugging we noticed that a few of the groupby operations done on Rdd are taking more time
Also a few of the stages are appearing twice, some finishing very quickly, some taking more time
Here is a screenshot of .
Currently running locally, having shuffle partitions set to 2 and number of partitions set to 5, data is around 1,00,000 records.
Speaking of groupby operation, we are grouping a dataframe (which is a result of several joins) based on two columns, and then applying a function to get some result.
val groupedRows = rows.rdd.groupBy(row => (
row.getAs[Long](Column1),
row.getAs[Int](Column2)
))
val rdd = groupedRows.values.map(Criteria)
Where Criteria is some function acted on the grouped resultant rows. Can we optimize this group by in any way?
Here is a screenshot of the .
I would suggest you not to convert the existing dataframe to rdd and do the complex process you are performing.
If you want to perform Criteria function on two columns (Column1 and Column2), you can do this directly on dataframe. Moreover, if your Criteria can be reduced to combination of inbuilt functions then it would be great. But you can always use udf functions for custom rules.
What I would suggest you to do is groupBy on the dataframe and apply aggregation functions
rows.groupBy("Column1", "Column2").agg(Criteria function)
You can use Window functions if you want multiple rows from the grouped dataframe. more info here
.groupBy is known to be not the most efficient approach:
Note: This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each
key, using PairRDDFunctions.aggregateByKey or
PairRDDFunctions.reduceByKey will provide much better performance.
Sometimes it is better to use .reduceByKey or .aggregateByKey, as explained here:
While both of these functions will produce the correct answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
Why .reduceByKey, .aggregateByKey work faster than .groupBy? Because part of the aggregation happens during map phase and less data is shuffled around worker nodes during reduce phase. Here is a good explanation on how does aggregateByKey work.
I'm currently developing a Spark Application that uses dataframes to compute and aggregates specific columns from a hive table.
Aside from using count() function in dataframes/rdd. Is there a more optimal approach to get the number of records processed or number of count of records of a dataframe ?
I just need to know if there's something needed to override a specific function or so.
Any replies would be appreciated. I'm currently using Apache spark 1.6.
Thank you.
Aside from using count() function in dataframes/rdd, is there a more optimal
approach to get the number of records processed or number of count of records
of a dataframe?
Nope. Since an RDD may have an arbitrarily complex execution plan, involving JDBC table queries, file scans, etc., there's no apriori way to determine its size short of counting.
I recently need to collect a big data from a RDD in spark. The problem is the size of the data is too big so that I can't collect back all the data and store it in the driver's memory. So I tried multiple ways to implement this but still got failed.
First, I need to mark the records in the RDD with an index so I used zipWithIndex() function.
After that, I tried the 1st method: I used take() function to collect the first 10 million records from the RDD and then do the iteration on this 10 million records to do some processing. When it's done, I filter out the first 10 million records from the RDD and then take the first 10 million records from the new RDD (filtered one), and do iteration again. Do this loop until the RDD is empty. --> This doesn't work cause I need to count the initial RDD to get the total number of the records while this seems to consume a lot of memory. Anyway, it failed in the end.
Next I tried the 2nd method: I used the toLocalIterator to get the iterator of the RDD directly and do the iteration and processing on the driver. This seems to be a solution but it turns out that the performance is really bad. It costs too many time on the hasNext(). The testing is still on-going. Even it works, the performance can't reach the requirement.
So is there any other way to implement this kind of thing? Many thanks.