Why method count( ) does not get true num of rows? - scala

I was running my program on a cluster and I encountered a strange problem. When I try to count the number of rows in a particular dataset ds, I find that count() does not return the correct value. When I add ds.persist() before ds.count(), count result is correct. Why is that? Doesn't count() itself have operations such as cache or persist? Why are rows loss?
Thank you for your help.

I think it is do with the lazy evaluation. I remember an old thread on spark developer community that due to the optimisations on dataframe and dataset APIs, a count may not necessarily trigger the enitre dataframe/dataset evaluation. Hence the count may not be accurate.
However, if you do a df.rdd.count or ds.rdd.count or do a a cache or persist first on the dataset/dataframe and then do a count, it will evaluate the entire dataframe or dataset and the count will be accurate.
Looking at another thread How to force DataFrame evaluation in Spark please see the reply be Vince.Bdn which is in-line with my chain of thoughts.
If you want to further validate it, create a large dataframe and do the count before a persist and another one after the persist and look at the DAG, that should validate the same. in my case I went with a dataframe of 1 million records with 6 columns.

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

How to calculate number of rows obtained after a join, a filter or a write without using count function - Pyspark

I am using PySpark to join, filter and write large dataframe to a csv.
After each filter join or write, I count the number of lines with df.count().
However, counting the number of rows mean reloading the data and re-perform the various operations.
How could I count the number of lines during each different operations without reloading and calculate as with df.count() ?
I am aware that the cache function could be a solution to not reload and recalculate but I am looking for another solution as it's not always the best one.
Thank you in advance!
Why not look at the spark UI to see get a feel for what's happening instead of using Count? This might help you get a feel without actually doing counts. Jobs/Tasks can help you find the bottlenecks. The SQL tab can help you look at your plan to understand what's actually happening.
If you want something better than count.
countApprox is cheaper.(RDD level tooling) You should be caching if you are going to count it and then use it the dataframe again after. Actually count is sometimes use to force caching.

Efficient way to check if there are NA's in pyspark

I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?
There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers