Getting the count of records in a data frame quickly - scala

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.

It's going to take so much time anyway. At least the first time.
One way is to cache the dataframe, so you will be able to more with it, other than count.
E.g
df.cache()
df.count()
Subsequent operations don't take much time.

The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.
It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.
If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.
Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).
The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.
approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.
The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.

Related

Performance of Post-Aggregations in Apache Druid

What are the performance trade-offs that I must consider when using post-aggregations as opposed to defining metrics in the ingestion spec when rollup is enabled?
I guess it all depends on the result set.
When you do this at ingestion time, it will only take some time when the data is pushed into your druid cluster. Selecting data is just a matter of retrieving the data from the segments.
An post-aggregation will run through the result of your query and then "re-process" the result. So this will have some overhead. How much is hard to tell though.
When you need more speed, or want to reduce CPU, I would recommend doing the changes at ingestion time. However, the downside is that this takes extra disk space, as you store the result of your calculation as a new column.
If disk space is a problem, I guess you could better use post-aggregations.

How to improve low throughput groupbykey in dataflow pipeline

I have an apache beam batch pipeline (written in java) to transform raw analytics data from bigquery into an aggregated form. Session records (that might now be extended by the next days worth of page events) and a new set of page events are read from bigquery. The pipeline is then performing a groupByKey operation to group by user id (across both datasets) before the aggregation operation to create session records. The groupByKey operation is performing very slowly (a throughput of ~50 per sec) on the larger dataset (~8400000 records) whereas the throughput for the other input (~1000000 records) was much much higher (~10000 per sec). Does anyone have any advice on how I can troubleshoot and ultimately improve the speed of the operation?
From research online I am aware sometimes it can be more efficient to use a Combine operation rather than groupByKey ( among others this article) but I did not think that would be appropriate for the data I'm grouping (BQ TableRow records).
Further info that might be useful:
The groupByKey is taking the 8400000 into approx 3500000 grouped records with a range of ~2000 to 1 records being combined per key
I fully acknowledge I am lacking a full understanding of the intricacies of apache beam and dataflow and am keen to understand a lot more as I will be building out a number of different pipelines.
Below is a screenshot of the dataflow graph
Individual stages in Beam get fused together when running on Dataflow, meaning the throughput of on stage gets tied to others, so it's fully possible it's not the GroupByKey but rather adjacent DoFns that are causing the slowness. If you click on a step you can see, the step-info tab, a field that gives the wall time for executing a particular step. I would see if there is a particular step in your pipeline around that GroupByKey that has a high walltime.

s3 parquet write - too many partitions, slow writing

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.
you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark

Caching one big RDD or many small RDDs

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.
Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data = spark.read.parquet("file:///testdata/").limit(100)
data.select("col1").write.parquet("file:///test1/")
data.select("col2").write.parquet("file:///test2/")
data.select("col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data = spark.read.parquet("file:///testdata/").limit(100).cache()
data.select("col1").write.parquet("file:///test4/")
data.select("col2").write.parquet("file:///test5/")
data.select("col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.