I have a parquet I read from disk (20,000 partitions) and the display command df.display() returns almost right away, whereas df.limit(1).display() literally takes hours to execute. I don't understand what is going on here. It is also not only the display() command that is slow, but also a join I would actually like to perform. By contrast, df.show(n=1) returns almost instantaneously.
Limit() runs per partition first, then combines the result into a final result. Since there are 20,000 partitions in your data this takes a lot of time to execute.
One solution to still use limit() is to reduce the number of partitions as in this answer with: df.coalesce(1).limit(1).display(). But this is not recommended as all the data will be sent to the driver, and may cause out of memory exception.
Related
I have db table which has around 5-6 Mn entries and it is taking around 20 minutes to perform vacuuming. Since, one field of this table is updated very frequently, thereare a lot of dead rows to deal with.
For an estimate, with our current user base it can have 2 Million dead tuples on daily basis. So, vacuuming of this table requires both:
Read IO: as the whole table is not present in shared memory.
Write IO: as there are a lot of entries to update.
What should be an ideal way to vacuum this table? Should I increase the autovacuum_cost_limit to allow more operations per autovacuum run? But as i can see, it will increase IOPS, which again might hinder the performance. Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it? If I decrease it it will run more often, although write IO will decrease but it will lead to more number of time period with high read IO.
Also, as the user base will increase it will take more and more time as the size of table with increase and vacuum will have to read a lot from disk. So, what should I do?
One of the solution I have thought of:
Separate the highly updated column and make a separate table.
Tweaking the parameter to make it run more often to decrease write IO(as discussed above). How to handle more Read IO, as vacuum will now run more often?
Combine point 2 along with increasing RAM to reduce Read IO as well.
In general what is the approach that people takes, because I assume people must have very big table 10GB or more, that needs to be vacuumed.
Separating the column is a viable strategy but would be a last resort to me. PostgreSQL already has a high per-row overhead, and doing this would double it (which might also remove most of the benefit). Plus, it would make your queries uglier, harder to read, harder to maintain, easier to introduce bugs. Where splitting it would be most attractive is if index-only-scans on a set of columns not including this is are important to you, and splitting it out lets you keep the visibility map for those remaining columns in a better state.
Why do you care that it takes 20 minutes? Is that causing something bad to happen? At that rate, you could vacuum this table 72 times a day, which seems to be way more often than it actually needs to be vacuumed. In v12, the default value for autovacuum_vacuum_cost_delay was dropped 10 fold, to 2ms. This change in default was not driven by changes in the code in v12, but rather by the realization that the old default was just out of date with modern hardware in most cases. I would have no trouble pushing that change into v11 config; but I don't think doing so would address your main concern, either.
Do you actually have a problem with the amount of IO you are generating, or is it just conjecture? The IO done is mostly sequential, but how important that is would depend on your storage hardware. Do you see latency spikes while the vacuum is happening? Are you charged per IO and your bill is too high? High IO is not inherently a problem, it is only a problem if it causes a problem.
Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it?
If I decrease it it will run more often, although write IO will
decrease but it will lead to more number of time period with high read
IO.
Running more often probably won't decrease your write IO by much if any. Every table/index page with at least one obsolete tuple needs to get written, during every vacuum. Writing one page just to remove one obsolete tuple will cause more writing than waiting until there are a lot of obsolete tuples that can all be removed by one write. You might be writing somewhat less per vacuum, but doing more vacuums will make up for that, and probably far more than make up for it.
There are two approaches:
Reduce autovacuum_vacuum_cost_delay for that table so that autovacuum becomes faster. It will still consume I/O, CPU and RAM.
Set the fillfactor for the table to a value less than 100 and make sure that the column you update frequently is not indexed. Then you could get HOT updates which don't require VACUUM.
if i write
dataFrame.write.format("parquet").mode("append").save("temp.parquet")
in temp.parquet folder
i got the same file numbers as the row numbers
i think i'm not fully understand about parquet but is it natural?
Use coalesce before write operation
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
EDIT-1
Upon a closer look, the docs do warn about coalesce
However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1)
Therefore as suggested by #Amar, it's better to use repartition
You can set partitions as 1 to save as single file
dataFrame.repartition(1).write.format("parquet").mode("append").save("temp.parquet")
Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.
As it is repeatidly mentioned throughout the internet, you should use repartition in this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.
There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.
I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.
I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.
One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.
If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.
My team discussed this a bit more on our team blog the other day.
This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).
If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.
QMR Documentation: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html
QMR Rule Candidates query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_qmr_rule_candidates.sql
I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.
It's going to take so much time anyway. At least the first time.
One way is to cache the dataframe, so you will be able to more with it, other than count.
E.g
df.cache()
df.count()
Subsequent operations don't take much time.
The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.
It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.
If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.
Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).
The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.
approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.
The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.
We have a pipeline for which the initial stages are properly scalable - using several dozen workers apiece.
One of the last stages is
dataFrame.write.format(outFormat).mode(saveMode).
partitionBy(partColVals.map(_._1): _*).saveAsTable(tname)
For this stage we end up with a single worker. This clearly does not work for us - in fact the worker runs out of disk space - on top of being very slow.
Why would that command end up running on a single worker/single task only?
Update The output format was parquet. The number of partition columns did not affect the result (tried one column as well as several columns).
Another update None of the following conditions (as posited by an answer below) held:
coalesce or partitionBy statements
window / analytic functions
Dataset.limit
sql.shuffle.partitions
The problem is unlikely to be related in any way to saveAsTable.
A single task in a stage indicates that the input data (Dataset or RDD) has only a one partition. This is contrast to cases where there are multiple tasks but one or more have significantly higher execution time, which normally correspond to partitions containing positively skewed keys. Also you should confound a single task scenario with low CPU utilization. The former is usually a result of insufficient IO throughput (high CPU wait times are the most obvious indication of that), but in rare cases can be traced to usage of shared objects with low level synchronization primitives.
Since standard data sources don't shuffle data on write (including cases where partitionBy and bucketBy options are used) it is safe to assume that data has been repartitioned somewhere in the upstream code. Usually it means that one of the following happened:
Data has been explicitly moved to a single partition using coalesce(1) or repartition(1).
Data has been implicitly moved to a single partition for example with:
Dataset.limit
Window function applications with window definition lacking PARTITION BY clause.
df.withColumn(
"row_number",
row_number().over(Window.orderBy("some_column"))
)
sql.shuffle.partitions option is set to 1 and upstream code includes non-local operation on a Dataset.
Dataset is a result of applying a global aggregate function (without GROUP BY caluse). This usually not an issue, unless function is non-reducing (collect_list or comparable).
While there is no evidence that it is the problem here, in general case you should also possibility, data contains only a single partition all the way to the source. This usually when input is fetched using JDBC source, but the 3rd party formats can exhibit the same behavior.
To identify the source of the problem you should either check the execution plan for the input Dataset (explain(true)) or check SQL tab of the Spark Web UI.