I'm trying to run a recursive query in Amazon QuickSight, but it is aborting after 1001 iterations.
I've tried setting the recursion depth at the top of the query but am still receiving an error.
SET max_sp_recursion_depth = 2000;
I've also tried increasing it at the bottom, also with little success.
OPTION (MAXRECURSION 2000);
I am specifically looking for a solution when running the query in QuickSight.
Related
I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.
What I'm trying to do:
I'm trying to move about 2m records from one table into another. To do this, I'm doing an insert statement which is fed by a select query.
insert into my_table (
select a, b, c
from my_other_table
where (condition)
)
However, while running this, I keep running out of memory.
What I expected (and why I'm confused):
If the working set was larger than could fit in memory, I totally thought Postgres would buffer pages onto the disk and do the write iteratively behind the scenes.
However, what's happening is that it apparently tries to read all of the selected content into memory prior to stuffing it into the other table.
Even on our chunky r5.2xl instance, it consumes all of the memory until eventually the OOM Killer fires and the Aurora reboots the instance.
This graph is showing freeable memory dip down to zero everytime I run the query. The memory shooting back up is due to the instance automatically being killed and rebooted due to OOM.
My main question:
Why is Postgresql not being smarter and doing its own pagination behind the scenes?
Is there some setting I need to enable to get it to be aware of its memory limitations?
What I've tried:
Adjusting shared_buffers and work_mem parameters.
Aurora's default shared_buffer value allocates 20gb to our instance. I've tried dialing this down to 10gb, and then 6.5gb (restarting each time) but to no avail. The only affect was to make the query take ages and still ultimately consume all memory available after running for about 30min.
I similarly tried setting work_mem all the way to allowable minimum, but this seemingly had no effect on the end result as well.
What I can do as a work around:
I could, of course, do the pagination / batching from the client:
computeBatchOffsets(context).forEach(batchOffset ->
context.insertInto(BLAH)
.select(DSL.asterisk())
.from(FOO)
.limit(batchOffset)
.offset(batchOffset)
.execute()
But, in addition to it being slower than just letting the database do it, it "feels" like something the database should surely be able to do internally. So, I'm confused why I'd need to handle it at the client level.
I'm running a query out of pgAdmin 4 against a Postgres 9.5 database. Is there any method to get an estimation on how long this query will run? It is running now for nearly 20 hours.
I only found info about logging and similar to get the actual execution time after the query finished.
The query is about to order about 300,000 postGIS points using st_distance in a recursive CTE.
Has SQL or Postgres any mechanism to prevent infinite running queries? The recursion should stop at some point, is there maybe a way to peek at the last constructed row, which would give me a hint, how far the query is in the recursion.
If your transaction is in a dead lock, PostgreSQL may solve it by killing one (or some) of the transaction(s) involved.
Else, you have to master what you're doing.
When you use EXPLAIN (without ANALYZE), the planner is estimating your query duration but this value has to be taken as relative not absolute.
When I try to:
set wlm_query_slot_count to 10;
I get an error msg, "...query cannot run because wlm_query_slot_count is greater than the query concurrency for the queue, ... increase the query concurrency if you need more slots"
I've searched and searched, but cannot figure out where I can change the concurrency level (I'd like to raise it from its current 5 to 10).
Found the answer in AWS docs, under parameter groups
Change concurrency setting
Hi i have one problem while executing sql in postgresql.
I have a similar query like this:
select A, B , lower(C) from myTable ORDER BY A, B ;
WIthout ORDER BY clause, I get the result in 11 ms , but with order by , it took more than 4 minutes to retrieve the same results.
These column contains lots of data (1000000 or more) and has lot of duplicate data
Can any one suggest me solution??
Thank you
but with order by , it took more than 4 minutes to retrieve the same results.
udo already explained how indexes can be used to speed up sorting, this is probably the way you want to go.
But another solution (probably) is increasing the work_mem variable. This is almost always beneficial, unless you have many queries running at the same time.
When sorting large result sets, which don't fit in your work_mem setting, PostgreSQL resorts to a slow disk-based sort. If you allow it to use more memory, it will do fast in-memory sorts instead.
NB! Whenever you ask questions about PostgreSQL performance, you should post the EXPLAIN ANALYZE output for your query, and also the version of Postgres.
Have you tried putting an index on A,B?
That should speed things up.
Did you try using a DISTINCT for eliminating duplicates? This should be more efficient than an order by statement.