So I have a query that I need a column with an aggregate within a determined frame, as example:
SUM(subquery1.value) OVER (
PARTITION BY subquery1.entity
ORDER BY subquery1.timestamp
ROWS BETWEEN 55 PRECEDING and CURRENT ROW
)
And so it works, gives me the expected result and it doesn't take up much time. Here's an EXPLAIN ANALYZE of it:
I know the statics are a bit off, but I'm happy with it.
The problem is that the 55 preceding rows is not really what I want, in fact what I actually want is closer to 50,000 and that's when it gets slower. If i bump the PRECEDING to 555 it already gets much slower, for higher values I haven't waited the run.
So I don't know whats happening, I have tried the same aggregate directly from a table using 300,000+ rows and it takes less than one second. This example is just over 50k rows so it should be much much faster.
Related
I have a table with 100Mil+ records and 237 fields.
One of the fields is a varchar 1 field with three possible values (Y,N,I)
I need to find all of the records with N.
Right now I have a b-tree index built and the query below takes about 20 min to run.
Is there another index I can use to get better performance?
SELECT * FROM tableone WHERE export_value='N';
Assuming your values are roughly equally distributed (say at least 15% of each value) and roughly equally distributed throughout the table (some physically at the beginning, some in the middle, some at the end) then no.
If you think about it you'll see why. You'll have to look up tens of millions of disk blocks in the index and then fetch them from the disk one by one. By the time you have done that, it would have been quicker to just scan the whole table and pick out the values as they match. The planner knows this and would probably not use the index at all.
However - if you only have 17 rows with "N" or they are all very recently added to the table and so physically happen to be close to each other then yes, and index can help.
If you only had a few rows with "N" you would have mentioned it, so we can ignore that one.
If however you mostly insert to this table you might find a BRIN index helpful. That can let the planner see that e.g. the first 80% of your table doesn't have any "N" blocks and so it just needs to look at the last bit.
i am struggling with my dashboard performance which runs queries on Redshift using JDBC driver.
the query is like -
select <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_,
sum(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) as col_1_0_ from <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME> where <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$1
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$2
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$3
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$4
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$5
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$6
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$7
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$8
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$9
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$10
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$11
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$12
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$13
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$14
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$15
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$16
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$17
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$18
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$19
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$20
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$21
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$22
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$23
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$24
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$25
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$26
or ........
The For dashboard we use Spring, Hibernate ( I am not 100% sure about it though ).
But the query might sometimes stretch till $1000 + according to the filters/options being selected on the UI.
But the problem we are seeing is - The First Time this query is being run by the reports, it takes more than 40 sec - 60 seconds for the response. After the first time , the query runs quite fast and takes only few seconds to run.
We initially suspected there must be something wrong with redshift caching , but it turns out that , Even simple queries like these ( But Huge ) takes considerable time to COMPILE, which is clear when we look into the svl_compile table which shows this query was compiled in over 35 seconds.
What should I do to handle such issues ?
Recommend restructuring the query generated by your dashboard to use an IN list. Redshift should be able to reuse the already compiled query segments for different length IN lists.
Note that IN lists with less than 10 values will still be evaluated as OR. https://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
SELECT <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_
, SUM(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) AS col_1_0_
FROM <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME>
WHERE <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME> IN ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11 … $1000 )
;
I have a query (relationship between CONTRACT <-> ORDERS) that I decided to break up into 2 parts (contract and orders) so I can reuse in another stored procedure.
When I run the code before the break up, it took around 10 secs to run; however, when I use a function for getting the contract, then pump the data into a temp table first, then join to the other parts it takes 2m:30s - why the difference in time?
The function takes less than a second to run and returns only one row i.e. details of one contract (contract_id is the parameter supplied to the function).
The part that is most effecting the performance the (ORDERS) largest table in the query has 4.1 million rows and joins to a few other tables however; if I just run the sub query for orders in isolation with a particular filter i.e. the contract id it takes less than a second to run and just happens to return zero records based for the contract I am testing on (due to filtering on the type of order it is looking for).
Base on the above information you would think 1 sec at most for the function + 1 sec at most to get the orders + summarize = 2 seconds at most, not 2 and half minutes!
Where am I going wrong, how do I begin to isolate the issue in time difference?
I know someone is going to tell me to paste the code but surely it is an issue of the database vs indexes perhaps vs how the compiler performs when dealing with raw code versus broken up code into parts. Is there an area of the code I can look at before having to post my whole code as I have tried variations of OUTER APPLY vs LEFT JOIN from the contract temp table to the orders subquery and both give me about the same result. Any ideas?
I don't think the issue was with the code but the network I was running it on. Although bizarre in the fact I had 2 versions of the proc running side by side and yesterday or rather before the weekend one was running in 10 secs and it is still running in 10 secs 3 days later and my new version (using the function) was taking anywhere between 2 to 3 minutes. This morning it is running at 2 or 3 seconds!! So I don't know if it is the fact I changed from declaring my table structure and using a table variable instead first to where previously I was using SELECT ... INTO #Contract made the difference or the network or precompiling has an affect. Whatever it is it no longer an issue. Should I delete this post?
I have a table say 'T' in kdb which has rows over 6 billion. When I tried to execute query like this
select from T where i < 10
it throws wsfull expection. Is there any way I can execute queries like this in table having large amount of data.
10#T
The expression as you wrote it first makes a bitmap containing all of the elements where i (rownumber) < 10, which is as tall as one of your columns. It then does where (which just contains til 10) and then gets them from each row. You can save the last step with:
T[til 10]
but 10#T is shorter.
Assuming you have a partitioned table here, it is normally beneficial to have the partitioning column (date, int etc.) as the first item in the where clause of your query - otherwise as mentioned previously you are reading a six billion item list into memory, which will result in a 'wsfull signal for any machine with less than the requisite amount of RAM.
Bear in mind that row index starts at 0 for each partition, and is not reflective of position in the overall table. The query that you gave as an example in your question would return the first ten rows of each partition of table T in your database.
In order to do this without reaching your memory limit, you can try running the following (if your database is date-partitioned):
raze{10#select from T where date=x}each date
I was using hstore, Postgresql 9.3.4, to store a count for each time an event happened in a given day, with an update like the following.
days_count = days_count || hstore('x', (coalesce((days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the year. After running a simulation of expected behavior for production I ended up with a table that was 150MB + 2GB Toast + 25-30MB for the index, after Analyze and Vacuum.
I am now instead breaking up the above column into one for each month like the following
y_month_days_count = y_month_days_count || hstore('x', (coalesce((y_month_days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the month, and y is the month of the year.
I am still running the simulation right now, but so far at third of the way done I am at 60MB + A pretty steady 20-30MB of Toast + 25-30MB for the index. Which means in the end I should end up with about 180MB + 30-40MB for Toast + 25MB-30MB for the index after Analyze and Vacuum.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Second will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table? It seems to be steady now with row numbers in the hundred of thousands, and while I know more columns can make things slower, I am unsure if this is worse with hstore columns.
Finally I did find something out. I have one hstore column that ends up representing each hour a day, so it has 24 different keys. When I run the simulation for just this column I end up with almost no toast, in the KB, but when I run the whole simulation, with the days broken up into months columns, my largest hstore has 52 keys.
So for a simple store of either a counter or a word or two, the max number of keys before I see any amount of toast for hstore is between 24 and 52 keys.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Yes.
When you update any part of an out-of-line stored TOASTed field like text, hstore or json the whole field must be re-written as a new row version. This is a consequence of MVCC - it's necessary to retain a copy of every version of the row that might still be visible to another transaction.
The old one can be vacuumed away when it's no longer required by any running transaction, so in practice this has minimal impact so long as autovacuum is running aggressively enough.
So if you're updating lots of rows with big text, hstore or json fields, or updating them frequently, tune autovacuum up so it runs more often and does work faster. Make sure you don't have long running <IDLE> in transaction connections.
You say the table sizes you quoted were "after analyze and vacuum" but I'm guessing you only ran a regular vacuum, so the table bloat would've been freed for re-use by PostgreSQL but not released back to the OS. See if VACUUM FULL compacts it.
Will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table?
Depends on your query patterns and workload, but probably not.