i am struggling with my dashboard performance which runs queries on Redshift using JDBC driver.
the query is like -
select <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_,
sum(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) as col_1_0_ from <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME> where <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$1
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$2
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$3
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$4
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$5
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$6
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$7
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$8
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$9
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$10
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$11
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$12
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$13
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$14
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$15
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$16
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$17
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$18
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$19
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$20
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$21
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$22
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$23
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$24
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$25
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$26
or ........
The For dashboard we use Spring, Hibernate ( I am not 100% sure about it though ).
But the query might sometimes stretch till $1000 + according to the filters/options being selected on the UI.
But the problem we are seeing is - The First Time this query is being run by the reports, it takes more than 40 sec - 60 seconds for the response. After the first time , the query runs quite fast and takes only few seconds to run.
We initially suspected there must be something wrong with redshift caching , but it turns out that , Even simple queries like these ( But Huge ) takes considerable time to COMPILE, which is clear when we look into the svl_compile table which shows this query was compiled in over 35 seconds.
What should I do to handle such issues ?
Recommend restructuring the query generated by your dashboard to use an IN list. Redshift should be able to reuse the already compiled query segments for different length IN lists.
Note that IN lists with less than 10 values will still be evaluated as OR. https://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
SELECT <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_
, SUM(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) AS col_1_0_
FROM <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME>
WHERE <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME> IN ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11 … $1000 )
;
Related
I have a dataset in Quicksight that is connected to a Redshift table but pulls from SPICE. The dataset is scheduled to refresh from Redshift to SPICE daily. It is a small table, I am using only a fraction of my SPICE capacity, and this method has been working fine for almost two years.
For some reason, suddenly the SPICE table it not refreshing completely & I can't figure out why.
There are 183 records in the table, but there are only 181 records in the SPICE dataset. If I query the table using SQL Workbench/J I get 183 recs but only 181 in the QS dataset.
I have tried to refresh multiple times & have also set the dataset to query directly to bypass SPICE and still cannot get those other two rows returned.
Nothing has changed in our permissions or anything about the Redshift-Quicksight IAM config.
Any ideas about what could possible be going on here?
Thanks for any help!
UPDATE: As I mentioned, if I select * from the table with SQL Workbench/J, I get the 183 rows that I expect. However, if I select * directly from the AWS query editor v2, I only get 181 rows. Can anyone explain to me what is causing this discrepancy?
SOLVED: The difference is that my processing now requires a COMMIT, where it did not require the COMMIT statement before.
So I have a query that I need a column with an aggregate within a determined frame, as example:
SUM(subquery1.value) OVER (
PARTITION BY subquery1.entity
ORDER BY subquery1.timestamp
ROWS BETWEEN 55 PRECEDING and CURRENT ROW
)
And so it works, gives me the expected result and it doesn't take up much time. Here's an EXPLAIN ANALYZE of it:
I know the statics are a bit off, but I'm happy with it.
The problem is that the 55 preceding rows is not really what I want, in fact what I actually want is closer to 50,000 and that's when it gets slower. If i bump the PRECEDING to 555 it already gets much slower, for higher values I haven't waited the run.
So I don't know whats happening, I have tried the same aggregate directly from a table using 300,000+ rows and it takes less than one second. This example is just over 50k rows so it should be much much faster.
I have a query (relationship between CONTRACT <-> ORDERS) that I decided to break up into 2 parts (contract and orders) so I can reuse in another stored procedure.
When I run the code before the break up, it took around 10 secs to run; however, when I use a function for getting the contract, then pump the data into a temp table first, then join to the other parts it takes 2m:30s - why the difference in time?
The function takes less than a second to run and returns only one row i.e. details of one contract (contract_id is the parameter supplied to the function).
The part that is most effecting the performance the (ORDERS) largest table in the query has 4.1 million rows and joins to a few other tables however; if I just run the sub query for orders in isolation with a particular filter i.e. the contract id it takes less than a second to run and just happens to return zero records based for the contract I am testing on (due to filtering on the type of order it is looking for).
Base on the above information you would think 1 sec at most for the function + 1 sec at most to get the orders + summarize = 2 seconds at most, not 2 and half minutes!
Where am I going wrong, how do I begin to isolate the issue in time difference?
I know someone is going to tell me to paste the code but surely it is an issue of the database vs indexes perhaps vs how the compiler performs when dealing with raw code versus broken up code into parts. Is there an area of the code I can look at before having to post my whole code as I have tried variations of OUTER APPLY vs LEFT JOIN from the contract temp table to the orders subquery and both give me about the same result. Any ideas?
I don't think the issue was with the code but the network I was running it on. Although bizarre in the fact I had 2 versions of the proc running side by side and yesterday or rather before the weekend one was running in 10 secs and it is still running in 10 secs 3 days later and my new version (using the function) was taking anywhere between 2 to 3 minutes. This morning it is running at 2 or 3 seconds!! So I don't know if it is the fact I changed from declaring my table structure and using a table variable instead first to where previously I was using SELECT ... INTO #Contract made the difference or the network or precompiling has an affect. Whatever it is it no longer an issue. Should I delete this post?
I have executed a query in HIVE CLI that should generate around 11.000.000 rows, I know the result because I have executed the query in the MS SQL Server Management Studio too.
The problem is that in HIVE CLI the rows are showing on an on ( right know there are more than 12 hours since I started the execution ) and all I want to know is the time processing, which is showed only after showing the results.
So I have 2 questions :
How to skip showing rows results in HIVE command line ?
If I will execute the query in Beeswax, how do I see statistics like execution time , similar with SET STATISTICS TIME ON in T-SQL ?
You can check it using link given in log .But it wont give you total processing left.
I have a resultset with columns:
interval_start(timestamp) , camp , queue , other columns
2012-09-10 11:10 c1 q1
2012-09-10 11:20 c1 q2
interval_start is having values in 10 minutes interval like :
2012-09-10 11:10,
2012-09-10 11:20,
2012-09-10 11:30 ....
using Joda Time library and interval_start field, I have created a variable to create string such that if minutes of interval_start lie between 00-30, 30 is set in minutes else 00 is set in minutes.
I want to group the data as :
camp as group1
variable created as group2
queue as group3
and done some aggregations
But in my report result, I am getting same queue many time in same interval.
I have used order by camp, interval_start, queue but the problem is still exists.
Attaching screenshot for your reference:
Is there any way to sort the resultset according to created variable?
Best guess would be an issue with your actual SQL Query. You say the same queue is being repeated, but from looking at your image it is not actually be repeated it is a different row.
Your query is going to be tough to pull off, as you are really wanting to your query to have an order by of order by camp, (rounded)interval_start, queue. Without that it is ordering by the camp column, and then the non-rounded version of interval_start and then camp. When means that the data is not in the correct order for Jasper Reports to group them the way you want. And then the real kicker is Jasper Reports does not have the functionality to sort the data once it gets it. It is up to the developer to supply that.
So you have a couple options:
Update your SQL Query to do the rounding of your time. Depending on your database this is done in different ways, but will likely required some type of stored procedure or function to handle it (see this TSQL function for example).
instead of having the sql query in the report, move it outside the report, and process the data, doing the rounding and sorting on the java side. Then pass it is as a the REPORT_DATASOURCE parameter.
Add a column to your table to store the rounded time. You may be able to create a trigger to handle this all in the database, with out having to change any of your other code in your application.
Honestly both these options are not ideal, and I hope someone comes along and provides an answer that proves me wrong. But I do not think there is currently a better way.