I have changed the test database server from windows to synology. After this change, there is one query which takes 9-10 sec to get a result on synology. On the Windows server ,it took 0.18sec to get data. When I exported data , indexes and triggers were also exported. So no issue with indexes.
I have also compared "Explain SQL" result on both server. It is same no difference.
select SUM(work_hours) as apt_hour, `employee_id` from `job_status` where `job_date` = '2020-10-23' and `status` != '0' and `job_status`.`deleted_at` is null group by `employee_id`
What is causing issue here ? How to decrease the execution time?
Related
i am struggling with my dashboard performance which runs queries on Redshift using JDBC driver.
the query is like -
select <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_,
sum(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) as col_1_0_ from <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME> where <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$1
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$2
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$3
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$4
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$5
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$6
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$7
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$8
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$9
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$10
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$11
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$12
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$13
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$14
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$15
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$16
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$17
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$18
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$19
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$20
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$21
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$22
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$23
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$24
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$25
or <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME>=$26
or ........
The For dashboard we use Spring, Hibernate ( I am not 100% sure about it though ).
But the query might sometimes stretch till $1000 + according to the filters/options being selected on the UI.
But the problem we are seeing is - The First Time this query is being run by the reports, it takes more than 40 sec - 60 seconds for the response. After the first time , the query runs quite fast and takes only few seconds to run.
We initially suspected there must be something wrong with redshift caching , but it turns out that , Even simple queries like these ( But Huge ) takes considerable time to COMPILE, which is clear when we look into the svl_compile table which shows this query was compiled in over 35 seconds.
What should I do to handle such issues ?
Recommend restructuring the query generated by your dashboard to use an IN list. Redshift should be able to reuse the already compiled query segments for different length IN lists.
Note that IN lists with less than 10 values will still be evaluated as OR. https://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
SELECT <ALIAS_TO_SCHEMA.TABLENAME>.<ANOTHER_COLUMN_NAME> as col_0_0_
, SUM(<ALIAS_TO_SCHEMA.TABLENAME>.devicecount) AS col_1_0_
FROM <table_schema>.<table_name> <ALIAS_TO_SCHEMA.TABLENAME>
WHERE <ALIAS_TO_SCHEMA.TABLENAME>.<COLUMN_NAME> IN ( $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11 … $1000 )
;
I am pulling down stock market data and inserting it into a postgresql database. I have 500 stocks for 60 days of historical data. Each day has 390 trading minutes, and each minute is a row in the database table. The summary of the issue is that the first 20-50 minutes of each day are missing for the each stock. Sometimes its less than 50, but it is never more than 50. Every minute after that for each day is fine (EDIT: on further inspection there are missing minutes all over the place). The maximum matches the max number of concurrent goroutines (https://github.com/korovkin/limiter).
The hardware is set up in my home. I have a laptop that pulls the data, and a 8 year old gaming computer that has been repurposed as a postgres database running in ubuntu. They are connected through a netgear nighthawk x6 router and communicate over the LAN.
The laptop is running a go program that pulls data down and performs concurrent inserts. I loop through the 60 days, for each day I loop through each stock, and for each stock I loop through each minute and insert it into the database via a INSERT statement. Inside the minute loop I used a library that limits the max number of goroutines.
I am fixing it by grabbing the data again, and inserting until the first time the postgres server responds that the entry is a duplicate and violates the unique constraints on the table and breaking out of the loop for each stock.
However, I'd like to know what happened, as I want to better understand how these problems can arise under load. Any ideas?
limit := NewConcurrencyLimiter(50)
for _, m := range ms {
limit.Execute(func() {
m.Insert()
})
}
limit.Wait()
The issue is that using a receiver means that everything is passed by reference. I needed to copy the values I wanted inserted within the for loop, and change the method away from a receiver to one with input parameters
for i, _ := range ms {
value := ms[i]
limit.Execute(func() {
Insert(value)
})
}
limit.Wait()
I am using JasperSoft Reports v.6.2.1 and when running a report within the Studio preview the output comes after 2 seconds.
Running the same report (output xlsx) on the server takes > half a minute - though there is no data volume issue (crosstab, 500 lines, 17 columns in excel, "ignore pagination" = true).
I am using $P{LoggedInUsername} to filter data within the WHERE-part of a WITH-clause (based on the user's rights), run the report and realized, when using a fixed value (the user's id as a string) instead of the parameter in the query, the report execution speed is good.
Same against Oracle DB from SQL Developer - the query resultset with a user's id string is back in 2 sec.
Also the output of $P{LoggedInUsername} in a TextField produces a String.
Once switching back to the $P{LoggedInUsername}-parameter in the query, the report takes ages again or runs out of heap memory in the Studio/server.
What could be the issue?
Finally my problem was solved using the expression user_id = '$P!{LoggedInUsername}' instead of $P{LoggedInUsername} in the WHERE-part of my query.
I have a script that runs a function on every item in my database to extract academic citations. The database is large, so the script takes about a week to run.
During that time, items are added and removed from the database.
The database is too large to pull completely into memory, so I have to chunk through it to process all the items.
Is there a way to ensure that when the script finishes, all of the items have been processed? Is this a pattern with a simple solution? So far my research hasn't revealed anything useful.
PS: Locking the table for a week isn't an option!
I would add a timestamp column "modified_at" to the table which defaults to null. So any new item can be identified.
Your script can then pick the chunks to work on based on that column.
update items
set modified_at = current_timestamp
from (
select id
from items
where modified_at is null
limit 1000 --<<< this defines the size of each "chunk" that you work on
) t
where t.id = items.id
returning items.*;
This will update 1000 rows that have not been processed as being processed and will return those rows in one single statement. Your job can then work on the returned items.
New rows need to be added with modified_at = null and your script will pick them up based on the where modified_at is null condition the next time you run it.
If you also change items while processing them, you need to update the modified_at accordingly. In your script you will then need to store the last start of your processing somewhere. The next run of your script can then select items to be processed using
where modified_at is null
or modified_at < (last script start time)
If you only process each item only once (and then never again), you don't really need a timestamp, a simple boolean (e.g. is_processed) would do as well.
I have executed a query in HIVE CLI that should generate around 11.000.000 rows, I know the result because I have executed the query in the MS SQL Server Management Studio too.
The problem is that in HIVE CLI the rows are showing on an on ( right know there are more than 12 hours since I started the execution ) and all I want to know is the time processing, which is showed only after showing the results.
So I have 2 questions :
How to skip showing rows results in HIVE command line ?
If I will execute the query in Beeswax, how do I see statistics like execution time , similar with SET STATISTICS TIME ON in T-SQL ?
You can check it using link given in log .But it wont give you total processing left.