Ecto.Adapters.SQL.query! gives a different result - postgresql

So this is apparently one of these weird days... And I know this makes 0 sense.
I'm executing a query in datagrip (a tool to execute raw querys) to the exact same database as in my phoenix application. And they are returning different results.
The query is quite complicated, but it's the only query that shows different results. So I cannot simplify it. I've tried other queries to be sure that I'm having the same database, restarted the server etc.
Here is the exact same query executed from my console. As you can see it is not the same result. A few rows are missing.
I have also checked if this is a timing issue by executing select now() => same result (more or less obviously). If I execute only the generate_series part, it returns the same result. So it could have something to do with the join.
I also checked the last few entries in the ttnmessages table just to be sure there is no general caching issue. The queries do also give the same result there.
So my question is: Is there anything that Ecto does differently upon executing a query? How can I figure this out? I'm grateful for any hint.
EDIT: The query is in both cases:
SELECT g.series AS time, MAX((t.payload ->'pulse')::text::numeric) as pulse
FROM generate_series(date_trunc('hour', now())- INTERVAL '12 hours', date_trunc('hour', now()), INTERVAL '60 min') AS g(series)
LEFT JOIN ttnmessages t
ON t.inserted_at < g.series + INTERVAL '60 min'
AND t.inserted_at > g.series
WHERE t.hardware_serial LIKE '093B55DF0C2C525A'
GROUP BY g.series
ORDER BY g.series;

While I did not find out the cause, I changed the query to the following:
SELECT MAX(t.inserted_at) as time, (t.payload ->'pulse')::text::numeric as pulse
FROM ttnmessages t
WHERE t.inserted_at > now() - INTERVAL '12 hours'
AND t.payload ->'pulse' IS NOT NULL
AND t.hardware_serial LIKE '093B55DF0C2C525A'
GROUP BY (t.payload ->'pulse')
ORDER BY time;
Runtime is < 50ms, so I'm happy with the result.
And I'll ignore the different results from the question. The query here returns the same result just like it's supposed to.

Related

Postgres SQL query runs very slow when using current_date

We have a SQL query using a where clause to filter the data set by date. Currently the where clause in the query is set as follows -
WHERE date_field BETWEEN current_date - integer #interval AND current_date (where the interval is the last 90 days)
This query has suddenly started to slow down for the past couple of days. It has started to take upwards of 10 mins. If we remove the current_date from this query and hard code dates in this query it runs like it used to previously in less than 10 seconds.
Hard-coded version
WHERE date_field BETWEEN '03-12-2020' AND '06-12-2020'
This query is run against the Postgres engine in Amazon Aurora. The same where clause is used in other queries filter against the same field which are not impacted by this issue.
I am trying to figure out how we can determine what caused this issue suddenly and how we can resolve this?

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

How to execute SELECT DISTINCT ON query using SQLAlchemy

I have a requirement to display spend estimation for last 30 days. SpendEstimation is calculated multiple times a day. This can be achieved using simple SQL query:
SELECT DISTINCT ON (date) date(time) AS date, resource_id , time
FROM spend_estimation
WHERE
resource_id = '<id>'
and time > now() - interval '30 days'
ORDER BY date DESC, time DESC;
Unfortunately I can't seem to be able to do the same using SQLAlchemy. It always creates select distinct on all columns. Generated query does not contain distinct on.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
'date'
).order_by(
'date',
SpendEstimation.time
)
SELECT DISTINCT
date(time) AS date,
resource_id,
time
FROM spend
ORDER BY date, time
It is missing ON (date) bit. If I user query.group_by - then SQLAlchemy adds distinct on. Though I can't think of solution for given problem using group by.
Tried using function in distinct part and order by part as well.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
func.date(SpendEstimation.time).label('date')
).order_by(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.time
)
Which resulted in this SQL:
SELECT DISTINCT
date(time) AS date,
resource_id,
time,
date(time) AS date # only difference
FROM spend
ORDER BY date, time
Which is still missing DISTINCT ON.
Your SqlAlchemy version might be the culprit.
Sqlalchemy with postgres. Try to get 'DISTINCT ON' instead of 'DISTINCT'
Links to this bug report:
https://bitbucket.org/zzzeek/sqlalchemy/issues/2142
A fix wasn't backported to 0.6, looks like it was fixed in 0.7.
Stupid question: have you tried distinct on SpendEstimation.date instead of 'date'?
EDIT: It just struck me that you're trying to use the named column from the SELECT. SQLAlchemy is not that smart. Try passing in the func expression into the distinct() call.

Amazon redshift query planner

I'm facing a situation with Amazon Redshift that I haven't been able to explain to myself yet. Query planner seems not to be able to handle same table in subquery of two derived tables in a join.
I have essentially three tables, Source_A, Source_B, Target_1, Target_2 and a query like
SELECT a,b,c,d FROM
(
SELECT a,b FROM Source_A where date > (SELECT max(date) FROM Target_1)
)
INNER JOIN
(
SELECT c,d FROM Source_B where date > (SELECT max(date) FROM Target_2)
)
ON Source_A.a = Source_B.c
The query works fine as long as tables Target_1 and Target_2 are different tables. If I change the query so that Target_2 = Target_1, something happens. After the change, the query starts to take about 10 times longer time. And when I look at the performance monitor I can see that all this extra time is taken so that only the Leader Node is active.
When I take EXPLAIN of both options I see practically no difference in the output. All the steps are the same. But the is the difference that the EXPLAIN takes seconds in one and almost half an hour with the other one where the Target tables are the same.
So to summarise what I think I have observed is -- that on join, if I use same table in a subquery of each derived tables, the query planner goes nuts.

Syntax for Avoiding Multiple Subqueries In Postgres

I'm new to postgres, so please take it easy on me.
I'm trying to write a query so that for any user, I can pull ALL of the log files (both their activity and the activity of others) for one minute prior and one minute after their name appears in the logs within the same batchstamp.
chat.batchstamp is varchar
chat.datetime is timestamp
chat.msg is text
chat.action is text (this is the field with the username)
Here are the separate commands I want to use, I just don't know how to put them together and if this is really the right path to go on this.
SELECT batchstamp, datetime, msg FROM chat WHERE action LIKE 'username';
Anticipated output:
batchstamp datetime msg
abc 2010-12-13 23:18:00 System logon
abc 2010-12-13 10:12:13 System logon
def 2010-12-14 11:12:18 System logon
SELECT * FROM chat WHERE datetime BETWEEN datetimefrompreviousquery - interval '1 minute' AND datetimefrompreviousquery + interval '1 minute';
Can you please help explain to me what I should do to feed data from the previous query in to the second query? I've looked at subqueries, but do I need to run two subqueries? Should I build a temporary table?
After this is all done, how do I make sure that the times the query matches are within the same batchstamp?
If you're able to point me in the right direction, that's great. If you're able to provide the query, that's even better. If my explanation doesn't make sense, maybe I've been looking at this too long.
Thanks for your time.
Based on nate c's code below, I used this:
SELECT * FROM chat,
( SELECT batchstamp, datetime FROM chat WHERE action = 'fakeuser' )
AS log WHERE chat.datetime BETWEEN log.datetime - interval '1 minute' AND log.datetime + '1 minute';
It doesn't seem to return every hit of 'fakeuser' and when it does, it pulls the logs from every 'batchstamp' instead of just the one where 'fakeuser' was found. Am I in for another nested query? What's this type of procedure called so I can further research it?
Thanks again.
You first query can go in the from clause with '(' brackets around it and 'as alias' name. After that you can reference it as you would a normal table in the rest of the query.
SELECT
*
FROM chat,
(
SELECT
batchstamp,
datetime,
msg
FROM log
WHERE action LIKE 'username'
) AS log
WHERE chat.datetime BETWEEN
log.datetime - interval '1 minute'
AND log.datetime + interval '1 minute';
That should get you started.
A colleague at work came up with the following solution which seems to provide the results I'm looking for. Thanks for everyone's help.
SELECT batchstamp, datetime, msg INTO temptable FROM chat WHERE action = 'fakeusername';
select a.batchstamp, a.action, a.datetime, a.msg
FROM chat a, temptable b
WHERE a.batchstamp = b.batchstamp
and (
a.datetime BETWEEN b.datetime - interval '1 minute'
AND b.datetime + interval '1 minute'
) and a.batchstamp = '2011-3-1 21:21:37'
group by a.batchstamp, a.action, a.datetime, a.msg
order by a.datetime;