Postgres issue with Gount / Group By / date_trunc() - postgresql

I know that there are a couple of threads regarding this, but I read them all w/o any luck.
I have the following query :
select coalesce(count(*)), date_trunc('month', generate_series(min(item.updated_at), max(item.updated_at), '1 month'))
from item
group by date_trunc('month', item.updated_at)
order by date_trunc ;
but it only shows months where items where updated at. And it just skips months with 0 matches.
I tried adding coalesce and generating the series with generate_series(), but it's still not working.
Any clues?.
Thanks a lot in advance.

Related

Ecto.Adapters.SQL.query! gives a different result

So this is apparently one of these weird days... And I know this makes 0 sense.
I'm executing a query in datagrip (a tool to execute raw querys) to the exact same database as in my phoenix application. And they are returning different results.
The query is quite complicated, but it's the only query that shows different results. So I cannot simplify it. I've tried other queries to be sure that I'm having the same database, restarted the server etc.
Here is the exact same query executed from my console. As you can see it is not the same result. A few rows are missing.
I have also checked if this is a timing issue by executing select now() => same result (more or less obviously). If I execute only the generate_series part, it returns the same result. So it could have something to do with the join.
I also checked the last few entries in the ttnmessages table just to be sure there is no general caching issue. The queries do also give the same result there.
So my question is: Is there anything that Ecto does differently upon executing a query? How can I figure this out? I'm grateful for any hint.
EDIT: The query is in both cases:
SELECT g.series AS time, MAX((t.payload ->'pulse')::text::numeric) as pulse
FROM generate_series(date_trunc('hour', now())- INTERVAL '12 hours', date_trunc('hour', now()), INTERVAL '60 min') AS g(series)
LEFT JOIN ttnmessages t
ON t.inserted_at < g.series + INTERVAL '60 min'
AND t.inserted_at > g.series
WHERE t.hardware_serial LIKE '093B55DF0C2C525A'
GROUP BY g.series
ORDER BY g.series;
While I did not find out the cause, I changed the query to the following:
SELECT MAX(t.inserted_at) as time, (t.payload ->'pulse')::text::numeric as pulse
FROM ttnmessages t
WHERE t.inserted_at > now() - INTERVAL '12 hours'
AND t.payload ->'pulse' IS NOT NULL
AND t.hardware_serial LIKE '093B55DF0C2C525A'
GROUP BY (t.payload ->'pulse')
ORDER BY time;
Runtime is < 50ms, so I'm happy with the result.
And I'll ignore the different results from the question. The query here returns the same result just like it's supposed to.

GROUP BY date error after update to MySQL 5.7

I have a simple script that counts form leads and displays the counts by month and year. It worked fine until I upgraded to MySQL 5.7. Now I get this error:
There was an error running the query [Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'form.form_25.submission_date' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by]
My query is:
SELECT YEAR(`submission_date`) AS yr,
MONTH(`submission_date`) AS mth,
DATE_FORMAT(`submission_date`,'%M %Y') AS display_date,
COUNT(*) AS leadcount
FROM form_25
WHERE `submission_date` >= CURRENT_DATE - INTERVAL 1 YEAR
GROUP BY yr,mth
ORDER BY yr DESC, mth DESC
I realize this is because only_full_group_by is enabled, but I don't want to disable it.
I've researched this problem, but it seems like all of the suggested solutions are about grouping by a unique column. That isn't a solution in this case because grouping by my primary column does not display the lead counts properly.
Thanks in advance for your help.
Okay, I figured out a solution that is good enough for my purposes. I discovered that the error only happens when this line is included:
DATE_FORMAT(`submission_date`,'%M %Y') AS display_date,
So I removed that line and recreated the display_date variable in PHP by using the yr and mth aliases.

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

How to execute SELECT DISTINCT ON query using SQLAlchemy

I have a requirement to display spend estimation for last 30 days. SpendEstimation is calculated multiple times a day. This can be achieved using simple SQL query:
SELECT DISTINCT ON (date) date(time) AS date, resource_id , time
FROM spend_estimation
WHERE
resource_id = '<id>'
and time > now() - interval '30 days'
ORDER BY date DESC, time DESC;
Unfortunately I can't seem to be able to do the same using SQLAlchemy. It always creates select distinct on all columns. Generated query does not contain distinct on.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
'date'
).order_by(
'date',
SpendEstimation.time
)
SELECT DISTINCT
date(time) AS date,
resource_id,
time
FROM spend
ORDER BY date, time
It is missing ON (date) bit. If I user query.group_by - then SQLAlchemy adds distinct on. Though I can't think of solution for given problem using group by.
Tried using function in distinct part and order by part as well.
query = session.query(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.resource_id,
SpendEstimation.time
).distinct(
func.date(SpendEstimation.time).label('date')
).order_by(
func.date(SpendEstimation.time).label('date'),
SpendEstimation.time
)
Which resulted in this SQL:
SELECT DISTINCT
date(time) AS date,
resource_id,
time,
date(time) AS date # only difference
FROM spend
ORDER BY date, time
Which is still missing DISTINCT ON.
Your SqlAlchemy version might be the culprit.
Sqlalchemy with postgres. Try to get 'DISTINCT ON' instead of 'DISTINCT'
Links to this bug report:
https://bitbucket.org/zzzeek/sqlalchemy/issues/2142
A fix wasn't backported to 0.6, looks like it was fixed in 0.7.
Stupid question: have you tried distinct on SpendEstimation.date instead of 'date'?
EDIT: It just struck me that you're trying to use the named column from the SELECT. SQLAlchemy is not that smart. Try passing in the func expression into the distinct() call.

Syntax for Avoiding Multiple Subqueries In Postgres

I'm new to postgres, so please take it easy on me.
I'm trying to write a query so that for any user, I can pull ALL of the log files (both their activity and the activity of others) for one minute prior and one minute after their name appears in the logs within the same batchstamp.
chat.batchstamp is varchar
chat.datetime is timestamp
chat.msg is text
chat.action is text (this is the field with the username)
Here are the separate commands I want to use, I just don't know how to put them together and if this is really the right path to go on this.
SELECT batchstamp, datetime, msg FROM chat WHERE action LIKE 'username';
Anticipated output:
batchstamp datetime msg
abc 2010-12-13 23:18:00 System logon
abc 2010-12-13 10:12:13 System logon
def 2010-12-14 11:12:18 System logon
SELECT * FROM chat WHERE datetime BETWEEN datetimefrompreviousquery - interval '1 minute' AND datetimefrompreviousquery + interval '1 minute';
Can you please help explain to me what I should do to feed data from the previous query in to the second query? I've looked at subqueries, but do I need to run two subqueries? Should I build a temporary table?
After this is all done, how do I make sure that the times the query matches are within the same batchstamp?
If you're able to point me in the right direction, that's great. If you're able to provide the query, that's even better. If my explanation doesn't make sense, maybe I've been looking at this too long.
Thanks for your time.
Based on nate c's code below, I used this:
SELECT * FROM chat,
( SELECT batchstamp, datetime FROM chat WHERE action = 'fakeuser' )
AS log WHERE chat.datetime BETWEEN log.datetime - interval '1 minute' AND log.datetime + '1 minute';
It doesn't seem to return every hit of 'fakeuser' and when it does, it pulls the logs from every 'batchstamp' instead of just the one where 'fakeuser' was found. Am I in for another nested query? What's this type of procedure called so I can further research it?
Thanks again.
You first query can go in the from clause with '(' brackets around it and 'as alias' name. After that you can reference it as you would a normal table in the rest of the query.
SELECT
*
FROM chat,
(
SELECT
batchstamp,
datetime,
msg
FROM log
WHERE action LIKE 'username'
) AS log
WHERE chat.datetime BETWEEN
log.datetime - interval '1 minute'
AND log.datetime + interval '1 minute';
That should get you started.
A colleague at work came up with the following solution which seems to provide the results I'm looking for. Thanks for everyone's help.
SELECT batchstamp, datetime, msg INTO temptable FROM chat WHERE action = 'fakeusername';
select a.batchstamp, a.action, a.datetime, a.msg
FROM chat a, temptable b
WHERE a.batchstamp = b.batchstamp
and (
a.datetime BETWEEN b.datetime - interval '1 minute'
AND b.datetime + interval '1 minute'
) and a.batchstamp = '2011-3-1 21:21:37'
group by a.batchstamp, a.action, a.datetime, a.msg
order by a.datetime;