PostgreSQL DISTINCT problem: works locally but not on server - postgresql

I've come across a vexing problem with a PostgreSQL query. This works in my local development environment:
SELECT distinct (user_id) user_id, created_at, is_goodday
FROM table
WHERE ((created_at >= '2011-07-01 00:00:00') AND user_id = 95
AND (created_at < '2011-08-01 00:00:00'))
ORDER BY user_id, created_at ASC;
...but gives the following error on my QA server (which is on Heroku):
PGError: ERROR: syntax error at or near "user_id"
LINE 1: SELECT distinct (user_id) user_id, created_at,
^
Why could this be?
Other possibly relevant info:
I have tried single-quoting and double-quoting the field names
It's a Rails 3 app, but I'm using this SQL raw, i.e. no ActiveRecord magic
My local version of Postgres is 9.0.4 on Mac, but I have no idea what version Heroku is using

As per your comment, the standard PostgreSQL version of that query would be:
SELECT user_id, created_at, is_goodday
FROM table
WHERE created_at >= '2011-07-01 00:00:00'
AND created_at < '2011-08-01 00:00:00'
AND user_id = 95
ORDER BY created_at DESC, id DESC
LIMIT 1
You don't need user_id in the ORDER BY because you have user_id = 95, you want created_at DESC in the ORDER BY to put the most recent created_at at the top; then you LIMIT 1 to slice off just the first row in the result set. GROUP BY can be used to enforce uniqueness or if you need to group things for an aggregate function but you don't need it for either one of those here as you can get uniqueness through ORDER BY and LIMIT and you can hide your aggregation inside the ORDER BY (i.e. you don't need MAX because ORDER BY does that for you).
Since you have user_id = 95 in your WHERE, you don't need user_id in the SELECT but you can leave it in if that makes it easier for you in Ruby-land.
It is possible that you could have multiple entries with the same created_at so I added an id DESC to the ORDER BY to force PostgreSQL to choose the one with the highest id. There's nothing wrong with being paranoid when they really are out to get you and bugs definitely are out to get you.
Also, you want DESC in your ORDER BY to get the highest values at the top, ASC puts the lowest values at the top. The more recent timestamps will be the higher ones.
In general, the GROUP BY and SELECT have to match up because:
When GROUP BY is present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions, since there would be more than one possible value to return for an ungrouped column.
But that doesn't matter here because you don't need a GROUP BY at all. I linked to the 8.3 version of the documentation to match the PostgreSQL version you're using.
There are probably various other ways to do this but this one as probably as straight forward and clear as you're going to get.

put a quote in user_id like user_id = '95'. Your query should be
SELECT distinct (user_id) as uid, created_at, is_goodday FROM table WHERE
((created_at >= '2011-07-01 00:00:00') AND user_id = '95' AND (created_at < '2011-08-01 00:00:00')) ORDER BY user_id, created_at ASC;

You're using DISTINCT ON (without writing the ON). Perhaps you should write the ON. Perhaps your postgres server dates from before the feature was implemented (which is pretty old by now).
If all else fails, you can always do that with some GROUP BY...

Related

Find time difference between two most recent orders

I am trying to estimate the time of a new order from repeat customers by finding the time difference between the most recent order and the second most recent order, and then adding that difference to the most recent order.
I have been trying limit and offset, but this returns a blanket date for every row. I am thinking I need to do a lateral join, but not sure how to implement it correctly. When I try to do it, I receive no output.
select public.orders.customer_id,
max(public.orders.created_at) as last_order_date,
(select created_at from public.orders group by created_at order by created_at desc limit 1 offset 1) as second_last
from public.orders
inner join
(select
customer_id, count(*)
from public.orders
where status = 'fulfilled'
group by public.orders.customer_id
having count(customer_id) >1) repeat_customers
on public.orders.customer_id = repeat_customers.customer_id
group by public.orders.customer_id;
I wanted the second_last field to be populated by the second most recent date for each customer_id, but the output is the second most recent date for the entire table, resulting in the same date for every entry.
For your second_last column you're not limiting it per customer, it will indeed find the max of everything just like the results you've seen. See the WHERE clause in the example below which should solve this:
(SELECT
created_at
FROM
public.orders po
WHERE
po.customer_id = customer_id
ORDER BY
created_at
LIMIT 1 OFFSET 1) AS second_last
I've also aliased the table because I wasn't sure if it would complain about ambiguity since the same table is mentioned in the main select.

Calculate previous order date and status in Postgres

I have a simple table of orders, and I need to calculate some stats for each order. Essentially I have a Postgres db with fields:
Order_ID (unique), User_ID, Created_at (date), City, Total
I want to write a query that will generate, for each Order_ID:
1) the Created_at date of the user's most recent order prior to the current Order_ID (so if a customer placed order with Order_ID=200005b on 9/20/14, what is the date of that user's most recent previous order?)
2) another field showing a user's "Status" based on this date, given the following cases:
-- if this is user's first order, Status="new";
-- if most recent previous order date <= 60 days before the given/current order, Status="active";
-- if most recent previous order date > 60 days before the given/current order, Status="reactivated"
I think there's a way to write this query using some nested SELECTS, and maybe a self-join, but I don't know PostgreSQL well enough to understand the ordering of queries. I have been able to generate an "Order_N" field using the following query that I could use to lookup (Order_N)-1 to find the date, but I get stuck once trying to use that in nesting.
SELECT
user_id,
order_id,
created_at,
row_number() over (partition by user_id order by created_at ) as order_n
order by user_id, created_at;
Does anyone have any ideas?

postgreSQL classification limit

I have a requirement that needs to query top 5 news for each type and return to frontend, implemented by JPA.
I've two solutions now,
One is to manually append union SQL by annotation,
Call a service by different parameter type in loop.
in fact what I want is just like SQL as below
select id, title, content
from portal p
where p.type = 'NEWS'
order by create_date
limit 5
union
select id,title,content,
from portal p
where p.type = 'MAG'
order by create_date
limit 5
union...
Solution A need to code many SQL statements in JAVA, while solution B is not efficient as types is more than 10.
Is there any other way to query the data? by annotation or postgreSQL function? I'm new to both JPA & Postgres.
Thanks in advance.
You can do this with a single SQL statement. I'm not sure whether you'll be able to avoid a table scan. You might need to include some more columns, depending most likely on whether you need to sort by them.
select *
from (select
id, title, content,
row_number() over (partition by type order by create_date asc) row_num
from portal
) as numbered_rows
where row_num <= 5;
One advantage of this kind of SQL statement is that it requires no maintenance. It will continue to work correctly no matter how many different types you add.
Think carefully whether you need the first five (order by create_date ASC) or the latest five (order by create_date DESC).

PostgreSQL Aggregate groups with limit performance

I'm a newbie in PostgreSQL. Is there a way to improve execution time of the following query:
SELECT s.id, s.name, s.url,
(SELECT array_agg(p.url)
FROM (
SELECT url
FROM pages
WHERE site_id = s.id ORDER BY created DESC LIMIT 5
) as p
) as last_pages
FROM sites s
I havn't found how to insert LIMIT clause into aggregate call, as ordering.
There are indexes by created (timestamp) and site_id (integer) in table pages, but the foreign key from sites.id to pages.site_id is absent, unfortunately. The query is intented to return a list of sites with sublists of 5 most recently created pages.
PostgreSQL version is 9.1.5.
You need to start by thinking like the database management system. You also need to think very carefully about what you are asking from the database.
Your fundamental problem here is that you likely have a very large number of separate indexing calls happening here when a sequential scan may be quite a bit faster. Your current query gives very little flexibility to the planner because of the fact that you have subqueries which must be correlated.
A much better way to do this would be with a view (inline or not) and a window function:
SELECT s.id, s.name, s.url, array_agg(p.url)
FROM sites s
JOIN (select site_id, url,
row_number() OVER (partition by site_id order by created desc) as num
from pages) p on s.id = p.site_id
WHERE num <= 5;
This will likely change a very large number of index scans to a single large sequential scan.

Is there a way to find TOP X records with grouped data?

I'm working with a Sybase 12.5 server and I have a table defined as such:
CREATE TABLE SomeTable(
[GroupID] [int] NOT NULL,
[DateStamp] [datetime] NOT NULL,
[SomeName] varchar(100),
PRIMARY KEY CLUSTERED (GroupID,DateStamp)
)
I want to be able to list, per [GroupID], only the latest X records by [DateStamp]. The kicker is X > 1, so plain old MAX() won't cut it. I'm assuming there's a wonderfully nasty way to do this with cursors and what-not, but I'm wondering if there is a simpler way without that stuff.
I know I'm missing something blatantly obvious and I'm gonna kick myself for not getting it, but .... I'm not getting it. Please help.
Is there a way to find TOP X records, but with grouped data?
According to the online manual, Sybase 12.5 supports WINDOW functions and ROW_NUMBER(), though their syntax differs from standard SQL slightly.
Try something like this:
SELECT SP.*
FROM (
SELECT *, ROW_NUMBER() OVER (windowA ORDER BY [DateStamp] DESC) AS RowNum
FROM SomeTable
WINDOW windowA AS (PARTITION BY [GroupID])
) AS SP
WHERE SP.RowNum <= 3
ORDER BY RowNum DESC;
I don't have an instance of Sybase, so I haven't tested this. I'm just synthesizing this example from the doc.
I made a mistake. The doc I was looking at was Sybase SQL Anywhere 11. It seems that Sybase ASA does not support the WINDOW clause at all, even in the most recent version.
Here's another query that could accomplish the same thing. You can use a self-join to match each row of SomeTable to all rows with the same GroupID and a later DateStamp. If there are three or fewer later rows, then we've got one of the top three.
SELECT s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
FROM SomeTable s1
LEFT OUTER JOIN SomeTable s2
ON s1.[GroupID] = s2.[GroupID] AND s1.[DateStamp] < s2.[DateStamp]
GROUP BY s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
HAVING COUNT(*) < 3
ORDER BY s1.[DateStamp] DESC;
Note that you must list the same columns in the SELECT list as you list in the GROUP BY clause. Basically, all columns from s1 that you want this query to return.
Here's quite an unscalable way!
SELECT GroupID, DateStamp, SomeName
FROM SomeTable ST1
WHERE X <
(SELECT COUNT(*)
FROM SomeTable ST2
WHERE ST1.GroupID=ST2.GroupID AND ST2.DateStamp > ST1.DateStamp)
Edit Bill's solution is vastly preferable though.