Postgres reporting "window function call requires an OVER clause" in a query that has an OVER clause - postgresql

I've got a table, me, that's got an e_id column, a first_name, and a last_name. (A few other columns too but I'm just trying to figure out the window function stuff.) When I try to do a query and pick out the first name values in the table for a given e_id value:
SELECT e_id, first_value(first_name), first_value(last_name)
OVER (PARTITION BY e_id)
FROM me
I get the "window function call requires an OVER clause" error. Now, like I said, I don't know what I'm doing yet, but I'm pretty sure there's at least an attempt at an OVER clause in that query. OK, so when I try it without the functions:
SELECT e_id, first_name, last_name
OVER (PARTITION BY e_id)
FROM me
I get a syntax error at OVER. I'm running psql version 9.4.4 against a 9.1.13 server. I'm staring at the documentation for 9.1 and it looks to me like OVER is documented there. Am I just missing something basic here?

Each window function must have its own OVER clause. The problem with your first query is that the first_value(first_name) window function does not have an OVER clause.
And the problem with your second query is that you have an OVER clause that is not preceded by a window function.
Try this
SELECT e_id,
first_value(first_name) OVER (PARTITION BY e_id),
first_value(last_name) OVER (PARTITION BY e_id)
FROM me

Related

how to remove DISTINCT from query, in Postgresql

I have the following view:
CREATE OR REPLACE VIEW {accountId}.last_assets AS
SELECT DISTINCT ON (coin) *
from {accountId}.{tableAssetsName}
ORDER BY coin, ts DESC;
The view returns, for each coin, the latest update.
and, in an unrelated question, a comment was:
Note: DISTINCT is always a red flag. (almost always)
I read about this and it seems to have a performance issue with Postgres. In my scenario, I don't have any issues with performance, but I would like to understand:
how could such a query be rewritten to not use DISTINCT then?
There is a substantial difference between DISTINCT and the Postgres proprietary DISTINCT ON () operator.
Does whoever wrote that "DISTINCT is always a red flag" actually know the difference between DISTINCT and DISTINCT ON in Postgres?
The problem that your view solves, is known as greatest-n-per-group and in Postgres distinct on is typically more efficient than the alternatives.
The problem can be solved with e.g. window functions:
CREATE OR REPLACE VIEW {accountId}.last_assets AS
SELECT ....
FROM (
SELECT *,
row_number() over (partition by coin order by ts desc) as rn
from {accountId}.{tableAssetsName}
) t
WHERE rn = 1;
But I wouldn't be surprised if that is actually slower than you current solution.

PostgreSQL use function result in ORDER BY

Is there a way to use the results of a function call in the order by clause?
My current attempt (I've also tried some slight variations).
SELECT it.item_type_id, it.asset_tag, split_part(it.asset_tag, 'ASSET', 2)::INT as tag_num
FROM serials.item_types it
WHERE it.asset_tag LIKE 'ASSET%'
ORDER BY split_part(it.asset_tag, 'ASSET', 2)::INT;
While my general assumption is that this can't be done, I wanted to know if there was a way to accomplish this that I wasn't thinking of.
EDIT: The query above gives the following error [22P02] ERROR: invalid input syntax for integer: "******"
Your query is generally OK, the problem is that for some row the result of split_part(it.asset_tag, 'ASSET', 2) is the string ******. And that string cannot be cast to an integer.
You may want to remove the order by and the cast in the select list and add a where split_part(it.asset_tag, 'ASSET', 2) = '******', for instance, to narrow down that data issue.
Once that is resolved, having such a function in the order by list is perfectly fine. The quoted section of the documentation in the comments on the question is referring to applying an order by clause to the results of UNION, INTERSECTION, etc. queries. In other words, the order by found in this query:
(select column1 as result_column1 from table1
union
select column2 from table 2)
order by result_column1
can only refer to the accumulated result columns, not to expressions on individual rows.

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.

Proper GROUP BY syntax

I'm fairly proficient in mySQL and MSSQL, but I'm just getting started with postgres. I'm sure this is a simple issue, so to be brief:
SQL error:
ERROR: column "incidents.open_date" must appear in the GROUP BY clause or be used in an aggregate function
In statement:
SELECT date(open_date), COUNT(*)
FROM incidents
GROUP BY 1
ORDER BY open_date
The type for open_date is timestamp with time zone, and I get the same results if I use GROUP BY date(open_date).
I've tried going over the postgres docs and some examples online, but everything seems to indicate that this should be valid.
The problem is with the unadorned open_date in the ORDER BY clause.
This should do it:
SELECT date(open_date), COUNT(*)
FROM incidents
GROUP BY date(open_date)
ORDER BY date(open_date);
This would also work (though I prefer not to use integers to refer to columns for maintenance reasons):
SELECT date(open_date), COUNT(*)
FROM incidents
GROUP BY 1
ORDER BY 1;
"open_date" is not in your select list, "date(open_date)" is.
Either of these will work:
order by date(open_date)
order by 1
You can also name your columns in the select statement, and then refer to that alias:
select date(open_date) "alias" ... order by alias
Some databases require the keyword, AS, before the alias in your select.

Is there a way to find TOP X records with grouped data?

I'm working with a Sybase 12.5 server and I have a table defined as such:
CREATE TABLE SomeTable(
[GroupID] [int] NOT NULL,
[DateStamp] [datetime] NOT NULL,
[SomeName] varchar(100),
PRIMARY KEY CLUSTERED (GroupID,DateStamp)
)
I want to be able to list, per [GroupID], only the latest X records by [DateStamp]. The kicker is X > 1, so plain old MAX() won't cut it. I'm assuming there's a wonderfully nasty way to do this with cursors and what-not, but I'm wondering if there is a simpler way without that stuff.
I know I'm missing something blatantly obvious and I'm gonna kick myself for not getting it, but .... I'm not getting it. Please help.
Is there a way to find TOP X records, but with grouped data?
According to the online manual, Sybase 12.5 supports WINDOW functions and ROW_NUMBER(), though their syntax differs from standard SQL slightly.
Try something like this:
SELECT SP.*
FROM (
SELECT *, ROW_NUMBER() OVER (windowA ORDER BY [DateStamp] DESC) AS RowNum
FROM SomeTable
WINDOW windowA AS (PARTITION BY [GroupID])
) AS SP
WHERE SP.RowNum <= 3
ORDER BY RowNum DESC;
I don't have an instance of Sybase, so I haven't tested this. I'm just synthesizing this example from the doc.
I made a mistake. The doc I was looking at was Sybase SQL Anywhere 11. It seems that Sybase ASA does not support the WINDOW clause at all, even in the most recent version.
Here's another query that could accomplish the same thing. You can use a self-join to match each row of SomeTable to all rows with the same GroupID and a later DateStamp. If there are three or fewer later rows, then we've got one of the top three.
SELECT s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
FROM SomeTable s1
LEFT OUTER JOIN SomeTable s2
ON s1.[GroupID] = s2.[GroupID] AND s1.[DateStamp] < s2.[DateStamp]
GROUP BY s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
HAVING COUNT(*) < 3
ORDER BY s1.[DateStamp] DESC;
Note that you must list the same columns in the SELECT list as you list in the GROUP BY clause. Basically, all columns from s1 that you want this query to return.
Here's quite an unscalable way!
SELECT GroupID, DateStamp, SomeName
FROM SomeTable ST1
WHERE X <
(SELECT COUNT(*)
FROM SomeTable ST2
WHERE ST1.GroupID=ST2.GroupID AND ST2.DateStamp > ST1.DateStamp)
Edit Bill's solution is vastly preferable though.