PostgreSQL statistical mode value - postgresql

I am using the SQL query
SELECT round(avg(int_value)) AS modal_value FROM t;
to obtain modal value, that, of couse, not is correct, but is a first option to show some result.
So, my question is, "How to do the thing right?".
With PostgreSQL 8.3+ we can use this user-defined agregate to define mode:
CREATE FUNCTION _final_mode(anyarray) RETURNS anyelement AS $f$
SELECT a FROM unnest($1) a
GROUP BY 1 ORDER BY COUNT(1) DESC, 1
LIMIT 1;
$f$ LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, STYPE=anyarray,
FINALFUNC=_final_mode, INITCOND='{}'
);
but, as an user-defined average, with big tables it can be slow (compare sum/count with buildin AVG function). With PostgreSQL 9+, there are no direct (buildin) function for calculate statistical mode value? Perhaps using pg_stats... How to do something like
SELECT (most_common_vals(int_value))[1] AS modal_value FROM t;
The pg_stats VIEW can be used for this kind of task (even once, by hand)?

Since PostgreSQL 9.4 there is a built-in aggregate function mode. It is used like
SELECT mode() WITHIN GROUP (ORDER BY some_value) AS modal_value FROM tbl;
Read more about ordered-set aggregate functions here:
36.10.3. Ordered-Set Aggregates
Built-in Ordered-Set Aggregate Functions
See other answers for dealing with older versions of Postgres.

You can try something like:
SELECT int_value, count(*)
FROM t
GROUP BY int_value
ORDER BY count(*) DESC
LIMIT 1;
The idea behind it - you get the count for every int_value, then order them (so that the biggest count goes first), then LIMIT the query to first row only, to get the int_value with highest count only.

If you want to do it by groups:
select
int_value * 10 / (select max(int_value) from t) g,
min(int_value) "from",
max(int_value) "to",
count(*) total
from t
group by 1
order by 4 desc

At the question introductiom I cited this link with a good SQL-coded solution (and #IgorRomanchenko used the same algorithm in this answer). #ClodoaldoNeto shows a "new solution", but is for scalars and measures as I comment, not is an answer for the current question.
Pasted 2 months and ~40Views, no new issue...
Conlusions
Conclusions using only informations (and evidence of the absence of further info) of this page and cited links. Summary:
The user-defined aggregate mode() is enough, we not need a build-in (compiled) version.
There are no infrastructure for optimizations, a build-in do the something than the user-defined.
I tested the cited SQL aggregate function , in contexts like
SELECT mode(some_value) AS modal_value FROM t;
And, on my tests, it was fast... So, not justify an "build-in function" (like STATS_MODE of Oracle), only in a "statistical package" demand context -- but if you will spend time and memory to install something I suggest R language.
Another implicit question, was about a statistical package "preparing" or making use of some PostgreSQL-infrastructure (like pg_stats)... A good clue for a "canonical answer" is at the comment of #IgorRomanchenko: "pg_stat (...) contains only estimates, not the exact value". So, mode function can not make use of infrastructure, as I supposed.
NOTE: we must remember that, for "modal intervals", we can use another function, see #ClodoaldoNeto's answer.

The mode is of the most value that has occurred, so I sobreescrevi the function I found here and I made this:
CREATE OR REPLACE FUNCTION _final_mode(anyarray)
RETURNS anyelement AS
$BODY$
SELECT
CASE
WHEN t1.cnt <> t2.cnt THEN t1.a
ELSE NULL
END
FROM
(SELECT a, COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 1
) as t1,
(SELECT a,
COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 2 OFFSET 1
) as t2
$BODY$
LANGUAGE 'sql' IMMUTABLE;
-- Tell Postgres how to use our aggregate
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, --Function to call for each row. Just builds the array
STYPE=anyarray,
FINALFUNC=_final_mode, --Function to call after everything has been added to array
INITCOND='{}' --Initialize an empty array when starting
);

Related

Tiebreaker criterion of the mode() in postgres

When using the mode() aggregation function, which tiebreaker criterion does the method use?
select mode() within group (order by my_field) FROM my_table
I couldn't find any documentation related to that
What happens if the column has an equal amount of occurrence of the values
select my_field, count(*) FROM my_table group by 1
status
count
4096
24
4098
24
In this example above, I am getting 4096, but I would like to confirm if it actually gets the lowest result, or if this is happening for another reason
UPDATE:
I still don't know how to fix this so that it's not an arbitrary choice, for now I'm using another order by
select mode() within group (order by my_field) FROM my_table order by my_field
Per the docs, it is arbitrary:
mode () WITHIN GROUP ( ORDER BY anyelement ) → anyelement
Computes the mode, the most frequent value of the aggregated argument
(arbitrarily choosing the first one if there are multiple
equally-frequent values). The aggregated argument must be of a
sortable type.
https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE

Retrieval of columns from functions that returns table (or setof record)

All time I have an variation of this problem, and not remember how to workaround, only "oop was so simple, but how to?"... Perhaps there are some patterns and best way to work with each pattern. Let's see the main one, examplefying by unnest() and ts_stat().
First, good examples, no problems, because unnest() returns only one column:
SELECT * FROM unnest(array[1,2,3]) t(id); -- is ok, the int columns there!
SELECT unnest(array[1,2,3]) t(id); -- is ok, the int columns
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, unnest(array[4,id]) as x
FROM t; -- more complex, but ok!
Now a function that returns a defined SETOF RECORD,
SELECT * FROM ts_stat('SELECT kx FROM terms where id=2') -- GOOD
-- show all word|ndoc|nentry columns
SELECT ts_stat('SELECT kx FROM terms where id=2') as x -- BAD
-- because lost columns, show only "x" column... but works
-- NOTE: you can imagine any other function, as json_each(), etc.
See GOOD/BAD considerations... So, this is the problem: a SETOF RECORD with more tham one column. In the simplest (unnest above) case, the solution is to use in the "FROM side", as a table; but, when RECORD have multiple fields, arises the problem.
--MAIN EXAMPLE FOR THE DISCUSSION:
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, ts_stat('SELECT kx FROM terms where id='||id) as x
FROM t; -- BAD, but works...
Now, in this main example, is not possible to use ts_stat() in the "FROM side", so, characterizing the pattern: a function that returns a TABLE or a SETOF RECORD, in a query where we need columns, but the function can't in the "FROM side".
QUESTION: What the generic (and most elegant) solution to this pattern? How (syntax pattern) to show columns?
NOTE: another problem is that, if you not remember exactly the syntax of solution, you try things that not works... In this case an error:
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, x.word, x.ndoc, x.nentry
FROM (
SELECT t.nsid,
ts_stat('SELECT kx FROM terms where id='||id) as x
FROM t
) s;
SQL PARSER ERROR (PostgreSQL 9.5): no table "x" in the FROM clause.
You should never use a set-returning-function (SRF) in a SELECT list. The main example should be written with an implicit LATERAL JOIN:
SELECT v.id, x.*
FROM (VALUES (1),(2),(3)) v(id)
JOIN ts_stat('SELECT kx FROM terms where id=' || v.id) x ON true;
The lateral join is implicit here because an SRF can refer to columns from relations specified before it the FROM clause without using the keyword LATERAL. In the example above the SRF ts_stat() makes a lateral reference to column and relation v(id). You can also do this with e.g. sub-queries but then you have to explicitly use the keyword LATERAL.
Note that while you can use a SRF in a select list, its use is discouraged. You provide the example of unnest(anyarray) which is interesting because there is also the overloaded variant unnest(anyarray, ...) (i.e. unnest multiple arrays in one call) which will throw an error when used in a select list; in can only be used as a row source. The reason why you should not use SRFs in a select list is that there is no obvious solution when using multiple SRFs each producing a different number of rows.

nested SELECT statements interact in ways that I don't understand

I thought I understood how I can do a SELECT from the results of another SELECT statement, but there seems to be some sort of blurring of scope that I don't understand. I am using SQL Server 2008R2.
It is easiest to explain with an example.
Create a table with a single nvarchar column - load the table with a single text value and a couple of numbers:
CREATE TABLE #temptable( a nvarchar(30) );
INSERT INTO #temptable( a )
VALUES('apple');
INSERT INTO #temptable( a )
VALUES(1);
INSERT INTO #temptable( a )
VALUES(2);
select * from #temptable;
This will return: apple, 1, 2
Use IsNumeric to get only the rows of the table that can be cast to numeric - this will leave the text value apple behind. This works fine.
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1 ;
This returns: 1, 2
However, if I use that exact same query as an inner select, and try to do a numeric WHERE clause, it fails saying cannot convert nvarchar value 'apple' to data type int. How has it got the value 'apple' back??
select
x.NumA
from
(
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
) x
where x.NumA > 1
;
Note that the failing query works just fine without the WHERE clause:
select
x.NumA
from
(
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
) x
;
I find this very surprising. What am I not getting? TIA
If you take a look at the estimated execution plan you'll find that it has optimized the inner query into the outer and combined the WHERE clauses.
Using a CTE to isolate the operations works (in SQL Server 2008 R2):
declare #temptable as table ( a nvarchar(30) );
INSERT INTO #temptable( a )
VALUES ('apple'), ('1'), ('2');
with Numbers as (
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
)
select * from Numbers
The reason you are getting this is fair and simple. When a query is executed there are some steps that are being followed. This is a parse, algebrize, optimize and compile.
The algebrize part in this case will get all the objects you need for this query. The optimize will use these objects to create a best query plan which will be compiled and executed...
So, when you look into that part you will see it will do a table scan on #temptable. And #temptable is defined as the way you created your table. That you will do some compute on it is a different thing..... The column still has the nvarchar datatype..
To know how this works you have to know how to read a query. First all the objects are retrieved (from table, inner join table), then the predicates (where, on), then the grouping and such, then the select of the columns (with the cast) and then the orderby.
So with that in mind, when you have a combination of selects, the optimizer will still process it that way.. since your select is subordinate to the from and join parts of your query, it will be a reason for getting this error.
I hope i made it a little clear?
The optimizer is free to move expressions in the query plan in order to produce the most cost efficient plan for retrieving the data (the evaluation order of the predicates is not guaranteed). I think using the case expression like bellow produces a NULL in absence of the ELSE clause and thus takes the APPLE out
select a from #temptable where case when isnumeric(a) = 1 then a end > 1

Why does this Oracle 10g SQL run slow only when I query a subquery with a where clause?

I can't paste in the entire SQL for various reasons, so consider this example:
select *
from
(select nvl(get_quantity(1), 10) available_qty
from dual)
where available_qty > 30;
get_quantity is a function that makes a calculation based on the ID of a record that's passed through it. If it returns null, I use nvl() to force it to 10.
The query runs very slow when I use the WHERE clause in the parent query. When I comment out the WHERE clause, however, it runs very fast. What I don't get is why it can display the data very fast, but it can't query it just as fast. I am querying the results of a subquery, too. I was under the impression that subqueries return a "rendered" dataset. It's almost as if querying the available_qty identifier is causing it to reference something within the subquery.
This is why I don't think the contents of the get_quantity function are relevant here, so I didn't bother posting it. Instead, I think it's a misunderstanding on my part of how Oracle handles subqueries and whatnot.
Do any of you Oracle gurus have any idea what I am doing wrong?
Afterthought: as I was entering tags for this question, the tag "correlated subquery" came up. In doing some quick research, it seems that this type of subquery somewhat depends on the outer query. Could this be related to my problem?
Let's try an experiment. First we'll run the following query:
select lvl, rnd
from (select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd from dual) b;
The "a" subquery will return 5 rows with values from 1 to 5. The "b" subquery will return one row with a random value. If the function is run before the two tables are join (by Cartesian), the same random value will be returned for each row. The actual results:
LVL RND
---------- ----------
1 .417932089
2 .963531718
3 .617016889
4 .128395638
5 .069405568
5 rows selected.
Clearly the function was run for each of the joined rows, not for the subquery before the join. This is a result of Oracle's optimizer deciding that the best path for the query is to do things in that order. To prevent this, we have to add something to the second subquery that will make Oracle run the subquery in it's entirety before performing the join. We'll add rownum to the subquery, since Oracle knows rownum will change if it's run after the join. The following query demonstrates this:
select lvl, rnd from (
select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd, rownum from dual) b;
As you can see from the results, the function was only run once in this case:
LVL RND
---------- ----------
1 .028513902
2 .028513902
3 .028513902
4 .028513902
5 .028513902
5 rows selected.
In your case, it seems likely that the filter provided by the where clause is making the optimizer take a different path, where it's running the function repeatedly, rather than once. By making Oracle run the subquery as written, you should get more consistent run-times.

Is there a way to find TOP X records with grouped data?

I'm working with a Sybase 12.5 server and I have a table defined as such:
CREATE TABLE SomeTable(
[GroupID] [int] NOT NULL,
[DateStamp] [datetime] NOT NULL,
[SomeName] varchar(100),
PRIMARY KEY CLUSTERED (GroupID,DateStamp)
)
I want to be able to list, per [GroupID], only the latest X records by [DateStamp]. The kicker is X > 1, so plain old MAX() won't cut it. I'm assuming there's a wonderfully nasty way to do this with cursors and what-not, but I'm wondering if there is a simpler way without that stuff.
I know I'm missing something blatantly obvious and I'm gonna kick myself for not getting it, but .... I'm not getting it. Please help.
Is there a way to find TOP X records, but with grouped data?
According to the online manual, Sybase 12.5 supports WINDOW functions and ROW_NUMBER(), though their syntax differs from standard SQL slightly.
Try something like this:
SELECT SP.*
FROM (
SELECT *, ROW_NUMBER() OVER (windowA ORDER BY [DateStamp] DESC) AS RowNum
FROM SomeTable
WINDOW windowA AS (PARTITION BY [GroupID])
) AS SP
WHERE SP.RowNum <= 3
ORDER BY RowNum DESC;
I don't have an instance of Sybase, so I haven't tested this. I'm just synthesizing this example from the doc.
I made a mistake. The doc I was looking at was Sybase SQL Anywhere 11. It seems that Sybase ASA does not support the WINDOW clause at all, even in the most recent version.
Here's another query that could accomplish the same thing. You can use a self-join to match each row of SomeTable to all rows with the same GroupID and a later DateStamp. If there are three or fewer later rows, then we've got one of the top three.
SELECT s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
FROM SomeTable s1
LEFT OUTER JOIN SomeTable s2
ON s1.[GroupID] = s2.[GroupID] AND s1.[DateStamp] < s2.[DateStamp]
GROUP BY s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
HAVING COUNT(*) < 3
ORDER BY s1.[DateStamp] DESC;
Note that you must list the same columns in the SELECT list as you list in the GROUP BY clause. Basically, all columns from s1 that you want this query to return.
Here's quite an unscalable way!
SELECT GroupID, DateStamp, SomeName
FROM SomeTable ST1
WHERE X <
(SELECT COUNT(*)
FROM SomeTable ST2
WHERE ST1.GroupID=ST2.GroupID AND ST2.DateStamp > ST1.DateStamp)
Edit Bill's solution is vastly preferable though.