Tiebreaker criterion of the mode() in postgres - postgresql

When using the mode() aggregation function, which tiebreaker criterion does the method use?
select mode() within group (order by my_field) FROM my_table
I couldn't find any documentation related to that
What happens if the column has an equal amount of occurrence of the values
select my_field, count(*) FROM my_table group by 1
status
count
4096
24
4098
24
In this example above, I am getting 4096, but I would like to confirm if it actually gets the lowest result, or if this is happening for another reason
UPDATE:
I still don't know how to fix this so that it's not an arbitrary choice, for now I'm using another order by
select mode() within group (order by my_field) FROM my_table order by my_field

Per the docs, it is arbitrary:
mode () WITHIN GROUP ( ORDER BY anyelement ) → anyelement
Computes the mode, the most frequent value of the aggregated argument
(arbitrarily choosing the first one if there are multiple
equally-frequent values). The aggregated argument must be of a
sortable type.
https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE

Related

Postgres: counting records in two groups (existing foreign key or null)

I have a table items and a table batches. A batch can have n items associated by items.batch_id.
I'd like to write a query item counts in two groups batched and unbatched:
items WHERE batch_id IS NOT NULL (batched)
items WHERE batch_id IS NULL (unbatched)
The result should look like this
batched
unbatched
1200000
100
Any help appreciated, thank you!
EDIT:
I got stuck with using GROUP BY which turned out to be the wrong tool for the job.
You can use COUNT with `FILTER( WHERE)
it is called conditional count
CREATE TABLE items(item_id int, batch_id int)
CREATE TABLE
INSERT INTO items VALUEs(1,NULL),(2,NULL),(3,1)
INSERT 0 3
CREATE tABLe batch (batch_id int)
CREATE TABLE
select
count(*) filter (WHERE batch_id IS NOT NULL ) as "matched"
,
count(*) filter (WHERE batch_id IS NULL ) as "unmatched"
from items
matched
unmatched
1
2
SELECT 1
fiddle
The count() function seems to be the most likely basic tool here. Given an expression, it returns a count of the number of rows where that expression evaluates to non-null. Given the argument *, it counts all rows in the group.
To the extent that there is a trick, it is getting the batched an unbatched counts in the same result row. There are at least three ways to do that:
Using subqueries:
select
(select count(batch_id) from items) as batched,
(select count(*) from items where batch_id is null) as unbatched
-- no FROM
That's pretty straightforward. Each subquery is executed and produces one column of the result. Because no FROM clause is given in the outer query, there will be exactly one result row.
Using window functions:
select
count(batch_id) over () as batched,
(count(*) over () - count(batch_id) over ()) as unbatched
from items
limit 1
That will compute the batched and unbatched results for the whole table on every result row, one per row of the items table, but then only one result row is actually returned. It is reasonable to hope (though you would definitely want to test) that postgres doesn't actually compute those counts for all the rows that are culled by the limit clause. You might, for example, compare the performance of this option with that of the previous option.
Using count() with a filter clause, as described in detail in another answer.

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

Select until row matches in postgresql?

Is there a way to select rows until some condition is met? I.e. a type of limit, but not limited to N rows, but to all the rows until the first non-matching row?
For example, say I have the table:
CREATE TABLE t (id SERIAL PRIMARY KEY, rank INTEGER, value INTEGER);
INSERT INTO t (rank, value) VALUES ( 1, 1), (2, 1), (2,2),(3,1);
that is:
test=# SELECT * FROM t;
id | rank | value
----+------+-------
1 | 1 | 1
2 | 2 | 1
3 | 2 | 2
4 | 3 | 1
(4 rows)
I want to order by rank, and select up until the first row that is over 1.
I.e. SELECT * FROM t ORDER BY rank UNTIL value>1
and I want the first 2 rows back?
One solution is to use a subquery and bool_or:
SELECT * FROM
( SELECT id, rank, value, bool_and(value<2) OVER (order by rank, id) AS ok FROM t ORDER BY rank) t2
WHERE ok=true
BUT wont that end up going through all rows, even if I only want a handful?
(real world context: I have timestamped events in a table, I can use a window query lead/lag to select the time between two events, I want all event from now going back as long as they happened less than 10 minutes apart – the lead/lag window query complicates things, so simplified example here)
edit: made window-function order by rank, id
What you want is a sort of stop-condition. As far as I am aware there is no such thing in SQL, at least PostgreSQL's dialect.
What you can do is use a PL/PgSQL procedure to read rows from a cursor and return them until the stop condition is met. It won't be super fast, but it'll be alright. It's just a FOR loop over a query with an IF expression THEN exit; ELSE return next; END IF;. No explicit cursor is required because PL/PgSQL will use one internally if you FOR loop over a query.
Another option is to create a cursor and read chunks of rows from it in the application, then discard part of the last chunk once the stop condition is met.
Either way, a cursor is going to be what you want.
A stop expression wouldn't actually be too hard to implement in PostgreSQL by the way. You'd have to implement a new executor node type, but the new CustomScan support would make that practical to do in an extension. Then you'd just evaluate an expression to decide whether or not to carry on fetching rows.
You can try something such as:
select * from t, (
select rank from t where value = 1 order by "rank" limit 1) x
where t.rank <= x.rank order by rank;
It will make two passes through the first part of the table (which you might be able to cut by creating an index on (rank, value = 1)) but shouldn't evaluate the rest of the table if you have an index on rank.
[If you could have window expressions in where clauses you could use a window expression to make sure any previous rows didn't have value = 1.. but even if this were possible, then getting the query evaluator to use to limit search would be yet another challenge.]
This may be no better than your solution, since you begged the question, "won't that end up going through all rows?"
I can tell you this -- the explain plan is different than your solution. I don't know how the guts of PostgreSQL works, but if I were writing a "max" function, I would think it would always be O(n). By contrast, you had an order by which is average case O(n log n), worst case O(n^2).
That said, I cannot deny that this will go through all rows:
select * from sandbox.t
where id < (select min (id) from sandbox.t where value > 1)
One thing to clarify, though, is that unless you scan all rows, I'm not sure how you could determine the minimum value. Any time you invoke an aggregate concept across all records, doesn't that mean that you must read all rows?

PostgreSQL statistical mode value

I am using the SQL query
SELECT round(avg(int_value)) AS modal_value FROM t;
to obtain modal value, that, of couse, not is correct, but is a first option to show some result.
So, my question is, "How to do the thing right?".
With PostgreSQL 8.3+ we can use this user-defined agregate to define mode:
CREATE FUNCTION _final_mode(anyarray) RETURNS anyelement AS $f$
SELECT a FROM unnest($1) a
GROUP BY 1 ORDER BY COUNT(1) DESC, 1
LIMIT 1;
$f$ LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, STYPE=anyarray,
FINALFUNC=_final_mode, INITCOND='{}'
);
but, as an user-defined average, with big tables it can be slow (compare sum/count with buildin AVG function). With PostgreSQL 9+, there are no direct (buildin) function for calculate statistical mode value? Perhaps using pg_stats... How to do something like
SELECT (most_common_vals(int_value))[1] AS modal_value FROM t;
The pg_stats VIEW can be used for this kind of task (even once, by hand)?
Since PostgreSQL 9.4 there is a built-in aggregate function mode. It is used like
SELECT mode() WITHIN GROUP (ORDER BY some_value) AS modal_value FROM tbl;
Read more about ordered-set aggregate functions here:
36.10.3. Ordered-Set Aggregates
Built-in Ordered-Set Aggregate Functions
See other answers for dealing with older versions of Postgres.
You can try something like:
SELECT int_value, count(*)
FROM t
GROUP BY int_value
ORDER BY count(*) DESC
LIMIT 1;
The idea behind it - you get the count for every int_value, then order them (so that the biggest count goes first), then LIMIT the query to first row only, to get the int_value with highest count only.
If you want to do it by groups:
select
int_value * 10 / (select max(int_value) from t) g,
min(int_value) "from",
max(int_value) "to",
count(*) total
from t
group by 1
order by 4 desc
At the question introductiom I cited this link with a good SQL-coded solution (and #IgorRomanchenko used the same algorithm in this answer). #ClodoaldoNeto shows a "new solution", but is for scalars and measures as I comment, not is an answer for the current question.
Pasted 2 months and ~40Views, no new issue...
Conlusions
Conclusions using only informations (and evidence of the absence of further info) of this page and cited links. Summary:
The user-defined aggregate mode() is enough, we not need a build-in (compiled) version.
There are no infrastructure for optimizations, a build-in do the something than the user-defined.
I tested the cited SQL aggregate function , in contexts like
SELECT mode(some_value) AS modal_value FROM t;
And, on my tests, it was fast... So, not justify an "build-in function" (like STATS_MODE of Oracle), only in a "statistical package" demand context -- but if you will spend time and memory to install something I suggest R language.
Another implicit question, was about a statistical package "preparing" or making use of some PostgreSQL-infrastructure (like pg_stats)... A good clue for a "canonical answer" is at the comment of #IgorRomanchenko: "pg_stat (...) contains only estimates, not the exact value". So, mode function can not make use of infrastructure, as I supposed.
NOTE: we must remember that, for "modal intervals", we can use another function, see #ClodoaldoNeto's answer.
The mode is of the most value that has occurred, so I sobreescrevi the function I found here and I made this:
CREATE OR REPLACE FUNCTION _final_mode(anyarray)
RETURNS anyelement AS
$BODY$
SELECT
CASE
WHEN t1.cnt <> t2.cnt THEN t1.a
ELSE NULL
END
FROM
(SELECT a, COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 1
) as t1,
(SELECT a,
COUNT(*) AS cnt
FROM unnest($1) a
WHERE a IS NOT NULL
GROUP BY 1
ORDER BY COUNT(*) DESC, 1
LIMIT 2 OFFSET 1
) as t2
$BODY$
LANGUAGE 'sql' IMMUTABLE;
-- Tell Postgres how to use our aggregate
CREATE AGGREGATE mode(anyelement) (
SFUNC=array_append, --Function to call for each row. Just builds the array
STYPE=anyarray,
FINALFUNC=_final_mode, --Function to call after everything has been added to array
INITCOND='{}' --Initialize an empty array when starting
);

ROWID equivalent in postgres 9.2

Is there any way to get rowid of a record in postgres??
In oracle i can use like
SELECT MAX(BILLS.ROWID) FROM BILLS
Yes, there is ctid column which is equivalent for rowid. But is useless for you. Rowid and ctid are physical row/tuple identifiers => can change after rebuild/vacuum.
See: Chapter 5. Data Definition > 5.4. System Columns
The PostgreSQL row_number() window function can be used for most purposes where you would use rowid. Whereas in Oracle the rowid is an intrinsic numbering of the result data rows, in Postgres row_number() computes a numbering within a logical ordering of the returned data. Normally if you want to number the rows, it means you expect them in a particular order, so you would specify which column(s) to order the rows when numbering them:
select client_name, row_number() over (order by date) from bills;
If you just want the rows numbered arbitrarily you can leave the over clause empty:
select client_name, row_number() over () from bills;
If you want to calculate an aggregate over the row number you'll have to use a subquery:
select max(rownum) from (
select row_number() over () as rownum from bills
) r;
If all you need is the last item from a table, and you have a column to sort sequentially, there's a simpler approach than using row_number(). Just reverse the sort order and select the first item:
select * from bills
order by date desc limit 1;
Use a Sequence. You can choose 4 or 8 byte values.
http://www.neilconway.org/docs/sequences/
Add any unique column to your table(name maybe rowid).
And prevent changing it by creating BEFORE UPDATE trigger, which will raise exception if someone will try to update.
You may populate this column with sequence as #JohnMudd mentioned.