Postgres aggregate function for selecting best / first element in a group

Postgres aggregate function for selecting best / first element in a group - postgresql

I have a table of elements with a number of items that need de-duping based on a priority. The following is a grossly simplified but representative example:
sophia=> select * from numbers order by value, priority;
value | priority | label
-------+----------+-------
1 | 1 | One
1 | 2 | Eins
2 | 1 | Two
2 | 2 | Zwei
3 | 2 | Drei
4 | 1 | Four
4 | 2 | Vier
(7 rows)
I want to restrict this to returning only a single row per number. Easy enough, I can use the first() aggregate function detailed in https://wiki.postgresql.org/wiki/First/last_(aggregate)
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
The problem with this is that the order isn't well defined so if the DB rows were inserted in a different order, I might get this:
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | Eins
2 | Zwei
3 | Drei
4 | Vier
(4 rows)
Of course, the solution to that also seems simple, in that I could just do an order by:
sophia=> select value, first(label) from (select * from numbers order by priority) foo group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
However, the problem here is that the query optimizer is free to discard the order by rules in subqueries meaning that this doesn't always work and breaks in random nasty places.
I have a solution that I'm currently using in a handful of places that relies on array_agg.
sophia=> select value, (array_agg(label order by priority))[1] as best_label from numbers group by value;
value | best_label
-------+------------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
This provides robust ordering but involves creating a bunch of extra arrays during query time that just get thrown away and hence the performance on larger datasets rather sucks.
So the question is, is there a better, cleaner, faster way of dealing with this?

Your last attempt includes the answer to your question, you just didn't realise it:
array_agg(label order by priority)
Note the order by clause inside the aggregate function. This isn't special to array_agg, but is a general part of the syntax for using aggregate functions:
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. In many cases this does not matter; for example, min produces the same result no matter what order it receives the inputs in. However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering. The order_by_clause has the same syntax as for a query-level ORDER BY clause, as described in Section 7.5, except that its expressions are always just expressions and cannot be output-column names or numbers.
Thus the solution to your problem is simply to put an order by inside your first aggregate expression:
select value, first(label order by priority) from numbers group by value order by value;
Given how elegant this is, I'm surprised that first and last are still not implemented as built-in aggregates.

The Postgres select statement has a clause called DISTINCT ON which is extremely useful in the case when you would like to return one of a group. In this case, you would use:
SELECT DISTINCT ON (value) value, label
FROM numbers
ORDER BY value, priority;
Using DISTINCT ON is generally faster than other methods involving groups or window functions.

Related

How to do a between join clause in KDB?

Suppose I have a table A with the columns bucket_start_date, bucket_end_date,
A
bucket_start_date | bucket_end_date
2015.05.02 | 2015.05.08
2015.05.08 | 2015.05.12
Also suppose i have a table B with the columns date, coins.
A
date | coins
2015.05.02 | 5
2015.05.06 | 11
2015.05.09 | 32
How do I do a join in kdb that logically looks like
select A.bucket_start_date, A.bucket_end_date, sum(coins) from A join B where B.date BETWEEN A.bucket_start_date and A.bucket_end_date group by A.bucket_start_date, A.bucket_end_date
So I want the result to look like
bucket_start_date | bucket_end_date | sum(coins)
2015.05.02 | 2015.05.08 | 16
2015.05.08 | 2015.05.12 | 32

A window join is a natural way of acheiving this result. Below is a wj1 function that will get what you are after:
q)wj1[A`bucket_start_date`bucket_end_date;`date;A;(B;(sum;`coins))]
bucket_start_date bucket_end_date coins
---------------------------------------
2015.05.02 2015.05.08 16
2015.05.08 2015.05.12 32
The first variable is a pair of lists of dates, with the first being beginning dates and last being end dates.
The second variable is the common columns, in this case you want to use the date column, since you are looking in which window each date fits in.
The third and fourth variable contains the simple tables to join, and finally (sum;`coins) is a list of the function to be applied to the given column. Again, in this case you are summing the coins column within each window.
A wj considers prevailing values on entry to each interval, whilst wj1 considers only values occuring in each interval. You can change wj1 to wj in the function to see the difference.

Firstly it is good convention not to use _ in naming conventions as _ is also used as the drop operator in q.
q)data:([]bucketSt:2015.05.02 2015.05.08;bucketEnd:2015.05.08 2015.05.12)
q)daterange:([]date:2015.05.02 2015.05.06 2015.05.09; coins: 5 11 32)
But the solution to the question without window join can be a fairly straightforward select statement.
update coins:({exec sum coins from daterange where date within x} each get each data) from data
starting from the inside of the () brackets.
q)get each data
2015.05.02 2015.05.08
2015.05.08 2015.05.12
returns the start and end times for each row.
Where a simple exec statement with aggregation gets the necessary results from the daterange table. Finally using an update statement on the original table with the new values. Returning the table as follows:
bucketSt bucketEnd coins
---------------------------
2015.05.02 2015.05.08 16
2015.05.08 2015.05.12 32
There is a possibility to do a window join as well which is more effective, but this should be easily understandable. Hope it helps!

Postgis DB Structure: Identical tables for multiple years?

I have multiple datasets for different years as a shapefile and converted them to postgres tables, which confronts me with the following situation:
I got tables boris2018, boris2017, boris2016 and so on.
They all share an identical schema, for now let's focus on the following columns (example is one row out of the boris2018 table). The rows represent actual postgis geometry with certain properties.
brw | brwznr | gema | entw | nuta
-----+--------+---------+------+------
290 | 285034 | Sieglar | B | W
the 'brwznr' column is an ID of some kind, but it does not seem to be entirely consistent across all years for each geometry.
Then again, most of the tables contain duplicate information. The geometry should be identical in every year, although this is not guaranteed, too.
What I first did was to match the brwznr of each year with the 2018 data, adding a brw17, brw2016, ... column to my boris2018 data, like so:
brw18 | brw17 | brw16 | brwznr | gema | entw | nuta
-------+-------+-------+--------+---------+------+------
290 | 260 | 250 | 285034 | Sieglar | B | W
This led to some data getting lost (because there was no matching brwznr found), some data wrongly matched (because some matching was wrong due to inconsistencies in the data) and it didn't feel right.
What I actually want to achieve is having fast queries that get me the different brw values for a certain coordinate, something around the lines of
SELECT ortst, brw, gema, gena
FROM boris2018, boris2017, boris2016
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
or
SELECT ortst, brw18, brw17, brw16, gema, gena
FROM boris
WHERE ST_Intersects(geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
although this obviously wrong/has its deficits.
Since I am new to databases in general, I can't really tell whether this is a querying problem or a database structure problem.
I hope anyone can help, your time and effort is highly appreciated!
Tim

Have you tried using a CTE?
WITH j AS (
SELECT ortst, brw, gema, gena FROM boris2016
UNION
SELECT ortst, brw, gema, gena FROM boris2017
UNION
SELECT ortst, brw, gema, gena FROM boris2018)
SELECT * FROM j
WHERE ST_Intersects(j.geom,ST_SetSRID(ST_Point(7.130577, 50.80292),4326));
Depending on your needs, you might wanna use UNION ALL. Note that this approach might not be fastest one when dealing with very large tables. If it is the case, consider merging the result of these three queries into another table and create an index using the geom field. Let me know in the comments if it is the case.

Build a list of grouped values

I'm new to this page and this is the first time i post a question. Sorry for anything wrong. The question may be old, but i just can't find any answer for SQL AnyWhere.
I have a table like
Order | Mark
======|========
1 | AA
2 | BB
1 | CC
2 | DD
1 | EE
I want to have result as following
Order | Mark
1 | AA,CC,EE
2 | BB,DD
My current SQL is
Select Order, Cast(Mark as NVARCHAR(20))
From #Order
Group by Order
and it just give me with result completely the same with the original table.
Any idea for this?

You can use the ASA LIST() aggregate function (untested, you might need to enclose the order column name into quotes as it is also a reserved name):
SELECT Order, LIST( Mark )
FROM #Order
GROUP BY Order;
You can customize the separator character and order if you need.
Note: it is rather a bad idea to
name your table and column name with like regular SQL clause (Order by)
use the same name for column an table (Order)

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).

Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |

I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...