array_agg guaranteed consistent across multiple columns in Postgres? - postgresql

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?

According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.

It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.

Related

get postgres to use an index when querying timestamps in a function

I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?
This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

How to force postgresql to use particular index when all fields match

Consider I have a table T with fields a,b,c,d with two indices: first on a,b,c fields and second on a,b,d fields. Types of a,b,c and d are integer. Both indices are almost the same (on production they both have about 2Gb size, they have the same creation time and the same statistics of usage, table overall have about 60 millions rows).
I make two queries:
select * from T where a=... and b=... and c=...;
select * from T where a=... and b=... and d=...;
I expect that for the first query index on a,b,c fields is used and for the second index on a,b,d fields is used. However it's not the case and for both queries first index is used, but in second case with "filter"(I used expect analyze to gain this knowledge). For me such behavior is unacceptable, because in some circumstances number of entries in filter grows very fast and autovacuum/analyze (which actually helps the planner to use the right index) works too slow to prevent the unexpected latencies and downtime.
So my question is: how can I force postgresql not to use wrong index with filtering, but rather use the right index when all fields in query's 'where' and in that index match?
Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
What I have actually done is I have changed the indices, instead of a,b,c and a,b,d indices now I'm having c,a,b and d,a,b.
One problem appeared: I needed an index on 'a', because some queries rely on it. However when I add index solely on 'a', the problem from the first post appears again (the index on 'a' is used when planner thinks that its cost is lower than cost of c,a,b or d,a,b). So I decided to add new field to the table which is a copy (has the same data) of a field 'a', let's call it 'a1', and I added the index on this field. Now when I need to filter contents of 'a' somehow instead I'm filtering on 'a1' field. It's weird, but I couldn't find another solution.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.