Most efficient way to join to a two-part key, with a fallback to matching only the first part? - tsql

In purely technical terms
Given a table with a two-column unique key, and input values for those two columns, what is the most efficient way to return the first matching row based on a two-step match?:
If an exact match exists on both key parts, return that
Otherwise, return the first (if any) matching row based on the first part alone
This operation will be done in many different places, on many rows. The "payload" of the match will be a single string column (nvarchar(400)). I want to optimize for fast reads. Paying for this with slower inserts and updates and more storage is acceptable. So having multiple indexes with the payload included is an option, as long is there is a good way to execute the two-step match described above. There absolutely will be a unique index on (key1, key2) with the payload included, so essentially all reads will be going off of this index alone, unless there is some clever approach that would use additional indexes.
A method that returns the entire matching row is preferred, but if a scalar function that only returns the payload is an order of magnitude faster, then that is worth considering.
I've tried three different methods, two of which I have posted as answers below. The third method was about 20x more expensive in the explain plan cost, and I've included it at the end of this post as an example of what not to do.
I'm curious to see if there are better ways, though, and will happily vote someone else's suggestion as the answer if it is better. In my dev database the query planner estimates similar costs to my two approaches, but my dev database doesn't have anywhere near the volume of multilingual text that will be in production, so it's hard to know if this accurately reflects the comparative read performance on a large data set. As tagged, the platform is SQL Server 2012, so if there are new applicable features available as of that version do make use of them.
Business context
I have a table LabelText that represents translations of user-supplied dynamic content:
create table Label ( bigint identity(1,1) not null primary key );
create table LabelText (
LabelTextID bigint identity(1,1) not null primary key
, LabelID bigint not null
, LanguageCode char(2) not null
, LabelText nvarchar(400) not null
, constraint FK_LabelText_Label
foreign key ( NameLabelID ) references Label ( LabelID )
);
There is a unique index on LabelID and LanguageCode, so there can only be one translation of a text item for each ISO 2-character language code. The LabelText field is also included, so reads can access the index along without having to fetch back from the underlying table:
create unique index UQ_LabelText
on LabelText ( LabelID, LanguageCode )
include ( LabelText);
I'm looking for the fastest-performing way to return the best match from the LabelText table in a two-step match, given a LabelID and LanguageCode.
For examples, let's say we have a Component table that looks like this:
create table Component (
ComponentID bigint identity(1,1) not null primary key
, NameLabelID bigint not null
, DescriptionLabelID bigint not null
, constraint FK_Component_NameLabel
foreign key ( NameLabelID ) references Label ( LabelID )
, constraint FK_Component_DescLabel
foreign key ( DescriptionLabelID ) references Label ( LabelID )
);
Users will each have a preferred language, but there is no guarantee that a text item will have a translation in their language. In this business context it makes more sense to show any available translation rather than none, when the user's preferred language is not available. So for example a German user may call a certain widget the 'linkenpfostenklammer'. A British user would prefer to see an English translation if one is available, but until there is one it is better to see the German (or Spanish, or French) version than to see nothing.
What not to do: Cross apply with dynamic sort
Whether encapsulated in a table-valued function or included inline, the following use of cross apply with a dynamic sort was about 20x more expensive (per explain plan estimate) than either the scalar-valued function in my first answer or the union all approach in my second answer:
declare #LanguageCode char(2) = 'de';
select
c.ComponentID
, c.NameLabelID
, n.LanguageCode as NameLanguage
, n.LabelText as NameText
from Component c
outer apply (
select top 1
lt.LanguageCode
, lt.LabelText
from LabelText lt
where lt.LabelID = c.NameLabelID
order by
(case when lt.LanguageCode = #LanguageCode then 0 else 1 end)
) n

I think this is going to be most performant
select lt.*, c.*
from ( select LabelText, LabelID from LabelText
where LabelTextID = #LabelTextID and LabelID = #LabelID
union
select LabelText, min(LabelID) from LabelText
where LabelTextID = #LabelTextID
and not exists (select 1 from LabelText
where LabelTextID = #LabelTextID and LabelID = #LabelID)
group by LabelTextID, LabelText
) lt
join component c
on c.NameLabelID = lt.LabelID

OP solution 1: Scalar function
A scalar function would make it easy to encapsulate the lookup for reuse elsewhere, though it does not return the language code of the text actually returned. I'm also unsure of the cost of executing multiple times per row in denormalized views.
create function GetLabelText(#LabelID bigint, #LanguageCode char(2))
returns nvarchar(400)
as
begin
declare #text nvarchar(400);
select #text = LabelText
from LabelText
where LabelID = #LabelID and LanguageCode = #LanguageCode
;
if #text is null begin
select #text = LabelText
from LabelText
where LabelID = #LabelID;
end
return #text;
end
Usage would look like this:
declare #LanguageCode char(2) = 'de';
select
ComponentID
, NameLabelID
, DescriptionLabelID
, GetLabelText(NameLabelID, #LanguageCode) AS NameText
, GetLabelText(DescriptionLabelID, #LanguageCode) AS DescriptionText
from Component

OP solution 2: Inline table-valued function using top 1, union all
A table-valued function is nice because it encapsulates the lookup for reuse just as with a scalar function, but also returns the matching LanguageCode of the row that was actually selected. In my dev database with limited data the explain plan cost of the following use of top 1 and union all is comparable to the scalar function approach in "OP Solution 1":
create function GetLabelText(#LabelID bigint, #LanguageCode char(2))
returns table
as
return (
select top 1
A.LanguageCode
, A.LabelText
from (
select
LanguageCode
, LabelText
from LabelText
where LabelID = #LabelID
and LanguageCode = #LanguageCode
union all
select
LanguageCode
, LabelText
from LabelText
where LabelID = #LabelID
) A
);
Usage:
declare #LanguageCode char(2) = 'de';
select
c.ComponentID
, c.NameLabelID
, n.LanguageCode AS NameLanguage
, n.LabelText AS NameText
, c.DescriptionLabelID
, c.LanguageCode AS DescriptionLanguage
, c.LabelText AS DescriptionText
from Component c
outer apply GetLabelText(c.NameLabelID, #LanguageCode) n
outer apply GetLabelText(c.DescriptionLabelID, #LanguageCode) d

Related

How to return different format of records from a single PL/pgSQL function?

I am a frontend developer but I started to write backend stuff. I have spent quite some amount of time trying to figure out how to solve this. I really need some help.
Here are the simplified definitions and relations of two tables:
Relationship between tables
CREATE TABLE IF NOT EXISTS items (
item_id uuid NOT NULL DEFAULT gen_random_uuid() ,
parent_id uuid DEFAULT NULL ,
parent_table parent_tables NOT NULL
);
CREATE TABLE IF NOT EXISTS collections (
collection_id uuid NOT NULL DEFAULT gen_random_uuid() ,
parent_id uuid DEFAULT NULL
);
Our product is an online document collaboration tool, page can have nested pages.
I have a piece of PostgreSQL code for getting all of its ancestor records for given item_ids.
WITH RECURSIVE ancestors AS (
SELECT *
FROM items
WHERE item_id in ( ${itemIds} )
UNION
SELECT i.*
FROM items i
INNER JOIN ancestors a ON a.parent_id = i.item_id
)
SELECT * FROM ancestors
It works fine for nesting regular pages, But if I am going to support nesting collection pages, which means some items' parent_id might refer to "collection" table's collection_id, this code will not work anymore. According to my limited experience, I don't think pure SQL code can solve it. I think writing a PL/pgSQL function might be a solution, but I need to get all ancestor records to given itemIds, which means returning a mix of items and collections records.
So how to return different format of records from a single PL/pgSQL function? I did some research but haven't found any example.
You can make it work by returning a superset as row: comprised of item and collection. One of both will be NULL for each result row.
WITH RECURSIVE ancestors AS (
SELECT 0 AS lvl, i.parent_id, i.parent_table, i AS _item, NULL::collections AS _coll
FROM items i
WHERE item_id IN ( ${itemIds} )
UNION ALL -- !
SELECT lvl + 1, COALESCE(i.parent_id, c.parent_id), COALESCE(i.parent_table, 'i'), i, c
FROM ancestors a
LEFT JOIN items i ON a.parent_table = 'i' AND i.item_id = a.parent_id
LEFT JOIN collections c ON a.parent_table = 'c' AND c.collection_id = a.parent_id
WHERE a.parent_id IS NOT NULL
)
SELECT lvl, _item, _coll
FROM ancestors
-- ORDER BY ?
db<>fiddle here
UNION ALL, not UNION.
Assuming a collection's parent is always an item, while an item can go either way.
We need LEFT JOIN on both potential parent tables to stay in the race.
I added an optional lvl to keep track of the level of hierarchy.
About decomposing row types:
Combine postgres function with query
Record returned from function has columns concatenated

postgresql group by datetime in join query

I have 2 tables in my postgresql timescaledb database (version 12.06) that I try to query through inner join.
Tables' structure:
CREATE TABLE currency(
id serial PRIMARY KEY,
symbol TEXT NOT NULL,
name TEXT NOT NULL,
quote_asset TEXT
);
CREATE TABLE currency_price (
currency_id integer NOT NULL,
dt timestamp WITHOUT time ZONE NOT NULL,
open NUMERIC NOT NULL,
high NUMERIC NOT NULL,
low NUMERIC NOT NULL,
close NUMERIC,
volume NUMERIC NOT NULL,
PRIMARY KEY (
currency_id,
dt
),
CONSTRAINT fk_currency FOREIGN KEY (currency_id) REFERENCES currency(id)
);
The query I'm trying to make is:
SELECT currency_id AS id, symbol, MAX(close) AS close, DATE(dt) AS date
FROM currency_price
JOIN currency ON
currency.id = currency_price.currency_id
GROUP BY currency_id, symbol, date
LIMIT 100;
Basically, it returns all the rows that exist in currency_price table. I know that postgres doesn't allow select columns without an aggregate function or including them in "group by" clause. So, if I don't include dt column in my select query, i receive expected results, but if I include it, the output shows rows of every single day of each currency while I only want to have the max value of every currency and filter them out based on various dates afterwards.
I'm very inexperienced with SQL in general.
Any suggestions to solve this would be very appreciated.
There are several ways to do it, easiest one comes to mind is using window functions.
select *
from (
SELECT currency_id,symbol,close,dt
,row_number() over(partition by currency_id,symbol
order by close desc,dt desc) as rr
FROM currency_price
JOIN currency ON currency.id = currency_price.currency_id
where dt::date = '2021-06-07'
)q1
where rr=1
General window functions:
https://www.postgresql.org/docs/9.5/functions-window.html
works also with standard aggregate functions like SUM,AVG,MAX,MIN and others.
Some examples: https://www.postgresqltutorial.com/postgresql-window-function/

postgresql : search records based on array field vaule with multiple values

I have a table that has an array field.
CREATE TABLE notifications
(
id integer NOT NULL DEFAULT nextval('notifications_id_seq'::regclass),
title character(100) COLLATE pg_catalog."default" NOT NULL,
tags text[] COLLATE pg_catalog."default",
CONSTRAINT notifications_pkey PRIMARY KEY (id)
)
and tags field can have multiple values from
["a","b","c","d"]
now I want all the records for which tags have a or d ("a","d")array values.
I can use postgresl in but this can be used to search single value. How can I achieve this?
You could use ANY:
SELECT *
FROM notifications
WHERE 'a' = ANY(tags) OR 'b' = ANY(tags);
DBFiddle Demo
If the values 'a' and 'b' are static (you only need to check for those 2 values in every query), then you can go with the solution that Lukasz Szozda provided.
But if the values you want to check for are dynamic and are different in multiple queries(sometimes it is {'a','b'} but sometimes it is {'b', 'f','m'}) you can create an intersection of both of the arrays and check if the intersection is empty.
For example:
If we have the following table and data:
CREATE TABLE test_table_1(description TEXT, tags TEXT[]);
INSERT INTO test_table_1(description, tags) VALUES
('desc1', array['a','b','c']),
('desc2', array['c','d','e']);
If we want to get all of the rows from test_table_1 that have one of the following tags b, f, or m, we could do it with the following query:
SELECT * FROM test_table_1 tt1
WHERE array_length((SELECT array
(
SELECT UNNEST(tt1.tags)
INTERSECT
SELECT UNNEST(array['b','f','m'])
)), 1) > 0;
In the query above we use array_length to check if the intersection is empty.
Writing the query this way can also be useful if you want to add additional constraint to the number of matched tags.
For example if you want to get all of the rows that have at least 2 tags from the group {'a','b','c'} you just need to set array_length(...) > 1

Postgresql inserts values falsely

I want to add a denormalized table for some data of a gtfs-feed. For that I created a new table:
CREATE TABLE denormalized_trips (
stops_coords json NOT NULL,
stops_object json NOT NULL,
agency_key text NOT NULL,
trip_id text NOT NULL,
route_id text NOT NULL,
service_id text NOT NULL,
shape_id text,
route_color text,
route_long_name text,
route_desc text,
direction_id text
);
CREATE INDEX denormalized_trips_index ON denormalized_trips (agency_key, trip_id);
CREATE UNIQUE INDEX denormalized_trips_index ON denormalized_trips (agency_key, route_id);
Now I want to transfer data from one table to the other via an insert statement. The statement is rather complex.
INSERT INTO denormalized_trips
SELECT
trps.stops_coords,
trps.stops_object,
trps.trip_id,
trps.service_id,
trps.route_id,
trps.direction_id,
trps.agency_key,
trps.shape_id,
trps.route_color,
trps.route_long_name,
trps.route_desc
FROM (
SELECT
array_to_json(ARRAY_AGG(array[stop_lat, stop_lon])) AS stops_coords,
array_to_json(ARRAY_AGG(array[
stops.stop_id,
CAST ( stop_times.stop_sequence AS TEXT ),
stops.stop_name,
stop_times.departure_time,
CAST ( stop_times.departure_time_seconds AS TEXT ),
stop_times.arrival_time,
CAST ( stop_times.arrival_time_seconds AS TEXT )
])) AS stops_object,
trips.trip_id,
trips.service_id,
trips.direction_id,
trips.agency_key,
trips.shape_id,
routes.route_id,
routes.route_color,
routes.route_long_name,
routes.route_desc
FROM gtfs_stop_times AS stop_times
INNER JOIN gtfs_trips AS trips
ON trips.trip_id = stop_times.trip_id AND trips.agency_key = stop_times.agency_key
INNER JOIN gtfs_routes AS routes ON trips.agency_key = routes.agency_key AND routes.route_id = trips.route_id
INNER JOIN gtfs_stops AS stops
ON stops.stop_id = stop_times.stop_id
AND stops.agency_key = stop_times.agency_key
AND NOT EXISTS (
SELECT 0
FROM denormalized_max_stop_sequence AS max
WHERE max.agency_key = stop_times.agency_key
AND max.trip_id = stop_times.trip_id
AND max.trip_max = stop_times.stop_sequence
)
GROUP BY
trips.trip_id,
trips.service_id,
trips.direction_id,
trips.agency_key,
trips.shape_id,
routes.route_id,
routes.route_color,
routes.route_long_name,
routes.route_desc
) as trps
If I just run the inner select statement I will get the right results. They look something like this: (screenshot does not show all tables because it's too long)
But if I execute the insert statement and display the content of the table i will get something like this:
As you may notice the contents are not inserted into the right columns of the table. The agency_key now has the values of the trip_id and the direction_id is now the service_id (and there are more tables that are messed up).
So my question is what am I doing wrong that my insert statement inserts the contents into the wrong columns of the newly created table?
Thanks for your help.
Postgres, by default, will insert your values in the order the columns are declared in the table; it has nothing to do with what your columns are named in the query.
https://www.postgresql.org/docs/9.5/static/sql-insert.html
If no list of column names is given at all, the default is all the columns of the table in their declared order; or the first N column names, if there are only N columns supplied by the VALUES clause or query.
You can alter your insert to declare the order of the columns you're inserting, or you can change the order of your select to match the order of columns in the table.

an empty row with null-like values in not-null field

I'm using postgresql 9.0 beta 4.
After inserting a lot of data into a partitioned table, i found a weird thing. When I query the table, i can see an empty row with null-like values in 'not-null' fields.
That weird query result is like below.
689th row is empty. The first 3 fields, (stid, d, ticker), are composing primary key. So they should not be null. The query i used is this.
select * from st_daily2 where stid=267408 order by d
I can even do the group by on this data.
select stid, date_trunc('month', d) ym, count(*) from st_daily2
where stid=267408 group by stid, date_trunc('month', d)
The 'group by' results still has the empty row.
The 1st row is empty.
But if i query where 'stid' or 'd' is null, then it returns nothing.
Is this a bug of postgresql 9b4? Or some data corruption?
EDIT :
I added my table definition.
CREATE TABLE st_daily
(
stid integer NOT NULL,
d date NOT NULL,
ticker character varying(15) NOT NULL,
mp integer NOT NULL,
settlep double precision NOT NULL,
prft integer NOT NULL,
atr20 double precision NOT NULL,
upd timestamp with time zone,
ntrds double precision
)
WITH (
OIDS=FALSE
);
CREATE TABLE st_daily2
(
CONSTRAINT st_daily2_pk PRIMARY KEY (stid, d, ticker),
CONSTRAINT st_daily2_strgs_fk FOREIGN KEY (stid)
REFERENCES strgs (stid) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT st_daily2_ck CHECK (stid >= 200000 AND stid < 300000)
)
INHERITS (st_daily)
WITH (
OIDS=FALSE
);
The data in this table is simulation results. Multithreaded multiple simulation engines written in c# insert data into the database using Npgsql.
psql also shows the empty row.
You'd better leave a posting at http://www.postgresql.org/support/submitbug
Some questions:
Could you show use the table
definitions and constraints for the
partions?
How did you load your data?
You get the same result when using
another tool, like psql?
The answer to your problem may very well lie in your first sentence:
I'm using postgresql 9.0 beta 4.
Why would you do that? Upgrade to a stable release. Preferably the latest point-release of the current version.
This is 9.1.4 as of today.
I got to the same point: "what in the heck is that blank value?"
No, it's not a NULL, it's a -infinity.
To filter for such a row use:
WHERE
case when mytestcolumn = '-infinity'::timestamp or
mytestcolumn = 'infinity'::timestamp
then NULL else mytestcolumn end IS NULL
instead of:
WHERE mytestcolumn IS NULL