PostgreSQL: How to use a lookup table to select data across multiple tables?

PostgreSQL: How to use a lookup table to select data across multiple tables? - postgresql

I have a schema with many large tables which all have the same structure. Each table has an index on its id. I also have a separate table with all the id's across the other tables, pointing to their tablename; for example, the tables in the schema:
Table 'A'
id content
1 ...
2 ...
3 ...
Table 'B'
id content
4 ...
5 ...
6 ...
Table 'C'
id content
5 ...
6 ...
7 ...
(As you can see the id's are not always unique across the tables) And then a table with lookup:
Table 'lookup'
id tablename
1 'A'
2 'A'
3 'A'
4 'B'
5 'B'
5 'C'
6 'B'
6 'C'
7 'C'
Now, how can I make a view like this?
SELECT
id, content
FROM
view
WHERE
id = 6
where it would select the content from B and C (where id is 6). Also, it should only do an index scan on B and C to reduce search time. Once again, there are many tables and they are very large. By far the most of the id's are unique across the tables.
How can I do this? (Or maybe, should I do it this way?)
PS The content of the tables is not stored into a single table because the volume is constantly growing and inserting/copying into this indexed table becomes very slow after a while. Also, it is more easy to remove specific data by just truncating separate tables.

You are looking for table partitioning:
CREATE TABLE object (
country text,
id bigint,
content bytea,
PRIMARY KEY (country, id)
) PARTITION BY LIST (country);
CREATE TABLE object_a PARTITION OF object FOR VALUES IN ('A');
CREATE TABLE object_b PARTITION OF object FOR VALUES IN ('B');
CREATE TABLE object_c PARTITION OF object FOR VALUES IN ('C');
This alone won't give you quick access by id, but simplifies managing the union and querying by country name. You'd still need the lookup table for the countries:
CREATE MATERIALIZED VIEW lookup AS SELECT country, id FROM object;
Then you can do
SELECT content
FROM object
JOIN lookup ON object.country = lookup.country AND object.id = lookup.id
WHERE lookup.id = 6

Related

Unnest vs just having every row needed in table

I have a choice in how a data table is created and am wondering which approach is more performant.
Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested
That is, if I have the data:
day
val1
val2
Mon
7
11
Tue
7
11
Wed
8
9
Thu
1
4
Is it better to enter the data in as shown, or instead:
day
val1
val2
(Mon,Tue)
7
11
(Wed)
8
9
(Thu)
1
4
And then use unnest() to explode those into unique rows when I need them?
Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:
-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
RETURNS uuid
LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';
-- data table
CREATE TABLE data (
data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)
);
-- reference data_id in next table
CREATE TABLE day_data (
day text
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id) -- must be unique?
);
db<>fiddle here
With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.
In-depth explanation in this closely related answer:
Why doesn't my UNIQUE constraint trigger?
You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).
About generated columns:
Computed / calculated / virtual / derived columns in PostgreSQL
Basic technique to avoid duplicates while inserting new data:
INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;
If there can be concurrent writes, see:
How to use RETURNING with ON CONFLICT in PostgreSQL?

SQL (Redshift) to get the intersect of multiple tables

I'm using Redshift and have 6 tables of IDs in. I want to get the intersect between each of the tables.
So my final output would look something like this:
Table 1 & Table 2 have 10% common IDs
Table 1 & Table 3 have 50% common IDs
.....
.....
Table 6 & Table 4 have 20% common IDs
Table 6 & Table 5 have 3% common IDs
I can easily get the data, but it would be a lot of repeating the same SQL, so I've tried to create some tables of all the IDs and tables they are in but I'm stuck as to what to get the data in one or two SQL's.
Any ideas welcome!

you could try to full join all these tables by ID in a subquery and then use conditional aggregate so that Table 1 & Table 2 have 10% common IDs would be expressed as
100.0*sum(case when id1 is not null and id2 is not null then 1 end)/count(id1)
(taking Table 1 row count as denominator)

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?

I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.

according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo

#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+

maximum number of arguments postgresql function can take

I have a view consists of two table. Let's say table TableA and table TableB.
Now table A has got around 20 columns and table B has go 4 columns.
TableA (
id datatype,
uid datatype,
.
.
.
18 more);
TableB (
id datatype,
uid datatype,
a_id datatype,
amount datatype,
CONSTRAINT tablea_tableb_fkey FOREIGN KEY (a_id)
REFERENCES tablea (id) MATCH SIMPLE
ON UPDATE RESTRICT ON DELETE RESTRICT,
);
So there is one to many relationship between TableA and TableB. Now I have written the view as follows...
CREATE OR REPLACE VIEW AB AS
SELECT a.id, a.uid, ..., array_agg(b.amount) AS amounts
FROM TableA a
JOIN TableB b ON a.id = b.a_id
GROUP BY i.id;
Now I want to write insert rule for this view which I am doing by writing a helper function. The function takes around 18 parameters (except id, uid has default value) for inserting into TableA and 1 parameter which is an Array for TableB.
So the total parameters for function is 19. I want to know what is the maximum number of arguments that I can pass to function in postgresql? Is it wise to send this many number of arguments? Is there any better way to write the function for this many number of arguments?

FUNC_MAX_ARGS is one compilation parameter (you can change it and recompile), and on my 9.2 source code is 100.
If you have more parameters, then using a arrays is a good idea.

SQL Server 2008 View Columns match to underlying Table Columns

I've been attempting to write some SQL code that when provided with a view will locate the columns that the view columns reference and work out if there are any indexes on those columns. The end aim is to provide users with a list of columns it would be possible to use when querying the view.
Currently though I can only find the columns that the view uses (and by extension their indexes) but I cant match them back to the index's columns.
For example:
I have TableA, which has 5 Columns: ID, Name, Val1, Val2, TableBID
I have TableB, which has 3 Columns: ID, Name, Code
I then create a view, View1, which is:
SELECT A.ID,
Name,
Val1
FROM TableA A
INNER JOIN TableB ON A.TableBID = B.ID
WHERE B.Code = 'AAA'
When I query for references using:
SELECT *
FROM sys.dm_sql_referenced_entities('dbo.View1', 'OBJECT')
I'll get a list of the Table/Column references within it, but no indication of which View Column references what.
Is there any way I can access the information I need, bear in mind I cannot do name matching as the columns in the Alias may use aliases and therefore may not have the same names as the underlying data.
I'm using SQL Server 2008 SP1 if that has any impact.

Rename columns in view using unique combination Tablename + '_' + ColumnName for every view column. Then you can split views column onto table and column.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL: How to use a lookup table to select data across multiple tables? - postgresql

Related

Unnest vs just having every row needed in table

SQL (Redshift) to get the intersect of multiple tables

Cassandra CQL3 select row keys from table with compound primary key

maximum number of arguments postgresql function can take

SQL Server 2008 View Columns match to underlying Table Columns

Categories

Resources