Unnest vs just having every row needed in table - postgresql

I have a choice in how a data table is created and am wondering which approach is more performant.
Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested
That is, if I have the data:
day
val1
val2
Mon
7
11
Tue
7
11
Wed
8
9
Thu
1
4
Is it better to enter the data in as shown, or instead:
day
val1
val2
(Mon,Tue)
7
11
(Wed)
8
9
(Thu)
1
4
And then use unnest() to explode those into unique rows when I need them?
Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:
-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
RETURNS uuid
LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';
-- data table
CREATE TABLE data (
data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)
);
-- reference data_id in next table
CREATE TABLE day_data (
day text
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id) -- must be unique?
);
db<>fiddle here
With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.
In-depth explanation in this closely related answer:
Why doesn't my UNIQUE constraint trigger?
You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).
About generated columns:
Computed / calculated / virtual / derived columns in PostgreSQL
Basic technique to avoid duplicates while inserting new data:
INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;
If there can be concurrent writes, see:
How to use RETURNING with ON CONFLICT in PostgreSQL?

Related

Is there a way to add the same row multiple times with different ids into a table with postgresql?

I am trying to add the same data for a row into my table x number of times in postgresql. Is there a way of doing that without manually entering the same values x number of times? I am looking for the equivalent of the go[count] in sql for postgres...if that exists.
Use the function generate_series(), e.g.:
insert into my_table
select id, 'alfa', 'beta'
from generate_series(1,4) as id;
Test it in db<>fiddle.
Idea
Produce a resultset of a given size and cross join it with the record that you want to insert x times. What would still be missing is the generation of proper PK values. A specific suggestion would require more details on the data model.
Query
The sample query below presupposes that your PK values are autogenerated.
CREATE TABLE test ( id SERIAL, a VARCHAR(10), b VARCHAR(10) );
INSERT INTO test (a, b)
WITH RECURSIVE Numbers(i) AS (
SELECT 1
UNION ALL
SELECT i + 1
FROM Numbers
WHERE i < 5 -- This is the value `x`
)
SELECT adhoc.*
FROM Numbers n
CROSS JOIN ( -- This is the single record to be inserted multiple times
SELECT 'value_a' a
, 'value_b' b
) adhoc
;
See it in action in this db fiddle.
Note / Reference
The solution is adopted from here with minor modifications (there are a host of other solutions to generate x consecutive numbers with SQL hierachical / recursive queries, so the choice of reference is somewhat arbitrary).

Does the index column order matter on row insert in Postgresql?

I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.

Predict partition number for Postgres hash partitioning

I'm writting an app which uses partitions in Postgres DB. This is will be send to customers and run on their server. This implies that I have to be prepared for many different scenarios.
Lets start with simple table schema:
CREATE TABLE dir (
id SERIAL,
volume_id BIGINT,
path TEXT
);
I want to partition that table by volume_id column.
What I would like to achieve:
limited number of partitions (right now it's 500 but I'm will be tweaking this parameter later)
Do not create all partitions at once - add them only when they are needed
support volume ids up to 100K
[NICE TO HAVE] - been able for human to calculate partition number from volume_id
Solution that I have right now:
partition by LIST
each partition handles volume_id % 500 like this:
CREATE TABLE dir_part_1 PARTITION OF dir FOR VALUES IN (1, 501, 1001, 1501, ..., 9501);
This works great because I can create partition when it's needed, and I know exactly to which partition given volume_id belongs. But I have to manually declare numbers and I cannot support high volume_ids because speed of insert statements decrease drastically (more than 2 times).
It looks like I could try HASH partitioning but my biggest concern is that I have to create all partitions at the very beginning and I would like to be able to create them dynamically when they are needed, because planning time increases significantly up to 5 seconds for 500 partitions. For example I know that I will be adding rows with volume_id=5. How can I tell which partition should I create?
I was able to force Postgres to use dummy hash function by adding hash operator for partitioned table.
CREATE OR REPLACE FUNCTION partition_custom_bigint_hash(value BIGINT, seed BIGINT)
RETURNS BIGINT AS $$
-- this number is UINT64CONST(0x49a0f4dd15e5a8e3) from
-- https://github.com/postgres/postgres/blob/REL_13_STABLE/src/include/common/hashfn.h#L83
SELECT value - 5305509591434766563;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE OPERATOR CLASS partition_custom_bigint_hash_op
FOR TYPE int8
USING hash AS
OPERATOR 1 =,
FUNCTION 2 partition_custom_bigint_hash(BIGINT, BIGINT);
Now you can declare partitioned table like this:
CREATE TABLE some_table (
id SERIAL,
partition_id BIGINT,
value TEXT
) PARTITION BY HASH (partition_id);
CREATE TABLE some_table_part_2 PARTITION OF some_table FOR VALUES WITH (modulus 3, remainder 2);
Now you can safely assume that allow rows with partition_id % 3 = 2 will land in some_table_part_2 partition. So if you are sure what values you will receive in partition_id column you can create only required partitions.
DISCLAIMER 1: Unfortunately this will not work correctly right now (Postgres 13.1) because of bug #16840
DISCLAIMER 2: There is not point of using this technic unless you are planning to create large number of partitions (I would say 50 or more) and prolonged planning time is an issue.

Cap on number of table rows matching a certain condition using Postgres exclusion constraint?

If I have a Postgresql db schema for a tires table like this (a user has many tires):
user_id integer
description text
size integer
color text
created_at timestamp
and I want to enforce a constraint that says "a user can only have 4 tires".
A naive way to implement this would be to do:
SELECT COUNT(*) FROM tires WHERE user_id='123'
, compare the result to 4 and insert if it's lower than 4. It's suspectible to race conditions and so is a naive approach.
I don't want to add a count column. How can I do this (or can I do this) using an exclusion constraint? If it's not possible with exclusion constraints, what is the canonical way?
The "right" way is using locks here.
Firstly, you need to lock related rows. Secondly, insert new record if a user has less than 4 tires. Here is the SQL:
begin;
-- lock rows
select count(*)
from tires
where user_id = 123
for update;
-- throw an error of there are more than 3
-- insert new row
insert into tires (user_id, description)
values (123, 'tire123');
commit;
You can read more here How to perform conditional insert based on row count?

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+