How to solve PostgreSQL index width problem - postgresql

I have a docker container which contains a PostgreSQL database. An application then connects to the database. In the database I have a table defined as:
CREATE TABLE IF NOT EXISTS configuration (
id SERIAL PRIMARY KEY,
board BIGINT NOT NULL REFERENCES boards ( id ),
date_time decimal(20,0) NOT NULL,
version INTEGER NOT NULL,
data TEXT NOT NULL,
UNIQUE(board, date_time, version, data)
);
I am not explicitly creating any indices or any of the other meta-objects associated with this table.
The application used to be able to write to this table without problem, but now I am getting the following error:
Failure during 'insert_configuration_record': ERROR: index row size 5992 exceeds btree version 4 maximum 2704 for index "configuration_board_date_time_version_data_key"
DETAIL: Index row references tuple (2,12) in relation "configuration".
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
It is possible that the PostgreSQL version changed at some point when I rebuilt the docker container, but I have not seen any messages refusing to load the database from the persistent storage or asking me to upgrade it. The current database version is (PostgreSQL) 12.9 (Ubuntu 12.9-0ubuntu0.20.04.1).
It is possible that previously the data I was writing was short enough to not hit the limit.
How do I use "a function index of an MD5 hash of the value" to avoid this problem?

If you had that unique constraint before with the same data, then you must have built PostgreSQL with a block size greater than the default 8kB.
Anyway, you should do that the hint tells you, and instead of the unique constraint create a unique index:
CREATE UNIQUE INDEX ON configuration (
board,
date_time,
version,
md5(data)
);
You cannot turn this index into a unique constraint, because such a constraint can only be defined on plain columns, not on expressions. However, the behavior will be just the same as a constraint.

Related

Cannot create primary key using already created index

I have a table ideas with columns idea_id, element_id and element_value.
Initially, I had created a composite primary key(ideas_pkey) using all three columns but I started facing size limit issues with the index associated with the primary key as the element_value column had a huge value.
Hence, I created another unique index hashing the column with possible large values
CREATE UNIQUE INDEX ideas_pindex ON public.ideas USING btree (idea_id, element_id, md5(element_value))
Now I deleted the initial primary key ideas_pkey and wanted to recreate it using this newly created index like so
alter table ideas add constraint ideas_pkey PRIMARY KEY ("idea_id", "element_id", "element_value") USING INDEX ideas_pindex;
But this fails with the following error
ERROR: syntax error at or near "ideas_pindex"
LINE 2: ...a_id", "element_id", "element_value") USING INDEX ideas_...
^
SQL state: 42601
Character: 209
What am I doing wrong?
A primary key index can't be a functional index. You can instead just have a unique index on your table, or create another column storing the md5() of your larger column and use it in the PK.
That being said, there is also another error in your query: If you want to specify an index name, you can't specify the PK columns (they are derived from the underlying index). And if you want to specify the pk columns, you can't specify the index name/definition, as it will be automatically created. See the doc

Converting PostgreSQL table to TimescaleDB hypertable

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.

Does PostgreSQL create an internal key (probably an int type) as primary key for a table without a primary key specified?

From https://stackoverflow.com/a/40597571/3284469
If you don't specify a primary key, RDBMS will help you choose an unique and non-null key, OR create an internal key (probably an int type) as primary key for this table.
Could you give some examples for the "OR" case, where a RDBMS (PostgreSQL in particular, and possibly also MySQL or SQL Server) create an "internal key (probably an int type) as primary key" for a table without a primary key specified?
Does PostgreSQL have something similar to MySQL?
Thanks.
for Postgres:
From "5.4. System Columns":
oid
The object identifier (object ID) of a row. This column is only present if the table was created using WITH OIDS, or if the default_with_oids configuration variable was set at the time. This column is of type oid (same name as the column); see Section 8.18 for more information about the type.
and
ctid
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore `ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.
Both come close to what you're searching for but have restrictions as you can read in the documentation. So, as the manual states, using a user-defined PK is the better choice.
for SQL Server:
There is the undocumented pseudo column %%physloc%%. It describes the physical location of a row. That, however, might be subject to change if the row gets physically moved for whatever reason. And it's undocumented, that is, its behavior might change any time between releases or even just patches or it might be removed completely without further notice. So using a user-defined PK is the better choice here either.

POSTGRESQL:autoincrement for varchar type field

I'm switching from MongoDB to PostgreSQL and was wondering how I can implement the same concept as used in MongoDB for uniquely identifying each raws by MongoId.
After migration, the already existing unique fields in our database is saved as character type. I am looking for minimum source code changes.
So if any way exist in postgresql for generating auto increment unique Id for each inserting into table.
The closest thing to MongoDB's ObjectId in PostgreSQL is the uuid type. Note that ObjectId has only 12 bytes, while UUIDs have 128 bits (16 bytes).
You can convert your existsing IDs by appending (or prepending) f.ex. '00000000' to them.
alter table some_table
alter id_column
type uuid
using (id_column || '00000000')::uuid;
Although it would be the best if you can do this while migrating the schema + data. If you can't do it during the migration, you need to update you IDs (while they are still varchars: this way the referenced columns will propagate the change), drop foreign keys, do the alter type and then re-apply foreign keys.
You can generate various UUIDs (for default values of the column) with the uuid-ossp module.
create extension "uuid-ossp";
alter table some_table
alter id_column
set default uuid_generate_v4();
Use a sequence as a default for the column:
create sequence some_id_sequence
start with 100000
owned by some_table.id_column;
The start with should be bigger then your current maximum number.
Then use that sequence as a default for your column:
alter table some_table
alter id_column set default nextval('some_id_sequence')::text;
The better solution would be to change the column to an integer column. Storing numbers in a text (or varchar) column is a really bad idea.

About clustered index in postgres

I'm using psql to access a postgres database. When viewing the metadata of a table, is there any way to see whether an index of a table is a clustered index?
I heard that the PRIMARY KEY of a table is automatically associated with a clustered index, is it true?
Note that PostgreSQL uses the term "clustered index" to use something vaguely similar and yet very different to SQL Server.
If a particular index has been nominated as the clustering index for a table, then psql's \d command will indicate the clustered index, e.g.,
Indexes:
"timezone_description_pkey" PRIMARY KEY, btree (timezone) CLUSTER
PostgreSQL does not nominate indices as clustering indices by default. Nor does it automatically arrange table data to correlate with the clustered index even when so nominated: the CLUSTER command has to be used to reorganise the table data.
In PostgreSQL the clustered attribute is held in the metadata of the corresponding index, rather than the relation itself. It is the indisclustered attribute in pg_index catalogue. Note, however, that clustering relations within postgres is a one-time action: even if the attribute is true, updates to the table do not maintain the sorted nature of the data. To date, automatic maintenance of data clustering remains a popular TODO item.
There is often confusion between clustered and integrated indexes, particularly since the popular textbooks use conflicting names, and the terminology is different again in the manuals of postgres and SQL server (to name just two). When I talk about an integrated index (also called a main index or primary index) I mean one in which the relation data is contained in the leaves of the index, as opposed an external or secondary index in which the leaves contain index entries that point to the table records. The former type is necessarily always clustered. Unfortunately postgres only supports the latter type. Anyhow, the fact that an integrated (primary) index is always clustered may have given rise to the belief that "a PRIMARY KEY of a table is automatically associated with a clustered index". The two statements sound similar, but are different.
PostgreSQL does not have direct implementation of CLUSTER index like Microsoft SQL Server.
Reference Taken from this Blog:
In PostgreSQL, we have one CLUSTER command which is similar to Cluster Index.
Once you create your table primary key or any other Index, you can execute the CLUSTER command by specifying that Index name to achieve the physical order of the Table Data.
When a table is clustered, it is physically reordered based on the index information. Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no attempt is made to store new or updated rows according to their index order.
Syntax of Cluster:
First time you must execute CLUSTER using the Index Name.
CLUSTER table_name USING index_name;
Cluster the table:
Once you have executed CLUSTER with Index, next time you should execute only CLUSTER TABLE because It knows that which index already defined as CLUSTER.
CLUSTER table_name;
is there any way to see whether an index of a table is a clustered index
PostgreSQL does not have a clustered index, so you won't be able to see them.
I heard that the PRIMARY KEY of a table is automatically associated with a clustered index, is it true?
No, that's not true (see above)
You can manually cluster a table along an index, but this is nothing that will be maintained automatically (as e.g. with SQL Server's clustered indexes).
For more details, see the description of the CLUSTER command in the manual.
Cluster Indexing
A cluster index means telling the database to store the close values actually close to one another on the disk. They can uniquely identify the rows in the SQL table. Every table can have exactly one one clustered index. A cluster index can cover more than one column. By default, a column with a primary key already has a clustered index.
dictionary
A dictionary itself is a table with clustered index. Because all the data is physically stored in alphabetical order.
Non-Cluster Indexing
Non-clustered indexing is like simple indexing of a book. They are just used for fast retrieval of data. Not sure to have unique data. A non-clustered index contains the non-clustered index keys and their corresponding data location pointer. For example, a book's content index contains the key of a topic or chapter and the page location of that.
book content index
A book's content table holds the content name and its page location. It is not sure that the data is unique. Because same paragraph or text line or word can be placed many times.
PostgreSQL Indexing
PostgreSQL automatically creates indexes for PRIMARY KEY and every UNIQUE constraints of a table. Login to a database in PostgreSQL terminal and type \d table_name. All stored indexes will be visualized. If there is a clustered index then it will also be identified.
Creating a table
CREATE TABLE IF NOT EXISTS profile(
uid serial NOT NULL UNIQUE PRIMARY KEY,
username varchar(30) NOT NULL UNIQUE,
phone varchar(11) NOT NULL UNIQUE,
age smallint CHECK(age>12),
address text NULL
);
3 index will be created automatically. All these indexes are non clustered
"profile_pkey" PRIMARY KEY, btree (uid)
"profile_phone_key" UNIQUE CONSTRAINT, btree (phone)
"profile_username_key" UNIQUE CONSTRAINT, btree (username)
Create our own index with uid and username
CREATE INDEX profile_index ON profile(uid, username);
This actually creates a non-clustered index. To make it clustered, run the next part.
Transform a non-clustered index into a clustered one
ALTER TABLE profile CLUSTER ON profile_index;
Check the table with \d profile. It will be like this:
Table "public.profile"
Column | Type | Collation | Nullable | Default
----------+-----------------------+-----------+----------+--------------------------------------
uid | integer | | not null | nextval('profile_uid_seq'::regclass)
username | character varying(30) | | not null |
phone | character varying(11) | | not null |
age | smallint | | |
address | text | | |
Indexes:
"profile_pkey" PRIMARY KEY, btree (uid)
"profile_phone_key" UNIQUE CONSTRAINT, btree (phone)
"profile_username_key" UNIQUE CONSTRAINT, btree (username)
"profile_index" btree (uid, username) CLUSTER
Check constraints:
"profile_age_check" CHECK (age > 12)
Notice that the profile_index is now "CLUSTER"
Now, re-cluster the table so that the table can follow the cluster index role
CLUSTER profile;
If you want to know if a given table is CLUSTERed using SQL, you can use the following query to show the index being used (tested in Postgres versions 9.5 and 9.6):
SELECT
i.relname AS index_for_cluster
FROM
pg_index AS idx
JOIN
pg_class AS i
ON
i.oid = idx.indexrelid
WHERE
idx.indisclustered
AND idx.indrelid::regclass = 'your_table_name'::regclass;