Implicit Index for table - postgresql

I am learning Postgresql and db in general. I have a simple query like this and I want to understand what it does
CREATE TABLE adempiere.c_mom(
c_mom_id NUMERIC(10,0) NOT NULL,
isactive character(1) DEFAULT 'Y'::bpchar NOT NULL,
start_date date NOT NULL,
start_time timestamp without time zone NOT NULL,
end_time timestamp without time zone NOT NULL,
CONSTRAINT c_mom_pkey PRIMARY KEY (c_mom_id)
);
So after I execute this I got
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "c_mom_pkey" for table "c_mom"
Now I know that my PK is c_mom_id, but what is the purpose of creating an implicit index it under name c_mom_key?
What does DEFAULT 'Y'::bpchar, or in general what does :: in psql do?
Thank you

The :: notation is a PostgreSQL-specific type cast notation, in this case to type bpchar (blank-padded char).
An index is created to back primary keys to make them efficient. If there wasn't an index to back it, each insert statement would have to scan the whole table just to figure out if that insertion would create a duplicate key or not. Using an index speeds that up (dramatically if the table is large).
This is not PostgreSQL specific. A lot of relational databases will create unique indexes to back primary keys.

Related

Multicolum index vs singel column index for time series data in Postgres

This table started out at short term storage for meter data before it was going to be validated and added to some long term storage tables.
Turns out the clients wants to keep this data for a long time since we saved it and it is growing fast.
create table metering_meterreading
(
id bigserial not null. # Primary Key
created_at timestamp with time zone not null,
updated_at timestamp with time zone not null,
timestamp timestamp with time zone not null, # BTREE index
value numeric(15, 3) not null,
meter_device_id uuid not null, # FK to meter_device, BTREE index
series_id uuid not null # FK to series, BTREE index
organization_id uuid not null. # FK to org , BTREE index
);
I am planning on dropping the primary key since (org_id, meter_device_id, series_id, timestamp) makes it unique. It was just added by my ORM (django) and I didn't care when we started.
But since I pretty much always want to filter in organization, meter_device, and series to get a range of time series data I am wondering if it would be more efficient to have a multicolumn index on (organization_id, meter_device_id, series_id, timestamp) instead of the separate indexes.
I read somewhere that if I had a range it should be the rightmost in the index.
This is still not an super efficient table for timeseries data, since it will grow large, but I am planning in fixing that by partitioning on range, or maybe even use Timescale. But before partitioning I would like it to be as efficient as possible to look up data in it.
I also saw an example somewhere that used a separate table to identify the metric:
create table metric
(
id
organization_id
meter_device_id
series_id
) UNIQE (organization_id, meter_device_id, series_id)
;
create table metering_meterreading
(
metric_id. bigserial, FK to metric, BTREE index
timestamp timestamp with time zone not null, # BTREE index
value numeric(15, 3) not null,
created_at timestamp with time zone not null,
updated_at timestamp with time zone not null,
);
But I am not sure if that is actually better than just putting them all in table. It might impact ingestion rate since there is another table involved now.
If (org_id, meter_device_id, series_id, timestamp) uniquely determine a table row, you need to use a multi-column primary key over all of them. So you automatically have a 4-column index on these columns. Just make sure that timestamp is last in the list, then that index will support your query ideally.

Slow UPDATE appending on array inside JSONB column

I have a table with the following structure (massiveJS default):
CREATE TABLE "myTable" (
id uuid NOT NULL DEFAULT uuid_generate_v4(),
body jsonb NOT NULL,
created_at timestamptz NULL DEFAULT now(),
updated_at timestamptz NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id)
);
This table has a GIN index on the body column and a btree index on the primary key (default).
The json object stored inside the body column has an array that can grow really large (up to 50k items).
We use this statement to append new items to the array:
UPDATE "myTable"
SET body = jsonb_set(body, '{myArray}', (body->'myArray')
|| '["item1", "item2", "item3", "itemx""]')
WHERE id = 'e6e325da-0e8b-4d2e-9481-bc7e03c195b1'
The system updates this array in several records frequently and when the system is under heavy load this UPDATE starts getting really slow, PostgreSQL CPU goes to 100% and the whole application slows down.
Is there any way we could speed up this UPDATE statement in order to append/remove elements from the array without forcing PostgreSQL too much?

How to create TimescaleDB Hypertable with time partitioning on non unique timestamp?

I have just started to use TimescaleDB and want to create a hypertable on a table with events.
Originally I thought of following the conventional pattern of:
CREATE TABLE event (
id serial PRIMARY KEY,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
CREATE INDEX event_ts_idx on event(ts);
However, when I tried to create the hypertable with the following query:
SELECT create_hypertable('event', 'ts');
I got: ERROR: cannot create a unique index without the column "ts" (used in partitioning)
After doing some research, it seems that the timestamp itself needs to be the (or part of the) primary key.
However, I do not want the timestamp ts to be unique. It is very likely that these high frequency events will coincide in the same microsecond (the maximum resolution of the timestamp type). It is the whole reason why I am looking into TimescaleDB in the first place.
What is the best practice in this case?
I was thinking of maybe keeping the serial id as part of the primary key, and making it composite like this:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
PRIMARY KEY (id, ts)
);
SELECT create_hypertable('event_hyper', 'ts');
This sort of works, but I am unsure if it is the right approach, or if I am creating a complicated primary key which will slow down inserts or create other problems.
What is the right approach when you have possible collision in timestamps when using TimescaleDB hypertables?
How to create TimescaleDB Hypertable with time partitioning on non unique timestamp?
There is no need to create unique constraint on time dimension (unique constraints are not required). This works:
CREATE TABLE event (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event', 'ts');
Note that the primary key on id is removed.
If you want to create unique constraint or primary key, then TimescaleDB requires that any unique constraint or primary key includes the time dimension. This is similar to limitation of PostgreSQL in declarative partitioning to include partition key into unique constraint:
Unique constraints (and hence primary keys) on partitioned tables must include all the partition key columns. This limitation exists because PostgreSQL can only enforce uniqueness in each partition individually.
TimescaleDB also enforces uniqueness in each chunk individually. Maintaining uniqueness across chunks can affect ingesting performance dramatically.
The most common approach to fix the issue with the primary key is to create a composite key and include the time dimension as proposed in the question. If the index on the time dimension is not needed (no queries only on time is expected), then the index on time dimension can be avoided:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
PRIMARY KEY (id, ts)
);
SELECT create_hypertable('event_hyper', 'ts', create_default_indexes => FALSE);
It is also possible to use an integer column as the time dimension. It is important that such column has time dimension properties: the value is increasing over time, which is important for insert performance, and queries will select a time range, which is critical for query performance over large database. The common case is for storing unix epoch.
Since id in event_hyper is SERIAL, it will increase with time. However, I doubt the queries will select the range on it. For completeness SQL will be:
CREATE TABLE event_hyper (
id serial PRIMARY KEY,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event_hyper', 'id', chunk_time_interval => 1000000);
To build on #k_rus 's answer, it seems like the generated primary key here is not actually what you're looking for. What meaning does that id have? Isn't it just identifying a unique details, ts combination? Or can there meaningfully be two values that have the same timestamp and the same details but different ids that actually has some sort of semantic meaning. It seems to me that that is somewhat nonsensical, in which case, I would do a primary key on (details, ts) which should provide you the uniqueness condition that you need. I do not know if your ORM will like this, they tend to be overly dependent on generated primary keys because, among other things, not all databases support composite primary keys. But in general, my advice for cases like this is to actually use a composite primary key with logical meaning.
Now if you actually care about multiple messages with the same details at the same timestamp, I might suggest a table structure something like
CREATE TABLE event_hyper (
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
count int,
PRIMARY KEY (details, ts)
);
with which you can do an INSERT ON CONFLICT DO UPDATE in order to increment it.
I wish that ORMs were better about doing this sort of thing, but you can usually trick ORMs into reading from other tables (or a view over them because then they think they can't update records there etc, which is why they need to have the generated PK). Then it just means that there's a little bit of custom ingest code to write that inserts into the hypertable. It's often better to do this anyway because, in general, I've found that ORMs don't always follow best practices for high volume inserts, and often don't use bulk loading techniques.
So a table like that, with a view that just select's * from the table should then allow you to use the ORM for reads, write a very small amount of custom code to do ingest into the timeseries table and voila - it works. The rest of your relational model, which is the part that the ORM excels at doing can live in the ORM and then have a minor integration here with a bit of custom SQL and a few custom methods.
The limitation is:
Need to make all partition columns (primary & secondary, if any) as a unique key of table.
Refer: https://github.com/timescale/timescaledb/issues/447#issuecomment-369371441
2 choices in my opinion:
partition by a single column, which is a unique key (e.g the primary key),
partition with a 2nd space partition key, need to make the 2 columns a combined unique key,
I got the same problem.
The solution was to avoid this field:
id: 'id'
I think I'm replying a little bit too late, but still.
You can try something like this:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event_hyper', 'ts', partitioning_column => 'id', number_partitions => X);
Where X is the desirable number of hash partitions by column 'id'.
https://docs.timescale.com/api/latest/hypertable/create_hypertable/#optional-arguments
As you can also notice there's no PRIMARY KEY constraint in table 'event_hyper'.
Output of create_hypertable() operation should be:
create_hypertable
---------------------------
(1,public,event_hyper,t)

References to multiple tables in PostgreSQL

I have many time series stored in a PostgreSQL database over multiple tables. I would like to create a table 'anomalies' which references to time series with particuliar behaviour, for instance a value that is exceptionally high.
My question is the following: what is the best way to link the entries of 'anomalies' with other tables?
I could create a foreign key in each table referencing to an entry in anomaly, but then it would be not so obvious to go from the anomaly to the entry referencing the anomaly.
The other possibility I see is to store the name of the corresponding table in the entries of anomalies, but it does not seem like a good idea, as the table name might change, or the table might get deleted.
Is there a more elegant solution to do this?
CREATE TABLE type_1(
type_1_id SERIAL PRIMARY KEY,
type_1_name TEXT NOT NULL,
unique(type_1_name)
)
CREATE TABLE type_1_ts(
date DATE NOT NULL,
value REAL NOT NULL,
type_1_id INTEGER REFERENCES type_1(type_1_id) NOT NULL,
PRIMARY KEY(type_1_id, date)
)
CREATE TABLE type_2(
type_2_id SERIAL PRIMARY KEY,
type_2_name TEXT NOT NULL,
unique(type_2_name)
)
CREATE TABLE type_2_ts(
date DATE NOT NULL,
value REAL NOT NULL,
state INTEGER NOT NULL,
type_2_id INTEGER REFERENCES type_2(type_2_id) NOT NULL,
PRIMARY KEY(type_2_id, date)
)
CREATE TABLE anomalies(
anomaly_id SERIAL PRIMARY_KEY,
date DATE NOT NULL,
property TEXT NOT NULL,
value REAL NOT NULL,
-- reference to a table_name and an entry id?
table_name TEXT
data_id INEGER
)
What I'd like to do at the end is to be able to do:
SELECT * FROM ANOMALIES WHERE table_name='type_1',
or simply list the data_type corresponding to the entries

PostgreSQL bigserial & nextval

I've got a PgSQL 9.4.3 server setup and previously I was only using the public schema and for example I created a table like this:
CREATE TABLE ma_accessed_by_members_tracking (
reference bigserial NOT NULL,
ma_reference bigint NOT NULL,
membership_reference bigint NOT NULL,
date_accessed timestamp without time zone,
points_awarded bigint NOT NULL
);
Using the Windows Program PgAdmin III I can see it created the proper information and sequence.
However I've recently added another schema called "test" to the same database and created the exact same table, just like before.
However this time I see:
CREATE TABLE test.ma_accessed_by_members_tracking
(
reference bigint NOT NULL DEFAULT nextval('ma_accessed_by_members_tracking_reference_seq'::regclass),
ma_reference bigint NOT NULL,
membership_reference bigint NOT NULL,
date_accessed timestamp without time zone,
points_awarded bigint NOT NULL
);
My question / curiosity is why in a public schema the reference shows bigserial but in the test schema reference shows bigint with a nextval?
Both work as expected. I just do not understand why the difference in schema's would show different table creations. I realize that bigint and bigserial allow the same volume of ints to be used.
Merely A Notational Convenience
According to the documentation on Serial Types, smallserial, serial, and bigserial are not true data types. Rather, they are a notation to create at once both sequence and column with default value pointing to that sequence.
I created test table on schema public. The command psql \d shows bigint column type. Maybe it's PgAdmin behavior ?
Update
I checked PgAdmin source code. In function pgColumn::GetDefinition() it scans table pg_depend for auto dependency and when found it - replaces bigint with bigserial to simulate original table create code.
When you create a serial column in the standard way:
CREATE TABLE new_table (
new_id serial);
Postgres creates a sequence with commands:
CREATE SEQUENCE new_table_new_id_seq ...
ALTER SEQUENCE new_table_new_id_seq OWNED BY new_table.new_id;
From documentation: The OWNED BY option causes the sequence to be associated with a specific table column, such that if that column (or its whole table) is dropped, the sequence will be automatically dropped as well.
Standard name of a sequence is built from table name, column name and suffix _seq.
If a serial column was created in such a way, PgAdmin shows its type as serial.
If a sequence has non-standard name or is not associated with a column, PgAdmin shows nextval() as default value.