How to create TimescaleDB Hypertable with time partitioning on non unique timestamp? - postgresql

I have just started to use TimescaleDB and want to create a hypertable on a table with events.
Originally I thought of following the conventional pattern of:
CREATE TABLE event (
id serial PRIMARY KEY,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
CREATE INDEX event_ts_idx on event(ts);
However, when I tried to create the hypertable with the following query:
SELECT create_hypertable('event', 'ts');
I got: ERROR: cannot create a unique index without the column "ts" (used in partitioning)
After doing some research, it seems that the timestamp itself needs to be the (or part of the) primary key.
However, I do not want the timestamp ts to be unique. It is very likely that these high frequency events will coincide in the same microsecond (the maximum resolution of the timestamp type). It is the whole reason why I am looking into TimescaleDB in the first place.
What is the best practice in this case?
I was thinking of maybe keeping the serial id as part of the primary key, and making it composite like this:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
PRIMARY KEY (id, ts)
);
SELECT create_hypertable('event_hyper', 'ts');
This sort of works, but I am unsure if it is the right approach, or if I am creating a complicated primary key which will slow down inserts or create other problems.
What is the right approach when you have possible collision in timestamps when using TimescaleDB hypertables?

How to create TimescaleDB Hypertable with time partitioning on non unique timestamp?
There is no need to create unique constraint on time dimension (unique constraints are not required). This works:
CREATE TABLE event (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event', 'ts');
Note that the primary key on id is removed.
If you want to create unique constraint or primary key, then TimescaleDB requires that any unique constraint or primary key includes the time dimension. This is similar to limitation of PostgreSQL in declarative partitioning to include partition key into unique constraint:
Unique constraints (and hence primary keys) on partitioned tables must include all the partition key columns. This limitation exists because PostgreSQL can only enforce uniqueness in each partition individually.
TimescaleDB also enforces uniqueness in each chunk individually. Maintaining uniqueness across chunks can affect ingesting performance dramatically.
The most common approach to fix the issue with the primary key is to create a composite key and include the time dimension as proposed in the question. If the index on the time dimension is not needed (no queries only on time is expected), then the index on time dimension can be avoided:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
PRIMARY KEY (id, ts)
);
SELECT create_hypertable('event_hyper', 'ts', create_default_indexes => FALSE);
It is also possible to use an integer column as the time dimension. It is important that such column has time dimension properties: the value is increasing over time, which is important for insert performance, and queries will select a time range, which is critical for query performance over large database. The common case is for storing unix epoch.
Since id in event_hyper is SERIAL, it will increase with time. However, I doubt the queries will select the range on it. For completeness SQL will be:
CREATE TABLE event_hyper (
id serial PRIMARY KEY,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event_hyper', 'id', chunk_time_interval => 1000000);

To build on #k_rus 's answer, it seems like the generated primary key here is not actually what you're looking for. What meaning does that id have? Isn't it just identifying a unique details, ts combination? Or can there meaningfully be two values that have the same timestamp and the same details but different ids that actually has some sort of semantic meaning. It seems to me that that is somewhat nonsensical, in which case, I would do a primary key on (details, ts) which should provide you the uniqueness condition that you need. I do not know if your ORM will like this, they tend to be overly dependent on generated primary keys because, among other things, not all databases support composite primary keys. But in general, my advice for cases like this is to actually use a composite primary key with logical meaning.
Now if you actually care about multiple messages with the same details at the same timestamp, I might suggest a table structure something like
CREATE TABLE event_hyper (
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL,
count int,
PRIMARY KEY (details, ts)
);
with which you can do an INSERT ON CONFLICT DO UPDATE in order to increment it.
I wish that ORMs were better about doing this sort of thing, but you can usually trick ORMs into reading from other tables (or a view over them because then they think they can't update records there etc, which is why they need to have the generated PK). Then it just means that there's a little bit of custom ingest code to write that inserts into the hypertable. It's often better to do this anyway because, in general, I've found that ORMs don't always follow best practices for high volume inserts, and often don't use bulk loading techniques.
So a table like that, with a view that just select's * from the table should then allow you to use the ORM for reads, write a very small amount of custom code to do ingest into the timeseries table and voila - it works. The rest of your relational model, which is the part that the ORM excels at doing can live in the ORM and then have a minor integration here with a bit of custom SQL and a few custom methods.

The limitation is:
Need to make all partition columns (primary & secondary, if any) as a unique key of table.
Refer: https://github.com/timescale/timescaledb/issues/447#issuecomment-369371441
2 choices in my opinion:
partition by a single column, which is a unique key (e.g the primary key),
partition with a 2nd space partition key, need to make the 2 columns a combined unique key,

I got the same problem.
The solution was to avoid this field:
id: 'id'

I think I'm replying a little bit too late, but still.
You can try something like this:
CREATE TABLE event_hyper (
id serial,
ts timestamp with time zone NOT NULL,
details varchar(255) NOT NULL
);
SELECT create_hypertable('event_hyper', 'ts', partitioning_column => 'id', number_partitions => X);
Where X is the desirable number of hash partitions by column 'id'.
https://docs.timescale.com/api/latest/hypertable/create_hypertable/#optional-arguments
As you can also notice there's no PRIMARY KEY constraint in table 'event_hyper'.
Output of create_hypertable() operation should be:
create_hypertable
---------------------------
(1,public,event_hyper,t)

Related

Would this PostgresQL model work for long-term use and security?

I'm making a real-time chat app and was stuck figuring out how the DB model should look like. I've made this diagram, but would this work? My issue is more to do with foreign keys.
I know this is a very vague question. But have been struggling with this model for a while now. This is the first database I'm setting up so it's probably got a load of errors.
Actually you are fairly close, but over complicated it a bit. At the conceptual/logical model you have just 2 entities. Users and Messages
with a many-to-many relationship. At the physical level the Channels table resolves the M:M into the 2 one_to_many you have described. But the
viewing this way ravels a couple issues. The attribute user is not required in the Messages table and if physically implemented requires a not easily done validation
that the user there exists in the Channels table. Further everything that Message:User relationship provides is a available
via Users:Channels:Messages relationship. A similar argument applies to Channels column in Users - completely resolved by the resolution table. Suggestion: drop user from message table and channels from users.
Now lets look at the columns of Channels. It looks like you using a boiler plate for created_at and updated_at, but are they necessary?
Well at least for updated_at No. What can be updated? If either User or Message is updated you have a brand new entry. Yes it may seem like the same physical row (actually it is not)
but the meaning is completely different. Well how about last massage? What is it trying to indicate that the max value created at for the user does not give you?
I cannot see anything. I guess you could change the created at but what is the point of tracking when I changed that column. Suggestion: drop last message sent and updated at (unless required by Institution standards) from message table.
That leaves the Users table itself. Besides Channels mentioned above there is the Contacts column. Physically as a array it violates 1NF and becomes difficult to manage - (as wall as validating that the contact is in fact a user)
Logically it is creating a M:M on USER:USER. So resolve it the same way as User:Messages, pull it out into another table, say User_Contacts with 2 attributes to the Users table. Suggestion drop contacts for the users table and create a resolution table.
Unfortunately, I do not have a good ERD diagrammer, so I just provide DDL.
create table users (
user_id integer generated always as identity primary key
, name text
, phone_number text
, last_login timestamptz
, created_at timestamptz
, updated_at timestamptz
) ;
create type message_type as enum ('short', 'long'); -- list all values
create table messages(
msg_id integer generated always as identity primary key
, msg_type message_type
, message text
, created_at timestamptz
, updated_at timestamptz
);
create table channels( -- resolves M:M Users:Messages
user_id integer
, msg_id integer
, created_at timestamptz
, constraint channels_pk
primary key (user_id, msg_id)
, constraint channels_2_users_fk
foreign key (user_id)
references users(user_id)
, constraint channels_2_messages_fk
foreign key (msg_id)
references messages(msg_id )
);
create table user_contacts( -- resolves M:M Users:Users
user_id integer
, contact_id integer
, created_at timestamptz
, constraint user_contacts_pk
primary key (user_id, contact_id)
, constraint user_2_users_fk
foreign key (user_id)
references users(user_id)
, constraint contact_2_user_fk
foreign key (user_id)
references users(user_id)
, constraint contact_not_me_check check (user_id <> contact_id)
);
Notes:
Do not use text as PK, use either integer (bigint) or UUID, and generate them during insert.
Caution on ENUM. In Postgres you can add new values, but you cannot remove a value. Depending upon number of values and how often the change consider creating a lookup/reference table for them.
Do not use the data type TIME. It is really not that useful without the date. Simple example I login today at 15:00, you login tomorrow at 13:00. Now, from the database itself, which of us logged in first.

Converting PostgreSQL table to TimescaleDB hypertable

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.

TimescaleDB/PostgreSQL: how to use unique constraint when creating hypertables?

I am trying to create a table in PostgreSQL to contain lots of data and for that reason I want to use timescales hypertable as in the example below.
CREATE TABLE "datapoints" (
"tstz" timestamptz NOT NULL,
"id" bigserial UNIQUE NOT NULL,
"entity_id" bigint NOT NULL,
"value" real NOT NULL,
PRIMARY KEY ("id", "tstz", "entity_id")
);
SELECT create_hypertable('datapoints','tstz');
However, this throws an error - shown below. As far as I have figured out the error arise since the unique constraint isn't allowed in hypertables, but I really need the uniqueness. So does anyone have an idea on how to solve it or work around it?
ERROR: cannot create a unique index without the column "tstz" (used in partitioning)
SQL state: TS103
There is no way to avoid that.
TimescaleDB uses PostgreSQL partitioning, and it is not possible to have a primary key or unique constraint on a partitioned table that does not contain the partitioning key.
The reason behind that is that an index on a partitioned table consists of individual indexes on the partitions (these are the partitions of the partitioned index). Now the only way to guarantee uniqueness for such a partitioned index is to have the uniqueness implicit in the definition, which is only the case if the partitioning key is part of the index.
So you either have to sacrifice the uniqueness constraint on id (which is pretty much given if you use a sequence) or you have to do without partitioning.

Index to query sorted values in keyed time range

Suppose I have key/value/timerange tuples, e.g.:
CREATE TABLE historical_values(
key TEXT,
value NUMERIC,
from_time TIMESTAMPTZ,
to_time TIMESTAMPTZ
)
and would like to be able to efficiently query values (sorted descending) for a specific key and time, e.g.:
SELECT value
FROM historical_values
WHERE
key = [KEY]
AND from_time <= [TIME]
AND to_time >= [TIME]
ORDER BY value DESC
What kind of index/types should I use to get the best lookup performance? I suspect my solution will involve a tstzrange and a gist index, but I'm
not sure how to make that play well with the key matching and value ordering requirements.
Edit: Here's some more information about usage.
Ideally uses features available in Postgres v9.6.
Relation will contain approx. 1k keys and 5m values per key. Values are large integers (up to 32 bytes), mostly unique. Time ranges between few hours to a couple years. Time horizon is 5 years. No NULL values allowed, but some time ranges are open-ended (could either use NULL or a time far into the future for to_time).
The primary key is the key and time range (as there is only one historical value for a time range, per key).
Common operations are a) updating to_time to "close" a historical value, and b) inserting a new value with from_time = NOW.
All values may be queried. Partitioning is an option.
DB design
For a big table like that ("1k keys and 5m values per key") I would suggest to optimize storage like:
CREATE TABLE hist_keys (
key_id serial PRIMARY KEY
, key text NOT NULL UNIQUE
);
CREATE TABLE hist_values (
hist_value_id bigserial PRIMARY KEY -- optional, see below!
, key_id int NOT NULL REFERENCES hist_keys
, value numeric
, from_time timestamptz NOT NULL
, to_time timestamptz NOT NULL
, CONSTRAINT range_valid CHECK (from_time <= to_time) -- or < ?
);
Also helps index performance.
And consider partitioning. List-partitioning on key_id. Maybe even add sub-partitioning on (range partitioning this time) on from_time. Read the manual here.
With one partition per key_id, (and constraint exclusion enabled!) Postgres would only look at the small partition (and index) for the given key, instead of the whole big table. Major win.
But I would strongly suggest to upgrade to at least Postgres 10 first, which added "declarative partitioning". Makes managing partition a lot easier.
Better yet, skip forward to Postgres 11 (currently beta), which adds major improvements for partitioning (incl. performance improvements). Most notably, for your goal to get the best lookup performance, quoting the chapter on partitioning in release notes for Postgres 11 (currently beta):
Allow faster partition elimination during query processing (Amit Langote, David Rowley, Dilip Kumar)
This speeds access to partitioned tables with many partitions.
Allow partition elimination during query execution (David Rowley, Beena Emerson)
Previously partition elimination could only happen at planning time,
meaning many joins and prepared queries could not use partition elimination.
Index
From the perspective of the value column, the small subset of selected rows is arbitrary for every new query. I don't expect you'll find a useful way to support ORDER BY value DESC with an index. I'd concentrate on the other columns. Maybe add value as last column to each index if you can get index-only scans out of it (possible for btree and GiST).
Without partitioning:
CREATE UNIQUE INDEX hist_btree_idx ON hist_values (key_id, from_time, to_time DESC);
UNIQUE is optional, but see below.
Note the importance of opposing sort orders for from_time and to_time. See (closely related!):
Optimizing queries on a range of timestamps (two columns)
This is almost the same index as the one implementing your PK on (key_id, from_time, to_time). Unfortunately, we cannot use it as PK index. Quoting the manual:
Also, it must be a b-tree index with default sort ordering.
So I added a bigserial as surrogate primary key in my suggested table design above and NOT NULL constraints plus the UNIQUE index to enforce your uniqueness rule.
In Postgres 10 or later consider an IDENTITY column instead:
Auto increment table column
You might even do with PK constraint in this exceptional case to avoid duplicating the index and keep the table at minimum size. Depends on the complete situation. You may need it for FK constraints or similar. See:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
A GiST index like you already suspected may be even faster. I suggest to keep your original timestamptz columns in the table (16 bytes instead of 32 bytes for a tstzrange) and add key_id after installing the additional module btree_gist:
CREATE INDEX hist_gist_idx ON hist_values
USING GiST (key_id, tstzrange(from_time, to_time, '[]'));
The expression tstzrange(from_time, to_time, '[]') constructs a range including upper and lower bound. Read the manual here.
Your query needs to match the index:
SELECT value
FROM hist_values
WHERE key = [KEY]
AND tstzrange(from_time, to_time, '[]') #> tstzrange([TIME_FROM], [TIME_TO], '[]')
ORDER BY value DESC;
It's equivalent to your original.
#> being the range contains operator.
With list-partitioning on key_id
With a separate table for each key_id, we can omit key_id from the index, improving size and performance - especially for the GiST index - for which we then also don't need the additional module btree_gist. Results in ~ 1000 partitions and the corresponding indexes:
CREATE INDEX hist999_gist_idx ON hist_values USING GiST (tstzrange(from_time, to_time, '[]'));
Related:
Store the day of the week and time?

How to ensure the accuracy of aggregated data in PostgreSQL table?

I have two PostgreSQL tables - table A contains individual client's credit movements records (increase / decrease) and Table B contains data of aggregated table A. Simplified structure of the tables (I removed FK and rules):
CREATE TABLE "public"."credit_review" (
"id" SERIAL,
"client_id" INTEGER NOT NULL,
"credit_change" INTEGER DEFAULT 0 NOT NULL,
"itime" TIMESTAMP(0) WITH TIME ZONE DEFAULT now()
) WITHOUT OIDS;
CREATE TABLE "public"."credit_review_aggregated" (
"id" SERIAL,
"credit_amount" INT DEFAULT 0 NOT NULL,
"valid_to_review_id" INT NOT NULL,
"client_id" INTEGER NOT NULL,
"itime" TIMESTAMP(0) WITH TIME ZONE DEFAULT now()
) WITHOUT OIDS;
Column "credit_review_aggregated.valid_to_review_id" is FK to "credit_review.id".
Because it is very important to have data in aggregation table correct I'm looking for a way of ensuring this need. It occurred to me:
Disable deleting and udpating records in both tables
On aggregated table create trigger to check if the entered data are correct (and if not, don't allow insert). I don't like it too much because when a record is inserted into aggregation tables credit_amount value will be counted twice (once in application a second time in the trigger).
Do you have some advice for me how to ensure this situation?
I'm not entirely clear on the invariant you're trying to enforce, but from the general outlines of the problem, I would be inclined to use trigger code to enforce it, and ues SERIALIZABLE transactions. Enforcing invariants across multiple tables becomes very tricky very quickly otherwise.
http://wiki.postgresql.org/wiki/SSI
Full disclosure: Because my employer needed to enforce complex integrity rules across multiple tables, I worked on adding SSI to PostgreSQL, along with Dan R.K. Ports of MIT.