Can a foreign key be created to point into a non unique column in a date range versioned table in PostgreSQL? - postgresql

I would like to do two things in PostgresSQL:
version rows in a table by a date range
ensure the integrity of the table by setting up single column foreign keys
For me, it seems I can do only one of the above at the same time.
Detailed example:
I need to version a content of a table based on the date range (so in any particular point in time there is only one row for the (customId, validFrom, validUntil) unique index (there are no overlapping ranges). but it's important that none of those columns are unique by themselves.
By using this method I can query my table and get the valid entity for any point in time, but I could not figure out how to link this table via the customId key to another table so the integrity of the table is guarded.
The problem is that the customId key is not unique as there can be more than one of the same key when multiple ranges are recorded.
One solution I have used before is creating an another x_history table when I am only interested in the latest state of the entity, and copy the old state to the history table every time, but this time, this wouldn't work really well because I would constantly query two table as it's "random" what version of data I am interested in during selects.
Example by data:
table a:
id (PK)
custom_id (unique in any single point of time via the above composite unique index)
valid_from (timestamp, storing the start of the validity of a)
valid_until (timestamp, storing the end of the validity of a)
table b:
id (PK)
a__custom_id (unique in any single point of time)
valid_from (timestamp, storing the start of the validity of b)
valid_until (timestamp, storing the end of the validity of b)
I would like to insert only those rows into table b which
b.a__custom_id exists in a.custom_id
b.a_custom_id, b.valid_from, b.valid_until is unique

You cannot easily have both foreign keys and historical data.
One way would be to have the validity range as part of the primary key, but then you have to update many rows whenever you modify an entry in the referenced table.
I think you can get away with a history table if you include the currently active version in the history table. Then you can just query the history table, and the table with the current values is just there for foreign keys.
The history table would have an exclusion constraint over the primary key and the time range.

Related

Converting PostgreSQL table to TimescaleDB hypertable

I have a PostgreSQL table which I am trying to convert to a TimescaleDB hypertable.
The table looks as follows:
CREATE TABLE public.data
(
event_time timestamp with time zone NOT NULL,
pair_id integer NOT NULL,
entry_id bigint NOT NULL,
event_data int NOT NULL,
CONSTRAINT con1 UNIQUE (pair_id, entry_id ),
CONSTRAINT pair_id_fkey FOREIGN KEY (pair_id)
REFERENCES public.pairs (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
When I attempt to convert this table to a TimescaleDB hypertable using the following command:
SELECT create_hypertable(
'data',
'event_time',
chunk_time_interval => INTERVAL '1 hour',
migrate_data => TRUE
);
I get the Error: ERROR: cannot create a unique index without the column "event_time" (used in partitioning)
Question 1: From this post How to convert a simple postgresql table to hypertable or timescale db table using created_at for indexing my understanding is that this is because I have specified a unique constraint (pair_id_fkey) which does not contain the column I am partitioning by - event_time. Is that correct?
Question 2: How should I change my table or hypertable to be able to convert this? I have added some data on how I plan to use the data and the structure of the data bellow.
Data Properties and usage:
There can be multiple entries with the same event_time - those entries would have entry_id's which are in sequence
This means that if I have 2 entries (event_time 2021-05-18::10:16, id 105, <some_data>) and (event_time 2021-05-18::10:16, id 107, <some_data>) then the entry with id 106 would also have event_time 2021-05-18::10:16
The entry_id is not generated by me and I use the unique constraint con1 to ensure that I am not inserting duplicate data
I will query the data mainly on event_time e.g. to create plots and perform other analysis
At this point the database contains around 4.6 Billion rows but should contain many more soon
I would like to take advantage of TimescaleDB's speed and good compression
I don't care too much about insert performance
Solutions I have been considering:
Pack all the events which have the same timestamp in to an array somehow and keep them in one row. I think this would have downsides on compression and provide less flexibility on querying the data. Also I would probably end up having to unpack the data on each query.
Remove the unique constraint con1 - then how do I ensure that I don't add the same row twice?
Expand unique constraint con1 to include event_time - would that not somehow decrease performance while at the same time open up for the error where I accidentally insert 2 rows with entry_id and pair_id but different event_time? (I doubt this is a likely thing to happen though)
You understand correctly that UNIQUE (pair_id, entry_id ) doesn't allow to create hypertable from the table, since unique constraints need to include the partition key, i.e., event_time in your case.
I don't follow how the first option, where records with the same timestamp are packed into single record, will help with the uniqueness.
Removing the unique constraint will allow to create hypertable and as you mentioned you will lose possibility to check the constraint.
Adding the time column, e.g., UNIQUE (pair_id, entry_id, event_time) is quite common approach, but it allows to insert duplicates with different timestamps as you mentioned. It will perform worse than option 2 during inserts. You can replace index on event_time (which you need, since you query on this column, and it is created automatically by TimescaleDB) with unique index, so you save a little bit e.g.,
CREATE UNIQUE INDEX indx ON (event_time, pair_id, entry_id);
Manually create unique constraint on each chunk table. This will guarantee uniqueness within the chunk, but it will be still possible to have duplicates in different chunks. The main drawback is you will need to figure out how to create it when new chunk is created.
Unique constraints without partition keys are not supported in TimescaleDB, since it will require to access all existing chunks to check uniqueness and it will kill performance. (or it will require to create a global index, which can be large) I don't think it is common case for time series data to have unique constraints as it is usually related to artificially generated counter-based identifiers.

Partition Based on Date is Problematic for Table's Primary Key - Any Workaround?

I have a table that I would like to partition based on the "deleted" date.
create table resources(
id int,
type varchar(30),
deleted date
)
I want to have a foreign key from another table pointing to this table's id column. However since I have the partition based on the deleted date, I must include it in the primary key. Adding the deleted column to the primary key does not make sense and will also prevent the other table's FK from pointing to this table. Is there a workaround?
Thanks
No, there is no workaround for foreign keys to a partitioned table. You have to add deleted to the primary key or unique constraint and also to the table that references the partitioned table.
If you don't want that, you'll have to go without a foreign key constraint.
Perhaps it would be useful to add deleted to the referencing table and partition that along the same boundaries as resources. Then you could get rid of old data in a coordinated fashion, and if you add deleted to the join condition, you can profit from enable_partitionwise_join = on.

Foreign key from events table 1-1 0r many?

I'm likely overthinking a problem here and may well get downvoted but I'm prepared to take the hit. I'm building my first schema in a data warehouse.
2 tables: events and contacts:
events(id(pk), cid, other, fields, here)
contacts(id (pk), cid(fk), other, fields, here)
Someone visits our website and registers. A line item is generated in events column "id" and a "cid" for contacts is generated. A new record is added to contacts.
I have two questions:
Can I make the primary key of contacts cid? Thus the primary key is also a foreign key?
I'm using MySQL Workbench to create the schema. When I create the contacts table I am able to set the foreign key of cid and the cardinality as either 1-1 or 1-many. From the point of view of contacts table, is the relationship 1-1 or to many? There will only ever be 1 cid record in contacts but if that user does multiple things (like receive an email from us etc) they will appear multiple times in events table. So, logically 1-many. But when creating this in Workbench the relation line appears as though it's a 1-many relation with the many part being at contacts, not the other way around as desired. It should be the other way around?
What is the relationship between events.cid and contacts.cid?
If a user's registration results in a single contact_ record while each user visit to the web site (each Session started) results in an event_ record belonging to that user’s contact_ record, then you have a One-To-Many relationship.
`contact_` = parent table (the One)
`event_` = child table (the Many)
Notice how I boiled down that relationship into a single sentence. That should be your goal when doing analysis work to determine table structure.
Relationships are almost always defined as a link from a primary key on parent table to a foreign key on a child table.
How you define the primary key is up to you. First decide whether you want a natural key or a surrogate key. In my experience a natural key never works out as the values always eventually change.
If using a surrogate key, decide what type. The usual choices are an integer tied to an automatically incrementing sequence generator, or a UUID. If ever federating data with other databases or systems then UUID is the way to go. If using an integer, decide on size, with 32-bit integers handling a total of 2-4 billion records. A 64-bit integer can track 18 quintillion records.
The foreign key in child table is simply a copy of its assigned parent’s primary key value. So the foreign key must have same data type as that parent primary key.
If a particular parent record owns multiple records in the child table, each of those child records will carry a copy of that parent’s primary key. So if the user Susan has five events, her primary key value appears once in the contact_ table and that same value appears five times in the event_ table stored in the foreign key column.
If cid uniquely identifies each contact_ record amongst all the other contact_ records, then you have a primary key. No need to define another.

How to maintain record history on table with one-to-many relationships?

I have a "services" table for detailing services that we provide. Among the data that needs recording are several small one-to-many relationships (all with a foreign key constraint to the service_id) such as:
service_owners -- user_ids responsible for delivery of service
service_tags -- e.g. IT, Records Management, Finance
customer_categories -- ENUM value
provider_categories -- ENUM value
software_used -- self-explanatory
The problem I have is that I want to keep a history of updates to a service, for which I'm using an update trigger on the table, that performs an insert into a history table matching the original columns. However, if a normalized approach to the above data is used, with separate tables and foreign keys for each one-to-many relationship, any update on these tables will not be recognised in the history of the service.
Does anyone have any suggestions? It seems like I need to store child keys in the service table to maintain the integrity of the service history. Is a delimited text field a valid approach here or, as I am using postgreSQL, perhaps arrays are also a valid option? These feel somewhat dirty though!
Thanks.
If your table is:
create table T (
ix int identity primary key,
val nvarchar(50)
)
And your history table is:
create table THistory (
ix int identity primary key,
val nvarchar(50),
updateType char(1), -- C=Create, U=Update or D=Delete
updateTime datetime,
updateUsername sysname
)
Then you just need to put an update trigger on all tables of interest. You can then find out what the state of any/all of the tables were at any point in history, to determine what the relationships were at that time.
I'd avoid using arrays in any database whenever possible.
I don't like updates for the exact reason you are saying here...you lose information as it's over written. My answer is quite simple...don't update. Not sure if you're at a point where this can be implemented...but if you can I'd recommend using the main table itself to store historical (no need for a second set of history tables).
Add a column to your main header table called 'active'. This can be a character or a bit (0 is off and 1 is on). Then it's a bit of trigger magic...when an update is preformed, you insert a row into the table identical to the record being over-written with a status of '0' (or inactive) and then update the existing row (this process keeps the ID column on the active record the same, the newly inserted record is the inactive one with a new ID).
This way no data is ever lost (admittedly you are storing quite a few rows...) and the history can easily be viewed with a select where active = 0.
The pain here is if you are working on something already implemented...every existing query that hits this table will need to be updated to include a check for the active column. Makes this solution very easy to implement if you are designing a new system, but a pain if it's a long standing application. Unfortunately existing reports will include both off and on records (without throwing an error) until you can modify the where clause

Is this kind of DB relation design favourable and correct? Should it be converted to a no-sql solution?

First of all, I did my research but being rather a newbie, I am not that well acquainted with words so might have failed in founding the correct ones. I beg your pardon in case of a possible duplicate.
Question #1:
I have a table consisting of ID [PK] and LABEL [Varchar 128]. Each record (row) here is unique. What I want is, to define relations between these LABELS.
Requisite:
There will be an n amount of groups, each group containing one or more of these LABELS. In each group, each LABEL can either exist or not exist (meaning a group does not have 2x of same LABEL).
How should I define this relation?
I thought of creating another table with ID [PK] - Group ID [randomly assigned unique key] - LABEL_ID [ID of Labels table pointing to a single Label]
Is this correct and favourable? If a group has 10 LABELS then there will be 10 records with unique ID, same uniquely assigned Group ID and LABEL_ID pointing to LABELS table.
Question #2:
Should I let go of the Relational solution (as described above) and opt for a NoSQL solution? Where Each group is stored on it's own as a single entry into the database with an ID [PK] - Data [Containing either labels or IDs of labels pointing to the Label table]?
If NoSQL is the way to go, how should I store this data?
a) Should I have ID - Data (containing Labels)?
b) ID - Data (containing IDs of Labels)?
Question #3:
If NoSQL solution here is the best way, which NoSQL database should I choose for this use case?
Thank you.
There's no real need for an ID column in this GroupLabels table:
CREATE TABLE GroupLabels (
GroupID int not null,
LabelID int not null,
constraint PK_GroupLabels PRIMARY KEY (GroupID,LabelID),
constraint FK_GroupLabels_Groups FOREIGN KEY (GroupID) references Groups,
constraint FK_GroupLabels_Labels FOREIGN KEY (LabelID) references Labels
)
By doing the above, we've automatically achieved a constraint - that the same label can't be added to the same group more than once.
With the above, I'd say it's a reasonably common SQL solution.
There is too little information here to make recommendations on the question of "to SQL or not to SQL".
However, the relational approach would be as you describe, I think.
CREATE TABLE Group
(
GroupId int PRIMARY KEY
)
CREATE TABLE GroupLabel
(
GroupId int FOREIGN KEY REFERENCES Group,
LabelId int FOREIGN KEY REFERENCES Label,
UNIQUE (GroupId, LabelId)
)
CREATE TABLE Label
(
LabelId int PRIMARY KEY,
Value varchar(100) UNIQUE
)
Here, every label is unique, Many labels may be in each group and each label may be in many groups but each label can only be in each group once.
As #Damien_The_Unbeliever indicates, the Group table can be omitted if you don't need to store any additional attributes about each group by making the GroupId column on the GroupLabels table solely unique.
You might need to change the syntax slightly for whatever RDBMS you're using.