How are composite keys evaluated? - postgresql

Are composite keys guaranteed to be unique as long as the individual values of the columns it consists of are unique (as in the column values are separately evaluated), or is it the resulting value (as in a concatenation of the column values) that makes up the key and has to be unique?
Would, for example, the following two rows result in the same key, or would they both be considered unique and therefore allowed:
PRIMARY KEY (user_id, friend_id)
|-----------|-------------|
| user_id | friend_id |
|-----------|-------------|
| 10 | 283 |
| 1028 | 3 |
|-----------|-------------|
Now, I'm obviously no database expert, it's actually the first time I'm thinking about using composite keys (never had reason to before), so it might be something that "everyone" knows about or is just really easy to find in the documentation, but I've been unable to find an answer to it.
You expect the example above to work (logically, why shouldn't it? The separate values are certainly unique), but I just really want to be sure before I proceed with my project.

A PRIMARY KEY constraint is implemented by UNIQUE index on the involved columns plus NOT NULL constraints on all involved columns.
"Unique" means that the combination of all columns is unique. What you are worrying about is the concatenation of the textual representation of two values ('10' || '283') = ('1028' || '3') but that's not how composite types operate at all. All fields are considered separately and as values of the defined data types, not as text representations.
NULL values are never considered to be equal, but those are not allowed in pk columns.
The order of columns is relevant for performance. The accompanying composite index preferences the leading columns. More details in this closely related answer:
PostgreSQL composite primary key

each unique constraint (including primary key constraints) demands that each row in the relation has a functional dependency on the projection to the attributes named in the constraint.
What that means is that, in your example,
user_id | friend_id
======= | =========
1 | 1
1 | 2
2 | 1
4 | 5
are all allowed; since no pair of <user_id, friend_id> occurs more than once. given the above, you could not have another <1, 1> because it would conflict with the first row.

Related

Implement revision number while keeping primary key in sql

Suppose I have a psql table with a primary key and some data:
pkey | price
----------------------+-------
0075QlyLvw8bi7q6XJo7 | 20
(1 row)
However, I would like to save historical updates on it without losing the functionality that comes from referencing it's key in other tables as foreign keys.
I am thinking of doing some kind of revision_number + timestamp approach where each "update" would be a new row, example:
pkey | price | rev_no
----------------------+-------+--------
0075QlyLvw8bi7q6XJo7 | 20 | 0
----------------------+-------+--------
0075QlyLvw8bi7q6XJo7 | 15 | 1
(2 rows)
Then create a view that always takes the highest revision number of the table and reference keys from that view.
However to me this workaraound seems a bit too heavy for a task that in my opinion should be fairly common. Is there something I'm missing? Do you have a better solution or is there a well known paradigm for these types of problems which I don't know about?
Assuming PKey is actually the defined primary key you cannot do the revision scheme you outlined without creating a history table and moving old data to it. The primary key must be unique for any revision. But if you have a properly normalized table there several valid method, the following is one:
Review the other attributes and identify the candidate business keys (columns of business meaning that could be defined unique -- perhaps the item name.
If not already present add 2 columns: effective timestamp and superseded timestamp.
Now create a partial unique index on the identified column,from #1) and the superseded timestamp being a column meaning this is the currently active version.
Create a simple view as Select * from table. Since this is a simple view it is fully update-able. Use this View for Select,Insert and Delete, but
for Update create an instead of trigger. This trigger will set the superseded timestamp of the current active row and insert a new row update applied and the updated the version number.
With the above you can get you uniquely keep on the current active revision. Further you maintain the history of all relationships at each version. (See demo, including a couple useful functions)

Is it possible to make a list partition in PostgreSQL based on a join of the partition list key?

Consider the following tables
Table "public.Foo"
Column | Type |
------------------+-----------------------------+
foo_id | integer | PK
bar_id | integer | FK to bars
....
Table "public.Bar"
Column | Type |
------------------+-----------------------------+
bar_id | integer | PK
....
Table "public.very_big"
Column | Type |
------------------+-----------------------------+
foo_id | integer
....
Bar is mapped to Foo such that a Bar has many Foos. There are < 50 Bars, each of which has hundreds of Foos. Then, I have a separate table that is quite large, > 200 million rows, which references Foos (among other tables), creating a large composite primary key that results in a large table size.
In an ideal world, I'd like to partition very_big on bar_id, without having to add a column for bar_id, by just using the reference of foo_id. This would give a good number of partitions, 1-50 order of magnitude, while partitioning on foo_id directly would be thousands, which is advised against in the documentation.
In a sense, this ideal world is a list partition based on a join. Based on the documentation, it doesn't seem like this is possible to me. It also raises some hairy questions, like what happens to data in one partition if the foo-bar relationship is changed such that the data would have to move partitions? Because of this, this doesn't seem possible/doesn't make sense.
One alternative is to use a manually defined list partition. This will result in each manually defined range having on the order of 300-500 values. Some cons of this approach are 1) if references change between foo and bar as discussed above the partition is totally ruined (I don't foresee this happening, otherwise I wouldn't be considering this solution, but it's still a downside) 2) It's possible large, non-contiguous list partition definitions of 300-500 values would have poor performance for the postgres query planner. 3) Newly added Foos won't have a place to go based on pre-defined lists, as they won't have been added to the partition definition.
Another alternative is to suck it up and add a column for bar_id to very_big, but this is non-normal and redundant.
Is the "join" based list partition possible? (I think not)
If not, are there performance consequences of non-contiguous, 100+ order of magnitude list based partitions? (Unsure, but there are other costs here)
Are there other solutions to this problem I am potentially missing? I'm planning to go with the "denormalized" solution for the moment and add bar_id to very_big and define the partition based on that
Thank you

In Postgresql, is there a way to restrict a column's values to be an enum?

In postgresql, I can create a table documenting which type of vehicle people have.
CREATE TABLE IF NOT EXISTS person_vehicle_type
( id SERIAL NOT NULL PRIMARY KEY
, name TEXT NOT NULL
, vehicle_type TEXT
);
This table might have values such as
id | name | vehicle_type
----+---------+---------
1 | Joe | sedan
2 | Sue | truck
3 | Larry | motorcycle
4 | Mary | sedan
5 | John | truck
6 | Crystal | motorcycle
7 | Matt | sedan
The values in the car_type column are restricted to the set {sedan, truck, motorcycle}.
Is there a way to formalize this restriction in postgresql?
Personally I would use foreign key and lookup table.
Anyway you could use enums. I recommend to read article PostgreSQL Domain Integrity In Depth:
A few RDBMSes (PostgreSQL and MySQL) have a special enum type that
ensures a variable or column must be one of a certain list of values.
This is also enforcible with custom domains.
However the problem is technically best thought of as referential
integrity rather than domain integrity, and usually best enforced with
foreign keys and a reference table. Putting values in a regular
reference table rather than storing them in the schema treats those
values as first-class data. Modifying the set of possible values can
then be performed with DML (data manipulation language) rather than
DDL (data definition language)....
However when the possible enumerated values are very unlikely to
change, then using the enum type provides a few minor advantages.
Enums values have human-readable names but internally they are simple integers. They don’t take much storage space. To compete with
this efficiency using a reference table would require using an
artificial integer key, rather than a natural primary key of the value
description. Even then the enum does not require any foreign key
validation or join query overhead.
Enums and domains are enforced everywhere, even in stored procedure arguments, whereas lookup table values are not. Reference
table enumerations are enforced with foreign keys, which apply only to
rows in a table.
The enum type defines an automatic (but customizable) order relation:
CREATE TYPE log_level AS ENUM ('notice', 'warning', 'error', 'severe');
CREATE TABLE log(i SERIAL, level log_level);
INSERT INTO log(level)
VALUES ('notice'::log_level), ('error'::log_level), ('severe'::log_level);
SELECT * FROM log WHERE level >= 'warning';
DBFiddle Demo
Drawback:
Unlike a restriction of values enforced by foreign key, there is no way to delete a value from an existing enum type. The only workarounds are messing with system tables or renaming the enum, recreating it with the desired values, then altering tables to use the replacement enum. Not pretty.

What is Hash and Range Primary Key?

I am not able to understand what Range / primary key is here in the docs on Working with Tables and Data in DynamoDB
How does it work?
What do they mean by "unordered hash index on the hash attribute and a sorted range index on the range attribute"?
"Hash and Range Primary Key" means that a single row in DynamoDB has a unique primary key made up of both the hash and the range key. For example with a hash key of X and range key of Y, your primary key is effectively XY. You can also have multiple range keys for the same hash key but the combination must be unique, like XZ and XA. Let's use their examples for each type of table:
Hash Primary Key – The primary key is made of one attribute, a hash
attribute. For example, a ProductCatalog table can have ProductID as
its primary key. DynamoDB builds an unordered hash index on this
primary key attribute.
This means that every row is keyed off of this value. Every row in DynamoDB will have a required, unique value for this attribute. Unordered hash index means what is says - the data is not ordered and you are not given any guarantees into how the data is stored. You won't be able to make queries on an unordered index such as Get me all rows that have a ProductID greater than X. You write and fetch items based on the hash key. For example, Get me the row from that table that has ProductID X. You are making a query against an unordered index so your gets against it are basically key-value lookups, are very fast, and use very little throughput.
Hash and Range Primary Key – The primary key is made of two
attributes. The first attribute is the hash attribute and the second
attribute is the range attribute. For example, the forum Thread table
can have ForumName and Subject as its primary key, where ForumName is
the hash attribute and Subject is the range attribute. DynamoDB builds
an unordered hash index on the hash attribute and a sorted range index
on the range attribute.
This means that every row's primary key is the combination of the hash and range key. You can make direct gets on single rows if you have both the hash and range key, or you can make a query against the sorted range index. For example, get Get me all rows from the table with Hash key X that have range keys greater than Y, or other queries to that affect. They have better performance and less capacity usage compared to Scans and Queries against fields that are not indexed. From their documentation:
Query results are always sorted by the range key. If the data type of
the range key is Number, the results are returned in numeric order;
otherwise, the results are returned in order of ASCII character code
values. By default, the sort order is ascending. To reverse the order,
set the ScanIndexForward parameter to false
I probably missed some things as I typed this out and I only scratched the surface. There are a lot more aspects to take into consideration when working with DynamoDB tables (throughput, consistency, capacity, other indices, key distribution, etc.). You should take a look at the sample tables and data page for examples.
A well-explained answer is already given by #mkobit, but I will add a big picture of the range key and hash key.
In a simple words range + hash key = composite primary key CoreComponents of Dynamodb
A primary key is consists of a hash key and an optional range key.
Hash key is used to select the DynamoDB partition. Partitions are
parts of the table data. Range keys are used to sort the items in the
partition, if they exist.
So both have a different purpose and together help to do complex query.
In the above example hashkey1 can have multiple n-range. Another example of range and hashkey is game, userA(hashkey) can play Ngame(range)
The Music table described in Tables, Items, and Attributes is an
example of a table with a composite primary key (Artist and
SongTitle). You can access any item in the Music table directly, if
you provide the Artist and SongTitle values for that item.
A composite primary key gives you additional flexibility when querying
data. For example, if you provide only the value for Artist, DynamoDB
retrieves all of the songs by that artist. To retrieve only a subset
of songs by a particular artist, you can provide a value for Artist
along with a range of values for SongTitle.
https://www.slideshare.net/InfoQ/amazon-dynamodb-design-patterns-best-practices
https://www.slideshare.net/AmazonWebServices/awsome-day-2016-module-4-databases-amazon-dynamodb-and-amazon-rds
https://ceyhunozgun.blogspot.com/2017/04/implementing-object-persistence-with-dynamodb.html
As the whole thing is mixing up let's look at it function and code to simulate what it means consicely
The only way to get a row is via primary key
getRow(pk: PrimaryKey): Row
Primary key data structure can be this:
// If you decide your primary key is just the partition key.
class PrimaryKey(partitionKey: String)
// and in thids case
getRow(somePartitionKey): Row
However you can decide your primary key is partition key + sort key in this case:
// if you decide your primary key is partition key + sort key
class PrimaryKey(partitionKey: String, sortKey: String)
getRow(partitionKey, sortKey): Row
getMultipleRows(partitionKey): Row[]
So the bottom line:
Decided that your primary key is partition key only? get single row by partition key.
Decided that your primary key is partition key + sort key?
2.1 Get single row by (partition key, sort key) or get range of rows by (partition key)
In either way you get a single row by primary key the only question is if you defined that primary key to be partition key only or partition key + sort key
Building blocks are:
Table
Item
KV Attribute.
Think of Item as a row and of KV Attribute as cells in that row.
You can get an item (a row) by primary key.
You can get multiple items (multiple rows) by specifying (HashKey, RangeKeyQuery)
You can do (2) only if you decided that your PK is composed of (HashKey, SortKey).
More visually as its complex, the way I see it:
+----------------------------------------------------------------------------------+
|Table |
|+------------------------------------------------------------------------------+ |
||Item | |
||+-----------+ +-----------+ +-----------+ +-----------+ | |
|||primaryKey | |kv attr | |kv attr ...| |kv attr ...| | |
||+-----------+ +-----------+ +-----------+ +-----------+ | |
|+------------------------------------------------------------------------------+ |
|+------------------------------------------------------------------------------+ |
||Item | |
||+-----------+ +-----------+ +-----------+ +-----------+ +-----------+ | |
|||primaryKey | |kv attr | |kv attr ...| |kv attr ...| |kv attr ...| | |
||+-----------+ +-----------+ +-----------+ +-----------+ +-----------+ | |
|+------------------------------------------------------------------------------+ |
| |
+----------------------------------------------------------------------------------+
+----------------------------------------------------------------------------------+
|1. Always get item by PrimaryKey |
|2. PK is (Hash,RangeKey), great get MULTIPLE Items by Hash, filter/sort by range |
|3. PK is HashKey: just get a SINGLE ITEM by hashKey |
| +--------------------------+|
| +---------------+ |getByPK => getBy(1 ||
| +-----------+ +>|(HashKey,Range)|--->|hashKey, > < or startWith ||
| +->|Composite |-+ +---------------+ |of rangeKeys) ||
| | +-----------+ +--------------------------+|
|+-----------+ | |
||PrimaryKey |-+ |
|+-----------+ | +--------------------------+|
| | +-----------+ +---------------+ |getByPK => get by specific||
| +->|HashType |-->|get one item |--->|hashKey ||
| +-----------+ +---------------+ | ||
| +--------------------------+|
+----------------------------------------------------------------------------------+
So what is happening above. Notice the following observations. As we said our data belongs to (Table, Item, KVAttribute). Then Every Item has a primary key. Now the way you compose that primary key is meaningful into how you can access the data.
If you decide that your PrimaryKey is simply a hash key then great you can get a single item out of it. If you decide however that your primary key is hashKey + SortKey then you could also do a range query on your primary key because you will get your items by (HashKey + SomeRangeFunction(on range key)). So you can get multiple items with your primary key query.
Note: I did not refer to secondary indexes.
#vnr you can retrieve all the sort keys associated with a partition key by just using the query using partion key. No need of scan. The point here is partition key is compulsory in a query . Sort key are used only to get range of data
Just for anyone who is also confused about the terminology:
The partition key of an item is also known as its hash attribute. The
term hash attribute derives from the use of an internal hash function
in DynamoDB that evenly distributes data items across partitions,
based on their partition key values.
The sort key of an item is also known as its range attribute. The term
range attribute derives from the way DynamoDB stores items with the
same partition key physically close together, in sorted order by the
sort key value.
Source: "Core components of Amazon DynamoDB"

Keeping a table column empty when it is indexed as unique

Is it possible to keep a table column empty if it's defined as unique?
Table schema
Column | Type | Modifiers | Description
-------------------+------------------------+---------------+-------------
id | integer | not null |
name | character varying(64) | |
Indexes
Indexes:
"clients_pkey" PRIMARY KEY, btree (id)
"clients_name_idx" UNIQUE, btree (name)
Has OIDs: yes
Due to modifications to the application sometimes the name column needs to be empty, is this possible at all?
If the column can contain NULL values, then that is OK, as NULL is not included in the index.
Note that some databases don't implement the standard properly (some versions of SQL Server only allowed one NULL value per unique constraint, but I'm sure if that is still the case).
Using NULL is the better option, but you could also use a conditional unique index:
CREATE UNIQUE INDEX unique_clients_name ON clients (name) WHERE name <> '';
And avoid oid's, these are useless and obsolete.