how to relate many rows to a row from the same table postgresql? - postgresql

consider Amazon product category architecture (one product may have 7 parent categories another might have 2). I want to build the same thing using Postgres.
A: Is there any scaleable logical way to do this? or I must consider using a graph database.
ps: the project will not be AMAZON BIG. this is a monolith project, not a microservice.
B: my thoughts are that I should have a field named parent_categories in my category table which is an array of UUIDs of categories then a field named category_id for the products table that is related to the last category parent would work.
something like this:
CREATE TABLE categories (
id UUID PRIMARY KEY NOT NULL DEFAULT gen_random_uuid (),
name VARCHAR NOT NULL,
parent_categories UUID[]
);
CREATE TABLE products (
id UUID PRIMARY KEY NOT NULL DEFAULT gen_random_uuid (),
name VARCHAR NOT NULL,
category_id UUID[],
CONSTRAINT fk_category FOREIGN KEY(category_id) REFERENCES categories(id)
);
the problem is with joining the chained categories I'm expecting a result like the below when fetching categories (I'm using node.js) and I don't know how to join every element of that array.
categories: [{
id: "id",
name: "name",
parent_categories: [{
id: "id",
name: "name"
}]
}]

This question is about relational theory.
You have a pair of tables containing id and name, that's lovely.
Discard the array attributes, and then
CREATE TABLE product_category (
product_id UUID REFERENCES products(id),
category_id UUID REFERENCES categories(id),
PRIMARY KEY (product_id, category_id)
)
Now you are perfectly set up for 3-way JOINs.
Consider adopting the "table names are singular" convention,
rather than the current plural-form names.
Add a parent_id column to categories,
so the table supports self-joins.
Then use WITH RECURSIVE to navigate
the hierarchical tree of categories.
(Classic example in the Oracle documentation
shows how manager can be used for emp
self-joins to produce a deeply nested org chart.)

Related

Use index to speed up query using values from different tables

I have a table products, a table orders and a table orderProducts.
Products have a name as a PK (apple, banana, mango) and a price .
orders have a created_at date and an id as a PK.
orderProducts connects orders and products, so they have a product_name and an order_id. Now I would like to show all orders for a given product that happened in the last 24 hours.
I use the following query:
SELECT
orders.id,
orders.created_at,
products.name,
products.price
FROM
orderProducts
JOIN products ON
products.name=orderProducts.product
JOIN orders ON
orders.id=orderProducts.order
WHERE
products.name='banana'
AND
orders.created_at BETWEEN NOW() - INTERVAL '24 HOURS' AND NOW()
ORDER BY
orders.created_at
This works, but I would like to optimize this query with an index. This index would need to first be ordered by
the product name, so it can be filtered
then the created_at of the order in descending order, so it can select only the ones from 24 hours ago
The problem is, that from what I have seen, indexes can only be created on a single table, without the possibility of joining another tables values to it. Since two individual index do not solve this problem either, I was wondering if there was an alternative way to optimize this particular query.
Here are the table scripts:
CREATE TABLE products
(
name text PRIMARY KEY,
price integer,
)
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
)
CREATE TABLE orderProducts
(
product text REFERENCES products(name),
"order" integer REFERENCES orders(id),
)
First of all. Please do not put indices everywhere - that lead to slower changing operations...
As proposed by #Laurenz Albe - do not guess - check.
Other than that. Note that you know product name, price is repeated - so you can query that once. Question if in your case two queries are going to be faster then single one... Check that.
Please read docs. I would try this index:
create index orders_id_created_at on orders(created_at desc, id)
Normally id should go first, since that is unique, however here system should be able to filter out on both predicates - where/join. Just guessing here.
orderProducts I would like to see index on both columns, however for this query only one should be needed. In practice you are going from products to orders, or other way - both paths are possible, that is why I've wrote about indexing both columns. I would use two separate indexes:
create index orderproducts_product_id on orderproducts (product_id) include (order_id);
create index orderproducts_order_id on orderproducts (order_id) include (product_id);
Probably that is not changing much, but... idea is to use only index, but not the table itself.
These rules are important in terms of performance:
Integer index faster than string index, therefore, you should try to make the primary keys always be an integer. Because join the tables uses primary keys too.
If when in where clauses always use two fields then we must create an index for both fields.
Foreign-Keys are not indexed, you must create an index for foreign-key fields manually.
So, recommended table scripts will be are that:
CREATE TABLE products
(
id serial primary key,
name text,
price integer
);
CREATE UNIQUE INDEX products_name_idx ON products USING btree (name);
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX orders_created_at_idx ON orders USING btree (created_at);
CREATE TABLE orderProducts
(
product_id integer REFERENCES products(id),
order_id integer REFERENCES orders(id)
);
CREATE INDEX orderproducts_product_id_idx ON orderproducts USING btree (product_id, order_id);
---- OR ----
CREATE INDEX orderproducts_product_id ON orderproducts (product_id);
CREATE INDEX orderproducts_order_id ON orderproducts (order_id);

Fine on SQLite, broken in Postgresql: column must appear in the GROUP BY clause or be used in an aggregate function

I have a query which works fine on SQLite, but when I run it on the same data in Postgresql I get:
column "role.id" must appear in the GROUP BY clause or be used in an aggregate function
I have three tables, for people, exhibitions, and a table that links the two: "One person in one exhibition performing a particular role" (such as "Artist" or "Curator"):
CREATE TABLE "person" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "exhibition" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "role" (`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT,
`name` varchar(30) NOT NULL,
`exhibition_id` integer NOT NULL,
`person_id` integer NOT NULL,
FOREIGN KEY(`exhibition_id`) REFERENCES `exhibition`(`id`),
FOREIGN KEY(`person_id`) REFERENCES `person`(`id`));
I want to display the people involved in an exhibition ordered by how many things they've done. So, I get the IDs of the people in an exhibition (1,2,3,4) and then do this:
SELECT
*,
COUNT(person.id) AS role_count
FROM person
INNER JOIN role
ON person.id = role.person_id
WHERE person.id IN ( 1, 2, 3, 4 )
GROUP BY person.id
ORDER BY role_count DESC
That orders the people by role_count, which is the number of roles they've had across all exhibitions
It works fine on SQLite, but not in Postgresql. I've tried putting role.id into the GROUP BY (instead of, and as well as, person.id) but that changes the results.
You know when you struggle for ages, post an SO question, and then immediately stumble on the answer?
From this answer I realised that I couldn't select role.id (which the SELECT * is implicitly doing) as it wasn't in the GROUP BY.
I couldn't add it to the GROUP BY (because that changes the results) so the solution was to not select it.
So I changed the SELECT part to:
SELECT
person.*,
COUNT(person.id) AS role_count
FROM person
...
Now role.id is not being selected. And that works.
If I needed any other fields from the role table, like name, I could add those explicitly too:
SELECT
person.*,
role.name,
COUNT(person.id) AS role_count
FROM person
...
Just like the error says, Standard SQL doesn't let you SELECT anything other than one of the GROUP BY columns or a call to an aggregate function. (For a logical reason: How would the RDBMS know which role.id to select when there are multiple rows to select from within a group?) PostgreSQL actually enforces this rule; SQLite ignores it and just returns data from an arbitrary row in the group.
As you discovered, omitting role.id from the SELECT fixes your error. But if you do want SQLite's behavior of selecting the ID from an arbitrary row, you can simply wrap it in an aggregate function, e.g., SELECT MAX(role.id) instead of just SELECT role.id).

Is there a database can represent set-like structure?

I mainly focus on the query operation, not union or intersection.
Here is an example.
Let say we have a multi-level category:
CATEGORY-TOP-LEVEL:
CATEGORY1:
CATEGORY1.1:
item1
CATEGORY2:
CATEGORY2.1:
item2
Here, item[N] is the data. Category is a tree structure to represent which category the item belongs to.
Now, suppose I'd like to query all data in category 1, the database should give me item1.
Suppose I'd like to query all data in category-top-level, the database should give me item1 and item2.
It's like set theory. Because item1 belongs to CATEGORY1.1, and CATEGORY1.1 belongs to CATEGORY1. Thus item1 belongs to CATEGORY1.
One solution is use Materialized Paths: We put an field in item, named path, the value is like: ",CATEGORY-TOP-LEVEL,CATEGORY1,CATEGORY1.2". But the problem is it will cause a lot of writing operations when I change a category's name or the hierarchy of the category.
Can MongoDB support that? if not, is there a database can support that?
P.S. Let's take query performance into consideration.
Every modern relational database can support that.
There are different ways of modeling this in a relational database, the most common one is called the "adjacency model":
create table category
(
id integer primary key,
name varchar(100) not null,
parent_category_id integer references category
);
If an item can only ever belong to a single category, the item table would look like this:
create table item
(
id integer primary key,
name varchar(100) not null,
category_id integer not null rerences category
);
If an item can belong to more then one category, you need a many-to-many relationship (also very common in the relational world)
To get all categories below a certain category you can use a recursive query:
with recursive cat_tree as (
select id, name, parent_category_id
from category
where id = 42 -- this is the category where you want to start
union all
select c.id, c.name, c.parent_category_id
from category c
join cat_tree p on p.id = c.parent_category_id
)
select *
from cat_tree;
To get the items together with the categories, just join the above to the item table.
The above query is standard ANSI SQL.
Other popular models are the nested set model, the materialized path (you mentioned that) and the closure table.
This gets asked a lot. See the tags recursive-query and hierarchical-data for many more examples.

postgreSQL table design

I need to create a table (postgresql 9.1) and I am stuck. Could you possibly help?
The incoming data can assume either of the two formats:
client id(int), shop id(int), asof(date), quantity
client id(int), , asof(date), quantity
The given incoming CSV template is: {client id, shop id, shop type, shop genre, asof, quantity}
In the first case, the key is -- client id, shop id, asof
In the second case, the key is -- client id, shop type, shop genre, asof
I tried something like:
create table(
client_id int references...,
shop_id int references...,
shop_type int references...,
shop_genre varchar(30),
asof date,
quantity real,
primary key( client_id, shop_id, shop_type, shop_genre, asof )
);
But then I ran into a problem. When data is of format 1, the inserts fail because of nulls in pk.
The queries within a client can be either by shop id, or by a combination of shop type and genre. There are no use cases of partial or regex matches on genre.
What would be a suitable design? Must I split this into 2 tables and then take a union of search results? Or, is it customary to put 0's and blanks for missing values and move along?
If it matters, the table is expected to be 100-500 million rows once all historic data is loaded.
Thanks.
You could try partial unique indexes aka filtered unique index aka conditional unique indexes.
http://www.postgresql.org/docs/9.2/static/indexes-partial.html
Basically what it comes down to is the uniqueness is filtered based on a where clause,
For example(Of course test for correctness and impact on performance):
CREATE TABLE client(
pk_id SERIAL,
client_id int,
shop_id int,
shop_type int,
shop_genre varchar(30),
asof date,
quantity real,
PRIMARY KEY (pk_id)
);
CREATE UNIQUE INDEX uidx1_client
ON client
USING btree
(client_id, shop_id, asof, quantity)
WHERE client_id = 200;
CREATE UNIQUE INDEX uidx2_client
ON client
USING btree
(client_id, asof, quantity)
WHERE client_id = 500;
A simple solution would be to create a field for the primary key which would use one of two algorithms to generate its data depending on what is passed in.
If you wanted a fully normalised solution, you would probably need to split the shop information into two separate tables and have it referenced from this table using outer joins.
You may also be able to use table inheritance available in postgres.

Implement 1:N relation in postgreSQL (object-relational)

I'm struggling with postgreSQL, as I don't know how to link one instance of type A to a set of instances of type B. I'll give a brief example:
Let's say we want to set up a DB containing music albums and people, each having a list of their favorite albums. We could define the types like that:
CREATE TYPE album_t AS (
Artist VARCHAR(50),
Title VARCHAR(50)
);
CREATE TYPE person_t AS (
FirstName VARCHAR(50),
LastName VARCHAR(50),
FavAlbums album_t ARRAY[5]
);
Now we want to create tables of those types:
CREATE TABLE Person of person_t WITH OIDS;
CREATE TABLE Album of album_t WITH OIDS;
Now as I want to make my DB as object-realational as it gets, I don't want to nest album "objects" in the row FavAlbums of the table Person, but I want to "point" to the entries in the table Album, so that n Person records can refer to the same Album record without duplicating it over and over.
I read the manual, but it seems that it lacks some vital examples as object-relational features aren't being used that often. I'm also familiar with the realational model, but I want to use extra tables for the relations.
Why you create a new type in postgresql to do what you need?
Why you don't use tables directly?
With n-n relation:
CREATE TABLE album (
idalbum integer primary key,
Artist VARCHAR(50),
Title VARCHAR(50)
);
CREATE TABLE person (
idperson integer primary key,
FirstName VARCHAR(50),
LastName VARCHAR(50)
);
CREATE TABLE person_album (
person_id integer,
album_id integer,
primary key (person_id, album_id),
FOREIGN KEY (person_id)
REFERENCES person (idperson),
FOREIGN KEY (album_id)
REFERENCES album (idalbum));
Or with a "pure" 1-n relation:
CREATE TABLE person (
idperson integer primary key,
FirstName VARCHAR(50),
LastName VARCHAR(50)
);
CREATE TABLE album (
idalbum integer primary key,
Artist VARCHAR(50),
Title VARCHAR(50),
person_id integer,
FOREIGN KEY (person_id)
REFERENCES person (idperson)
);
I hope that I help you.
Now as I want to make my DB as object-realational as it gets, I don't want to nest album "objects" in the row FavAlbums of the table Person, but I want to "point" to the entries in the table Album, so that n Person records can refer to the same Album record without duplicating it over and over.
Drop the array column, add an id primary key column (serial type) to each table, drop the oids (note that the manual recommends against using them). And add a FavoriteAlbum table with two columns (PersonId, AlbumId), the latter of which are a primary key. (Your relation is n-n, not 1-n.)
Sorry for answering my own question, but I just wanted to give some pieces of information I gained by toying around with that example.
ARRAY Type
I found out that the ARRAY Type in PostgreSQL is useful if you want to associate a variable number of values with one attribute, but only if you can live with duplicate entries. So that technique is not suitable for referencing "objects" by their identity.
References to Objects/Records by identity
So if you want to, as in my example, create a table of albums and want to be able to reference one album by more than one person, you should use a separate table to establish these relationships (Maybe by using the OIDs as keys).
Another crazy thing one could do is referencing albums by using an ARRAY of OIDs in the person table. But that is very awkward and really does not improve on the classic relational style.