PostgreSQL hierarchy(?) structure - postgresql

Please excuse my ignorance. I'm certain this is a FAQ, but I don't know the terminology well enough to know what to look for.
My company uses the following structure in terms of territory (example following):
Customer -> Market -> Area -> District -> Region
XYZ Co. -> Queens -> NYC -> Mid Atlantic -> Northeast
Each customer has only one market. Each market has only one district, and so forth. (I'm not sure if you'd call that one-to-many or many-to-one. I don't want to label it incorrectly).
This is how I have things set up right now:
create table region(
id int not null primary key,
name varchar(24)
);
create table district(
id int not null primary key,
name varchar(24),
region_id int references region(id) on update cascade
);
create table area(
id int not null primary key,
name varchar(24),
district_id int references district(id) on update cascade
);
create table market(
id int not null primary key,
name varchar(24),
area_id int references area(id) on update cascade
);
create table customer(
id int not null primary key,
name varchar(32),
sixweekavg numeric,
market_id int references market(id) on update cascade
);
Right now I have an opportunity to improve that setup as I'm more or less rewriting the site. I looked at this popular page:
What are the options for storing hierarchical data in a relational database?
And I'm sure that my best scenario lies there, but I don't know enough to figure out which one.
It's a reporting site, so there are way more reads than writes. Some of my pages show aggregated data at each level, customer through region (and top, too). So right now on a page that shows district-level data I would write something like:
select d.name, sum(sixweekavg) as avg from customer c
inner join market m on m.id = c.market_id
inner join area a on a.id = m.area_id
inner join district d on d.id = a.district_id
group by d.name order by d.name;
Pretty standard stuff, right? I'm sure a whole separate conversation could be had about materialized views, but for now I'd like to explore a better option for structuring the hierarchy (if that's even the correct term for this).
So given the following summary
PostgreSQL (it can be assumed this will not change)
Fixed hierarchy (my employer may at some point add or remove a tier, but every row in the customer table will always have the same number of "parents")
Significantly more reads than writes
Is there one method that may be better than the others for setting this up?
ltree
I did look at ltree, but I'm not quite sure how that would work. On pages where a user can select a district, for example, I query the district table for the names of each district. I had the idea to add an ltree column in my customers table which would hold the hierarchy, but still maintain the other tables. Is that a feasible and reasonable approach? I've searched for real-world examples of ltree but came up short - most that I found were designed for a random number of parent/child nodes, like a threaded comment section.
I appreciate your help and your patience!

Related

Do i really need individual table for my three types of users?

If i have three type of users. Let's say seller, consumers, and sales persons. Should i make individual table for there details like name, email passwords and all other credentials etc with a role_type table or separate table for each of them. Which is the best approach for a large project considering all engineering principles for DBMS like normalization etc.
Also tell me Does it effect the performance of the app if i have lots of joins in tables to perform certain operations?
If the only thing that distinguishes those people is the role but all details are the same, then I would definitely go for a single table.
The question is however, can a single person have more than one role? If that is never the case, then add a role_type column to the person table. Depending on how fixed those roles are maybe use a lookup table and a foreign key, e.g.:
create table role_type
(
id integer primary key,
name varchar(20) not null unique
);
create table person
(
id integer primary key,
.... other attributes ...,
role_id integer not null references role_type
);
However, in my experience the restriction to exactly one role per person usually doesn't hold, so you would need a many-to-many relation ship
create table role_type
(
id integer primary key,
name varchar(20) not null unique
);
create table person
(
id integer primary key,
.... other attributes ...,
);
create table person_role
(
person_id integer not null references person,
role_id integer not null references role_type,
primary key (person_id, role_id)
);
It sounds like this is a case of trying to model inheritance in your relational database. Complex topic, discussed here and here.
It sounds like your "seller, consumer, sales person" will need lots of different attributes and relationships. A seller typically belongs to a department, has targets, is linked to sales. A consumer has purchase history, maybe a credit limit, etc.
If that's the case,I'd suggest "class table inheritance" might be the right solution.
That might look something like this.
create table user_account
(id int not null,
username varchar not null,
password varchar not null
....);
create table buyer
(id int not null,
user_account_id int not null(fk),
credit_limit float not null,
....);
create table seller
(id int not null,
user_account_id int not null(fk),
sales_target float,
....);
To answer your other question - relational databases are optimized for joining tables. Decades of research and development have gone into this area, and a well-designed database (with indexes on the columns you're joining on) will show no noticeable performance impact due to joins. From practical experience, queries with hundreds of millions of records and ten or more joins run very fast on modern hardware.

PostreSQL: Link people and addresses by year

I building my first PostgreSQL database. It covers where people lived and worked over several decades (1890 to 1930). I have people, address, and restaurant name tables. The people moved around, both their residences and places of work.
How do I establish the link to say from a person to the address for certain years? In other words there might be from 1 to ~20 years (some people stayed put), but I'll want to query for each year (actually it's going to become a map).
I understand that if it were only once, it would be a foreign key.
I'm also going to be linking restaurant names to various addresses. In some cases I only have the names of the owners, so I'll have a link by year and whether they were employees or owners. I'll tackle that one next. Maybe with the first question answered I'll see my way to this one.
Thanks for any help.
You need a m:n relationship qualified by date. That's typically implemented with a "join table".
e.g. given:
create table person (
person_id integer primary key,
...
);
create table address (
address_id integer primary key,
...
);
you might write a table like:
create table residence_period (
person_id integer references person(person_id),
address_id integer references address(address_id),
moved_in date not null,
moved_out date,
constraint moved_in_before_moved_out
check (moved_out is null or moved_in < moved_out)
);
You could use a daterange type instead of two fields if you preferred.
If you want to assert that someone didn't have multiple overlapping residences you can use an exclusion constraint.

postgreSQL table design

I need to create a table (postgresql 9.1) and I am stuck. Could you possibly help?
The incoming data can assume either of the two formats:
client id(int), shop id(int), asof(date), quantity
client id(int), , asof(date), quantity
The given incoming CSV template is: {client id, shop id, shop type, shop genre, asof, quantity}
In the first case, the key is -- client id, shop id, asof
In the second case, the key is -- client id, shop type, shop genre, asof
I tried something like:
create table(
client_id int references...,
shop_id int references...,
shop_type int references...,
shop_genre varchar(30),
asof date,
quantity real,
primary key( client_id, shop_id, shop_type, shop_genre, asof )
);
But then I ran into a problem. When data is of format 1, the inserts fail because of nulls in pk.
The queries within a client can be either by shop id, or by a combination of shop type and genre. There are no use cases of partial or regex matches on genre.
What would be a suitable design? Must I split this into 2 tables and then take a union of search results? Or, is it customary to put 0's and blanks for missing values and move along?
If it matters, the table is expected to be 100-500 million rows once all historic data is loaded.
Thanks.
You could try partial unique indexes aka filtered unique index aka conditional unique indexes.
http://www.postgresql.org/docs/9.2/static/indexes-partial.html
Basically what it comes down to is the uniqueness is filtered based on a where clause,
For example(Of course test for correctness and impact on performance):
CREATE TABLE client(
pk_id SERIAL,
client_id int,
shop_id int,
shop_type int,
shop_genre varchar(30),
asof date,
quantity real,
PRIMARY KEY (pk_id)
);
CREATE UNIQUE INDEX uidx1_client
ON client
USING btree
(client_id, shop_id, asof, quantity)
WHERE client_id = 200;
CREATE UNIQUE INDEX uidx2_client
ON client
USING btree
(client_id, asof, quantity)
WHERE client_id = 500;
A simple solution would be to create a field for the primary key which would use one of two algorithms to generate its data depending on what is passed in.
If you wanted a fully normalised solution, you would probably need to split the shop information into two separate tables and have it referenced from this table using outer joins.
You may also be able to use table inheritance available in postgres.

Is this kind of DB relation design favourable and correct? Should it be converted to a no-sql solution?

First of all, I did my research but being rather a newbie, I am not that well acquainted with words so might have failed in founding the correct ones. I beg your pardon in case of a possible duplicate.
Question #1:
I have a table consisting of ID [PK] and LABEL [Varchar 128]. Each record (row) here is unique. What I want is, to define relations between these LABELS.
Requisite:
There will be an n amount of groups, each group containing one or more of these LABELS. In each group, each LABEL can either exist or not exist (meaning a group does not have 2x of same LABEL).
How should I define this relation?
I thought of creating another table with ID [PK] - Group ID [randomly assigned unique key] - LABEL_ID [ID of Labels table pointing to a single Label]
Is this correct and favourable? If a group has 10 LABELS then there will be 10 records with unique ID, same uniquely assigned Group ID and LABEL_ID pointing to LABELS table.
Question #2:
Should I let go of the Relational solution (as described above) and opt for a NoSQL solution? Where Each group is stored on it's own as a single entry into the database with an ID [PK] - Data [Containing either labels or IDs of labels pointing to the Label table]?
If NoSQL is the way to go, how should I store this data?
a) Should I have ID - Data (containing Labels)?
b) ID - Data (containing IDs of Labels)?
Question #3:
If NoSQL solution here is the best way, which NoSQL database should I choose for this use case?
Thank you.
There's no real need for an ID column in this GroupLabels table:
CREATE TABLE GroupLabels (
GroupID int not null,
LabelID int not null,
constraint PK_GroupLabels PRIMARY KEY (GroupID,LabelID),
constraint FK_GroupLabels_Groups FOREIGN KEY (GroupID) references Groups,
constraint FK_GroupLabels_Labels FOREIGN KEY (LabelID) references Labels
)
By doing the above, we've automatically achieved a constraint - that the same label can't be added to the same group more than once.
With the above, I'd say it's a reasonably common SQL solution.
There is too little information here to make recommendations on the question of "to SQL or not to SQL".
However, the relational approach would be as you describe, I think.
CREATE TABLE Group
(
GroupId int PRIMARY KEY
)
CREATE TABLE GroupLabel
(
GroupId int FOREIGN KEY REFERENCES Group,
LabelId int FOREIGN KEY REFERENCES Label,
UNIQUE (GroupId, LabelId)
)
CREATE TABLE Label
(
LabelId int PRIMARY KEY,
Value varchar(100) UNIQUE
)
Here, every label is unique, Many labels may be in each group and each label may be in many groups but each label can only be in each group once.
As #Damien_The_Unbeliever indicates, the Group table can be omitted if you don't need to store any additional attributes about each group by making the GroupId column on the GroupLabels table solely unique.
You might need to change the syntax slightly for whatever RDBMS you're using.

Database schema dilemma with foreign keys pointing to multiple tables (exclusive arc)

Hopefully my description is a little better than the title, but basically I'm having an issue with one part of a new application schema and i'm stuck on what is the most manageable and elegant solution in table structure.
Bare bones table structure with only relevant fields showing would be as follows:
airline (id, name, ...)
hotel (id, name, ...)
supplier (id, name, ...)
event (id, name,...)
eventComponent (id,name) {e.g Food Catering, Room Hire, Audio/Visual...}
eventFlight (id, eventid, airlineid, ...)
eventHotel (id, eventid, hotelid, ...)
eventSupplier (id, eventid, supplierid, hotelid, eventcomponentid, ...)
So airline, hotel, supplier are all reference tables, and an Event is create with 1 to many relationships between these reference tables. E.g an Event may have 2 flight entries, 3 Other components entries, and 2 hotel entries. But the issue is that in the EventSupplier table the supplier can be either a Supplier or an existing Hotel. So after the user has built their new event on the front-end i need to store this in a fashion that doesn't make it a nightmare to then return this data later.
I've been doing a lot of reading on Polymorphic relations and exclusive arcs and I think my scenario is definitely more along the lines or an Exclusive Arc relationship.
I was thinking:
CREATE TABLE eventSupplier (
id SERIAL PRIMARY KEY,
eventid INT NOT NULL,
hotelid INT,
supplierid INT,
CONSTRAINT UNIQUE (eventid, hotelid, supplierid), -- UNIQUE permits NULLs
CONSTRAINT CHECK (hotelid IS NOT NULL OR supplierid IS NOT NULL),
FOREIGN KEY (hotelid) REFERENCES hotel(id),
FOREIGN KEY (supplierid) REFERENCES supplier(id)
);
And then for the retrieval of this data just use an outer join to both tables to work out which one is linked.
select e.id as eventid, coalesce(h.name,s.name) as supplier
from eventSupplier es
left outer join
supplier s on s.id = es.supplierid
left outer join
hotel h on h.id = es.hotelid
where h.id is not null OR s.id is not null
My other options were to have a single foreign key in the eventSupplier table with another field for the "type" which seems to be a harder solution to retrieve data from, though it does seem quite flexible if I want to extend this down the track without making schema changes. Or alternately to store the hotelid in the Supplier table direct and just declare some suppliers as being a "hotel" though there were then be redundant data which I don't want.
Any thoughts on this would be much appreciated!
Cheers
Phil
How about handling events one-by-one and using an EventGroup to group them together?
EDIT:
I have simply renamed entities to fit the latest comments. This as close as I can get to this -- admittedly I do not understand the problem properly.
A good way to test your solution is to think about what would happen if an airline became a supplier. Does your solution handle that or start to get complicated.
Why do you explicitly need to find hotel data down the supplier route if you don't need that level of data other types of supplier? I would suggest that a supplier is a supplier, whether its a hotel or not for these purposes.
If you want to flag a supplier as a hotel, then simply put hotelid on the supplier table or else wait and hook in the supplier later via whatever mechanism you use to get detail on other suppliers.