How to decompose the relation into 5NF? - database-normalization

The example provided in the Fifth Normal Form has ACP(Agent, Company, Product) relation with the following data:
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | bus |
| Jones | Ford | car |
-----------------------------
Rule applied is if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
The relation is decomposed into 3 relations according to constratint 3D: AC(agent, company), AP(agent, product) and CP(company, product). Hence the join dependency is *{(agent, company), (agent, product),(company, product)}. According to the definition of 5NF, a table R is in fifth normal form (5NF) or project-join normal form (PJ/NF) if and only if every join dependency in R is implied by the keys of R. But none of the projections contain a key, since the key is {agent, company, product} itself.
Another example provided in the book by C.J. Date contains a similar relation shipments(suppier_number, part_number, project_number) and the relation is decomposed similarly stating constratint 3D. However, the definition of 5NF doesn't state about the constratint 3D.
So, I have a few questions on the scenario presented above:
What is constratint 3D?
What if the relation has one more attribute "region" making ACPR(agent, company, product, region)?

Related

How can i compare 2 tables in postgresql?

i have a table named hotel with 2 columns named : hotel_name , hotel_price
hotel_name | hotel_price
hotel1 | 5
hotel2 | 20
hotel3 | 100
hotel4 | 50
and another table named city that contains the column : city_name , average_prices
city_name | average_prices
paris | 20
london | 30
rome | 75
madrid | 100
I want to find which hotel has a price that's more expensive than average prices in the cities.For example i want in the end to find something like this:
hotel_name | city_name
hotel3 | paris --hotel3 is more expensive than the average price in paris
hotel3 | london --hotel3 is more expensive than the average price in london etc.
hotel3 | rome
hotel4 | paris
hotel4 | london
(I found the hotels that are more expensive than the average prices of the cities)
Any help would be valuable thank you .
A simple join is all that is needed. Typically tables are joined on a defined relationship (PK/FK) but there is nothing requiring that. See fiddle.
select h.hotel_name, c.city_name
from hotels h
join cities c
on h.hotel_price > c.average_prices;
However, while you can get the desired results, it's pretty meaningless. You cannot tell whether a particular hotel is even in a given city.

Postgres database: how to model multiple attributes that can have also multiple value, and have relations to other two entities

I have three entities, Items, Categories, and Attributes.
An Item can be in one or multiple Categories, so there is N:M relation.
Item ItemCategories Categories
id name item_id category_id id name
1 alfa 1 1 1 chipset
1 2 2 interface
An Item can have multiple Attributes depending on the 'Categories' they are in.
For example, the items in Category 'chipset' can have as attributes: 'interface', 'memory' 'tech'.
These attributes have a set of predefined values that don't change often, but they can change.
For example: 'memory' can only be ddr2, ddr3, ddr4.
Attributes CategoryAttributes
id name values category_id attribute_id
1 memory {ddr2, ddr3, ddr4} 1 1
An Item that is in the 'chipset' Category has access to the Attribute and can only have Null or the predefined value of the attribute.
I thought to use Enum or Json for Attribute values, but I have two other conditions:
ItemAttributes
item_id attribute_id value
1 1 {ddr2, ddr4}
1) If an Attribute appears in 2 Categories, and an Ithe is in both categories, only once an attribute can be shown.
2) I need to use the value with rank, so if two corresponding attribute values appear for an item, the rank should be greater if it is only one, or the value doesn't exist.
3)Creating separate tables for Attributes is not an option, because the number is not fixed, and can be big.
So, I don't know exactly the best options in the database design are to constrain the values and use for order ranking.
The problem you are describing is a typical open schema or vertical database, which is a classic use case for some kind of EAV model.
EAV is a complex yet powerful paradigm that allows a potentially open schema while respecting the database normal forms and allows to have what you need: having a variable number of attributes depending on specific instances of the same entity.
That is what happens typically in e-commerce using relational database since different products have different attributes (i.e a lipstick has color, but maybe for a hard drive you dont care about color but about capacity) and it doesn't make sense to have one attribute table, because the number is not fixed and can be big, and for most rows, there would be a lot of NULL values (that is the mathematical notion of a sparse matrix, that looks very ugly in a DB table)
You can take a look at Magento DB Model, a true reference in pure EAV at scale, or Wikipedia, but probably you can do that later, and for now, you just need the basics:
The basic idea is to store attributes, and their corresponding values as rows, instead of columns, in a single table.
In the simpler implementation the table has at least three columns: entity (usually a foreign key to an entity, or entity type/category), attribute (this can be a string, o a foreign key in more complex systems), and value.
In my previous example, oversimplifying, we could have a table like this, that lists attribute names and its values for
Item table Attributes table
+------+--------------+ +-------------+-----------+-------------+
| id | name | | item_id | attribute | value |
+------+--------------+ +-------------+-----------+-------------+
| 1 | "hard drive" | | 2 | "color" | "red" |
+------+--------------+ +-------------+-----------+-------------+
| 2 | "lipstick" | | 2 | "price" | 10 |
+------+--------------+ +-------------+-----------+-------------+
| 1 | "capacity"| "1TB" |
+-------------+-----------+-------------+
| 1 | "price" | 200 |
+-------------+-----------+-------------+
So for every item, you can have a list of attributes.
Since your model is more complex, has a few more constraints, so we need to adapt this model.
Since you want to limit the possible values, you will need a table for values
Since you will have a values table, the values hast to refer to an attribute, so you need the attributes to have an id, so you will have an attribute table
to make explicit and strict what categories have what attribute, you need a category-attribute table
With this, you end up with something like
Categories table
List of categories ids and names
+------+--------------+
| id | name |
+------+--------------+
| 1 | "chipset" |
+------+--------------+
| 2 | "interface" |
+------+--------------+
Attributes table
List of attribute ids and their name
+------+--------------+
| id | name |
+------+--------------+
| 1 | "interface" |
+------+--------------+
| 2 | "memory" |
+------+--------------+
| 3 | "tech" |
+------+--------------+
| 4 | "price" |
+------+--------------+
Category-Attribute table
What category has what attributes. Note that one attribute (i.e 4) can belong to 2 categories
+--------------+--------------+
| attribute_id | category_id |
+--------------+--------------+
| 1 | 1 |
+--------------+--------------+
| 2 | 1 |
+--------------+--------------+
| 3 | 1 |
+--------------+--------------+
| 4 | 1 |
+--------------+--------------+
| 4 | 2 |
+--------------+--------------+
Value table
List of possible values for every attribute
+----------+--------------+--------+
| value_id | attribute_id | value |
+-------------+-----------+--------+
| 1 | 2 | "ddr2" |
+----------+--------------+--------+
| 2 | 2 | "ddr3" |
+----------+--------------+--------+
| 3 | 2 | "ddr4" |
+----------+--------------+--------+
| 4 | 3 |"tech_1"|
+----------+--------------+--------+
| 5 | 3 |"tech_2"|
+----------+--------------+--------+
| 6 | ... | ... |
+----------+--------------+--------+
| 7 | ... | ... |
And finally, what you can imagine, the
Item-Attribute table will list one attribute value per row
+----------+--------------+-------+
| item_id | attribute_id | value |
+----------+-----------+----------+
| 1 | 2 | 1 |
+----------+--------------+-------+
| 1 | 2 | 3 |
+----------+--------------+-------+
Meaning that item 1, for attribute 2 (`memory`), has values 1 and 3 (`ddr2` and `ddr3`)
This will cover all your conditions:
Number of attributes is unlimited, as big as needed and not fixed
You can define clearly what category has what attributes
Two categories can have the same attribute
If 1 item belongs to two categories that have the same attribute, you can show only one (ie SELECT * from Category-Attribute where category_id in (SELECT category_id from ItemCategories where item_id = ...) will give you the list of eligible attributes, only one of each even if 2 categories had the same
You can do a rank, I think I dont have enough info for this query, but being this a fully normalized model, definitely, you can do a rank. You have here pretty much the full model, so surely you can figure out the query.
This is very similar to the model that Magento uses. It is very powerful but of course, it can get hard to manage, but it is the best way if we want to keep the model strict and make sure that it will enforce the constraints and that will accept all the SQL functions.
For systems less strict, it is always an option to go for a NoSQL database with much more flexible schemas.

Database design with multiple units and multiple attributes

Supposing I have this database design which I have researched.
Table: Products
ProductId | Name | BaseUnitId
1 | Lab gown | 1
2 | Gloves | 1
FK: BaseUnitId references Units.UnitId
Table: Units
UnitId | Name
1 | Each / Pieces
2 | Dozen
3 | Box
Table: Unit Conversion
ProdID | BaseUnitID | Factor | ConvertToUnitID
1 | 1 | 12 | 2
2 | 1 | 100 | 3
FK: BaseUnitId references Units.UnitID
FK: ConvertToUnitId references Units.UnitID
Table: Product Attribute
AttribId | Prod_ID | Attribute | Value
1 | 1 | Color | Blue
2 | 1 | Size | Large
3 | 2 | Color | Violet
4 | 2 | Size | Small
5 | 2 | Size | Medium
6 | 2 | Size | Large
7 | 2 | Color | White
FK: Prod_ID references Product.ProductID
Table: Inventory
Prod_ID | Base Unit Qty | Expiry
1 | 12 | n/a
2 | 100 | 2020-01-01
2 | 100 | 2021-12-31
FK: Prod_ID references Product.ProductID
How can I breakdown the inventory per unit per attribute?
e.g How can I get the inventory of SMALL VIOLET GLOVES? LARGE WHITE GLOVES?
Any suggestions? My idea is to create another table which will link product unit, product attribute and quantity.
But I dont know how to link the size attribute and color attribute to a unit.
Lastly, is there something wrong with this design?
I think it is quite wrong to split off the attributes of a product into a different table. I understand the desire to normalize, but it should be done differently.
I'd handle a product and its attributes like this:
CREATE TABLE product (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
name text NOT NULL,
baseunit_id bigint NOT NULL REFERENCES unit
);
CREATE TABLE inventory (
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
product_id bigint NOT NULL REFERENCES product,
color integer REFERENCES product_color,
size integer REFERENCES product_size,
other_attributes jsonb
);
That also makes sense if you think about in natural language terms: “How many dozens of large blue gloves do we have on store?”
Attributes that do not apply to a certain product can be left NULL.
I make a distinction between common and rare attributes. Common attributes have their own column. Rare attributes are bunched together in a jsonb column. I know that the latter is not normalized nor pretty, but varying attributes are not very suited for a relational model. A GIN index on the column will allow searches to be efficient.

How to find out the keywords in two hadoop tables with Spark?

I have two tables in HDFS. One table (Table-1) has some keywords as you can see below. Another table (Table-2) has a text column. Every row could have more than one keyword in Table-1. I need to find out all the matched keywords in Table-1 for the text column in Table-2, and output the keyword list for every row in Table-2.
Example :
Table-1:
ID | Name | Age | City | Gender
---------------------------------
111 | Micheal | 19 | NY | male
222 | George | 23 | CA | male
333 | Linda | 22 | LA | female
Table-2:
Text_Description
------------------------------------------------------------------------
1-Linda and my cousin left the house.
2-Michael who is 19 year old, and George are going to rock concert in CA.
3-Shopping card is ready at the NY for male persons.
Output:
1- Linda
2- Micheal, 19, George, CA
3- NY, male

Group with levenshtein distance

I have postgreSQL 9.2
My task is to find similar names in table (limited by some levenshtain distance).
For example, the distance is 3, the table has data:
| name |
|***************************|
| Marcus Miller |
| Marcos Miller |
| Macus Miler |
| David Bowie |
| Dave Grohl |
| Dav Grol |
| ... |
The result I want to get is like this:
| Marcus Miller, Marcos Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
Or
| Marcus Miller, Marcos Miller |
| Marcus Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
I tried this:
SELECT a.name, b.name
FROM my_table a
JOIN my_table b ON b.id < a.id AND levenshtein(b.name, a.name) < 3;
But it is too slow with my datum.
There is a significant conceptual error with your question; GROUP BY takes certain equivalence relations (in the mathematical sense) as an argument and uses that to partition a SQL relation into equivalence classes.
The problem is that the relation you describe, namely, "are two strings within a certain edit distance of each other", is not an equivalence relation. It's symmetric and reflexive, but not transitive. To illustrate, what should the answer be if I added a series of names to your dataset that morphed "Marcus Miller" into "Dave Grohl", with each name in the series being within that edit distance from the previous?
However, there are algorithms to partition a data set using things that aren't equivalence relations, such as geometrical distance. K-means clustering is one of the best known examples. Perhaps there is a way to adapt k-means or something similar to this problem, I don't know.