To what level is this schema normalized? - database-normalization

logical design:
Pet(name, type, birthday, cost)
determinants:
name->type
name->birthday
name->cost
Here is some data:
name type birthday cost
Bruno cat 1/1/1982 free
Poppy cat 1/2/1982 20.00
Silly cat 12/2/1995 free
Sam dog 2/3/1989 100.00
Tuffy dog 3/3/1974 free
There's repeated data between rows but no duplicate columns. I think it's in BCNF.

Yes, the schema is in BCNF, if the dependencies given are a cover of all the dependencies holding in the schema. In this case, name is the only candidate key, and the left part of each non-trivial dependency (including those implied by the cover) is a super key. So the relation is in BCNF.

Related

Database Normalization mistake

I'm preparing an exam and on my texts I found an example I don't understand.
On the Relation R(A,B,C,D,E,F) I got the following functional dependencies:
FD1 A,B -> C
FD2 C -> B
FD3 C,D -> E
FD4 D -> F
Now I think all The FD are in 3NF (none is in BCNF), but the text says FD1 and FD2 to be in 2NF and FD3 and FD4 to be in 1NF. Where am I making mistakes (or is it the text wrong).
I found alternative keys to be ABD and ACD
Terminology
It is highly improper to say that: “a Functional Dependency in is in a certain Normal Form”, since only a relation schema can be (or not) in a Normal Form. What can be said is that a Functional Dependency violates a certain Normal Form (so that the schema that contains it is not in that Normal Form).
Normal forms
It can be shown that a relation schema is in BCNF if every FD given has as determinant a superkey. Since, has you have correctly noted, the only candidate keys here are ABD and ACD, every dependency violates that Normal Form. So, the schema is not in BCNF.
To be in 3NF, a relation schema must have all the given functional dependencies such that either the determinant is a superkey, or every attribute of the determinate is a prime attribute, that is it is an attribute of some candidate key. In your example this is true for B and C, but not for E and F, so FD3 and FD4 violates the 3NF. So, the schema is neither in 3NF.
The 2NF, which is only of historical interest and not particularly useful in the normalization theory, is a normal form for which the relation schema does not have functional dependencies in which non-prime attributes depend on part of keys. This is not true again for FD3 and FD4, so that the relation is neither in 2NF.

Algebra Relational sql GROUP BY SORT BY ORDER BY

I wanted to know what is the equivalent in GROUP BY, SORT BY and ORDER BY in algebra relational ?
Neither is possible in relational algebra but people have been creating some "extensions" for these operations (Note: in the original text, part of the text is written as subscript).
GROUP BY, According to the book Fundamentals of Database Systems (Elmasri, Navathe 2011 6th ed):
Another type of request that cannot be expressed in the basic relational algebra is to
specify mathematical aggregate functions on collections of values from the database.
...
We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced
script F)7, to specify these types of requests as follows:
<grouping attributes> ℑ <function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list of (<function> <attribute>) pairs. In each such pair,
<function> is one of the allowed functions—such as SUM, AVERAGE, MAXIMUM,
MINIMUM,COUNT—and <attribute> is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list.
ORDER BY (SORT BY), John L. Donaldson's lecture notes* (not available anymore):
Since a relation is a set (or a bag), there is no ordering defined for a relation. That is, two relations are the same if they contain the same tuples, irrespective of ordering. However, a user frequently wants the output of a query to be listed in some particular order. We can define an additional operator τ which sorts a relation if we are willing to allow an operator whose output is not a relation, but an ordered list of tuples.
For example, the expression
τLastName,FirstName(Student)
generates a list of all the Student tuples, ordered by LastName (as the primary sort key) then FirstName (as a secondary sort key). (The secondary sort key is used only if two tuples agree on the primary sort key. A sorting operation can list any number of sort keys, from most significant to least significant.)
*John L. Donaldson's (Emeritus Professor) lecture notes from the course CSCI 311 Database Systems at the Oberlin College Computer Science. Referenced 2015. Checked 2022 and not available anymore.
You can use projection π for the columns that you want group the table by them without aggregating (The PROJECT operation removes any duplicate tuples)
as following:
π c1,c2,c3 (R)
where c1,c2,c3 are columns(attributes) and R is the table(the relation)
According to this SQL to relational algebra converter tool, we have:
SELECT agents.agent_code, agents.agent_name, SUM(orders.advance_amount)
FROM agents, orders
WHERE agents.agent_code = orders.agent_code
GROUP BY agents.agent_code, agents.agent_name
ORDER BY agents.agent_code
Written in functions sort of like:
τ agents.agent_code
γ agent_code, agent_name, SUM(advance_amount)
σ agents.agent_code = orders.agent_code (agents × orders)
With a diagram like:

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

How to normalize a doctor table to follow 2NF?

There is a base table called doctor in my database where I have the columns
Name, d_ID, Age, Gender, Contact_NO, Speciality, beg_Date, End_Date
I wish to normalize my table. I have found the dependencies for doctor table as follows:
Name, d_ID ---> Age, gender, Speciality
d_ID----> Name, Contanct_NO, Beg_Date, End_Date
There are a few more base tables with a similar structure.
I have computed the closures and found that I have 2 candidate keys which are {d_ID} and {Name,d_ID}. I chose {d_ID} to be the primary key and {Name,d_ID} to be the secondary key.
My question is:
I want to know if my table is in 2NF already. If not, please let
me know how to break down the relation?
I have an intermediate table called patient_record which has,
doctor id, patient id, nurse id, bed id (foreign key) and so on.My
confusion lies where, if normalization has to be only done to the intermediate tables
and not the other base tables. I believe this, because the base tables would only have
unique identifiers for their columns and hence they would
automatically fall under 2NF?
i computed the closures and found that i have 2 candidate keys which are {d_ID} and {Name,d_ID} (Please correct me if i am wrong).
No. By definition, candidate keys are irreducible. If d_ID is a candidate key, then {Name, d_ID} is not. {Name, d_ID} is not a candidate key, because it's reducible. Drop the attribute "Name", and you've got a candidate key (d_ID).
1) i want to know if my table is in 2NF already. If not, please let me know how to break down the relation?
It's really hard to say in this case. Although you have a unique ID number for every doctor, in your case it only serves to identify a row, not a doctor. Your table allows this kind of data.
d_ID Name Age Gender Contact_NO Speciality beg_Date End_Date
--
117 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
199 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
234 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
How many doctors are there? (I made up the data, so I'm really the only one who knows the right answer.) There are two. 234 is an accidental duplicate of 117. 199 is a different doctor than 117; it's just a coincidence that they're both heart specialists at the same hospital, and their hospital privileges start and stop on the same dates.
That's the difference between identifying a row and identifying a doctor.
Whether it's in 2NF depends on other functional dependencies that might not yet be identified. There might be several of these.
2) i have an intermediate table called patient_record which has the doctor id, patient id, nurse id, bed id (foreign key)and so on. i am confused if normalization has to be only done to intermediate tables and not the other base tables.
Normalization is usually done to all tables.
Because base tables would only have unique identifiers for the columns and hence they would automatically fall under 2NF?
No, that's not true. For clarification, see my answer to Learning database normalization, confused about 2NF.
Identifying a row and identifying a thing
It's a subtle point, but it's really, really important.
Let's look at a well-behaved table that has three candidate keys.
create table chemical_elements (
element_name varchar(35) not null unique,
symbol varchar(3) not null unique,
atomic_number integer not null unique
);
All three attributes in that table are declared not null unique, which is the SQL idiom for identifying candidate keys. If you feel uncomfortable not having at least one candidate key declared as primary key, then just pick one. It doesn't really matter which one.
insert into chemical_elements
(atomic_number, element_name, symbol)
values
(1, 'Hydrogen', 'H'),
(2, 'Helium', 'He'),
(3, 'Lithium', 'Li'),
(4, 'Beryllium', 'Be'),
[snip]
(116, 'Ununhexium', 'Uuh'),
(117, 'Ununseptium', 'Uus'),
(118, 'Ununoctium', 'Uuo');
Each of the three candidate keys--atomic_number, element_name, symbol--unambiguously identifies an element in the real world. There's only one possible answer to the question, "What is the atomic number for beryllium?"
Now look at the table of doctors. There's more than one possible answer to the question, "What is the ID number of the doctor named 'John Smith'?" In fact, there's more than one possible answer for the very same doctor, because 234 and 117 refer to the same person.
It doesn't help to include more columns, because the data is the same for both doctors. You can get more than one answer to the question, "What's the ID number for the 45-year-old male doctor whose name is 'John Smith', whose phone number is 123-456-7890, and whose specialty is 'Cardio'?"
If you find people making appointments for these two doctors, you'll probably find their ID numbers written on a yellow sticky and stuck on their monitor.
Dr. John Smith who looks like Brad Pitt (!), ID 117.
Other Dr. John Smith, ID 199.
Each ID number unambiguously identifies a row in the database, but each ID number doesn't unambiguously identify a doctor. You know that ID 117 identifies a doctor named John Smith, but if both John Smiths were standing in front of you, you wouldn't be able to tell which one belonged to ID number 117. (Unless you had read that yellow sticky, and you knew what Brad Pitt looked like. But that information isn't in the database.)
What does this have to do with your question?
Normalization is based on functional dependencies. What "function" are we talking about when we talk about "functional dependencies"? We're talking about the identity function.
Here is the normalization process:
Identify all the candidate keys of the relation.
Identify all the functional dependencies in the relation.
Examine the determinants of the functional dependencies.if any determinant is not a candidate key,the relation is not well formed. then
i) Place the columns of the functional dependency in a new relation of their own.
ii)Make the determinant of the functional dependency the primary key of the new relation.
iii) Leave a copy of the determinant as a foreign key in the original relation.
Create a referential integrity constraint between the the original and the new relation.

NoSql,Hbase,Cassandra Conceptualization db

Suppose I have a relationship 1 to N, for example
Student , College.
Student Attributes:
Name,Surname,CollegeFKey,
College attributes:
CollegeKey,Other,Other.
Suppose that I have a program which read students and Exams from a plain text file. And, on this file I have duplicated Colleges and Duplicated Studends.
Like in denormalized tables:
CollegeId,Other,Other,Name,Surname,CollegeFkey.
e.g.
1,x,y,Mike,M,1
1,x,y,R,P,1
...
...
...
You see, I have to check in this case always that in my normalized db, I have still not inserted in the Table College 2 times the key 1.
How can I solve this in Hbase or Cassandra? I mean, if I have 10000.. tables and rows, I don't want check for every Primary Key and then for every FK, if it was inserted OK?
How can I solve that? I can use no-sql db for work directly in de-normalized datas?
Can you link me to an example that solve this problem?
You can use Cassandra http://wiki.apache.org/cassandra/ with some high level language client (I use Hector for java https://github.com/rantav/hector). In Cassandra you will describe ColumnFamily College in this ColumnFamily you write Student columns which contains information about students.