I'm preparing an exam and on my texts I found an example I don't understand.
On the Relation R(A,B,C,D,E,F) I got the following functional dependencies:
FD1 A,B -> C
FD2 C -> B
FD3 C,D -> E
FD4 D -> F
Now I think all The FD are in 3NF (none is in BCNF), but the text says FD1 and FD2 to be in 2NF and FD3 and FD4 to be in 1NF. Where am I making mistakes (or is it the text wrong).
I found alternative keys to be ABD and ACD
Terminology
It is highly improper to say that: “a Functional Dependency in is in a certain Normal Form”, since only a relation schema can be (or not) in a Normal Form. What can be said is that a Functional Dependency violates a certain Normal Form (so that the schema that contains it is not in that Normal Form).
Normal forms
It can be shown that a relation schema is in BCNF if every FD given has as determinant a superkey. Since, has you have correctly noted, the only candidate keys here are ABD and ACD, every dependency violates that Normal Form. So, the schema is not in BCNF.
To be in 3NF, a relation schema must have all the given functional dependencies such that either the determinant is a superkey, or every attribute of the determinate is a prime attribute, that is it is an attribute of some candidate key. In your example this is true for B and C, but not for E and F, so FD3 and FD4 violates the 3NF. So, the schema is neither in 3NF.
The 2NF, which is only of historical interest and not particularly useful in the normalization theory, is a normal form for which the relation schema does not have functional dependencies in which non-prime attributes depend on part of keys. This is not true again for FD3 and FD4, so that the relation is neither in 2NF.
Related
I have seen many examples about fully functional dependencies, but they use to say that:
x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
But, what if y is determined by an attribute other than the proper subset or subset of x.
Suppose that I have a students table which consists of rollno(primary key), name, phone no unique not null, email unique not null.
As rollno is a primary key, let it be x and take name as y.
now x->y, but phone or email also determine y(name) which are not subsets of x. Is this still called a fully functional dependent?
If yes, should we check the determinants of y which are only subsets of x?
If no, what is the mistake I did?
x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
You are confusing the definition of "full functional dependency" with the definition of "2NF". The definition of fully functionally dependent has nothing to do with superkeys or candidate keys or primary keys. And for a relation to be in 2NF, if X is a candidate key and Y is non-prime then Y can't be determined by any proper subset of X.
A functional dependency X -> Y is partial when Y is also functionally dependent on a proper/smaller subset of X. Otherwise it is full. It doesn't matter what else is true.
A superkey is a column or set of columns that functionally determines every column. If there is no smaller superkey inside it then it is a candidate key. A relation is in 2NF when every attribute is fully functionally dependent on every candidate key. It doesn't matter what else is true.
You can pick one candidate key to call primary key. So a primary key is a candidate key. Otherwise the notion of "primary key" is irrelevant to functional dependencies and normalization.
(In SQL primary key means the same as unique not null, namely superkey. Which is a candidate key only if there's no smaller superkey in it. So a set declared primary key might not even be a primary key. And in SQL you can't declare {} as a superkey.)
As rollno is a primary key, let it be x and take name as y.
A primary key is a candidate key, so {rollno} determines every attribute and no proper subset of {rollno} determines every attribute. So {}, the only proper subset of {rollno}, is not a superkey. ({} is a superkey when there can only ever be at most one row in the table.) But it's still possible that {} -> name. (That would be if the name column only contains at most one name at a time.) Then {rollno} -> name would be partial because its proper subset {} determines name.
now x->y, but phone or email also determine y(name) which are not subsets of x. Is this still called a fully functional dependent?
If no proper subset of {rollno} determines name then {rollno} -> name fully, otherwise partially. That's what the definition says. Nothing else matters. But we don't know whether proper subset of {rollno} determines name because you didn't say whether {} -> name.
If {rollno}, {phoneno} and {email} are candidate keys and {} doesn't determine name then name is fully functionally dependent on all three (because no proper subset of any of them determines name).
You are saying:
x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
but this mixes two different concepts, that of “full functional dependency”, and that of “key”.
A functional dependency is full if you cannot remove any element of the left part without losing the propoerty of determining the right part. So if a functional dependency has only one attribute on the left part (like rollno → name), it is always complete.
A (candidate) key on the other hand is a set of attributes that determines all the attributes of a relation, and such that you cannot remove any attribute from it without losing the property of being a key (so, it is not a superkey).
In your example there are three different keys, rollno, phone, and email, each of them composed by a single attribute.
Of course, if you know that the set of attributes X is a key, you can write that X → T, where T are all the attributes of the relation, and this functional dependency is complete.
So, as part of my assignment, I have to prove that any relation with two attributes is in BCNF.
As per my understanding, if for a relation we have 3rd normal form and one non key attribute functionally determine key attribute, it violates the BCNF.
Say my relation consists of two attributes A1,A2
Scenario1(only one functional dependency)
A1 -> A2 (so A1 is the key, and A2 does not FD A1 : so no violation)
same applies for
A2 -> A1
But what if
A1->A2 and A2->A1
Here key can be either A1, A2. And the other non key attribute functionally determines the key.
In each functional dependency X -> Y, X and Y are sets of attributes. This requires special attention when either X or Y is an empty set1. So, in the example with only two attributes A1 and A2, we have all the possible non-trivial dependencies:
1. {} -> {A1}
2. {} -> {A2}
3. {} -> {A1 A2}
4. {A1} -> {A2}
5. {A2} -> {A1}
All the other possible dependencies are trivial dependencies, i.e. the right set is a subset of the left set (for instance {A1} -> {}, {} -> {}, {A1} -> {A1}, {A1 A2} -> {A1}, etc.). We know that these dependencies always hold, so they are not considered in the definition of the normal forms.
1. When empty sets are excluded from dependencies, the theorem is true
Consider the dependencies 4 and 5. We have four possible cases:
1. Only 4 holds, so we have: {A1} -> {A2}
this means that {A1} is a candidate key (since from {A1} -> {A2} we can derive that {A1}->{A1 A2}), and the BCNF condition is satisfied since each dependency has a superkey as determinant;
2. Only 5 holds, so we have: {A2} -> {A1}
equivalent to the previous case, only the role of A1 and A2 is exchanged;
3. Neither 4 nor 5 hold (no functional dependencies),
so the BCNF is formally satisfied (since no dependency violates the BCNF); and, finally:
4. both hold, so we have {A1} -> {A2} and {A2} -> {A1}
also in this case the relation is in BCNF, since {A1} and {A2} are both candidate keys, since they determine all the attributes (simply put together 1 and 2 above).
2. When we allow the empty set in the functional dependencies, the theorem is not true
Consider a relation R(A1, A2), with a cover F of the dependencies
F = { {}-> {A1} }
The meaning of {} -> {A1}, by recalling the definition of functional dependency, is that the column A1 has a constant value. So we have a relation with two columns, one of which has always the same value. In this case the only candidate key is {A2}, since {A2}+ = {A1 A2}, with {A1 A2} a superkey, and the relation is not in BCNF since a non-trivial functional dependency ({} -> {A1}) has a determinant which is not a superkey.
1 Note that in the scientific literature on the subject (as well as in books on databases) the possibility of empty sets in functional dependences is sometimes explicitly excluded (for instance, see: Tsou, Don-Min, and Patrick C. Fischer. “Decomposition of a Relation Scheme into Boyce-Codd Normal Form.” ACM SIGACT News 14, no. 3 (July 1, 1982): 23–29. https://doi.org/10.1145/990511.990513), while sometimes it is allowed, or not discussed.
For any relation to be in BCNF, the following must holds.
X → Y is a trivial functional dependency (Y ⊆ X).
X is a superkey for schema R
Wikipedia link here
For Example, there is a relation R = {A,B} with two attributes.
The only possible (non-trivial) FD's are {A}->{B} and {B}->{A}.
So, there are four possible cases:
1. No FD's holds in R. {C.K = AB}, Since it is an all key relation it's always in BCNF.
2. Only A->B holds. In this case {C.K = A} and relation satisfies BCNF.
3. Only B->A holds. In this case {C.K = B} and relation satisfies BCNF.
4. Both A->B and B->A holds. In this case there are two keys {CK = A and B} and
relation satisfies BCNF.
Hence, every Binary Relation (A relation with two attributes) is always in BCNF!
To prove any relation with two attributes is in BCNF.
Rule For Boyce-Codd Normal Form:
A relation R is in BCNF if R is in Third Normal Form and for every FD,LHS is super key
so if, A1 and A2 are the only attributes: A1 -> A2 and A2 -> A1 as functional dependencies, then in both functional dependencies, the left-hand side is a super key. Which satisfies the condition of BCNF.
I wanted to know what is the equivalent in GROUP BY, SORT BY and ORDER BY in algebra relational ?
Neither is possible in relational algebra but people have been creating some "extensions" for these operations (Note: in the original text, part of the text is written as subscript).
GROUP BY, According to the book Fundamentals of Database Systems (Elmasri, Navathe 2011 6th ed):
Another type of request that cannot be expressed in the basic relational algebra is to
specify mathematical aggregate functions on collections of values from the database.
...
We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced
script F)7, to specify these types of requests as follows:
<grouping attributes> ℑ <function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list of (<function> <attribute>) pairs. In each such pair,
<function> is one of the allowed functions—such as SUM, AVERAGE, MAXIMUM,
MINIMUM,COUNT—and <attribute> is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list.
ORDER BY (SORT BY), John L. Donaldson's lecture notes* (not available anymore):
Since a relation is a set (or a bag), there is no ordering defined for a relation. That is, two relations are the same if they contain the same tuples, irrespective of ordering. However, a user frequently wants the output of a query to be listed in some particular order. We can define an additional operator τ which sorts a relation if we are willing to allow an operator whose output is not a relation, but an ordered list of tuples.
For example, the expression
τLastName,FirstName(Student)
generates a list of all the Student tuples, ordered by LastName (as the primary sort key) then FirstName (as a secondary sort key). (The secondary sort key is used only if two tuples agree on the primary sort key. A sorting operation can list any number of sort keys, from most significant to least significant.)
*John L. Donaldson's (Emeritus Professor) lecture notes from the course CSCI 311 Database Systems at the Oberlin College Computer Science. Referenced 2015. Checked 2022 and not available anymore.
You can use projection π for the columns that you want group the table by them without aggregating (The PROJECT operation removes any duplicate tuples)
as following:
π c1,c2,c3 (R)
where c1,c2,c3 are columns(attributes) and R is the table(the relation)
According to this SQL to relational algebra converter tool, we have:
SELECT agents.agent_code, agents.agent_name, SUM(orders.advance_amount)
FROM agents, orders
WHERE agents.agent_code = orders.agent_code
GROUP BY agents.agent_code, agents.agent_name
ORDER BY agents.agent_code
Written in functions sort of like:
τ agents.agent_code
γ agent_code, agent_name, SUM(advance_amount)
σ agents.agent_code = orders.agent_code (agents × orders)
With a diagram like:
I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH
Aheo asks if it is ok to have a table with just one column. How about one with no columns, or, given that this seems difficult to do in most modern "relational" DBMSes, a relation with no attributes?
There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing roles equivalent to 1 and 0 in ordinary algebra.
A table with a single column is a set -- as long as you don't care about ordering the values, or associating any other info with them, it seems fine. You can check for membership in it, and basically that's all you can do. (If you don't have a UNIQUE constraint on the single column I guess you could also count number of occurrences... a multiset).
But what in blazes would a table with no columns (or a relation with no attributes) mean -- or, how would it be any good?!
DEE and cartesian product form a monoid. In practice, if you have Date's relational summarize operator, you'd use DEE as your grouping relation to obtain grand-totals. There are many other examples where DEE is practically useful, e.g. in a functional setting with a binary join operator you'd get n-ary join = foldr join dee
"There are exactly two relations with no attributes, one with an empty tuple, and one without. In The Third Manifesto, Date and Darwen (somewhat) humorously name them TABLE_DEE and TABLE_DUM (respectively).
They are useful to the extent that they are the identity of a variety of relational operators, playing a roles equivalent to 1 and 0 in ordinary algebra."
And of course they also play the role of "TRUE" and "FALSE" in boolean algebra. Meaning that they are useful when propositions such as "The shop is open" and "The alarm is set" are to be represented in a database.
A consequence of this is that they can also be usefully employed in any expression of the relational algebra for their properties of "acting as an IF/ELSE" : joining to TABLE_DUM means retaining no tuples at all from the other argument, joining to TABLE_DEE means retaining them all. So joining R to a relvar S which can be equal to either TABLE_DEE or TABLE_DUM, is the RA equivalent of "if S then R else FI", with FI standing for the empty relation.
Hm. So the lack of "real-world examples" got to me, and I tried my best. Perhaps surprisingly, I got half way there!
cjs=> CREATE TABLE D ();
CREATE TABLE
cjs=> SELECT COUNT (*) FROM D;
count
-------
0
(1 row)
cjs=> INSERT INTO D () VALUES ();
ERROR: syntax error at or near ")"
LINE 1: INSERT INTO D () VALUES ();
A table with a single column would make sense as a simple lookup. Let's say you have a list of strings you want to filter against for user inputed text. That table would store the words you would want to filter out.
It is difficult to see utility of TABLE_DEE and TABLE_DUM from SQL Database perspective. After all it is not guaranteed that your favorite db vendor allows you creating one or the other.
It is also difficult to see utility of TABLE_DEE and TABLE_DUM in relational algebra. One have to look beyond that. To get you a flavor how these constants can come alive consider relational algebra put into proper mathematical shape, that is as close as it is possible to Boolean algebra. D&D Algebra A is a step in this direction. Then, one can express classic relational algebra operations via more fundamental ones and those two constants become really handy.