Decomposition into ABC & CDE and preserving functional dependencies - database-normalization

Consider a relation R with five attributes ABCDE. Now
assume that R is decomposed into two smaller relations ABC and CDE.
Define S to be the relation (ABC NaturalJoin CDE).
a) Assume that the above decomposition is lossless join. What is the
dependency that guarantees the lossless join property.
b) Give an additional FD such that “dependency preserving” property is
violated by this decomposition.
c) Give two additional FD's that would be preserved by this
decomposition.
Question seems different to me because there is no FD given and its asking:
a)
R1=(A,B,C) R2=(C,D,E) R1∩R2 =C (how can i control dependency now)
F1' = {A->B,A->C,B->C,B->A,C->A,C->B,AB->C,AC->B,BC->A...}
F2' = {C->D,C->E,D->E....}
then i will find F' ??
b,c) how do i check , do i need to look for all possible FD's for R1 and R2

The question is definitely assuming things it hasn't said clearly. ABCDE could be subject to the JD *{ABC,CDE} while not being subject to any nontrivial FDs at all.
But suppose that the relation is subject to some FDs and isn't subject to any JDs other than ones that they imply. If C is a CK then the join is lossless. But then C -> ABCDE holds, because a CK determines all attributes, and C -> ABDE holds, because a CK determines all other attributes. No other FD holding would imply that the join is lossless, although that requires tedium (by looking at every possible case of CK) or inspiration to show.
Both these FDs guarantee losslessness. Although if one of these holds the other holds, and they express the same condition. So the question is sloppy. Or the question might consider that the two expressions express the same FD in the sense of a condition, but a FD is an expression and not a condition, so that would also be sloppy.
I suspect that the questioner really just wanted you to give some FD whose holding would guarantee losslessness. That would get rid of the complications.

Related

Could someone please give me an example of a 3NF *DECOMPOSITION* that is not in BCNF? (I have no problem determining this for non-decompositions.)

It seems to me that Bernstein's synthesis / 3NF synthesis always yields BCNF subrelations, but that's apparently not true.
When one uses 3NF synthesis, one will have subrelations as a result, and they will each consist of either:
just one functional dependency along with all attributes of the schema, so the left side of the lone functional dependency will be a superkey, and that subrelation will therefore be in BCNF.
multiple functional dependencies each of which have the same left side, so they're each superkeys, and that subrelation will therefore be in BCNF.
no functional dependency where the schema includes the attributes making up the primary key of the original / non-decomposed relation, which would satisfy BCNF vacuously because of there being no functional dependencies.
What is an example of the 3NF synthesis algorithm yielding a non-BCNF decomposition and why it is so?
Bernstein's algorithm returns (one or more) components in EKNF, which lies between 3NF & BCNF.
Your claims of "that subrelation will therefore be in BCNF" are wrong. The FDs that hold in a component are all the ones in the closure of the original relation whose attributes are all in the component. So FDs could hold in a component that are not out of its superkeys. (Which by definition of BCNF is just another way of saying a component could be not in BCNF. Obviously--since we are told that the algorithm doesn't always give BCNF.)
Since your reasoning is unsound, finding a counterexample seems moot. But just about any presentation of BCNF gives an example non-BCNF 3NF relation, which it then decomposes to BCNF. You can join the non-BCNF 3NF relation with a projection on attributes of one of its CKs extended by a fresh non-prime attribute, and Bernstein's algorithm can decompose back to the 2 tables.
Chris Date's classic An Introduction to Database Systems has a non-BCNF 3NF schema R(S, J, T) with minimal/irreducible cover
{S, J} -> T
{T} -> J
CKs are {S, J} & {T, J}. Berstein gives component (S, J, T)--non-BCNF 3NF input R--in which both given FDs hold--plus redundant component (T, J).
For an example with an additional non-redundant component, extend the cover by {T} -> X. CKs are the same. {S, J} -> T again gives (S, J, T)--non-BCNF--plus component (T, J, X).
So, could someone please give me an example of the 3NF synthesis algorithm yielding a non-BCNF decomposition and tell why it is so?
A better "So, [...]" would be, So, what is wrong with my reasoning? You would do well to examine the assumptions you made about what FDs could hold in a component. (That article happens to point out (with reference) that "A 3NF table that does not have multiple overlapping candidate keys is guaranteed to be in BCNF.")
There is no "why" in mathematics. We assume things ("assumptions", "axioms", "premises") & other things follow. We can ask for a proof of something, but the proof does not say "why" the something is so, it's a demonstration that it is so. "Why" might be used trying to ask for a proof or for steps that you got wrong in or are missing from whatever almost-proof you have in mind.
PS Such a ubiquitous non-BCNF 3NF relation is Today's Court Bookings in the Wikipedia article on BCNF as I write. But beware that that particular example has perhaps unintuitive FDs. Indeed beware that almost every relational model Wikipedia page--including that one--has errors & misconceptions. So do many, many textbooks, especially re normalization.
The answer of philipxy is correct. Since you are asking for an example, here there are a couple of them.
The relation (with a cover of the functional dependencies):
R (A B C D)
A B → C
C → D
D → B
through the synthesis algorithm is decomposed in:
R1 (A B C)
R2 (C D)
R3 (B D)
and R1 is not in BCNF for the dependency C → B (the candidate key is AB). Note that C → B is not present in the original cover, but is a dependency implied from it.
Here is another (classical) example:
Phones (AreaCode, PhoneNumber, Subscriber, Town, Street)
AreaCode, PhoneNumber → Town
AreaCode, PhoneNumber → Subscriber
AreaCode, PhoneNumber → Street
Town → AreaCode
The Bernsteins’s synthesis algorithm produces two subschemas:
R1 (AreaCode, PhoneNumber, Subscriber, Town, Street)
AreaCode, PhoneNumber → Town
AreaCode, PhoneNumber → Subscriber
AreaCode, PhoneNumber → Street
and:
R2 (Town, AreaCode)
Town → AreaCode
since R2 is included in R1, the algorithm eliminates the second relation. The resulting relation is in 3NF but not in BCNF, since the relation has two candidate keys, (AreaCode, PhoneNumber) and (PhoneNumber, Town) and the functional dependency Town → AreaCode violates the BCNF.

Definition of 3NF

I'm rather confused about the definition of 3NF.
Let R be a relation with attribute set X.
Suppose Y -> A is a functional dependency where A is a non-prime attribute and Y is a subset of X.
If Y is a proper subset of any candidate key for R, then the relation is not in 3NF (and not even in 2NF) because this is a partial dependency, which is not permitted in 2NF (and by extension 3NF).
If Y is a non-prime attribute, the relation is not in 3NF because this is a transitive dependency of the non-prime attribute A on any candidate key through the non-prime attribute Y.
But what if Y is a set containing both prime and non-prime attributes? What if A is a subset of Y? What if Y contains only prime attributes, but those prime attributes come from different keys of R so that Y is not a proper subset of any particular key of R? What if Y contains only, but multiple non-prime attributes? Which of these cases violates the requirements of 3NF and why?
TL;DR Get definitions straight.
To know whether a case violates 3NF you have to look at the criteria used in some definition.
Your question is rather like asking, I know an even number is one that is divisible by 2 or one whose decimal representation ends in 0, 2, 4, 6 or 8, but what if it's three times a square? Well, you have to use the definition--show that the given conditions imply that it's divisible by two or that its decimal representation ends in one of those digits. Why do you even care about other properties than the ones in the definition?
When some FDs (functional dependencies) hold, others must also hold. We say the latter are implied by the former. So when given FDs hold usually tons of others also hold. So one or more arbitrary FDs holding doesn't necessarily tell you anything about any normal forms might hold. Eg when U is a superset of V, U → V must hold; such FDs are called trivial because they are implied by any collection of FDs. Eg when U → V, every superset of U determines every subset of V. Armstrong's axioms are some rules that can be mechanically applied to find all FDs that hold. There are algorithms to find a canonical/minimal/irreducible cover for a given set, a set of FDs that imply all those in it with no proper subset that does. There are also algorithms to determine whether a relation satisfies certain NFs (normal forms), and to decompose them into components with higher NFs when they're not.
Sometimes we think there is a case that the definition doesn't handle but really we have got the definition wrong.
The definition you are trying to refer to for a relation being in 3NF actually requires that there be no transitive functional dependence of a non-prime attribute on a candidate key.
In your non-3NF example you should say there is a transitive FD, not "this is a transitive FD", because the violating FD is of the form CK → A not Y → A. Also, U → V is transitive when there is an X where U → X AND X → V AND NOT X → V. It doesn't matter whether X is a prime attribute.
PS It's not very helpful to ask "why" something is or isn't so in mathematics. We describe a situation in terms of some givens, and a bunch of things follow. We can say that if certain of the givens weren't so then that thing wouldn't be so. But if certain other givens weren't so then it might also not be so. We can give a proof that something is or isn't so as "why" but it's not the only proof.

How does aggregate generalise fold and fold generalise reduce?

As far as I understand aggregate is a generalisation of fold which in turn is a generalisation of reduce.
Similarily combineByKey is a generalisation of aggregateByKey which in turn is a generalisation of foldByKey which in turn is a generalisation of reduceByKey.
However I have trouble finding simple examples for each of those seven methods which can in turn only be expressed by them and not their less general versions. For example I found http://blog.madhukaraphatak.com/spark-rdd-fold/ giving an example for fold, but I have been able to use reduce in the same situation as well.
What I found out so far:
I read that the more generalised methods can be more efficient, but that would be a non-functional requirement and I would like to get examples which can not be implemented with the more specific method.
I also read that e.g. the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally: https://stackoverflow.com/a/25158790/4533188 (However, I still don't know any good simple example.) whereas in https://stackoverflow.com/a/26635928/4533188 I read that fold needs both properties to hold...
We could think of the zero value as a feature (e.g. for fold over reduce) as in "add all elements and add 3" and using 3 as the zero value, but that would be misleading, because 3 would be added for each partition, not just once. Also this is simply not the purpose of fold as far as I understood - it wasn't meant as a feature, but as a necessity to implement it to be able to take non-commutative functions.
What would simple examples for those seven methods be?
Let's work through what is actually needed logically.
First, note that if your collection is unordered, any set of (binary) operations on it need to be both commutative and associative, or you'll get different answers depending on which (arbitrary) order you choose each time. Since reduce, fold, and aggregate all use binary operations, if you use these things on a collection that is unordered (or is viewed as unordered), everything must be commutative and associative.
reduce is an implementation of the idea that if you can take two things and turn them into one thing, you can collapse an arbitrarily long collection into a single element. Associativity is exactly the property that it doesn't matter how you pair things up as long as you eventually pair them all and keep the left-to-right order unchanged, so that's exactly what you need.
a b c d a b c d a b c d
a # b c d a # b c d a b # c d
(a#b) c # d (a#b) # c d a (b#c) d
(a#b) # (c#d) ((a#b)#c) # d a # ((b#c)#d)
All of the above are the same as long as the operation (here called #) is associative. There is no reason to swap around which things go on the left and which go on the right, so the operation does not need to be commutative (addition is: a+b == b+a; concat is not: ab != ba).
reduce is mathematically simple and requires only an associative operation
Reduce is limited, though, in that it doesn't work on empty collections, and in that you can't change the type. If you're working sequentially, you can a function that takes a new type and the old type, and produces something with the new type. This is a sequential fold (left-fold if the new type goes on the left, right-fold if it goes on the right). There is no choice about the order of operations here, so commutativity and associativity and everything are irrelevant. There's exactly one way to work through your list sequentially. (If you want your left-fold and right-fold to always be the same, then the operation must be associative and commutative, but since left- and right-folds don't generally get accidentally swapped, this isn't very important to ensure.)
The problem comes when you want to work in parallel. You can't sequentially go through your collection; that's not parallel by definition! So you have to insert the new type at multiple places! Let's call our fold operation #, and we'll say that the new type goes on the left. Furthermore, we'll say that we always start with the same element, Z. Now we could do any of the following (and more):
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d
((Z#a)#b) # c d
(((Z#a)#b)#c) # d
Now we have a collection of one or more things of the new type. (If the original collection was empty, we just take Z.) We know what to do with that! Reduce! So we make a reduce operation for our new type (let's call it $, and remember it has to be associative), and then we have aggregate:
a b c d a b c d a b c d
Z#a b c d Z#a b Z#c d Z#a Z#b Z#c Z#d
(Z#a) # b c d (Z#a) # b (Z#c) # d Z#a $ Z#b Z#c $ Z#d
((Z#a)#b) # c d ((Z#a)#b) $ ((Z#c)#d) ((Z#a)$(Z#b)) $ ((Z#c)$(Z#d))
(((Z#a)#b)#c) # d
Now, these things all look really different. How can we make sure that they end up to be the same? There is no single concept that describes this, but the Z# operation has to be zero-like and $ and # have to be homomorphic, in that we need (Z#a)#b == (Z#a)$(Z#b). That's the actual relationship that you need (and it is technically very similar to a semigroup homomorphism). There are all sorts of ways to pick badly even if everything is associative and commutative. For example, if Z is the double value 0.0 and # is actually +, then Z is zero-like and # is associative and commutative. But if $ is actually *, which is also associative and commutative, everything goes wrong:
(0.0+2) * (0.0+3) == 2.0 * 3.0 == 6.0
((0.0+2) + 3) == 2.0 + 3 == 5.0
One example of a non-trival aggregate is building a collection, where # is the "append an element" operator and $ is the "concat two collections" operation.
aggregate is tricky and requires an associative reduce operation, plus a zero-like value and a fold-like operation that is homomorphic to the reduce
The bottom line is that aggregate is not simply a generalization of reduce.
But there is a simplification (less general form) if you're not actually changing the type. If Z is actually z and is an actual zero, we can just stick it in wherever we want and use reduce. Again, we don't need commutativity conceptually; we just stick in one or more z's and reduce, and our # and $ operations can be the same thing, namely the original # we used on the reduce
a b c d () <- empty
z#a z#b z
z#a (z#b)#c
z#a ((z#b)#c)#d
(z#a)#((z#b)#c)#d
If we just delete the z's from here, it works perfectly well, and in fact is equivalent to if (empty) z else reduce. But there's another way it could work too. If the operation # is also commutative, and z is not actually a zero but just occupies a fixed point of # (meaning z#z == z but z#a is not necessarily just a), then you can run the same thing, and since commutivity lets you switch the order around, you conceptually can reorder all the z's together at the beginning, and then merge them all together.
And this is a parallel fold, which is really a rather different beast than a sequential fold.
(Note that neither fold nor aggregate are strictly generalizations of reduce even for unordered collections where operations have to be associative and commutative, as some operations do not have a sensible zero! For instance, reducing strings by shortest length has as its "zero" the longest possible string, which conceptually doesn't exist, and practically is an absurd waste of memory.)
fold requires an associative reduce operation plus either a zero value or a reduce operation that's commutative plus a fixed-point value
Now, when would you ever use a parallel fold that wasn't just a reduceOrElse(zero)? Probably never, actually, though they can exist. For example, if you have a ring, you often have fixed points of the type we need. For instance, 10 % 45 == (10*10) % 45, and * is both associative and commutative in integers mod 45. Thus, if our collection is numbers mod 45, we can fold with a "zero" of 10 and an operation of *, and parallelize however we please while still getting the same result. Pretty weird.
However, note that you can just plug the zero and operation of fold into aggregate and get exactly the same result, so aggregate is a proper generalization of fold.
So, bottom line:
Reduce requires only an associative merge operation, but doesn't change the type, and doesn't work on empty collecitons.
Parallel fold tries to extend reduce but requires a true zero, or a fixed point and the merge operation must be commutative.
Aggregate changes the type by (conceptually) running sequential folds followed by a (parallel) reduce, but there are complex relationships between the reduce operation and the fold operation--basically they have to be doing "the same thing".
An unordered collection (e.g. a set) always requires an associative and commutative operation for any of the above.
With regard to the byKey stuff: it's just the same as this, except it only applies it to the collection of values associated with a (potentially repeated) key.
If Spark actually requires commutativity where the above analysis does not suggest it's needed, one could reasonably consider that a bug (or at least an unnecessary limitation of the implementation, given that operations like map and filter preserve order on ordered RDDs).
the function passed to fold only has to be associative, while the one for reduce has to be commutative additionally.
It is not correct. fold on RDDs requires the function to be commutative as well. It is not the same operation as fold on Iterable what is pretty well described in the official documentation:
This behaves somewhat differently from fold operations implemented for non-distributed
collections in functional languages like Scala.
This fold operation may be applied to
partitions individually, and then fold those results into the final result, rather than
apply the fold to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold applied to a
non-distributed collection.
As you can see order of merging partial values is not part of the contract hence function which is used for fold has to be commutative.
I read that the more generalised methods can be more efficient
Technically speaking there should be no significant difference. For fold vs reduce you can check my answers to reduce() vs. fold() in Apache Spark and Why is the fold action necessary in Spark?
Regarding *byKey methods all are implemented using the same basic construct which is combineByKeyWithClassTag and can be reduced to three simple operations:
createCombiner - create "zero" value for a given partition
mergeValue - merge values into accumulator
mergeCombiners - merge accumulators created for each partition.

In Answer Set Programming, what is the difference between a model and a least model?

I'm taking an artificial intelligence class and we are working with Answer Set Programming (Clingo specifically). We're talking mostly theory at the moment and I am having some trouble differentiating between models and least models. I have the following definitions:
Satisfying rules, models, least models and answer sets of definite
program
A program is called definite if it does not have “not” in the body of its rules.
A set S is said to satisfy a rule of the form a :- b1, …, bm, not c1, …, not cn. if its body is satisfied by S (I.e., b1 … bm are in S
and none of c1 ... cn are in S) implies that its head must be
satisfied by S (i..e, a is in S).
A set S is said to satisfy a program if it satisfies all rules of that program.
A set S is said to be an answer set of a definite program P if (a) S satisfies P (also referred to as S is a model of P) and (b) No
strict subset of S satisfies P (i.e., S is the least model of P).
With the question (pulled from the lecture slides, not homework):
P is defined as:
a :- b,e.
b.
c :- d,b.
d.
Which of the following are models and least models?
{}, {b}, {b,d}, {b,d,c}, {b,d,c,e}, {b,d,c,e,a}
Can anyone let me know what the answer to the above question is? I can probably figure out the difference from there, although if someone could explain the difference in common speak (rather than text-book definition), that would be wonderful. I'm not sure which forum to post this question under - please let me know if it should be posted somewhere else.
Thanks
First of all, note that this section of your slides is talking about the answer sets of a positive program P (also called a definite program), even though it mentions also not. The positive program is the simple case, as for a positive program P there always exists a unique least model LM(P), which is the intersection of all it's models.
Allowing not rules in the rule bodies makes things more complex. The body of a rule is the right side of :-.
The answer to the question would be, set by set:
S={} is not a model, since b and d are facts b. d.
S={b} is not a model, since d is a fact d.
S={b,d} is not a model, since c is implied by c :- d,b. and c is not in S
S={b,d,c} is a model
S={b,d,c,e} is not a model, since a is implied by a :- b,e. and a is not in S
S={b,d,c,e,a} is a model
So what is the least model? It's S={b,c,d} since no strict subset of S satisfies P.
We can arrive to the least model of a positive program P in two ways:
Enumerate all models and take their intersection (here {b,c,d}∩{a,b,c,d,e}={b,c,d}).
Starting with the facts (here b. d.) and iteratively adding implied atoms (here c :- b,d.) to S, repeating until S is a model and stopping at that point.
Like your slides say, the definition of an answer set for a positive program P is: S is an answer set of P if S is the least model of P. To be stricter, this is actually if and only if, since the least model LM(P) is unique.
As a last note, so you are not later confused by them, a constraint :- a, b is actually just shorthand for x :- not x, a, b. Thus programs containing constraints are not positive programs; though they might look like they are at first, since the body of a constraint seemingly doesn't contain a not.

Odd BCNF Decomposition on my past exam

I have a problem breaking this into BCNF:
Relation: R[A E P M N S T L]
FD's:
A -> EM,
A -> L,
M -> ST,
M -> N,
S -> T,
E -> P,
P -> E,
L -> A
This one was on one of my past exams, and I don't really know how to solve it.
I learned this on coursera by the woman (Jennifer Widom) who wrote our course literature:
-------------- BCNF ALGORITHM ------------
1. Take a FD that violates BCNF.
2. Decompose the FD to two other relations
3. First relation: The whole FD
4. Second relation: The rest of the Relation + the left hand side of the chosen FD
5. Iterate until all the new relations have key on its left hand side
-------------- BCNF ALGORITHM ------------
And I also tried another one that is basically the same, but written in a different way:
X->Y: R1({X}+), R2(R - {X}+ ; X) (Relation minus {X}+ (XY in this case), plus X.
So far, I'm here:
Obviously, A is key, so its FDs are already in BCNF. Question is, can I erase any redundant FDs maybe? If so, what is the thumb rule?
R1(MST) <-- BCNF.
R2(A E P M N L)
R3()
And have no idea where to go.
1. Take a FD that violates BCNF.
5. Iterate until all the new relations have key on its left hand side
Step 1 in mentioning BCNF violation can only make sense if it means you should identify the keys and non-key-implied FDs of each original and generated relation. Also, step 5 termination cannot be merely when there is a key on the left hand side, because a non-key determiner could be present. Clearly it is necessary and sufficient to stop when all relations are in BCNF. A particular full description of an algorithm may describe equivalent reorganizations and/or tests without explicitly mentioning BCNF.
If A is a key then so is L since L->A.
MSTNAELP keys A and L plus M->ST, S->T, E<->P
Those FDs are from non-keys. You picked M->ST. We get:
MST key M plus S->T
MNAELP keys A and L plus E<->P
Both have determiners not from keys, hence are not in BCNF. Pick S->T and E->P. We get:
ST key S in BCNF
SM key M in BCNF
EP keys P and E in BCNF
EMNAL keys A and L in BCNF