Is an FD(functional dependency) fully fd when x->y and z->y where z is not a subset of x? - rdbms

I have seen many examples about fully functional dependencies, but they use to say that:
x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
But, what if y is determined by an attribute other than the proper subset or subset of x.
Suppose that I have a students table which consists of rollno(primary key), name, phone no unique not null, email unique not null.
As rollno is a primary key, let it be x and take name as y.
now x->y, but phone or email also determine y(name) which are not subsets of x. Is this still called a fully functional dependent?
If yes, should we check the determinants of y which are only subsets of x?
If no, what is the mistake I did?

x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
You are confusing the definition of "full functional dependency" with the definition of "2NF". The definition of fully functionally dependent has nothing to do with superkeys or candidate keys or primary keys. And for a relation to be in 2NF, if X is a candidate key and Y is non-prime then Y can't be determined by any proper subset of X.
A functional dependency X -> Y is partial when Y is also functionally dependent on a proper/smaller subset of X. Otherwise it is full. It doesn't matter what else is true.
A superkey is a column or set of columns that functionally determines every column. If there is no smaller superkey inside it then it is a candidate key. A relation is in 2NF when every attribute is fully functionally dependent on every candidate key. It doesn't matter what else is true.
You can pick one candidate key to call primary key. So a primary key is a candidate key. Otherwise the notion of "primary key" is irrelevant to functional dependencies and normalization.
(In SQL primary key means the same as unique not null, namely superkey. Which is a candidate key only if there's no smaller superkey in it. So a set declared primary key might not even be a primary key. And in SQL you can't declare {} as a superkey.)
As rollno is a primary key, let it be x and take name as y.
A primary key is a candidate key, so {rollno} determines every attribute and no proper subset of {rollno} determines every attribute. So {}, the only proper subset of {rollno}, is not a superkey. ({} is a superkey when there can only ever be at most one row in the table.) But it's still possible that {} -> name. (That would be if the name column only contains at most one name at a time.) Then {rollno} -> name would be partial because its proper subset {} determines name.
now x->y, but phone or email also determine y(name) which are not subsets of x. Is this still called a fully functional dependent?
If no proper subset of {rollno} determines name then {rollno} -> name fully, otherwise partially. That's what the definition says. Nothing else matters. But we don't know whether proper subset of {rollno} determines name because you didn't say whether {} -> name.
If {rollno}, {phoneno} and {email} are candidate keys and {} doesn't determine name then name is fully functionally dependent on all three (because no proper subset of any of them determines name).

You are saying:
x->y such that y shouldn't be determined by any proper subset of x, x has to be a key.
but this mixes two different concepts, that of “full functional dependency”, and that of “key”.
A functional dependency is full if you cannot remove any element of the left part without losing the propoerty of determining the right part. So if a functional dependency has only one attribute on the left part (like rollno → name), it is always complete.
A (candidate) key on the other hand is a set of attributes that determines all the attributes of a relation, and such that you cannot remove any attribute from it without losing the property of being a key (so, it is not a superkey).
In your example there are three different keys, rollno, phone, and email, each of them composed by a single attribute.
Of course, if you know that the set of attributes X is a key, you can write that X → T, where T are all the attributes of the relation, and this functional dependency is complete.

Related

Does Erlang Mnesia select on an ordered_set give a list in Erlang Term order?

In the documentation it isn't clear to me whether I need to iterate through in order with either next or perhaps foldl (it is mentioned that foldr goes in the opposite order to ordered_set so presuambly foldl goes in the same order) or if I can use select and rely upon it being ordered (assuming ordered_set table)
can I use select and rely upon it being ordered (assuming ordered_set table)
ets:select/2:
For tables of type ordered_set, objects are visited in the same order as in a first/next traversal. This means that the match
specification is executed against objects with keys in the first/next
order and the corresponding result list is in the order of that
execution.
ets:first/1:
Returns the first key Key in table Tab. For an ordered_set table, the
first key in Erlang term order is returned.
Table Traversal:
Traversals using match and select functions may not need to scan
the entire table depending on how the key is specified. A match
pattern with a fully bound key (without any match variables) will
optimize the operation to a single key lookup without any table
traversal at all. For ordered_set a partially bound key will limit the
traversal to only scan a subset of the table based on term order.
It would make no sense to me for a table of type ordered_set to return search results in a random order.

Database Normalization mistake

I'm preparing an exam and on my texts I found an example I don't understand.
On the Relation R(A,B,C,D,E,F) I got the following functional dependencies:
FD1 A,B -> C
FD2 C -> B
FD3 C,D -> E
FD4 D -> F
Now I think all The FD are in 3NF (none is in BCNF), but the text says FD1 and FD2 to be in 2NF and FD3 and FD4 to be in 1NF. Where am I making mistakes (or is it the text wrong).
I found alternative keys to be ABD and ACD
Terminology
It is highly improper to say that: “a Functional Dependency in is in a certain Normal Form”, since only a relation schema can be (or not) in a Normal Form. What can be said is that a Functional Dependency violates a certain Normal Form (so that the schema that contains it is not in that Normal Form).
Normal forms
It can be shown that a relation schema is in BCNF if every FD given has as determinant a superkey. Since, has you have correctly noted, the only candidate keys here are ABD and ACD, every dependency violates that Normal Form. So, the schema is not in BCNF.
To be in 3NF, a relation schema must have all the given functional dependencies such that either the determinant is a superkey, or every attribute of the determinate is a prime attribute, that is it is an attribute of some candidate key. In your example this is true for B and C, but not for E and F, so FD3 and FD4 violates the 3NF. So, the schema is neither in 3NF.
The 2NF, which is only of historical interest and not particularly useful in the normalization theory, is a normal form for which the relation schema does not have functional dependencies in which non-prime attributes depend on part of keys. This is not true again for FD3 and FD4, so that the relation is neither in 2NF.

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Algebra Relational sql GROUP BY SORT BY ORDER BY

I wanted to know what is the equivalent in GROUP BY, SORT BY and ORDER BY in algebra relational ?
Neither is possible in relational algebra but people have been creating some "extensions" for these operations (Note: in the original text, part of the text is written as subscript).
GROUP BY, According to the book Fundamentals of Database Systems (Elmasri, Navathe 2011 6th ed):
Another type of request that cannot be expressed in the basic relational algebra is to
specify mathematical aggregate functions on collections of values from the database.
...
We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced
script F)7, to specify these types of requests as follows:
<grouping attributes> ℑ <function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list of (<function> <attribute>) pairs. In each such pair,
<function> is one of the allowed functions—such as SUM, AVERAGE, MAXIMUM,
MINIMUM,COUNT—and <attribute> is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list.
ORDER BY (SORT BY), John L. Donaldson's lecture notes* (not available anymore):
Since a relation is a set (or a bag), there is no ordering defined for a relation. That is, two relations are the same if they contain the same tuples, irrespective of ordering. However, a user frequently wants the output of a query to be listed in some particular order. We can define an additional operator τ which sorts a relation if we are willing to allow an operator whose output is not a relation, but an ordered list of tuples.
For example, the expression
τLastName,FirstName(Student)
generates a list of all the Student tuples, ordered by LastName (as the primary sort key) then FirstName (as a secondary sort key). (The secondary sort key is used only if two tuples agree on the primary sort key. A sorting operation can list any number of sort keys, from most significant to least significant.)
*John L. Donaldson's (Emeritus Professor) lecture notes from the course CSCI 311 Database Systems at the Oberlin College Computer Science. Referenced 2015. Checked 2022 and not available anymore.
You can use projection π for the columns that you want group the table by them without aggregating (The PROJECT operation removes any duplicate tuples)
as following:
π c1,c2,c3 (R)
where c1,c2,c3 are columns(attributes) and R is the table(the relation)
According to this SQL to relational algebra converter tool, we have:
SELECT agents.agent_code, agents.agent_name, SUM(orders.advance_amount)
FROM agents, orders
WHERE agents.agent_code = orders.agent_code
GROUP BY agents.agent_code, agents.agent_name
ORDER BY agents.agent_code
Written in functions sort of like:
τ agents.agent_code
γ agent_code, agent_name, SUM(advance_amount)
σ agents.agent_code = orders.agent_code (agents × orders)
With a diagram like:

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.