There is a base table called doctor in my database where I have the columns
Name, d_ID, Age, Gender, Contact_NO, Speciality, beg_Date, End_Date
I wish to normalize my table. I have found the dependencies for doctor table as follows:
Name, d_ID ---> Age, gender, Speciality
d_ID----> Name, Contanct_NO, Beg_Date, End_Date
There are a few more base tables with a similar structure.
I have computed the closures and found that I have 2 candidate keys which are {d_ID} and {Name,d_ID}. I chose {d_ID} to be the primary key and {Name,d_ID} to be the secondary key.
My question is:
I want to know if my table is in 2NF already. If not, please let
me know how to break down the relation?
I have an intermediate table called patient_record which has,
doctor id, patient id, nurse id, bed id (foreign key) and so on.My
confusion lies where, if normalization has to be only done to the intermediate tables
and not the other base tables. I believe this, because the base tables would only have
unique identifiers for their columns and hence they would
automatically fall under 2NF?
i computed the closures and found that i have 2 candidate keys which are {d_ID} and {Name,d_ID} (Please correct me if i am wrong).
No. By definition, candidate keys are irreducible. If d_ID is a candidate key, then {Name, d_ID} is not. {Name, d_ID} is not a candidate key, because it's reducible. Drop the attribute "Name", and you've got a candidate key (d_ID).
1) i want to know if my table is in 2NF already. If not, please let me know how to break down the relation?
It's really hard to say in this case. Although you have a unique ID number for every doctor, in your case it only serves to identify a row, not a doctor. Your table allows this kind of data.
d_ID Name Age Gender Contact_NO Speciality beg_Date End_Date
--
117 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
199 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
234 John Smith 45 M 123-456-7890 Cardio 2013-01-01 2015-12-31
How many doctors are there? (I made up the data, so I'm really the only one who knows the right answer.) There are two. 234 is an accidental duplicate of 117. 199 is a different doctor than 117; it's just a coincidence that they're both heart specialists at the same hospital, and their hospital privileges start and stop on the same dates.
That's the difference between identifying a row and identifying a doctor.
Whether it's in 2NF depends on other functional dependencies that might not yet be identified. There might be several of these.
2) i have an intermediate table called patient_record which has the doctor id, patient id, nurse id, bed id (foreign key)and so on. i am confused if normalization has to be only done to intermediate tables and not the other base tables.
Normalization is usually done to all tables.
Because base tables would only have unique identifiers for the columns and hence they would automatically fall under 2NF?
No, that's not true. For clarification, see my answer to Learning database normalization, confused about 2NF.
Identifying a row and identifying a thing
It's a subtle point, but it's really, really important.
Let's look at a well-behaved table that has three candidate keys.
create table chemical_elements (
element_name varchar(35) not null unique,
symbol varchar(3) not null unique,
atomic_number integer not null unique
);
All three attributes in that table are declared not null unique, which is the SQL idiom for identifying candidate keys. If you feel uncomfortable not having at least one candidate key declared as primary key, then just pick one. It doesn't really matter which one.
insert into chemical_elements
(atomic_number, element_name, symbol)
values
(1, 'Hydrogen', 'H'),
(2, 'Helium', 'He'),
(3, 'Lithium', 'Li'),
(4, 'Beryllium', 'Be'),
[snip]
(116, 'Ununhexium', 'Uuh'),
(117, 'Ununseptium', 'Uus'),
(118, 'Ununoctium', 'Uuo');
Each of the three candidate keys--atomic_number, element_name, symbol--unambiguously identifies an element in the real world. There's only one possible answer to the question, "What is the atomic number for beryllium?"
Now look at the table of doctors. There's more than one possible answer to the question, "What is the ID number of the doctor named 'John Smith'?" In fact, there's more than one possible answer for the very same doctor, because 234 and 117 refer to the same person.
It doesn't help to include more columns, because the data is the same for both doctors. You can get more than one answer to the question, "What's the ID number for the 45-year-old male doctor whose name is 'John Smith', whose phone number is 123-456-7890, and whose specialty is 'Cardio'?"
If you find people making appointments for these two doctors, you'll probably find their ID numbers written on a yellow sticky and stuck on their monitor.
Dr. John Smith who looks like Brad Pitt (!), ID 117.
Other Dr. John Smith, ID 199.
Each ID number unambiguously identifies a row in the database, but each ID number doesn't unambiguously identify a doctor. You know that ID 117 identifies a doctor named John Smith, but if both John Smiths were standing in front of you, you wouldn't be able to tell which one belonged to ID number 117. (Unless you had read that yellow sticky, and you knew what Brad Pitt looked like. But that information isn't in the database.)
What does this have to do with your question?
Normalization is based on functional dependencies. What "function" are we talking about when we talk about "functional dependencies"? We're talking about the identity function.
Here is the normalization process:
Identify all the candidate keys of the relation.
Identify all the functional dependencies in the relation.
Examine the determinants of the functional dependencies.if any determinant is not a candidate key,the relation is not well formed. then
i) Place the columns of the functional dependency in a new relation of their own.
ii)Make the determinant of the functional dependency the primary key of the new relation.
iii) Leave a copy of the determinant as a foreign key in the original relation.
Create a referential integrity constraint between the the original and the new relation.
Related
logical design:
Pet(name, type, birthday, cost)
determinants:
name->type
name->birthday
name->cost
Here is some data:
name type birthday cost
Bruno cat 1/1/1982 free
Poppy cat 1/2/1982 20.00
Silly cat 12/2/1995 free
Sam dog 2/3/1989 100.00
Tuffy dog 3/3/1974 free
There's repeated data between rows but no duplicate columns. I think it's in BCNF.
Yes, the schema is in BCNF, if the dependencies given are a cover of all the dependencies holding in the schema. In this case, name is the only candidate key, and the left part of each non-trivial dependency (including those implied by the cover) is a super key. So the relation is in BCNF.
I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?
For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)
I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.
I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here
How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.
Your best bet would probably be some form of hash function, and then a checksum added to the end.
If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.
I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.
I am trying out Filemaker Pro 12 right now with no previous FM experience, although other basic DB experience. The issue I have is trying to do filtered queries for a report that span one-to-many relationships. Here is an example;
The 2 tables:
Sample_Replicate
PK
Sample FK
other fields
Weights
Sample_Replicate_FK (linked to PK of Sample_Replicate)
Weight
Measurement type (tare, gross, dry, ash)
Wash type (null or from list of lab assays)
I want to create a report that displays: (gross-tare), (dry-tare)/(gross-tare), (ash-tare)/(gross-tare), and (dry-tare)/(gross-tare) for all dry weights with non null wash types.
It seems that FM wants me to create columns for each of these values (which is doable as the list of lab assays changes minimally and updating the database would be acceptable, though not preferred). I have tried to add a gross wt, tare wt, etc to the Sample_Replicate table, but it only is returning the first record (tare wt) when I use calculated field and method:
tare wt field = Case ( Weights::Measurement type = "Tare"; Weights::Weights )
gross wt field = Case ( Weights::Measurement type = "Gross"; Weights::Weights )
etc...
It also seems to be failing when I add the criteria:
and Is Empty(Weights::Wash type )
Could someone point me in the right direction on this issue. Thanks
EDIT:
I came across this: http://www.filemakertoday.com/com/showthread.php/14084-Calculation-based-on-1-to-many-relationship
It seems that I can create ~15 calculated fields for each combination of measurement and wash type on the weights table, then do a sum of these columns in the sample_replicate after adding these 15 columns to the table. This seems absolutely asinine. Isn't there a better way to filter results of a one-to-many relationship in FM?
What about the following structure:
Replicate
ID
Wash Weight
Replicate ID
Type (null or from list of lab assays)
Tare
Gross
Dry
Ash
+ calculated fields
I assume you only calculate weight ratios of the same wash type. The weight types (tare, gross, etc.) are not just labels here; since you use them in formulas in specific places, they are more like roles, so I think they deserve their own fields.
add tare wt field, etc. in the Weights table but then add a calc field in your Sample_Replicate table to get the sum of all related values
ex: add field "total tare wt" to be "sum ( Weights::tare wt)"
Suppose I have a relationship 1 to N, for example
Student , College.
Student Attributes:
Name,Surname,CollegeFKey,
College attributes:
CollegeKey,Other,Other.
Suppose that I have a program which read students and Exams from a plain text file. And, on this file I have duplicated Colleges and Duplicated Studends.
Like in denormalized tables:
CollegeId,Other,Other,Name,Surname,CollegeFkey.
e.g.
1,x,y,Mike,M,1
1,x,y,R,P,1
...
...
...
You see, I have to check in this case always that in my normalized db, I have still not inserted in the Table College 2 times the key 1.
How can I solve this in Hbase or Cassandra? I mean, if I have 10000.. tables and rows, I don't want check for every Primary Key and then for every FK, if it was inserted OK?
How can I solve that? I can use no-sql db for work directly in de-normalized datas?
Can you link me to an example that solve this problem?
You can use Cassandra http://wiki.apache.org/cassandra/ with some high level language client (I use Hector for java https://github.com/rantav/hector). In Cassandra you will describe ColumnFamily College in this ColumnFamily you write Student columns which contains information about students.