How to maintain integrity? - rdbms

I am just curious how do you all create tables to maintain integrity?
tblUserProfile
UserId
EyeColorId
HairColorId
RelationShipStatusId
etc.
Each of these values like EyeColorId has a set of values like Brown,Black,Yellow,Green. Similary HairColorId has black, blonde, crimson etc and RelationShipStatusId has Divorced, Married, Single etc. Should i create different tables for each one of these? like
tblEyeColor
EyeColorId
EyeColorCode
then :-
tblHairColor
HairColorId
HairColorCode
and likewise keep creating tables? There are many such tables(approximately 20-25). If i keep creating these tables and make joins on them, it will terribly slow down my performance. How do you all maintain this sort of enum values? Should i keep a check constraint or should i keep making tables?

I would say that Color looks like it could be a single table that both hair and eye could both use. How important is it to your integrity to enforce the fact that no one should have blonde eyes or blue hair?
Some people go with the idea of a single lookup table, which would have an id, a group, and a value and perhaps a description for each row. The id and group would be the primary key, so you'd have id = 1 and group = 1 for hair as the BLONDE hair color, id = 1 and group = 2 for eyes as the HAZEL eye color, etc.
Others would deride this as a poor denormalized design.
The joins will only slow you down when the number for a particular query gets large.
My advice would be to do a normalized design until you have some data to suggest that it's performing poorly. When that happens, profile your app to find out where the problem is and refactor or denormalize appropriately.
I would say that indexing will have a greater impact on your performance than these JOINs. You might be guilty of premature optimization without data to support it.

There is no need of creating tables if the number of options are fixed You can use Enum Type instead in your table.
e.g. Column EyeColor will be Enum of Black, Brown, Blue
However I've never seen someone with Green Eyes. LOL

That is the traditional method, yes.
Most modern RDBMSs will cache small tables like these lookups for extended periods of time, which alleviates the potential multi-join issues.
Also, you could pre-load all of the lookup tables in your application, and reverse-convert the enums to ids in code before accessing the database. This way you won't need joins in your queries.

Related

Will one Foreign key to two tables cause confusion?

If I have table A, and table B, and I have data that start off in table A but end up in table B, and I have a table C which has a foreign key that points to the primary key of A, but when the data get removed from A and ends up in table B, it should point to B instead (having the same id as A's data did). Will this cause confusion. Heres and example to show what I mean:
A (Pending results)
id =3
B( Completed Results)
null
C(user)
id = 1
results id = 3 (foreign key to both A and B)
After three minutes, the results have been posted.
A (Pending results)
null
B( Completed Results)
id = 3
C(user)
id = 1
results id = 3 (foreign key to both A and B)
Is there anything wrong with this implementation. Or would it be better to have A and B as one table? The table could grow very large which is what I am worried about. As separate tables, the reads to table A would be far greater than the reads to table B and table A would be much smaller, as it is just pending results. If A and B were combined into one table, then it would be both pending and a history of all completed results, so finding the ones which are pending would take much more time I assume. All of this is being done is postgresql if that makes a difference.
So I guess my question is: Is this implementation fine for a medium scale, or given the information I just said, should I combine table A and B (Even though B will grow infinitely large whereas A only contains present data and is significantly smaller).
Sounds like you've already found that this does not work. I couldn't follow your example properly because "A", "B", and "C" never work for me. I suspect those kinds of formulaic labels are better than specifics for other people. You just can't win ;-) In any case, it sounds like you're facing a practical concern about table size, and are being tempted to use a design that splits a natural table into two parts. (Hot and old.) As you found, that doesn't really work with the keys in a system. The relational model (etc., etc.) doesn't have a concept for "this thing is a child of this or that." So, you're swimming up stream there. Regardless, this kind of setup is very commonplace in the wild, so much so that it's got a name. Well, several names. "Polymorphic Association" from Bill Karwin's SQL Anti-Patterns is common. That's a good book, and short, by the way. Similarly, "promiscuous association" is a term you'll see. Or sometimes you'll see the table itself listed as a "jump table", or a "hub", etc.
I suspect there's a reason this non-relational pattern is so widely used: It makes sense to humans. An area where the relational model is always a tight pinch is when you have things which are kinds of things. Like, people who are staff or student. So many fields in common, several that are distinct to their specific type. One table? Two? Three? Table inheritance in Postgres might help...at least it's trying to. Anyway, polymorphic relations are problematic in an RDBMS because they're not able to be modeled or constrained automatically. You need custom code to figure out that this record is a child of that table...or the other table. You can't bake that into the relations. If you're interested in various solutions to this design problem, Karwin's chapter is quite good, easy to read, and full of alternative designs. If you don't feel like tracking down the book but are a bit interested, check out this article from a few years ago:
https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
Chances are, your interest right now is more day-to-day. Is sounds like you've got a processing pipeline with a few active records and an ever-increasing collection of older records. You don't mention your Postgres version, but you might have less to worry about than you imagine. First up, you could consider partitioning the table. A partitioned table has a single logical table that you talk to in your queries with a collection of smaller physical tables under the hood. You can get at the partitions directly, but you don't need to. You just talk to my_big_table and Postgres figures out where to look. So, you could split the data on week, month, etc. so that no one bucket every gets too big for you. In this case, the individual partitions have their own indexes too. So, you'll end up with smaller tables and smaller indexes under the hood. For this, you're best off using PG 11, or maybe PG 10. Partitioning is a big topic, and the Postgres feature set isn't a perfect match for every situation...you have to work within its limits. I'll leave it at that now as it's likely not what you need first.
Simpler than partitioning is an awesome Postgres feature you may not know about, partial indexes. This isn't unique to Postgres (SQL Server calls the same sort of feature a "filtered" index), but I don't think MySQL has it. Okay, the idea is really simple: Build an index that only includes rows that match a condition. Here's an example:
CREATE INDEX tasks_pending
ON tasks (status)
WHERE status = 'Pending'
If you're table has 100M records, a full B-tree has to catalog all 100M rows. You need that for a uniqueness check on a primary key...but it's big and expensive. Now imagine your 100M records have only 1,000 rows where status = pending. You've got an index with just those 1,000 rows. Tiny, fast, perfect. The beauty part here is that the partial index doesn't necessarily get bigger as your historical data set grows. And, shout out to historical data sets, they're very nice to have when you need to get aggregates, etc. in a simple search. If you split things into multiple tables, you'll need to write longer queries with UNION. (That wouldn't be the case with partitions where the physical division is masked by the logical partition master table.)
HTH

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

PostgreSQL: What is the maximum number of tables can store in postgreSQL database?

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

What db fits me?

I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:
Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:
cluster
\--news1
\--word1
\--word2
\--news2
\--word3
\--news3
\--word1
\--word3
And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.
Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.
I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.
Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.
I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.
NEW
As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)
TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)
You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).
All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.
TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.
As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).
NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.
When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:
news-analysis-example http://github.com/neo4j-examples/domain-models/raw/master/news-analysis.png
In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.
How about db4o? db4o
ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".
I don't understand why distributedness is not required. Could you elaborate on that?
Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.
Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra #classmethod really be that big a deal?
Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.
In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.
MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.
The good news:
This will make it easy to modify fields.
Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.
The bad news:
It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.
If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.