SQL table structure - tsql

I am starting a new project that will handle surveys and reviews. At this point I am trying to figure out what would be the best sql table structure to store and handle such information.
Basically, the survey will contain ratings, text reviews and additional optional information available for clients to share. Now I am thinking of either storing each information in a separate column or maybe merge all this data and store it as an XML in one column.
I am not sure what would be a better solution, but I have the following issues on my mind:
- would possible increase of information collected would be a problem in case of single XML column
- would a single XML column have any serious impact on performance when extracting and handling information from xml column

If you ever have a reason to query on a single piece of info, or update it alone, then don't store that data in XML, but instead as a separate column.
It is rare, IMO, that storing XML (or any other composite type of data) is a good idea in a DB. Although there are always exceptions.

Well, to keep this simple, you have two choices: dyanmic or static surveys.
Dynamic surveys would look like this:
Not only would reporting be more complicated, but so would the UI. The number of questions is unknown and you would eventually need logic to handle order, grouping, and data types.
Static surveys would look more like this:
Although you certainly give up some flexibility, the solution (including reports) is considerably simpler. You need not handle order, grouping, or data types (at least dynamically).
I like to argue that "Simplicity is the best Design" in almost everything.
Since I cannot know your requirements in detail, I cannot assume which is the better fit. But I can tell you this, the dynamic is often built when the static is sufficient.
Best of luck!

If you don't want to fight with a relational database that expects relational data you probably want reasonably normalized data. I don't see in your case what advantage the XML would give you. If you have multiple values entered in the survey, you probably want another table for survey entries with a foreign key to the survey.
If this is going to be a relatively extensive application you might think about a table for survey definition, a table for survey question, a table for survey response, and a table for survey question response. If the survey data can be multiple types, you might need a table for each kind of question that might be asked, though in some cases a column might do.
EDIT - I think you would at least have one row per answer to a question. If the answer is complex (doesn't correspond to just one instance of a simple data type) it might actually be multiple rows (though denormalizing into multiple columns is probably O.K. if the number of columns is small and fixed). If an answer to one question needs to be stored in multiple rows, you would almost certainly end up with one table that represents the answer, and has one row per answer, plus another table that represents pieces of the answer, and has one row per piece.
If the reason you are considering XML is that the answers are going to be of very different types (for example, a review with a rating, a title, a header, a body, and a comments section for one question; a list of hyperlinks for another question, etc.) then the answer table might actually have to be several tables, so that you can model the data for each type of question. That would be a pretty complicated case though.
Hopefully one row per response, in a single table, would be sufficient.

To piggyback off of Flimzy's answer, you want to simply store the data in the database and not a specific format (i.e. XML). You might a requirement at the moment for XML, but tomorrow it might be a CSV or a fixed width DAT file. Also, if you store just the data, then you can use the "power" of the database to search on specific columns of information and then return it as XML, if desired.

Related

POSTGRESQL JSONB column for storing follower-following relationship [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

PostgreSQL: JSON column or one-to-many table for config options

We currently have a table which stores information about users. Some of the columns hold information such as user ID, name etc., but many other columns (booleans, integers and varchars etc) hold configuration options for each user.
This has over time resulted in the width of the table becoming quite big and I think the time has come to migrate this to something new, so I want to remove all the "option"-related columns to a separate data structure.
The typical way of doing this, from my experience, would be to have a new table which would simply have option_id and option_name, and a second new table which would contain user_id, option_id, option_value, for example.
However, a colleague suggested using the new jsonb column type as an alternative, but I don't know if I like the idea of storing relational data in a non-relational way. From a Java point of view, it's pretty much the same as far as I can tell - it'll just be turned into a POJO and then cached on the object.
I should mention the number of users will be quite low, only going into the thousands, and number of columns could and will go into the hundreds.
Does anyone have advice on the best way forward here?
Technically, you have already de-normalized your database structure by adding columns to a table that are irrelevant to some of the entities stored therein.
Using JSON is just another way to de-normalize, cramming a bunch of values into a single row-column field. The excellent binary support for JSON in Postgres (the jsonb data type) then lets you index elements within those JSON documents, as a way to quickly access those embedded values. This is quite screwy from a relational point of view, but is handy for some situations.
Either approach is commonly done for this kind of problem, and is not necessarily bad. In general, de-normalizing is often a pay-now-or-pay-later kind of solution. But for something like user preferences, there may not be a pay-later penalty, as there often is with most business-oriented problem domains.
Nevertheless, you should consider a normalized database structure.
By the way, this kind of table-structure Question might be better asked in the sister site, http://DBA.StackExchange.com/.
I suggest searching Stack Overflow, that DBA site, and the wider Internet for discussions of database design for storing user preferences. Like this.

Is it good practice to have 2 or more tables with the same columns?

I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.

JSON or relational tables for complex user profiles

I am trying to design a Postgres database for holding a variety of information about users and see two obvious ways to go about it - specifically, the different many-many relations.
Store the basic user data in a user_info table. In separate tables, store the many-many relations like what schools someone attended, places they worked at, and so on. There will be a large number of such tables, (it is easy to add things like what places someone visited, what books they've read, etc. etc. I expect this to grow to a rather large list of tables).
In the main user_info table, store a JSON blob (properly organized of course) with all this additional info.
Which of these two options should I choose? Naturally, read performance is more important. I know that JSON is generally slower than ordinary relational tables but I am unsure if looking up info from a lot of different tables (as in option 1) will be slower than getting a single json blob and displaying it in the browser. As a further note, the JSONB format, in Postgres, actually has good indexing options.
Update:
Following some comments that a graphdb is what needs to be used: I should clarify the question is not about the choice of technology (rdbms vs graph db). But about the choice of data type given the technology (rdbms).
NoSQL is great for when you don't know what data you're going to store or how it's going to be used, or it fits well with the list/hash model. Relational databases are great for when you have a lot of certainty about the data, how it will be used, and when it fits into the relational model. I would suggest a hybrid approach, especially given PostgreSQL 9.2's JSON performance improvements.
Make traditional relationships for things you know are solid.
Make use of JSON for data that you want to capture but aren't sure you need.
For simple lists, make use of PostgreSQL arrays or JSON rather than join tables.
Abstract this all behind model classes.
As you gain more knowledge about the data, change how its stored.
For example, make tables for People, Schools, Work and Places and join tables between them. Fields like People.name and Places.address are normal columns. Things like "list of a person's pets" store it as an array of TEXT or a JSON field until you feel you need a Pets table. Any extra information you don't immediately know what you're going to do with like "how big is a school's endowment" put into a JSON metadata column.
Using model classes allows you to refactor your database without worrying about every piece of code that touches the database. Just be sure that all code which makes assumptions about the table structure goes into model methods.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.