Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
I have been looking into text search (without tsvector) of a varchar field (more or less between 10 to 400 chars) that has the following format:
field,field_a,field_b,field_c,...,field_n
The query I am planning to run is probably similar to:
select * from information_table where fields like '%field_x%'
As there are no spaces in fields, I wonder if there are some performance issues if I run the search across 500k+ rows.
Any insights into this?
Any documentation around performance of varchar and maybe varchar index?
I am not sure if tsvector will work on a full string without spaces. What do you think about this solution? Do you see another solutions that could help improve the performance?
Thanks and I look forward to hearing from you.
R
In general the text search parser will treat commas and spaces the same, so if you want to use FTS, the structure with commas does not pose a problem. pg_trgm also treats commas and spaces the same, so if you want to use that method instead it will also not have a problem due to the commas.
The performance is going to depend on how popular or rare the tokens in the query are in the body of text. It is hard to generalize that based on one example row and one example query, neither of which looks very realistic. Best way to figure it out would be to run some real queries with real (or at least realistic) data with EXPLAIN (ANALYZE, BUFFERS) and with track_io_timing turned on.
I have a materialized view (which is very much a table) where I need to make where in kind of queries.
The column I want to query (say view_id) definitely has repetitions (15-20).
The where in queries would also be very large i.e - it would contain a lot of view_id to query.
Should I go ahead and create an index on this column?
Will it give me some performance improvements?
I have another column which would help help me have a multi column index(unique). Should this be a better option?
With questions such as these on performance, there is no substitute for testing it with your exact case. There's little harm in trying it out (even on a production system, but utilize a test system if you can!), other than perhaps slowing performance until you undo what you did. Postgres makes this kind of tinkering safe.
#tim-biegeleisen's first comment is spot on: with your setup, your cardinality is reduced, but that doesn't mean it's not a win.
In short, try it and see. There is no better answer you will get than what your own dataset and access patterns will give you.
I'd like to emulate this type of Solr query:
http://wiki.apache.org/solr/MoreLikeThis
with PostgreSQL using its full text search facility.
Is there a way to do something like a "more like this" query with pure postgres?
Not out of the box I am afraid. It might be possible to compare two tsvectors to determine if they are similar enough, or pull the top n similar tsvectors, but there is no out of the box functionality to do this. The good news is that since tsvectors support GIN indexing, the complicated part is done for you.
What I think you'd need to do is create a function in C which determines the intersection of two tsvectors. From there you could create a function which determines if they overlap and an operator which addresses this. From there it shouldn't be too hard to create a ranking based on largest overlap.
Of course I suspect that this will be easiest to do in a language like C but you could probably use other procedural languages as well if you need to.
The wonderful thing about PostgreSQL is that anything is possible. of course the downside is that when you move further from core functionality you get to do a lot of it yourself.
I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated...
Check Waiting for 9.1 – Faster LIKE/ILIKE blog post from depesz for a solution using trigrams.
You'd need to use yet unreleased Postgresql 9.1 for this. And your writes would be much slower then, as trigram indexes are huge.
Full text search suggested by user12861 would help only if you're searching for words, not substrings.
You probably want to look into full text indexing. It's a bit complicated, maybe someone else can give a better description, or you might try some links, like this one for example:
http://wiki.postgresql.org/wiki/Full_Text_Indexing_with_PostgreSQL