I want to store user-defined segments. A segment will consist of several different rules. I was thinking I would either create a separate separate table of "Rules" with three columns: attribute name, operator, and value. For example, if a Segment is users in the united states the rule would be "country = US" in their respective columns. A segment can have many rules.
The other option is to store these as JSONB via Postgres in a "Rules" column in the Segment table. I'd follow a similar pattern to the above with an array of rules or something. What are the pros and cons of each method?
Maybe neither one of these is the right approach.
The choice is basically about the way you wish to read the data.
You are better off with JSON if:
you are not going to filter (with a WHERE clause) through the Rules
you do not need to get statistics (i.e. GROUP BY)
you will not imply any constraints on attributes/operators/values
you simply select the values (SELECT ..., Rules)
If you meet these requirements you can store data as JSON, thus eliminating JOINs and subselects, eliminating the overhead of primary key and indexes on Rules, etc.
But if you don't meet these you should store the data in a common relational design - your approach 1.
I would go with the first approach of storing the pieces of data individually in a relational database. It sounds like your data (segments->rules) will always contain the same structure (which is fairly simple), so there isn't a pressing reason to store the data as JSON.
As a side note, I think you will need another column in the "Rules" table, serving as a foreign key to the "Segments" table.
Pros to approach 1:
Data is easy to search and select. Your SQL statements can directly access specific information about the rules (the specific operators, name, value, etc) without having to parse the JSON object for the desired rule.
The above will result in reduced processing time
Only need to parse the JSON once (before the insert)
Cons to approach 1:
Requires parsing of JSON before the insert
Requires multiple inserts per segment
Regarding your last sentence, it is hard to prescribe a database design without knowing more about your intended functionality. For example, if the attribute names have meaning beyond a single segment, you would want to store the attribute names separately and reference them in the Rules table.
Related
Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
In Redshift we have a table (let's call it entity) which among other columns it has two important ones: hierarchy_id & entity_timestampt, the hierarchy_id is a combination of the ids of three hierarchical dimensions (A, B, C; each one having a relationship of one-to-many with the next one).
Thus: hierarchy_id == A.a_id || '-' || B.b_id || '-' || C.c_id
Additionally the table is distributed according to DISTKEY(hierarchy_id) and sorted using COMPOUND SORTKEY(hierarchy_id, entity_timestampt).
Over this table we need to generate multiple reports, some of them are fixed to the depths level of the hierarchy, while others will be filtered by higher parts and group the results by the lowers. However, the first layer of the hierarchy (the A dimension) is what defines our security model, users will never have access to different A dimensions other than the one they belong (this is our tenant information).
The current design proven to be useful for that matter when we were prototyping the reports in plain SQL as we could do things like this for the depths queries:
WHERE
entity.hierarchy_id = 'fixed_a_id-fixed_b_id-fixed_c_id' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Or like this for filtering by other points of the hierarchy:
WHERE
entity.hierarchy_id LIKE 'fixed_a_id-%' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
Now we want to use QuickSight for creating and sharing those reports using the embedding capabilities. But we haven't found a way to filter the data of the analysis as we want.
We tried to use the RLS by tags for annonymous users, but we have found two problems:
How to inject the A.a_id part of the query in the API that generates the embedding URL in a secure way (i.e. that users can't change it), While allowing them to configure the other parts of the hierarchy. And finally combining those independent pieces in the filter; without needing to generate a new URL each time users change the other parts.
(however, we may live with this limitation but)
How to do partial filters; i.e., the ones that looked like LIKE 'fixed_a_id-fixed_b_id-%' Since it seems RLS is always an equals condition.
Is there any way to make QuickSight to work as we want with our current table design? Or would we need to change the design?
For the latter, we have thought on keeping the three dimension ids as separated columns, that way we may add RLS for the A.a_id column and use parameters for the other ones, the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
COMPOUND SORTKEY(hierarchy_id, entity_timestampt)
You are aware you are sorting on only the first eight bytes of hierarchy_id? and the ability of the zone map to differentiate between blocks is based purely on the first eight bytes of the string?
I suspect you would have done a lot better to have had three separate columns.
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
I may be wrong - I would need to check - but I think if you use operators of any kind (such as functions, or LIKE, or even addition or subtraction) on a sortkey, the zone map does not operate and you read all blocks.
Also in your case, it may be - I've not tried using it yet - if you have AQUA enabled, because you're using LIKE, your entire query is being processed by AQUA. The performance consequences of this, positive and/or negative, are completely unknown to me.
Have you been using the system tables to verify your expectations of what is going on with your queries when it comes to zone map use?
the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
You are now facing the fundamental nature of sorted column-store; the sorting you choose defines the queries you can issue and so also defines the queries you cannot issue.
You either alter your data design, in some way, so what you want becomes possible, or you can duplicate the table in question where each duplicate has different sorting orders.
The first is an art, the second has obvious costs.
As an aside, although I've never used Quicksight, my experience with all SQL generators has been that they are completely oblivious to sorting and so the SQL they issue cannot be used on Big Data (as sorting is the method by which Big Data can be handled in a timely manner).
If you do not have Big Data, you'll be fine, but the question then is why are you using Redshift?
If you do have Big Data, the only solution I know of is to create a single aggregate table per dashboard, about 100k rows, and have the given dashboard use and only use that one table. The dashboard should normally simply read the entire table, which is fine, and then you avoid the nightmare SQL it normally will produce.
Use case
I need to store texts assigned to some entity. It's important to note that I always only care about the most current texts that have been assigned to that entity. In case new texts are inserted, older ones might even be deleted. And that "might" is the problem, because I can't rely that really only the most current texts are available.
The only thing I'm unsure about how to design is the case that some INSERT can provide either 1 or N texts for some entity. In the latter case, I need to know which N texts belong to the most current INSERT done for one and the same entity. Additionally, inserting N instead of 1 text will be pretty rare.
I know that things could be implemented using two different tables: One calculating some main-ID and the other mapping individual texts with their own IDs to that main-ID. Because multiple texts should happen rarely and a one table design already provides columns which could easily be reused for grouping multiple texts together, I prefer using one only.
Additionally, I thought of which concept would make a good grouping key in general as well. I somewhat doubt that others really always implement the two table-approach only and therefore created this question to get a better understanding. Of course I simply might be wrong and everybody avoids such "hacks" at all costs. :-)
Possible keys
Transaction-local timestamp
Postgres supports the concept of a transaction-local timestamp using current_timestamp. I need to have one of those to store when the texts have been stored anyway, so they might be used for grouping as well?
While there's in theory the probability of collisions, timestamps have a resolution of 1 microsecond, which is in practice enough for my needs. Texts are uploaded by human users and it is very unlikely that multiple humans upload texts for the same entity at the same time at all.
That timestamp won't be used as a primary key of course, only to group multiple texts if necessary.
Transaction-ID
Postgres supports txid_current to get the ID of the current transaction, which should be ever increasing over the lifetime of the current installation. The good thing is that this value is always available and the app doesn't need to do anything on it's own. But things can easily break in case of restores, can't they? Can TXIDs e.g. occur again with the restored cluster?
People knowing things better than me write the following:
Do not use the transaction ID for anything at the application level. It is an internal system level field. Whatever you are trying to do, it's likely that the transaction ID is not the right way to do it.
https://stackoverflow.com/a/32644144/2055163
You shouldn't be using the transaction ID for anything except an identifier for a transaction. You can't even assume that a lower transaction ID is an older transaction, due to transaction ID wrap-around.
https://stackoverflow.com/a/20602796/2055163
Isn't my grouping a valid use case for wanting to know the ID of the current transaction?
Custom sequence
Grouping simply needs a unique key per transaction, which can be achieved using a custom sequence for that purpose only. I don't see any downsides, its values consume less storage than e.g. UUIDs and can easily be queried.
Reusing first unique ID
The table to store the texts contains a serial-column, so each inserted text gets an individual ID already. Therefore, the ID of the first inserted text could simply always be additionally reused as the group-key for all later added texts.
In case of only inserting one text, one should easily be able to use currval and doesn't even need to explicitly query the ID of the inserted row. In case of multiple texts this doesn't work anymore, though, because currval would provide updated IDs instead of the first one per transaction only. So some special handling would be necessary.
APP-generated random UUID
Each request to store multiple texts could simply generate some UUID and group by that. The mainly used database Postgres even supports a corresponding data type.
I mainly see too downsides with this: It feels really hacky already and consumes more space than necessary. OTOH, compared to the texts to store, the latter might simply be negligible.
I'm creating a web-app that lets users search for restaurants and cafes. Since I currently have no data other than their type to differentiate the two, I have two options on storing the list of eateries.
Use a single table for both restaurants and cafes, and have an enum (text) column stating if an entry is a restaurant or cafe.
Create two separate tables, one for restaurants, and one for cafes.
I will never need to execute a query that collects data from both, so the only thing that matters to me I guess is performance. What would you suggest as the better option for PostgreSQL?
Typical database modeling would lend itself to a single table. The main reason is maintainability. If you have two tables with the same columns and your client decides they want to add a column, say hours of operation. You now have to write two sets of code for creating the column, reading the new column, updating the new column, etc. Also, what if your client wants you to start tracking bars, now you need a third table with a third set of code. It gets messy quick. It would be better to have two tables, a data table (say Establishment) with most of the columns (name, location, etc.) and then a second table that's a "type" table (say EstablishmentType) with a row for Restaurant, Cafe, Bar, etc. And of course a foreign key linking the two. This way you can have "X" types and only need to maintain a single set of code.
There are of course exceptions to this rule where you may want separate tables:
Performance due to a HUGE data set. (It depends on your server, but were talking at least hundreds of thousands of rows before it should matter in Postgres). If this is the reason I would suggest table inheritance to keep much of the proper maintainability while speeding up performance.
Cafes and Restaurants have two completely different sets of functionality in your website. If the entirety of your code is saying if Cafe, do this, if Restaurant, do that, then you already have two sets of code to maintain, with the added hassle of if logic in your code. If that's the case, two separate tables is a much cleaner and logical option.
In the end I chose to use 2 separate tables, as I really will never need to search for both at the same time, and this way I can expand a single table in the future if I need to add another data field specific to cafes, for example.
I'm trying to precompute a user-defined function on a per row basis. The idea is I have JSON object as a text object in one of the fields, and I want to parse out some other 'fields' from it, which can be returned in queries just like any other true field. However, the overhead of parsing the JSON is significant. Is there any way to precompute this parsing function in a way that speeds up queries?
Please refrain from arguing that there shouldn't be JSON as text on the database in the first place; I am aware of the pros and cons.
First off, you may be interested in the upcoming JSON data type of PostgreSQL 9.2 (to be released soon, now).
As to your question, you are looking for a materialized view (or the simpler form: a redundant precomputed column in your table). "Materialized View" is just the established term, not a special object in a PostgreSQL database. Basically you create a redundant table with precomputed values, that you refresh at certain events or on a timely basis.
A search for the term will give you some answers.
In addition to a materialized view, keep in mind that PostgreSQL can also index functions' output so you can do something like:
CREATE INDEX my_foo_bar_udf_idx ON foo (bar(baz));
This works only if the UDF is marked as immutable meaning output only depends on arguments. This gives you an option to run your function against the query arguments and then scan the index instead of the table. It doesn't meet all use cases, but it does meet many of them and it can often save you the headaches of materializing views.