precomputing user-defined functions in postgresql - postgresql

I'm trying to precompute a user-defined function on a per row basis. The idea is I have JSON object as a text object in one of the fields, and I want to parse out some other 'fields' from it, which can be returned in queries just like any other true field. However, the overhead of parsing the JSON is significant. Is there any way to precompute this parsing function in a way that speeds up queries?
Please refrain from arguing that there shouldn't be JSON as text on the database in the first place; I am aware of the pros and cons.

First off, you may be interested in the upcoming JSON data type of PostgreSQL 9.2 (to be released soon, now).
As to your question, you are looking for a materialized view (or the simpler form: a redundant precomputed column in your table). "Materialized View" is just the established term, not a special object in a PostgreSQL database. Basically you create a redundant table with precomputed values, that you refresh at certain events or on a timely basis.
A search for the term will give you some answers.

In addition to a materialized view, keep in mind that PostgreSQL can also index functions' output so you can do something like:
CREATE INDEX my_foo_bar_udf_idx ON foo (bar(baz));
This works only if the UDF is marked as immutable meaning output only depends on arguments. This gives you an option to run your function against the query arguments and then scan the index instead of the table. It doesn't meet all use cases, but it does meet many of them and it can often save you the headaches of materializing views.

Related

POSTGRESQL JSONB column for storing follower-following relationship [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

Storing user-defined segments JSONB vs separate table?

I want to store user-defined segments. A segment will consist of several different rules. I was thinking I would either create a separate separate table of "Rules" with three columns: attribute name, operator, and value. For example, if a Segment is users in the united states the rule would be "country = US" in their respective columns. A segment can have many rules.
The other option is to store these as JSONB via Postgres in a "Rules" column in the Segment table. I'd follow a similar pattern to the above with an array of rules or something. What are the pros and cons of each method?
Maybe neither one of these is the right approach.
The choice is basically about the way you wish to read the data.
You are better off with JSON if:
you are not going to filter (with a WHERE clause) through the Rules
you do not need to get statistics (i.e. GROUP BY)
you will not imply any constraints on attributes/operators/values
you simply select the values (SELECT ..., Rules)
If you meet these requirements you can store data as JSON, thus eliminating JOINs and subselects, eliminating the overhead of primary key and indexes on Rules, etc.
But if you don't meet these you should store the data in a common relational design - your approach 1.
I would go with the first approach of storing the pieces of data individually in a relational database. It sounds like your data (segments->rules) will always contain the same structure (which is fairly simple), so there isn't a pressing reason to store the data as JSON.
As a side note, I think you will need another column in the "Rules" table, serving as a foreign key to the "Segments" table.
Pros to approach 1:
Data is easy to search and select. Your SQL statements can directly access specific information about the rules (the specific operators, name, value, etc) without having to parse the JSON object for the desired rule.
The above will result in reduced processing time
Only need to parse the JSON once (before the insert)
Cons to approach 1:
Requires parsing of JSON before the insert
Requires multiple inserts per segment
Regarding your last sentence, it is hard to prescribe a database design without knowing more about your intended functionality. For example, if the attribute names have meaning beyond a single segment, you would want to store the attribute names separately and reference them in the Rules table.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

hierarchical data in a database: recursive query vs. closure tables vs. graph database

I'm starting on a new project that has some hierarchical data and I'm looking at all the options for storing that in a database at the moment.
I am using PostgreSQL, which does allow recursive querying. I also looked into design patterns for relational databases, such as closure tables and I had a look at graph database solutions such as neo4j.
I'm finding it difficult to decide between those options. For example: given that my RDBMS allows recursive queries, would it still make sense to use closure tables and how does that compare to graph database solutions in terms of maintainability and performance?
Any opinions/experience would be much appreciated!
The whole closure table is redundant if you can use recursive queries :)
I think it's much better to have a complicated recursive query that you have to figure out once than deal with the extra IO (and disk space) of a separate table and associated triggers.
I have done some simple tests with recursive queries in postgres. With a few million rows in the table queries were still < 10ms for returning all parents of a particular child. Returning all children was fast too, depending on the level of the parent. It seemed to depend more on disk IO fetching the rows rather than the query speed itself. This was done single user, so not sure how it would perform under load. I suspect it would be very fast still if you can also hold most of the table in memory (and setup postgres correctly). Clustering the table by parent id also seemed to help.
The level-field ("depth") of the closure table is redundant. It takes only one recursive query to compute it. That about sums it up.

Adding enumerated types in Postgres

I'm trying to add a new value to an enumerated type in my Postgres database but the type is already in use by various fields. Postgres won't let me do this as the type is already in use.
Previously I've accomplished this by:
copying all fields using the type
to a temporary varchar field,
dropping all the fields and functions using the
type,
deleting the type,
creating a new type with the extra
enumerated value,
Setting all
the temp fields and functions back to use the
enumerated type.
A big job in any situation, but an impossibly big task if the type is used in dozens of times throughout the DB in tables, views and functions. Surely there must be an easier way just to merely add a new value to an enumerated type?
Many thanks for any help.
#a_horse_with_no_name covered the optimal way to solve this problem. If a complete re-architecture doesn't suit your fancy, you can take advantage of PostgreSQL's support for transactions in both DDL and DML operations.
So you could, in theory, perform all five of your steps in a single transactional operation. Because of MVCC you will be able to safely make this change and have a minimal functional impact to users of your database. You'll probably incur a huge disk overhead (depending on the size of the tables) and substantial database bloat (if the transaction takes a lot time, the vacuum process won't run).
All of that being said, it's perfectly doable.
There will be an easy way in the new 9.1 version:
http://developer.postgresql.org/pgdocs/postgres/sql-altertype.html