Adding enumerated types in Postgres - postgresql

I'm trying to add a new value to an enumerated type in my Postgres database but the type is already in use by various fields. Postgres won't let me do this as the type is already in use.
Previously I've accomplished this by:
copying all fields using the type
to a temporary varchar field,
dropping all the fields and functions using the
type,
deleting the type,
creating a new type with the extra
enumerated value,
Setting all
the temp fields and functions back to use the
enumerated type.
A big job in any situation, but an impossibly big task if the type is used in dozens of times throughout the DB in tables, views and functions. Surely there must be an easier way just to merely add a new value to an enumerated type?
Many thanks for any help.

#a_horse_with_no_name covered the optimal way to solve this problem. If a complete re-architecture doesn't suit your fancy, you can take advantage of PostgreSQL's support for transactions in both DDL and DML operations.
So you could, in theory, perform all five of your steps in a single transactional operation. Because of MVCC you will be able to safely make this change and have a minimal functional impact to users of your database. You'll probably incur a huge disk overhead (depending on the size of the tables) and substantial database bloat (if the transaction takes a lot time, the vacuum process won't run).
All of that being said, it's perfectly doable.

There will be an easy way in the new 9.1 version:
http://developer.postgresql.org/pgdocs/postgres/sql-altertype.html

Related

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Databases or schemas for an application on Postgres with many tables

I'm in the process of rolling out a new feature on my webapp that will ultimately result in users having the ability to create dynamic tables in the database. Over time I expect that this may result in thousands, or tens of thousands of tables being created.
I understand that postgres doesn't have explicit limits on the number of tables in the database, however that performance might degrade if that number gets too large. In order to mitigate this I'm thinking of breaking up the underlying storage into either different databases or different schemas. My main question is: is one of those choices choices better than the other? If so, why? It seems easier to implement with schemas, however I'm not sure if that will actually solve some of the potential longer term performance issues that might come up.
Note that the tables are completely independent - so there are no concerns about needing to joins with other tables.
Also, assume I'm handing any validation that might get me into trouble with malicious and/or unexpected users being able to create database tables.
From the Database File Layout of the manual:
Each table and index is stored in a separate file.
So, this is the first point to take into account. You should have a filesystem which does a good job with a large number of files in a single directory, unless you use different tablespaces.
Note that you can have different tablespaces even in the same schema or in the same database, so the use of different schemas could by motivated by other reasons, like having tables with the same name (actually, schemas in PostgreSQL are just a way of partitioning the namespace).
For databases, I think the solution with just a database could be good for you, I assume that each database can introduce a non trivial overhead.
Finally: since the system works by using its own catalog, which is a set of relational tables, I suppose you could scale quite well, maybe you will need to add some indexes on the catalog tables, if they are not present.
The last advice: before investing time and resources on the project, do a simulation of it, by generating programmatically a thousand tables, filling them with random data, and simulating their use under the hypotheses of the load of your system.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

precomputing user-defined functions in postgresql

I'm trying to precompute a user-defined function on a per row basis. The idea is I have JSON object as a text object in one of the fields, and I want to parse out some other 'fields' from it, which can be returned in queries just like any other true field. However, the overhead of parsing the JSON is significant. Is there any way to precompute this parsing function in a way that speeds up queries?
Please refrain from arguing that there shouldn't be JSON as text on the database in the first place; I am aware of the pros and cons.
First off, you may be interested in the upcoming JSON data type of PostgreSQL 9.2 (to be released soon, now).
As to your question, you are looking for a materialized view (or the simpler form: a redundant precomputed column in your table). "Materialized View" is just the established term, not a special object in a PostgreSQL database. Basically you create a redundant table with precomputed values, that you refresh at certain events or on a timely basis.
A search for the term will give you some answers.
In addition to a materialized view, keep in mind that PostgreSQL can also index functions' output so you can do something like:
CREATE INDEX my_foo_bar_udf_idx ON foo (bar(baz));
This works only if the UDF is marked as immutable meaning output only depends on arguments. This gives you an option to run your function against the query arguments and then scan the index instead of the table. It doesn't meet all use cases, but it does meet many of them and it can often save you the headaches of materializing views.

hierarchical data in a database: recursive query vs. closure tables vs. graph database

I'm starting on a new project that has some hierarchical data and I'm looking at all the options for storing that in a database at the moment.
I am using PostgreSQL, which does allow recursive querying. I also looked into design patterns for relational databases, such as closure tables and I had a look at graph database solutions such as neo4j.
I'm finding it difficult to decide between those options. For example: given that my RDBMS allows recursive queries, would it still make sense to use closure tables and how does that compare to graph database solutions in terms of maintainability and performance?
Any opinions/experience would be much appreciated!
The whole closure table is redundant if you can use recursive queries :)
I think it's much better to have a complicated recursive query that you have to figure out once than deal with the extra IO (and disk space) of a separate table and associated triggers.
I have done some simple tests with recursive queries in postgres. With a few million rows in the table queries were still < 10ms for returning all parents of a particular child. Returning all children was fast too, depending on the level of the parent. It seemed to depend more on disk IO fetching the rows rather than the query speed itself. This was done single user, so not sure how it would perform under load. I suspect it would be very fast still if you can also hold most of the table in memory (and setup postgres correctly). Clustering the table by parent id also seemed to help.
The level-field ("depth") of the closure table is redundant. It takes only one recursive query to compute it. That about sums it up.