Update timezone columns in a Postgres DB - postgresql

I have started creating a product database using timestamp without timezone. Then, realizing my error, I started using timestamp with timezone. Now I'd like to unify this to the latter.
Question: Is it possible in an existing Postgres 8.4 DB already containing data to convert all the columns of type timestamp without TZ to ones with TZ?
The best solution would be a script that would do this in one execution (of course). Even a script that would fix a single column at a time would be great. The problem is that a naïve ALTERing the column fails on some existing VIEWs that use it in output (though I fail to see why it is bad in this case - it's just widening the output type a bit).

You want ALTER TABLE ... ALTER COLUMN ... TYPE ... USING (...) which does what you would expect. You will need to decide what timezone these times are in, and supply the suitable AT TIME ZONE expression for your USING clause.
These will ALTERs will rewrite each table, so allow for that. You may want to CLUSTER them afterwards.
However, you seem to think that the two types are interchangeable. They are not. That is why you need to drop and rebuild your views. Also you will want to rewrite any applications appropriately too.
If you can't see why they are different, make a good hot cup of tea or coffee, sit down and read the time & date sections of the manuals and spend an hour or so reading them thoroughly. Perhaps some of the Q&As here too. This is not necesarily a minor change. I'd be especially wary of any daylight-saving / Summer shifts in whatever time zone(s) you decide apply.

Related

PostgreSQL support for timestamps to nanosecond resolution

The data I'm receiving has timestamps down to nanoseconds (which we actually care about). Is there a way for Postgres timestamps to go to nanoseconds?
As others have pointed out, Postgres doesn't provide such type out of the box. However, it's relatively simple to create an extension that supports nanosecond resolution due to the open-source nature of Postgres. I faced similar issues a while ago and created this timestamp9 extension for Postgres.
It internally stores the timestamp as a bigint and defines it as the number of nanoseconds since the UNIX epoch. It provides some convenience functions around it that make it easy to view and manipulate the timestamps. If you can live with the limited time range that these timestamps can have, between the year 1970 and the year 2262, then this is a good solution.
Disclaimer: I'm the author of the extension
Nope, but you could trim timestamps to milliseconds, and store nanosecond part to a separate column.
You can create index on both, and view or function to return your wanted nanosecond timestamp, and you can even create index on your function.
On the one hand, the documented resolution is 1 microsecond.
On the other hand, PostgreSQL is open source. Maybe you can hack together something to support nanoseconds.

Postgres: is it possible to lock some rows for changing?

I have pretty old tables, which hold records for clients' payments and commissions for several years. In regular business cycle sometimes it's needed to recalc commissions and update table. But usually period of recalulation 1 or 2 months backwards, not more.
Recently, in result of bug in php script, our developer recalculated commissions since the very beggining 0_0. And the process of recalculation is really complicated so it cant be restored just grabbing yeasterday's backup - data changes in noumerous databases, so restoring data is really complicated and awfully expensive procedure. And complains from clients and change in accounting...you know..Horor.
We can't split tables by periods. (Well we can, but it will take year to remake all data selects).
What I'm trying to think about is to set up some update trigger that would check date of the changing record and allowed date that should be less then the updating date. So in case of mistake or bug, when someone would try to update such 'restricted' row it would get an exception and keep the data unchaged.
Is that a good approach? How's that can be done - I mean trigger?
For postgres you can use a check constraint to ensure the allowed_date is always less than the update_date:
ALTER TABLE mytable ADD CONSTRAINT datecheck CHECK (allowed_date < update_date);

Postgres partitioning?

My software runs a cronjob every 30 minutes, which pulls data from Google Analytics / Social networks and inserts the results into a Postgres DB.
The data looks like this:
url text NOT NULL,
rangeStart timestamp NOT NULL,
rangeEnd timestamp NOT NULL,
createdAt timestamp DEFAULT now() NOT NULL,
...
(various integer columns)
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table. At this rate, the cronjob will generate about 480 000 records a day and about 14.5 million a month.
I think the solution would be using several tables, for example I could use a specific table to store data generated in a given month: stats_2015_09, stats_2015_10, stats_2015_11 etc.
I know Postgres supports table partitioning. However, I'm new to this concept, so I'm not sure what's the best way to do this. Do I need partitioning in this case, or should I just create these tables manually? Or maybe there is a better solution?
The data will be queried later in various ways, and those queries are expected to run fast.
EDIT:
If I end up with 12-14 tables, each storing 10-20 millions rows, Postgres should be still able to run select statements quickly, right? Inserts don't have to be super fast.
Partitioning is a good idea under various circumstances. Two that come to mind are:
Your queries have a WHERE clause that can be readily mapped onto one or a handful of partitions.
You want a speedy way to delete historical data (dropping a partition is faster than deleting records).
Without knowledge of the types of queries that you want to run, it is difficult to say if partitioning is a good idea.
I think I can say that splitting the data into different tables is a bad idea because it is a maintenance nightmare:
You can't have foreign key references into the table.
Queries spanning multiple tables are cumbersome, so simple questions are hard to answer.
Maintaining tables becomes a nightmare (adding/removing a column).
Permissions have to be carefully maintained, if you have users with different roles.
In any case, the place to start is with Postgres's documentation on partitioning, which is here. I should note that Postgres's implementation is a bit more awkward than in other databases, so you might want to review the documentation for MySQL or SQL Server to get an idea of what it is doing.
Firstly, I would like to challenge the premise of your question:
Since one query returns 10 000+ items, it's obviously not a good idea to store this data in a single table.
As far as I know, there is no fundamental reason why the database would not cope fine with a single table of many millions of rows. At the extreme, if you created a table with no indexes, and simply appended rows to it, Postgres could simply carry on writing these rows to disk until you ran out of storage space. (There may be other limits internally, I'm not sure; but if so, they're big.)
The problems only come when you try to do something with that data, and the exact problems - and therefore exact solutions - depend on what you do.
If you want to regularly delete all rows which were inserted more than a fixed timescale ago, you could partition the data on the createdAt column. The DELETE would then become a very efficient DROP TABLE, and all INSERTs would be routed through a trigger to the "current" partition (or could even by-pass it if your import script was aware of the partition naming scheme). SELECTs, however, would probably not be able to specify a range of createAt values in their WHERE clause, and would thus need to query all partitions and combine the results. The more partitions you keep around at a time, the less efficient this would be.
Alternatively, you might examine the workload on the table and see that all queries either already do, or easily can, explicitly state a rangeStart value. In that case, you could partition on rangeStart, and the query planner would be able to eliminate all but one or a few partitions when planning each SELECT query. INSERTs would need to be routed through a trigger to the appropriate table, and maintenance operations (such as deleting old data that is no longer needed) would be much less efficient.
Or perhaps you know that once rangeEnd becomes "too old" you will no longer need the data, and can get both benefits: partition by rangeEnd, ensure all your SELECT queries explicitly mention rangeEnd, and drop partitions containing data you are no longer interested in.
To borrow Linus Torvald's terminology from git, the "plumbing" for partitioning is built into Postgres in the form of table inheritance, as documented here, but there is little in the way of "porcelain" other than examples in the manual. However, there is a very good extension called pg_partman which provides functions for managing partition sets based on either IDs or date ranges; it's well worth reading through the documentation to understand the different modes of operation. In my case, none quite matched, but forking that extension was significantly easier than writing everything from scratch.
Remember that partitioning does not come free, and if there is no obvious candidate for a column to partition by based on the kind of considerations above, you may actually be better off leaving the data in one table, and considering other optimisation strategies. For instance, partial indexes (CREATE INDEX ... WHERE) might be able to handle the most commonly queried subset of rows; perhaps combined with "covering indexes", where Postgres can return the query results directly from the index without reference to the main table structure ("index-only scans").

Can Postgres tables be optimized by the column order?

I recently had to propose a set of new Postgres tables to our DB team that will be used by an application I am writing. They failed the design because my table had fields that were listed like so:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_text : CHARACTER VARYING(100)
some_flag : BOOLEAN
They said that the table would not be optimal because some_text appears before some_flag, and since CHARACTER VARYING fields search slower than BOOLEANs, when doing a table scan, it is faster to have a table structure whose columns are sequenced from greatest precision to least precision; so, like this:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_flag : BOOLEAN
some_text : CHARACTER VARYING(100)
These DBAs come from a Sybase background and have only recently switched over as our Postgres DBAs. I am thinking that this is perhaps a Sybase optimization that doesn't apply to Postgres (I would think Postgres is smart enough to somehow not care about column sequence).
Either way I can't find any Postgres documentation that confirms or denies. Looking for any battle-worn Postgres DBAs to weigh-in as to whether this is a valid or bogus (or conditionally-valid!) claim.
Speaking from my experience with Oracle on similar issues, where there was a big change in behaviour between versions 9 and 10 (or 8 and 9) if memory serves (due to CPU overhead in finding column data within a row), I don't believe you should rely on documented behaviour for an issue like this when a practical experiment would be fairly straightforward and conclusive.
So I'd suggest that you create a test case for this. Create two tables with exactly the same data and the columns in a different order, and run repeated and varied tests. Try to implement the entire test as a single script that can be run on a development or test system and which tells you the answer. Maybe the DBA's are right, and you can say, "Hey, confirmed your thoughts on this, thanks a lot", or you might find no measurable and significant difference. In the latter case you can hand the entire test to the DBA's, and explain how you can't reproduce the problem. Let them run the tests.
Either way, someone is going to learn something, and you've got a test case you can apply to future (or past) versions.
Lastly, post here on what you found ;)
What your DBA's are probably referring to, is the access strategy for "gettting to" the boolean value in a given tuple (/row).
In THEIR proposed design, a system can "get to" that value by looking at byte 9.
In YOUR proposed design, the system must first inspect the LENGTH field of all varying-length columns [that come before your boolean column], before it can know the byte offset where the boolean value can be found. That is ALWAYS going to be slower than "their" way.
Their consideration is one of PHYSICAL design (and it is a correct one). Damir's answer is also correct, but it is an answer from the perspective of LOGICAL design.
If the remark by your DBA's is really intended as "criticism of a 'bad' design", then they deserve to be pointed out that LOGICAL design is your job (and column order doesn't matter at that level), and PHYSICAL design is their job. And if they expect you to do the PHYSICAL design (their job) as well, then there is no longer any reason for the boss to keep them employed.
From a database design point, there is no difference between your design and what your DBA suggests -- your application should not care. In relational databases (logically) there is no such thing as order of columns; actually if order of columns matters (logically) it failed 1NF.
So, simply pass all create table scripts to your DBAs and let them implement (reorder columns) in any way they feel it is optimal on the physical level. You simply continue with the application.
Database design can not fail on order of columns -- it is simply not part of the design process.
Future users of large data banks must be protected from having to know
how the data is organized in the machine ...
... the problems treated hare are those of data independence -- the
independence of application programs and terminal activities from
growth in data types and changes ...
E.F. Codd ~ 1979
Changes to the physical level ... must not require a change to an
application ...
Rule 8: Physical data independence (E.F. Codd ~ 1985)
So here we are -- 33 years later ...

Cassandra doesn't update (after some time?)

Very weird problem, some RowKeys looks like they are getting 'locked' after some time.
First they are created fine, i can update them for some time. Then after some time updates aren't working anymore but i can still update fresh created keys fine.
Anyone an idea?, phpcassa is screwing with me or cassandra?
Have you checked the timestamps you are using when writing to cassandra?
The client specifies a timestamp for each column you write to cassandra. Some portion of your code could have a bug where it is setting the timestamps incorrectly, leading to updates getting dropped.
In general, it's also worth making sure different clients are using the same timestamp granularity. The standard is microseconds-since-epoch, so if you were to use something that uses milliseconds-since-epoch it won't be able to overwrite data created with the larger timestamp numbers. In this case, both phpcassa and the cassandra cli conform to the standard so unless you are using a 3rd tool that you didn't mention that should be fine.