SQL Server stats and indexes after a partition swap - sql-server-2008-r2

This question might be a bit too general but I thought I would ask. I'm working with a terabyte scale data warehouse in SQL Server 2008 R2. There is a large fact table with data going back 5 years. I have aggregated a lot of this old data to a different table at a higher level of granularity. The next step is to remove the old data from my fact table.
I've decided that partition swapping is probably the best way to go to remove the older rows from the fact table and put them in an archive table, but I was wondering what a partition swap will do to stats and indexes on my fact table? Should I consider manually updating statistics after a partition swap? (auto update is set to off), will my indexes be fragmented and need reorganising or rebuilding?
Thanks for your help!

Partition switching is a metadata operation, so it's not going to cause fragmentation as no physical data is actually moving-- just logical references to it.
You should probably be updating statistics on a large table regularly, but it's not especially needed after a partition switch.

Related

How to re-partition table with hash in PostgreSQL?

I'm currently designing a table and want to partition it by account_name.
For now I'm thinking of going with a small number of partitions (e.g. 8) but since I expect a lot of data there is a chance I will need to re-partition it and make more partitions.
What is the best way to do this? If I understand correctly I can't just attach new partitions since I need to change modulus for previously used ones.
Should I copy and re-insert all the data or there is an easier way?
Repartitioning would mean to completely rewrite the table, as in
INSERT INTO new_tab SELECT * FROM old_tab;
which will cause extensive down time. One way around this is to use logical replication with new_tab on the standby side (possible from v13 on).
But my recommendation is not to do that. Choose a reasonable number of partitions and stick with that.

Do you need to add an index on a partitioned table (postgres 11)?

My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).
Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.

Archive old data in Postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.
That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.
While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Billions rows in PostgreSql: partition or not to partition?

What i have:
Simple server with one xeon with 8 logic cores, 16 gb ram, mdadm raid1 of 2x 7200rpm drives.
PostgreSql
A lot of data to work with. Up to 30 millions of rows are being imported per day.
Time - complex queries can be executed up to an hour
Simplified schema of table, that will be very big:
id| integer | not null default nextval('table_id_seq'::regclass)
url_id | integer | not null
domain_id | integer | not null
position | integer | not null
The problem with the schema above is that I don't have the exact answer on how to partition it.
Data for all periods is going to be used (NO queries will have date filters).
I thought about partitioning on "domain_id" field, but the problem is that it is hard to predict how many rows each partition will have.
My main question is:
Does is make sense to partition data if i don't use partition pruning and i am not going to delete old data?
What will be pros/cons of that ?
How will degrade my import speed, if i won't do partitioning?
Another question related to normalization:
Should url be exported to another table?
Pros of normalization
Table is going to have rows with average size of 20-30 bytes.
Joins on "url_id" are supposed to be much faster than on "url" field
Pros of denormalization
Data can be imported much, much faster, as i don't have to make lookup into "url" table before each insert.
Can anybody give me any advice? Thanks!
Partitioning is most useful if you are going to either have selection criteria in most queries which allow the planner to skip access to most of the partitions most of the time, or if you want to periodically purge all rows that are assigned to a partition, or both. (Dropping a table is a very fast way to delete a large number of rows!) I have heard of people hitting a threshold where partitioning helped keep indexes shallower, and therefore boost performance; but really that gets back to the first point, because you effectively move the first level of the index tree to another place -- it still has to happen.
On the face of it, it doesn't sound like partitioning will help.
Normalization, on the other hand, may improve performance more than you expect; by keeping all those rows narrower, you can get more of them into each page, reducing overall disk access. I would do proper 3rd normal form normalization, and only deviate from that based on evidence that it would help. If you see a performance problem while you still have disk space for a second copy of the data, try creating a denormalized table and seeing how performance is compared to the normalized version.
I think it makes sense, depending on your use cases. I don't know how far back in time your 30B row history goes, but it makes sense to partition if your transactional database doesn't need more than a few of the partitions you decide on.
For example, partitioning by month makes perfect sense if you only query for two months' worth of data at a time. The other ten months of the year can be moved into a reporting warehouse, keeping the transactional store smaller.
There are restrictions on the fields you can use in the partition. You'll have to be careful with those.
Get a performance baseline, do your partition, and remeasure to check for performance impacts.
With the given amount of data in mind, you'll be waiting on IO mostly. If possible, perform some tests with different HW configurations trying to get best IO figures for your scenarios. IMHO, 2 disks will not be enough after a while, unless there's something else behind the scenes.
Your table will be growing daily with a known ratio. And most likely it will be queried daily. As you haven't mentioned data being purged out (if it will be, then do partition it), this means that queries will run slower each day. At some point in time you'll start looking at how to optimize your queries. One of the possibilities is to parallelize query on the application level. But here some conditions should be met:
your table should be partitioned in order to parallelize queries;
HW should be capable of delivering the requested amount of IO in N parallel streams.
All answers should be given by the performance tests of different setups.
And as others mentioned, there're more benefits for DBA in partitioned tables, so I, personally, would go for partitioning any table that is expected to receive more then 5M rows per interval, be it day, week or month.