Piecemeal restore of database with filstream filegroups - filestream

I have tried taking the backup of database containing filestream and it works fine. I have tried restoring it on other server also and that works too.
Now, i am facing a problem. Our database is big in size (approx. 320 GB) and it takes time to backup and restore it. Therefore, the client wants us to suggest some technique to reduce the time.
I have tried piecemeal restore which allows you to backup and restore individual filegroups in the database. It works fine with all the other filegroups except for Filestream ones. I am able to take backup and not able to restore it.
Do u have any idea??
Regards,
Prashant.

The issue mentioned was solved by changing the recovery model of the database to Full and then taking backups and restoring them with Recovery attribute.
Thanks,
Prashant.

Related

PostgreSQL: even read access changes data files disk leading to large incremental backups using pgbackrest

We are using pgbackrest to backup our database to Amazon S3. We do full backups once a week and an incremental backup every other day.
Size of our database is around 1TB, a full backup is around 600GB and an incremental backup is also around 400GB!
We found out that even read access (pure select statements) on the database has the effect that the underlying data files (in /usr/local/pgsql/data/base/xxxxxx) change. This results in large incremental backups and also in very large storage (costs) on Amazon S3.
Usually the files with low index names (e.g. 391089.1) change on read access.
On an update, we see changes in one or more files - the index could correlate to the age of the row in the table.
Some more facts:
Postgres version 13.1
Database is running in docker container (docker version 20.10.0)
OS is CentOS 7
We see the phenomenon on multiple servers.
Can someone explain, why postgresql changes data files on pure read access?
We tested on a pure database without any other resources accessing the database.
This is normal. Some cases I can think of right away are:
a SELECT or other SQL statement setting a hint bit
This is a shortcut for subsequent statements that access the data, so they don't have t consult the commit log any more.
a SELECT ... FOR UPDATE writing a row lock
autovacuum removing dead row versions
These are leftovers from DELETE or UPDATE.
autovacuum freezing old visible row versions
This is necessary to prevent data corruption if the transaction ID counter wraps around.
The only way to fairly reliably prevent PostgreSQL from modifying a table in the future is:
never perform an INSERT, UPDATE or DELETE on it
run VACUUM (FREEZE) on the table and make sure that there are no concurrent transactions

Best way to backup and restore data in PostgreSQL for testing

I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.

Copying a MongoDB record for record

We have a MongoDB sitting at 600GB. We've deleted a lot of documents, and in the hopes of shrinking it, we repaired it onto a 2TB drive.
It ran for hours, eventually running out of the 2TB space. When I looked at the repair directory, it had created way more files than the original database??
Anyway, I'm trying to look for alternative options. My first thought was to create a new MongoDB, and copy each row from the old to the new. Is it possible to do this, and what's the fastest way?
I have a lot of success in copying database using the db.copyDatabase command:
link to mongodb copyDatabase
I have also used MongoVUE, which is a software that makes it easy to copy databases from one location to another - MonogoVUE (which is jus ta graphical interface on top of monogo).
If you have no luck with copyDatabase, I can suggest you try to dump and restore the database to an external file, something like mongodump or lvcreate
Here is a full read on backup and restore which should allow you to copy the database easily: http://docs.mongodb.org/manual/core/backups/

Is it possible to run Postgres on a write-protected file system? Or a shared file system?

I'm trying to set up a distributed processing environment,
with all of the data sitting in a single shared network drive.
I'm not going to write anything to it, and just be reading from it,
so we're considering write-protecting the network drive as well.
I remember when I was working with MSSQL,
I could back up databases to a DVD and load it directly as a read-only database.
If I can do something like that in Postgres,
I should be able to give it an abstraction like a read-only DVD,
and all will be good.
Is something like this possible in Postgres,
if not, any alternatives? (MySQL? sqlite even?)
Or if that's not possible is there some way to specify a shared file system?
(Make it know that other processes are reading from it as well?)
For various reasons, using a parallel dbms is not possible,
and I need two DB processes running parallel...
Any help is greatly appreciated.
Thanks!!
Write-protecting the data directory will cause PostgreSQL to fail to start, as it needs to be able to write postmaster.pid. PostgreSQL also needs to be able to write temporary files and tablespaces, set hint bits, manage the visibility map, and more.
In theory it might be possible to modify the PostgreSQL server to support running on a read-only database, but right now AFAIK this is not supported. Don't expect it to work. You'll need to clone the data directory for each instance.
If you want to run multiple PostgreSQL instances for performance reasons, having them fighting over shared storage would be counter-productive anyway. If the DB is small enough to fit in RAM it'd be OK ... but in that case it's also easy to just clone it to each machine. If the DB isn't big enough to be cached in RAM then both DB instances would be I/O bottlenecked and unlikely to perform any better than (probably slightly worse than) a single DB not subject to storage contention.
There's some chance that you could get it to work by:
Moving the constant data into a new tablespace onto read-only shared storage
Taking a basebackup of the database, minus the newly separated tablespace for shared data
Copying the basebackup of the DB to read/write private storage on each host that'll run a DB
Mounting the shared storage and linking the tablespace in place where Pg expects it
Starting pg
... at least if you force hint-bit setting and VACUUM FREEZE everything in the shared tablespace first. It isn't supported, it isn't tested, it probably won't work, there's no benefit over running private instances, and I sure as hell wouldn't do it, but if you really insist you could try it. Crashes, wrong query results, and other bizarre behaviour are not unlikely.
I've never tried it, but it may be possible to run postgres with a data dir which is mostly on a RO file system if all your use is indeed read-only. You will need to be sure to disable autovacuum. I think even read activity may generate xlog mutation, so you will probably have to symlink the pg_xlog directory onto a writeable file system. Sometimes read queries will spill to disk for large sorts or other temp requirements, so you should also link base/pgsql_tmp to a writeable disk area.
As Richard points out there are visibility hint bits in the data heap. May want to try VACUUM FULL FREEZE ANALYZE on the db before putting it on the RO file system.
"Is something like this possible in Postgres, if not, any alternatives? (MySQL? sqlite even?)"
I'm trying to figure out if I can do this with postgres as well, to port over a system from sqlite. I can confirm that this works just fine with sqlite3 database files on a read-only NFS share. Sqlite does work nicely for this purpose.
When done with sqlite, we cut over to a new directory with new sqlite files whenever there are updates. We don't ever insert into the in-use database. I'm not sure if inserts would pose any problems (with either database). Caching read-only data at the OS level could be an issue if another database instance mounted the dir read-write. This is something I would ideally like to be able to do.

Postgresql PITR backup: best practices to handle multiple databases?

Hy guys, i have a postgresql 8.3 server with many database.
Actually, im planning to backup those db with a script that will store all the backup in a folder with the same name of the db, for example:
/mypath/backup/my_database1/
/mypath/backup/my_database2/
/mypath/backup/foo_database/
Every day i make 1 dump each 2 hours, overwriting the files every day... for example, in the my_database1 folder i have:
my_database1.backup-00.sql //backup made everyday at the 00.00 AM
my_database1.backup-02.sql //backup made everyday at the 02.00 AM
my_database1.backup-04.sql //backup made everyday at the 04.00 AM
my_database1.backup-06.sql //backup made everyday at the 06.00 AM
my_database1.backup-08.sql //backup made everyday at the 08.00 AM
my_database1.backup-10.sql //backup made everyday at the 10.00 AM
[...and so on...]
This is how i actually assure myself to be able to restore everydatabase loosing at least 2 hours of data.
2 hours still looks too much.
I've got a look to the postgresql pitr trought the WAL files, but, those files seem to contain all the data about all my database.
I'll need to separate those files, in the same way i do separate the dump files.
How to?
Otherwise, there is another easy-to-install to have a backup procedure that allo me to restore just 1 backup at 10 seconds earlier, but without creating a dump file every 10 seconds?
It is not possible with one instance of PostgresSQL.
You can divide your 500 tables between several instances, each listening on different port, but it would mean that they will not use resources like memory effectively (memory reserved but unused in one instance can not be used by another).
Slony will also not work here, as it does not replicate DDL statements, like dropping a table.
I'd recommend doing both:
continue to do your pg_dump backups, but try to smooth it - throttle pg_dump io bandwith, so it will not cripple a server, and run it continuously - when it finishes with the last database then immediately start with a first one;
additionally setup PITR.
This way you can restore a single database fast, but you can loose some data. If you'll decide that you cannot afford to loose that much data then you can restore your PITR backup to a temporary location (with fsync=off and pg_xlog symlinked to ramdisk for speed), pg_dump affected database from there and restore it to your main database.
Why do you want to separate the databases?
The way the PITR works, it is not possible to do since it works on the complete cluster.
What you can do in that case is to create a data directory and a separate cluster for each of those databases (not recommended though since it will require different ports, and postmaster instances).
I believe that the benefits of using PITR instead of regular dumps outweigh having separate backups for each database, so perhaps you can re-think the reasons for why you need to separate it.
Another way could be to set up some replication with Slony-I but that would require a separate machine (or instance) that receives the data. On the other hand, that way you would have a replicated system in near real-time.
Update for comment:
To recover from mistakes, like deleting a table, PITR would be perfect since you can replay to a specific time. However, for 500 databases I understand that can be a lot of overhead. Slony-I would probably not work, since it is replicating. Not sure how it handles table deletions.
I am not aware of any other ways you can go. What I would do would probably still be going for PITR and just not do any mistakes ;). Jokes aside, depending how frequently mistakes are being made this could be a solution:
Set it up for PITR
have a second instance ready on standby.
When a mistake happens, replay the restore to the point in time on the second instance.
Do a pg_dump of the affected database from that instance.
Do a pg_restore on the production instance for that database.
However, it would require you to have a second instance ready, either on the same server or a different one (different is recommended). Also, the restore time would be a bit longer since it would require you to do one extra dump and restore.
I think the way you are doing this is flawed. You should have one database with multiple schemas and roles. Then you can use PITR. However PITR is not a replacement for dumps.