postgresql replication + scrubbing - postgresql

Is there any an easy (built-in, add-on, open-source or commercial) to do replication on Postgresql (Master-slave) to have the data inside the slave be scrubbed for PCI compliance while being replicated across? How about ETL tools? It does not have to be instantaneous ... up to an hour lag is acceptable but the faster the better of course.
If this doesn't work, how about possibly using triggers on the slave database to achieve this?

Perhaps you should try creating a view of the tables you wish to scrub (performing your scrubbing in the SELECT), and then replicate the view to your offsite location.

I believe triggers on the slave would put you at risk for non-compliance, since data could leak out. If you want a packaged solution, I'd probably look at Bucardo, looking specifically into doing custom replication hooks into slave, to filter out (or modify) the columns you don't need/want. If that won't work, the idea to use views is probably your next best bet.

Yes. Use slony, add triggers to the master to materialize what you want to replicate and replicate only those materialized views. If you scrub on the master, that should do what you want. Since Slony will happily replicate only part of your database, that should work fine (on the other hand, remember, Slony will happily replicate only part of your database).

Related

Can A Postgres Replication Publication And Subscription Exist On The Same Server

I have a request asking for a read only schema replica for a role in postgresql. After reading documentation and better understanding replication in postgresql, I'm trying to identify whether or not I can create the publisher and subscriber within the same database.
Any thoughts on the best approach without having a second server would be greatly appreciated.
You asked two different question. Same database? No. Since Pub/Sub requires tables to have the same name (including schema) on both ends, you would be trying to replicate a table onto itself. Using logical replication plugins other than the built-in one might get around this restriction.
Same server? Yes. You can replicate between two databases of the same instance (but see the note in the docs about some extra hoops you need to jump through) or between two instances on the same host. So whichever of those things you meant by "same server", yes, you can.
But it seems like an odd way to do this. If the access is read only, why does it matter whether it is to a replica of the real data or to the real data itself?

What is the best way to sync Postgres and ElasticSearch?

I have the choice to Sync ES with latest changes on my Postgres DB
1- Postgres Listen / Notify :
I should create a trigger -> use pg_notify -> and create listener in a separated service.
2- Async queries to ES :
I can update ElasticSearch asynchronously after a change on DB. ie:
model.save().then(() => {model.saveES() }).catch()
Which one will scale best ?
PS: We tried zombodb in production but it doesn’t goes well, it slows down the production.
as you are asking for the ways, I assume you want to know the possibilities to apply the better architecture, I would like you to propose an advice given by confluent:
here https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
I recommend you consider https://github.com/debezium/debezium. It has Postgresql support and implements the change capture model proposed in other posts instead of the dual write model.
debezium benefits:
low latency change streaming
stores changes in a replicated log for durability
emits only write events (creates, updates, deletes) which can be consumed and piped into other systems.
UPD. Here is a simple github repository, which shows how it works

Postgresql: master-slave replication of 1 table

Help me to chouse a simple (lightweight) solution to the master-slave replication of one table between two Postgresql databases. The table contains a large object.
Here you'll find a very good overview of the replication tools for PostgreSQL. Please, take a look and hopefully you'll be able to pick one.
Otherwise, if you need something really lightweight, you can do it yourself. You'll need a trigger and a couple of functions, a dblink module if you need almost immediate changes propagation, otherwise you can survive with cron.

Postgresql replication: londiste vs. slony

Has anyone had much experience using londiste? It is an alternative to slony for postgres replication. I have been beating my head against the wall trying to get slony to work the way I need it and was looking for any easier way.
londiste seems like a good alternative, but I wanted to see if anyone has any pros/cons before I commit to a switch.
I have used both and for my requirements Londiste is a good option.
We have a simple set up where a subset of tables is replicated from a staging server to live by large batch updates and insert and also intraday smaller updates running on postgres 8.4 and Centos 5.5 and skytools 2 and we also use it as the queue component for event based actions. Previously I have used Slony from the 1.* series so I can't comment on more recent versions.
Some Pros for Londiste
Simple to set up
Generally simple to administer
Haven't had any issues with robustness of replication in 8 months of production use
Also can be used as a generic queing system outside of replication and it is quite simple to write your own consumer
Some Cons
Documentation is pretty scant
You need to be careful when implementing ddl changes
It won't stop you from making changes in the slave
Can't be used for cascading replication or failover/switchover use case
I will limit my comment on Slony to my experience that it was complex to set up and administer and the version I used did not compare favourably on tolerance to network issues with Londiste but could have been used for cascading replication and switchover use cases.
As mentioned before, Londiste is simpler to use, indeed. And as of version 3, released in March 2012, Londiste supports cascading replication and failover/switchover, as well as a bunch of other new cool features.

How to prevent Write Ahead Logging on just one table in PostgreSQL?

I am considering log-shipping of Write Ahead Logs (WAL) in PostgreSQL to create a warm-standby database. However I have one table in the database that receives a huge amount of INSERT/DELETEs each day, but which I don't care about protecting the data in it. To reduce the amount of WALs produced I was wondering, is there a way to prevent any activity on one table from being recorded in the WALs?
Ran across this old question, which now has a better answer. Postgres 9.1 introduced "Unlogged Tables", which are tables that don't log their DML changes to WAL. See the docs for more info, but at least now there is a solution for this problem.
See Waiting for 9.1 - UNLOGGED tables by depesz, and the 9.1 docs.
Unfortunately, I don't believe there is. The WAL logging operates on the page level, which is much lower than the table level and doesn't even know which page holds data from which table. In fact, the WAL files don't even know which pages belong to which database.
You might consider moving your high activity table to a completely different instance of PostgreSQL. This seems drastic, but I can't think of another way off the top of my head to avoid having that activity show up in your WAL files.
To offer one option to my own question. There are temp tables - "temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below)" - which I think don't generate WALs. Even so, this might not be ideal as the table creation & design will be have to be in the code.
I'd consider memcached for use-cases like this. You can even spread the load over a bunch of cheap machines too.