What is the best way to sync Postgres and ElasticSearch? - postgresql

I have the choice to Sync ES with latest changes on my Postgres DB
1- Postgres Listen / Notify :
I should create a trigger -> use pg_notify -> and create listener in a separated service.
2- Async queries to ES :
I can update ElasticSearch asynchronously after a change on DB. ie:
model.save().then(() => {model.saveES() }).catch()
Which one will scale best ?
PS: We tried zombodb in production but it doesn’t goes well, it slows down the production.

as you are asking for the ways, I assume you want to know the possibilities to apply the better architecture, I would like you to propose an advice given by confluent:
here https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

I recommend you consider https://github.com/debezium/debezium. It has Postgresql support and implements the change capture model proposed in other posts instead of the dual write model.
debezium benefits:
low latency change streaming
stores changes in a replicated log for durability
emits only write events (creates, updates, deletes) which can be consumed and piped into other systems.
UPD. Here is a simple github repository, which shows how it works

Related

How to hook into all database operations in PostgreSQL

Wondering if there is a way to create a trigger of some sort in PostgreSQL where you can listen for all modification queries on all records on all tables in a database. Basically watch everything in a database as it changes.
For example, it looks like MongoDB has this feature using Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog.
Wondering how the same sort of thing can be accomplished in PostgreSQL.

Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.
I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.
If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here
However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.
Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.
You can also take a look at PGSync.
It's similar to Debezium but a lot easier to get up and running.
PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch.
It allows you to keep Postgres as your source-of-truth and expose structured denormalized
documents in Elasticsearch.
You simply define a JSON schema describing the structure of the data in
Elasticsearch.
Here is an example schema: (you can also have nested objects)
e.g
{
"nodes": {
"table": "book",
"columns": [
"isbn",
"title",
"description"
]
}
}
PGsync generates queries for your document on the fly.
No need to write queries like Logstash. It also supports and tracks deletion operations.
It operates both a polling and an event-driven model to capture changes made to date
and notification for changes that occur at a point in time.
The initial sync polls the database for changes since the last time the daemon
was run and thereafter event notification (based on triggers and handled by the pg-notify)
for changes to the database.
It has very little development overhead.
Create a schema as described above
Point pgsync at your Postgres database and Elasticsearch cluster
Start the daemon.
You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.
Have a look at the github repo for more details.
You can install the package from PyPI
Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data
I created a simple github repository, which shows how it works

Detecting new data in a replicated MongoDB slave

Background: I have a very specific use case where I have an existing MongoDB that I need to interact with via reads, but I have to ensure that the data can never be modified. However I also need to trigger some form of event when new data comes in so I can do post processing on it.
The current plan is to use replication to get the data onto a slave for the read processing. However for my purposes I only care about new data in various document stores. Part of the issue is that I can not modify the existing MongoDB and not all the data is timestamped, so there is no incremental way to handle this that I can think of.
Question: Is it possible to fire an event from a slave that would tell me I have new data and what it is? I will only have access to the slave DB, as the master will be locked.
I may have some limited ability to change the master DB, but I can not expect to change the document structure at all.
Instead of using a master/slave configuration you could instead use a replica set with a priority 0 secondary (so that it can never become primary).
You can tail the oplog on that secondary looking for insert operations.

How can I use MongoDB as a cache for Postgresql?

I have an application that can not afford to lose data, so Postgresql is my choice for database (ACID)
However, speed and query advantages of MongoDB are very attractive, but based on what I've read so far, MongoDB can report a successful write which may not have gone to disk, so I can't make it my mission critical db (I'll also need transactions)
I've seen references to people using mysql and MongoDB together, one for the transactions and the other for queries. Please not that I'm not talking about keeping some data in one DB and the rest in another. I want to use Postgresql as a gateway to data entry, and MongoDB for reads.
Are there any resources that offer an architecture/guide for Postgresql + MongoDB usage in this way? I can remember seeing this topic in Postgresql conference agenda, but I could not find the link.
I don't think you'll get much speed using MongoDB just as a cache. It's strengths are replication and horizontal scalability. On one computer you'd make Mongo and Postgres compete for memory, IO bandwidth and processor time.
As you can not afford to loose transactions you'll be better with Postgres only. Its has efficient caching, sophisticated query planner, prepared queries and wide indexing support cause that read-only queries will be very fast - really comparable to MongoDB on a single computer.
Postgres can even scale horizontally now using asynchronous, or, from version 9.1, synchronous replication.
One way to achieve this would be to set up a master-slave replication with the PostgreSQL database as master, and the MongoDB database as slave. You would then do all reads from MongoDB, and all writes to PostgreSQL.
This post discusses such a setup using a tool called Bucardo:
http://blog.endpoint.com/2011/06/mongodb-replication-from-postgres-using.html
You may also be able to do it with Tungsten Replicator, although it seems designed to be used with MySQL:
http://code.google.com/p/tungsten-replicator/wiki/TRCHeterogeneousReplication
I can remember seeing this topic in Postgresql conference agenda, but I could not find the
link.
Maybe, you are talking about this: https://www.postgresqlconference.org/content/hybrid-applications-using-mongodb-and-postgres
Depending how important transactions are to you, one option is to use MongoDb driver's safe mode and drop Postgresql.
http://www.mongodb.org/display/DOCS/getLastError+Command
How can you expect transactional consistency from Postgres but trust MongoDB for reads? How would you support rollbacks in this scenario? How do you detect when they've gotten out of sync?
I think you're better off going with memcache and implementing a higher level object cache. Alternatively, you could consider a replication slave for reads. If you have performance needs beyond what a dedicated read slave can provide, consider denormalizing your tables on your slave system.
Make sure that any of this is actually needed. For thin tables with PK lookups most modern database engines like Postgres or InnoDB are going to generally keep up with NoSQL solutions. Don't fall into the ROFLSCALE trap
http://www.youtube.com/watch?v=b2F-DItXtZs
I think you can run a mongo replica set.. Let say 3 Slave and 1 Master.. Then in your app you should run all write transactions on Postgresql and then on Mongo ReplicaSet.. After that you can query read operations on Mongo Replica set..
But Synchronizing will be a problem, you should work on it..
you may find some replacement for mongo in here or here that is safer and fast as well.
but I advise to simplify your solution instead of making a complicated design.
Visual Guide to NoSQL Systems
lucky
In mongodb we can specify writeConcern property to specify that it should write to journal/ instances and then send confirmation/ acknowledgement and i think even mongodb has teh concept of transactions. Not sure why we need postgres behind it.

postgresql replication + scrubbing

Is there any an easy (built-in, add-on, open-source or commercial) to do replication on Postgresql (Master-slave) to have the data inside the slave be scrubbed for PCI compliance while being replicated across? How about ETL tools? It does not have to be instantaneous ... up to an hour lag is acceptable but the faster the better of course.
If this doesn't work, how about possibly using triggers on the slave database to achieve this?
Perhaps you should try creating a view of the tables you wish to scrub (performing your scrubbing in the SELECT), and then replicate the view to your offsite location.
I believe triggers on the slave would put you at risk for non-compliance, since data could leak out. If you want a packaged solution, I'd probably look at Bucardo, looking specifically into doing custom replication hooks into slave, to filter out (or modify) the columns you don't need/want. If that won't work, the idea to use views is probably your next best bet.
Yes. Use slony, add triggers to the master to materialize what you want to replicate and replicate only those materialized views. If you scrub on the master, that should do what you want. Since Slony will happily replicate only part of your database, that should work fine (on the other hand, remember, Slony will happily replicate only part of your database).