PostgreSQL >= 10 Filtered publications and subscriptions - postgresql

I have gone nuts searching for information about this. MS SQL offers the option as can be found here https://www.mssqltips.com/sqlservertip/4116/sql-server-transactional-replication-static-row-and-column-filters/ but I could not find a way to do this in Postgres.
This is the case:
There is a central server storing the information of resources to be consumed locally by other local servers.
Each of the local servers is interested only in the resources that belongs to it. (i.e. there is a central repository of books, and I only want the books written in my language)
Additionally as each server is a separate client, there should be a "great wall" to avoid them to access information of other clients.
We have thought of several lines for developing this:
Use a socket (already implemented) to push changes from central to local via API.
Use triggers to push changes from central to local on database level.
Use logic replication as explained in the question.
I also have no information of which method would be more effective computationally or regarding I/O. The table is small, less than 15 columns and less than 10,000 rows. So I guess there should not be a problem. Although updates to this table may happen as several per second (2 or 3 per second estimated average).
Logic replication (Publisher + Subscribers at DB level) seems like the proper solution, but I am stuck.
Ideas?

One way to do it with logical replication would be to partition the table by client and replicate only the respective partition.
But with 10000 rows that seems like overkill.
I would enable row level security on the table so that every client can only see the data that belong to them. Then define a foreign table in each client's database and pull the whole thing over. You can either
truncate and replace the table every time
or
if you have foreign key dependencies, first remove the rows that do no longer exist in the central database, then insert or update those rows that are new or were modified

Package pglogical works for this purpose, as stated by #a_horse_with_no_name
Latency seems good for now, for us, it did the job.

Related

Archive old data in Postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.
That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.
While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?

I'm working on a solution in which we have two databases that are used with the following purposes:
An Elasticsearch used for search purposes
A Postgres database that acts as a source of truth for the data
Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:
Retrieve products by name: a search is performed using elasticsearch, and then the IDs retrieved by ES are used on a secondary query against Postgres to obtain the actual and trustworthy data (so we get fast searches on big tables while getting trustworthy data)
Update product fields: We allow users to update any product information (kind of a collaborative wiki). First we store the data in Postgres, and then into Elasticsearch.
However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.
So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.
What techniques/options are available to me to solve my kind of problem?
Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.
Immediate Consistency and of course eventual consistency
From Application layer
For the immediate consitency between two databases, the only way you can achieve them is to do polygot persistence in a transactional way so that either the same data in both Postgres and Elasticearch gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.
So basically GUI --> Application Layer --> Postgres/Elasticsearch
Queue/Real Time Streaming mechanism
You need to have a messaging queue so that the updates are going to the Queue in the event based approached.
GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch
Eventual consistency but not immediate consistency
Have a separate application, normally let's call this indexer. The purpose of this tool is to carry out updates from postgres and push them into the Elasticsearch.
What you can have in the indexer is have multiple single configuration per source which would have
An option to do select * and index everything into Elasticsearch or full crawl
This would be utlized when you want to delete/reindex entire data into Elasticsearch
Ability to detect only the updated rows in the Postgres and thereby push them into Elasticsearch or incremental crawl
For this you would need to have a select query with where clause based on the status on your postgres rows for e.g. pull records with status 0 for document which got recently updated, or based on timestamp to pull records which got updated in last 30 secs/1 min or depending on your needs. Incremental query
Once you perform the incremental crawl, if you implement incremental using status, you need to change the status of this to 1(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental query
Basically schedule jobs to run above queries as part of indexing operations.
Basically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch
Summary
I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.
Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.
Also I recommend one more link that can help
Hope it helps!

Commit to a log like Kafka + database with ACID properties?

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?
Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.

sql server split mirror db on to multiple devices

Say I have a large production mirrored 1TB DB that resides on a single MDF device and I would like to split that up into say 5 200 Gig devices.
I want to do this without interruption to Production.
I thought I could break the mirror and use the RESTORE process for creating a mirror to achieve the split to multiple devices quickly and without interruption to Production. Doing this twice would allow me to get this done in a few hours.
Has anyone done this? Is it the preferred method seeing as we are mirroring anyways?
What are other my alternatives, Pros and Cons? And gotchas?
Also, I recall another more organic process where one would create the 5 new New Devices and somehow, over time get the objects to move over to the new devices. Not sure of the process for this but I seem to recall it being discussed. Sounds like this could take a long time and possibly cause some clocking at times?
Thanks
...Ray
This isn't quite as simple a process as it first looks, the reason being is that just adding the files to SQL server isn't enough as even if you were to add 4 new files, they would all be empty space, you would have one file with 1Tb of data in it and 4 empty ones, which would eventually fill up as SQL server uses a proportional fill method for the files, but most of your queries would still be hitting the single file.
I take it you are doing this to improve performance? If so, you will need to move data around into different files in order to actually split the data up. Whether you can do this online or not depends on whether you are running Enterprise Edition or not (as this allows you to rebuild indexes online).
An easy way to move a table (or more accurately a clustered index, which is pretty much the same thing as the table for all intents and purposes) is to add a new filegroup with a new data file and then rebuild the clustered index specifying the new filegroup, you can use the following to do this:
CREATE CLUSTERED INDEX Existing_Index_Name ON schema_name.table_name(column_name)
WITH(DROP_EXISTING=ON,Online=ON) on [new_filegroup_name]
GO
This code will create the new index on the new filegroup, get rid of the old one and if you are running enterprise edition, it will do it all without blocking the users.
See the following link for more methods of moving the data between filegroups:
Move data between SQL Server database filegroups
You should also look into partitioning your tables to help improve performance too:
Partitioning Tables and Indexes
With regards to your mirroring setup, you should break the mirror, then on the primary add all your files/filegroups, then move the data between the filegroups, then backup the modified database on the primary, restore on the mirror (so all the files are set up the same on the mirror) and then re-set up your mirroring.