Why database replication optimizes reading and writing from the database? - postgresql

Can someone explain why database replication usually optimizes reading and writing from the database? I can not understand how it works... after all, replication adds a lot of extra operations, such as master-slave marge. The type of database I'm considering is a PostgreSQL cloud-based database, making 1.5M records per day, only 1% of this records is a reading.

PostgreSQL replication does not optimize reading and writing.
You can use it for load balancing, but it is not perfect for that:
your application has to know that is must write to one database and read from another
data written at the primary server don't become visible at the standby right away
If your workload consists to 99% of writing, that won't help you at all.
Look into sharding if you want to distribute the load.

Related

PostgreSQL: point-in-time recovery for individual database and not whole cluster

As per standard Postgres documentation
As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset.
From this, I understood that it is not possible to setup PITR for individual databases in a cluster (a.k.a. a database instance holding multiple databases).
If my understanding is incorrect, probably the next part of the question is not relevant, but if not, here it is:
I still do not get the problem in setting this up theoretically as each database is generating its own WAL archive.
The problem here is: I am in need of setting up multiple Postgres clusters and somehow I have only 2 RHEL 7.6 machines to handle this. I am trying to reduce the number of clusters on these 2 machines to only 2. I am planning to create multiple database rather than multiple instances to handle customer applications. But that means that I have to sacrifice PITS, as PITR only can be performed on the instance/cluster level and not on the database level (as per the official documentation).
Could someone please help clarifying my misunderstanding.
You are correct, you can only do PITR on a PostgreSQL database cluster, not on an individual database.
There is only one WAL stream for the complete database cluster; WAL is not split up per database.
Don't hesitate to run several PostgreSQL clusters on a single machine if that is advantageous for you.
There is little overhead in running a second database cluster. The biggest resource that is hogged by a cluster is shared buffers, but you want that to be only a fraction of the available RAM anyway. Most of the memory should be left to the filesystem cache that is shared by all PostgreSQL clusters.

Processing multiple concurrent read queries in Postgres

I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

How can one replicate a graph database like Neo4J

I am noob in NoSQL world. but After going thru the basics of how Neo4J works, I didnt unerstand how will replication be fast compared to column or document databases or a plain key value DB.
It has nodes and edges which are nothing but relations between those nodes, something simiar to Joins in a RDBMS.
So how does replication works here as ccompared to an RDBMS ?
Each NoSQL database will behave in different ways when it comes to replication. NoSQL is quite a wide term, so you should not expect to have good replication performance for all of them. In fact Neo4j Enterprise has some support for replication, but the design of Neo4j does not naturally lead to scaling. It is certainly not of of the core objectives, unlike others like Cassandra for example.
What do you mean with replication? Neo4j enterprise comes with an high-availability cluster that replicates your data across a number of machines.
If it is just about replicating the data, you can also shutdown your database and copy the database files (or in enterprise execute a online backup).

How can I use MongoDB as a cache for Postgresql?

I have an application that can not afford to lose data, so Postgresql is my choice for database (ACID)
However, speed and query advantages of MongoDB are very attractive, but based on what I've read so far, MongoDB can report a successful write which may not have gone to disk, so I can't make it my mission critical db (I'll also need transactions)
I've seen references to people using mysql and MongoDB together, one for the transactions and the other for queries. Please not that I'm not talking about keeping some data in one DB and the rest in another. I want to use Postgresql as a gateway to data entry, and MongoDB for reads.
Are there any resources that offer an architecture/guide for Postgresql + MongoDB usage in this way? I can remember seeing this topic in Postgresql conference agenda, but I could not find the link.
I don't think you'll get much speed using MongoDB just as a cache. It's strengths are replication and horizontal scalability. On one computer you'd make Mongo and Postgres compete for memory, IO bandwidth and processor time.
As you can not afford to loose transactions you'll be better with Postgres only. Its has efficient caching, sophisticated query planner, prepared queries and wide indexing support cause that read-only queries will be very fast - really comparable to MongoDB on a single computer.
Postgres can even scale horizontally now using asynchronous, or, from version 9.1, synchronous replication.
One way to achieve this would be to set up a master-slave replication with the PostgreSQL database as master, and the MongoDB database as slave. You would then do all reads from MongoDB, and all writes to PostgreSQL.
This post discusses such a setup using a tool called Bucardo:
http://blog.endpoint.com/2011/06/mongodb-replication-from-postgres-using.html
You may also be able to do it with Tungsten Replicator, although it seems designed to be used with MySQL:
http://code.google.com/p/tungsten-replicator/wiki/TRCHeterogeneousReplication
I can remember seeing this topic in Postgresql conference agenda, but I could not find the
link.
Maybe, you are talking about this: https://www.postgresqlconference.org/content/hybrid-applications-using-mongodb-and-postgres
Depending how important transactions are to you, one option is to use MongoDb driver's safe mode and drop Postgresql.
http://www.mongodb.org/display/DOCS/getLastError+Command
How can you expect transactional consistency from Postgres but trust MongoDB for reads? How would you support rollbacks in this scenario? How do you detect when they've gotten out of sync?
I think you're better off going with memcache and implementing a higher level object cache. Alternatively, you could consider a replication slave for reads. If you have performance needs beyond what a dedicated read slave can provide, consider denormalizing your tables on your slave system.
Make sure that any of this is actually needed. For thin tables with PK lookups most modern database engines like Postgres or InnoDB are going to generally keep up with NoSQL solutions. Don't fall into the ROFLSCALE trap
http://www.youtube.com/watch?v=b2F-DItXtZs
I think you can run a mongo replica set.. Let say 3 Slave and 1 Master.. Then in your app you should run all write transactions on Postgresql and then on Mongo ReplicaSet.. After that you can query read operations on Mongo Replica set..
But Synchronizing will be a problem, you should work on it..
you may find some replacement for mongo in here or here that is safer and fast as well.
but I advise to simplify your solution instead of making a complicated design.
Visual Guide to NoSQL Systems
lucky
In mongodb we can specify writeConcern property to specify that it should write to journal/ instances and then send confirmation/ acknowledgement and i think even mongodb has teh concept of transactions. Not sure why we need postgres behind it.