How to read from specific instance of a documentdb cluster - mongodb

I am having a replica lag issue with documentDB. Where I am trying to write some data from a collection and read the same at the same time. But because I am using a distributed system, I am not able to read the already written data from the replica sets.
Here's the cluster design.
.
So, is it possible to read from the primary instance in nodejs or is it possible to read from a specific instance?

How big is the replication lag? It might be worth investigating the cause for the lag, maybe bigger instances are needed or queries have to be optimized.
If your application can't tolerate eventual consistency or read after write consistency is required, then use readPreference: primaryPreferred to instruct the driver to read from the Primary instance when available. However, in this case, the replicas will not be used to scale horizontally the read traffic.
Amazon DocumentDB has other endpoints too:
reader endpoint - points to replica instances, it's found in the configuration section of the cluster (console or aws cli describe-db-clusters command)
instance endpoint - each instance has its own endpoint, it's found in the instances section (console or aws cli describe-db-instances command)
The best practice is to connect as replica set, using the readPreference parameter to adjust the preference. Instance endpoints can be useful when, for example, there's a need for large analytics queries and a bigger instance is deployed, temporarily, to run them.

Related

Mongo DB - difference between standalone & 1-node replica set

I needed to use Mongo DB transactions, and recently I understood that transactions don't work for Mongo standalone mode, but only for replica sets
(Mongo DB with C# - document added regardless of transaction).
Also, I read that standalone mode is not recommended for production.
So I found out that simply defining a replica set name in the mongod.cfg is enough to run Mongo DB as a replica set instead of standalone.
After changing that, Mongo transactions started working.
However, it feels a bit strange using it as replica-set although I'm not really using the replication functionality, and I want to make sure I'm using a valid configuration.
So my questions are:
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality? (as said I need it to allow transactions)
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
I've read that standalone mode is not recommended for production, although it sounds like it's the most basic configuration. I understand that this configuration is not used in most scenarios, but sometimes you may want to use it as a standard DB on a local machine. So why is standalone mode not recommended? Is it not stable enough, or other reasons?
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality?
You don't have high availability afforded by a proper replica set. Thus it's not recommended for a production deployment. This is fine for development though.
Note that a replica set's function is primarily about high availability instead of scaling.
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
A single-node replica set would have the oplog. This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification).
So why is standalone mode not recommended? Is it not stable enough, or other reasons?
MongoDB in production was designed with a replica set deployment in mind, for:
High availability in the face of node failures
Rolling maintenance/upgrades with no downtime
Possibility to scale-out reads
Possibility to have a replica of data in a special-purpose node that is not part of the high availability nodes
In short, MongoDB was designed to be a fault-tolerant distributed database (scales horizontally) instead of the typical SQL monolithic database (scales vertically). The idea is, if you lose one node of your replica set, the others will immediately take over. Most of the time your application don't even know there's a failure in the database side. In contrast, a failure in a monolithic database server would immediately disrupt your application.
I think kevinadi answered well, but I still want to add it.
A standalone is an instance of mongod that runs on a single server but is not part of a replica set. Standalone instances used for testing and development, but always recomended to use replica sets in production.
A single-node replica set would have the oplog which records all changes to its data sets . This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification). It also supports point in time recovery.
Please follow Convert a Standalone to a Replica Set if you would like to convert the standalone database to replicaset.
Transactions have been introduced in MongoDB version 4.0. Starting in version 4.0, for situations that require atomicity for updates to multiple documents or consistency between reads to multiple documents, MongoDB provides multi-document transactions for replica sets. The transaction is not available in standalone because it requires oplog to maintain strong consistency within a cluster.

MongoDB standalone vs replica set and how to migrate data from a standalone to a replica set

I have a few questions about MongoDB standalone and Replica sets, I don't really get it.
When should I use either of them
Why all the replica sets tutorials show 3 connections, is there a reason?
Can I create a replica set for 1 instance only? and in that case how is it different than the standalone mongodb instance?
How to Migrate data from a standalone instance to replica sets?
All these questions I'm asking because recently I was trying to implement transactions and sessions can only start on "replica sets" I don't really get the difference at all.
When should I use either of them?
Replication is the process of synchronizing data across multiple servers. Replication provides redundancy and increases data availability with multiple copies of data on different database servers. Replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
To keep your data safe
High (24*7) availability of data
Disaster recovery
No downtime for maintenance (like backups, index rebuilds, compaction) Read scaling (extra copies to read from)
Replica set is transparent to the application
Why all the replica sets tutorials show 3 connections, is there a reason?
The basic implementation to take full advantage of replication specifies you
should have at least one primary node with two secondary nodes. So the
examples are always with 3 nodes. Not only this if from 3 the
Primary node goes down you still have 2 nodes (mongoDB will assign
using arbiter rule) and one primary and one secondary for high availability
Can I create a replica set for 1 instance only? and in that case how is it different than the standalone mongodb instance?
It does not make sense to have single instance with mongo replication.
How to Migrate data from a standalone instance to replica sets?
Convert a Standalone to a Replica Set . Your existing data will be migrated to all replication instances once they are up and running when converted to replication sets from standalone.

How do I use Read Replicas?

I've read all the docs on the Google Cloud SQL site, and I now understand how to created and manage Read Replicas, but I have not seen any information about how to use them,
Does Google automatically load-balance connections between all instances?
Do I have to manually connect to a specific Read Replica to avoid hitting the Master? If so, do I have to manage reconnecting on replica failure myself?
Does Google automatically load-balance connections between all instances?
No, it doesn't. Each instance is independent. You can connect to replicas and use them to read while using the master to read/write, but you need to design that logic into your application
Do I have to manually connect to a specific Read Replica to avoid hitting the Master? If so, do I have to manage reconnecting on replica failure myself?
Yes, you have to connect to a specific read replica. Right now you can't even save and reuse the instance IP like you can do with compute engine instances (sigh, I hope they fix this soon....).
There is now a failover replica option that you can use so you don't need to connect to the read replica yourself, but it only activates on failure, it is not a load balancer.
Read replica can be used by setting up ProxySQL. You can configure ProxySQL to distribute the database queries. Here is a community tutorial providing more details on architecture and configuration example.
How do I use Read Replicas?
Use them for disaster recovery or to migrate your database to
another region by promoting a read replica to become a primary
database.
https://cloud.google.com/sql/docs/postgres/replication/cross-region-replicas
Use them for separating read workloads from production workloads. This blog post covers using Read Replicas for analytics workloads:
Use Cloud SQL Read Replicas to separate your analytics and production workloads
Cloud SQL does not provide load balancing between replicas1
ref:https://cloud.google.com/sql/docs/sqlserver/replication

MongoDB data replication in Kubernetes

I've been configuring pods in Kubernetes to hold a mongodb and golang image each with a service to load-balance. The major issue I am facing is data replication between databases. Replication controllers/replicasets do not seem to do what the name implies, but rather is a blank-slate copy instead of a replica of existing/currently running pods. I cannot seem to find any examples or clear answers on how Kubernetes addresses this, or does it even?
For example, data insertions being sent by the Go program are going to automatically load balance to one of X replicated instances of mongodb by the service. This poses problems since they will all be maintaining separate documents without any relation to one another once Kubernetes begins to balance the connections among other pods. Is there a way to address this in Kubernetes, or does it require a complete re-write of the Go code to expect data replication among numerous available databases?
Sorry, I'm relatively new to Kubernetes and couldn't seem to find much information regarding this.
You're right, a replica set is not a replica of another container, it's just a container with the same configuration spun up within the same logical unit.
A replica set (or deployment, which is the resource you should be using now) will have multiple pods, and it's up to you, the operator, to configure the mongodb part.
I would recommend reading this example of how to set up a replica set with multiple mongodb containers:
https://medium.com/google-cloud/mongodb-replica-sets-with-kubernetes-d96606bd9474#.e8y706grr

Autoscaling limited by RDS connection

I have some nightly jobs that are running on EC2 and the number of machines is scaled by the number of messages in SQS. My process requires reads from a Postgres RDS database. Now these are the issues I am facing.
Not able to scale beyond a certain number because of the unavailability of connections.
I tried creating a connection pool using pgbouncer, and tried with different settings as well, but it's missing a lot of data on the resultant set.
Make your postgresql RDS install multi AZ. Then you can make read replicas on demand and scale read performance with your load.
To answer the comments:
Some extra "plumbing" is required to make the connections to the read replica. Maybe route53 dynamically updated records as the scaling happens or something like haproxy
The reason I mention multi AZ is that this would help prevent downtime during an auto scaling event bringing up the read replica
It would be simpler (but more costly) to permanently bring up a read replica and use DNS round robin to share the load
See https://aws.amazon.com/blogs/aws/amazon-rds-announcing-read-replicas/ for information on read replicas