MongoDB replica set over a slow internet connection - mongodb

here's my problem:
I have a MongoDB Replica Set which I will have to use on extremelly slow connections (mainly inner africa, etc), I was wondering if other people over here have gone through something similar to this?
if so, could you please tell me how much is the replication capability on such a line?
if not, can you give me estimates?
I'll be testing it out soon, but I'd really prefer to be prepared, I do know it won't be slower than MySQL on these...
Thanks for your replies.

The type of bandwidth you'll need depends on the size of the object you're inserting or the size of the updates you're making to existing objects; and obviously how many operations you're performing a second. So we need to know more about the structure of your objects to give an idea of performance.
See this blog post (and other's from Kristina) for details on the inner workings of the oplog so that you understand what is actually being replicated. http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/
What you're going to battle with is unreliable connections. In my experience, MongoDB doesn't handle unreliable connections well. I've run replication between the US and UK and have had numerous problems where replication dies and simply doesn't start again without manual intervention.
If you have large databases, you need to consider what you're going to do if you have to resync your secondaries from scratch as it may take too long to bring them back online if you're on slow connections.

Related

What are the possible use cases of the OrientDb Live Query feature?

I apologise if the question is naive. I wanted to understand what could be a few possible use cases of the live query feature.
Let's say - My database state changes but it doesn't change every minute (or hour). If I execute a live query against my database/class/cluster, I'm not really expecting the callback to be called anytime soon. But, hey, I would still want to be notified when there's a state change.
My need with Orientdb is more on lines of ElasticSearch's percolator bundled with a publish-subscribe system.
Is live query meant to cater to such use cases too? Or is my understanding of live query very limited? What could be a few possible use cases for the live query feature?
Thanks!
Whether or not Live Queries will be appropriate for your use case depends on a few things. There are several reason why live queries make sense. A few questions to ask are:
How frequently does the data change?
How soon after the data changes do you need to know about it?
How many different groups of data (e.g. classes, clusters) do you need to deal with?
How many clients are connected to the server?
If the data does not change very often, or if you can wait a set period of time before an update, or you don't have many clients (hitting the DB directly), or if you only have one thing feeding the database, then you might want to just do polling. There is a balance between holding a connection open that you send a message on very infrequently (live queries) and polling too often.
For example. It's possible that you have an application server (tomcat, node, etc) and that your clients connect via web sockets. Now lets say your app server makes one (or a few pooled) live query to the database. Now lets say your database has an update. It might just go from the database to the app server (e.g. node). Node may now be responsible for fanning out that message across 100 web sockets (1 for each connected client). In this case, the fact that node is connected to the database in a persistent way with a live query open, is not that big of a deal.
The question is. If you have thousands of clients connected, do they all need an immediate update. If so are you planning on having them polling at a short interval? If so, you probably could benefit from a live query. Lots of clients polling at a short interval will generate a lot of unnecessary traffic and queries.
Unfortunately at the end of the day, the answer is it depends. You probably need to prototype and then instrument under load to see what your tradeoffs are. But in principal, it is less about how frequently updates come, and more about how often you would have clients poll, and how many clients you have. If the answer is "short intervals and a lot of clients" Give live queries a try.

Experimenting with high concurrent connection server

I am trying to build a server which can handle as many concurrent connections as possible. (100k at least, for a start)
Right now, when i test it through LAN, it can go up to 50k+ concurrent connections easily (did not test more yet). However when I test it from outside my LAN, it never goes beyond about 8k...
To be more precise, when going past 8k, the first sockets no longer receive any data, as if the new ones replaced them...
Does anyone have any idea what could cause this?
I have done some research, and it seems, although it isn't clear, that routers/modems may have a limited amount of supported concurrent connections, is that true?
If so, and if that's my problem, do I have to get one that can support more? Or get rid of it somehow?

How to connect meteor to an existing backend?

I recently discovered Meteor, and I really love the simplicity that it brings to programming new apps. My question is: how do you connect it to an existing back-end? We have a substantial amount of existing Clojure code, also running with MongoDB. What I would like to do is use Meteor to build the front-end of my app. I guess I could connect my Meteor app directly to the MongoDB instance of the back-end, but this does not seem like a good practice... or is it?
Another option I imagined was to access the DB from either the webapp or the Clojure code and create a separate way of communication between the two with a queue mechanism, or sockets. Any hint or pointer to relevant documentation would be helpful!
Take a look at Meteor's environment variable settings. By setting these variables you can easily define an external MongoDB instance. In particular it would be
$export MONGO_URL="mongodb://yourmongodbserver/your-db"
There is a screencast of eventedmind.com for this specific topic https://eventedmind.com/feed/sg3ejYnmhxpBNoWan which is quite resourceful.
Regarding the "how" to point them to the same, #Michael's answer is spot on; just point your Meteor web servers at the same MongoDB.
Regarding whether or not you should, that depends on your situation. Having everything run off the same DB certainly simplifies things.
Having separate dbs can potentially reduce the load on your db tier as you could selectively choose which writes/updates to replicate between the clojure and Meteor dbs.
One issue with either method is speed of notification of changes. Currently, Meteor servers poll the DB every 10 secs to recognize changes. Happily, once the oplog branch gets merged into master, it will give a large speed improvement in how quickly external changes made in the DB (as opposed to directly through a Meteor server) are reflected in the Meteor clients. The oplog support will enable Meteor servers to emulate a replica-set instance, tailing the oplog which will mean practically instant notification of db changes.
Using a queue as a middle-ware layer introduces complexity and adds another point of failure. It also increases latency of notification. These issues can be mitigated, though, and there may be other pieces of your infrastructure in the future that would benefit from such a middle-ware queue. For example, other interested systems could register with the queue to receive notification of changes without querying or needing to know about your db. You can also scale your MongoDB instances independently and tune the queue to determine what "eventually" means in the "eventually consistent" guarantee.
I think the questions to ask are:
how much overlap is there between the clojure dataset and the Meteor dataset
how quickly do you need changes to be reflected between the two
will a middle-ware queue be useful in other circumstances as you grow
Regarding possible queue technologies to look into, I've heard very good things about RabbitMQ. The Oct. 2013 talk at the Clojure NYC meetup included a description of switching to RabbitMQ from Amazon SQS due to latency issues with SQS and anecdotally RabbitMQ has been rock-solid for them.

How to implement real-time replication of MongoDB (or CouchDB) to many remote clients

I'm considering how to design a mechanism for replicating a (potentially large) MongoDB or other NoSQL (CouchDB, etc) database to dozens of clients at once. The clients would function like a replica set, but the replication would be one-way and the remote clients would belong to other parties. Specifically, I am looking for the following features:
real-time: changes to the master database should be pushed out to the clients as quickly as possible
replication to new clients: a new client must be able to connect, automatically sync the majority of existing data, then receive real-time updates.
efficient: both the initial synchronization/transfer of data and tracking of real-time updates ("diffs", if you will) are computationally efficient, with multiple clients connected.
secure: the master database presents an interface to which remote clients (who do not belong to the same owner or system) can connect: i.e., we cannot just add all the clients to the master's replica set.
robust: a temporarily connection failure between a client and the master database should be easily and efficiently recoverable.
In some sense, the server is publishing a collection of data and the clients are subscribing to it. I realize that this is a hard software engineering problem, and to my knowledge no piece of software has implemented this exactly yet. However, some approaches have come to mind as close, which I'll list below.
Meteor's DDP protocol: It's designed to do this with Mongo-like collections and exactly implements the model of publishing and subscribing to a set of data (rather than a stream of messages). It manages the initial sync and sends along live changes. However, it's still in development, and far from being an industrial-strength solutions - current drawbacks are that the server keeps a copy of every client's state in a possibly inefficient way and is only tested on collections that can fit in the memory of a web app. Also, it appears that DDP cannot efficiently sync an out-of-date database without fetching everything from scratch. If anyone can point to some examples of how large of a collection can be synced over DDP, that would be great. (See also: https://stackoverflow.com/q/10128430/586086)
Broadcasting the Mongo oplog: Using a high-throughput message bus like Apache Kafka, one may be able to efficiently send the oplog to many clients at once. This tackles some of the system implementation challenges. However, this requires that the clients start with an initial sync that gets them close enough to the current master state somehow and then start replaying the oplog from the appropriate point.
Continuous replication a la CouchDB: I'm not sure how this is implemented and how robust it is, given the sparsity of the documentation. However, it does seem to work over remote database connections. How efficient is this, though, when multiple clients are trying to replicate at the same time? (A similar hack to this would be to make the clients MongoDB Priority 0 replica set members; however, that seems to be far from its intended use. See also: http://guide.couchdb.org/draft/replication.html)
Please give pointers to software or pieces of software that already implement parts of this, or suggestions on the algorithms/data structures needed to do this efficiently.
If you are looking specifically for real-time replication, I'd recommend you look into SaaS offerings specifically for this purpose, such as https://www.firebase.com/

Postgres 9.0 and pgpool replication : single point of failure?

My application uses Postgresql 9.0 and is composed by one or more stations that interacts with a global database: it is like a common client server application but to avoid any additional hardware, all stations include both client and server: a main station is promoted to act also as server, and any other act as a client to it. This solution permits me to be scalable: a user may initially need a single station but it can decide to expand to more in future without a useless separate server in the initial phase.
I'm trying to avoid that if main station goes down all others stop working; to do it the best solution could be to continuously replicate the main database to unused database on one or more stations.
Searching I've found that pgpool can be used for my needs but from all examples and tutorial it seems that point of failure moves from main database to server that runs pgpool.
I read something about multiple pgpool and heartbeat tool but it isn't clear how to do it.
Considering my architecture, where doesn't exist separated and specialized servers, can someone give me some hints ? In case of failover it seems that pgpool do everything in automatic, can I consider that failover situation can be handled by a standard user without the intervention of an administrator ?
For these kind of applications I really like Amazon's Dynamo design. The document by the link is quite big, but it is worth reading. In fact, there're applications that already implement this approach:
mongoDB
Cassandra
Project Voldemort
Maybe others, but I'm not aware. Cassandra started within Facebook, Voldemort is the one used by LinkedIn. Making things distributed and adding redundancy into your data distribution you will step away from traditional Master-Slave replication approaches.
If you'd like to stay with PostgreSQL, it shouldn't be a big deal to implement such approach. You will need to implement an extra layer (a proxy), that will decide based on pre-configured options how to retrieve/save the data.
The proxying layer can be implemented in:
application (requires lot's of work IMHO);
database;
as a middleware.
You can use PL/Proxy on the middleware layer, project originated in Skype. It is deeply integrated into the PostgreSQL, so I'd say it is a combination of options 2 and 3. PL/Proxy will require you to use functions for all kind of queries against the database.
In case you will hit performance issues, PgBouncer can be used.
Last note: any way you decide to go, a known amount of development will be required.
EDIT:
It all depends on what you call “failure” and what you consider system being in an interrupted state.
Let's look on the pgpool features.
Connection Pooling PostgreSQL is using a single process (fork) per session. Obviously, if you have a very busy site, you'll hit the OS limit. To overcome this, connection poolers are used. They also allow you to use your resources evenly, so generally it's a good idea to have pooler before your database.In case of pgpool outage you'll face a big number of clients unable to reach your database. If you'll point them directly to the database, avoiding pooler, you'll face performance issues.
Replication All your queries will be auto-replicated to slave instances. This has meaning for the DML and DDL queries.In case of pgpool outage your replication will stop and slaves will not be able to catchup with master, as there's no change tracking done outside pgpool (as far as I know).
Load Balance Your read-only queries will be spread across several instances, achieving nice response times, allowing you to put more bandwidth on the system.In case of pgpool outage your queries will suddenly run much slower, if the system is capable of handling such a load. And this is in the case that master database will catchup instead of failed pgpool.
Limiting Exceeding Connections pgpool will queue connections in case they're not being able to process immediately.In case of pgpool outage all such connections will be aborted, which might brake the DB/Application protocol, i.e. Application was designed to never get connection aborts.
Parallel Query A single query is executed on several nodes to reduce response time.In case of pgpool outage such queries will not be possible, resulting in a longer processing.
If you're fine to face such conditions and you don't treat them as a failure, then pgpool can serve you well. And if 5 minutes of outage will cost your company several thousands $, then you should seek for a more solid solution.
The higher is the cost of the outage, the more fine tuned failover system should be.
Typically, it is not just single tool used to achieve failover automation.
In each failure you will have to tweak:
DNS, unless you want all clients' reconfiguration;
re-initialize backups and failover procedures;
make sure old master will not try to fight for it's role in case it comes back (STONITH);
in my experience we're people from DBA, SysAdmin, Architects and Operations departments who decide proper strategies.
Finally, in my view, pgpool is a good tool, I do use it. But it is not designed as a complete failover solution, not without extra thinking, measures taken, scripts written. Thus I've provided links to the distributed databases, they provide a much higher level of availability.
And PostgreSQL can be made distributed with a little effort due to it's great extensibility.
First of all, I'd recommend checking out pgBouncer rather than pgpool. Next, what level of scaling are you attempting to reach? You might just choose to run your connection pooler on all your client systems (bouncer is light enough for this to work).
That said, vyegorov's answer is probably the direction you should really be looking at in this day and age. Are you sure you really need a database?
EDIT
So, the rather obvious answer is that pgPool creates a single point of failure if you only have one box running it. The obvious solution is to run multiple poolers across multiple boxes. You then need to engineer your application code to handle database disconnections. This is not as easy at it sounds, but basically you need to use 2-phase commit for non-idempotent changes. So to the greatest extent possible you should make your changes idempotent.
Based on your comments, I'd guess that maybe you have limited experience dealing with database replication? pgPool does statement based replication. There are tradeoffs here. The benefit is that it's very easy to set up. The downside is that there is no guarantee that data on the replicated databases will be identical. It is also (I believe but haven't checked lately) not compatible with 2pc.
My prior comment asking if you really need a database was driven by my perception that you have designed a system without going into much detail around this part of it. I have about 2 decades experience working on "this part" of similar systems. I expect you will find that there are no out of the box solutions and that the issues involved get very complicated. In other words, I'm suggesting you re-consider your design.
Try reading this blog (with lots of information about PostgreSQL and PgPool-II):
https://www.itenlight.com/blog/2016/05/21/PostgreSQL+HA+with+pgpool-II+-+Part+5
Search for "WATCHDOG" on that same blog. With that you can configure a PgPool-II cluster. Two machines on the same subnet are required, though, and a virtual IP on the same subnet.
Hope that this is useful for anyone trying the same thing (even if this answer is a lot late).
PGPool certainly becomes a single point of failure, but it is a much smaller one than a Postgres instance.
Though I have not attempted it yet, it should be possible to have two machines with PGPool installed, but only running on one. You can then use Linux-HA to restart PGPool on the standby host if the primary becomes unavailable, and to optionally fail it back again when the primary comes back. You can at the same time use Linux-HA to move a single virtual IP over as well, so that your clients can connect to a single IP for their Postgres services.
Death of the postgres server will make PGPool send queries to the backup Postgres (promoting it to master if necessary).
Death of the PGPool server will cause a brief outage (configurable, but likely in the region of <1min) until PGPool starts up on the standby, the IP address is claimed, and a gratuitous ARP sent out. Of course, the client will have to be intelligent enough to reconnect without dying.