I am using MongoDB via its driver for node.js
I typically open a connection (via the connect() method) any time I need to perform an operation and close it (via the close() method) as soon as I am finished. In my programs, as natural, I need to perform many operations against the MongoDB and therefore it happens that I open and close many times the connection.
I am wondering whether this is a good practice or whether it would be better to open the connection as the first operation is executed, store it in a variable and use the already opened connections for the following operations closing it when the program ends.
Any advice is very much appreciated.
It is best practice to open the connection once, store it in a variable and close it at the end. MongoDB explicitly recommends this. This is the reason why opening and closing a connection is part of the MongoDB API rather than have it happen automatically for each query.
Opening and closing connections for each query will introduce a significant overhead both in terms of performance (CPU + latency), network traffic, memory management (creating and deleting objects), not only for the client but also for the server itself, which also impacts other clients.
About the terminology of connection: in some drivers like Java, what is actually created and stored in a variable is not a physical connection, but a MongoClient instance. It looks like a connection from an abstract (API) perspective, but it actually encapsulates the actual physical connection(s) and hides complexity from the user.
Creating the MongoClient instance only once, for the drivers that support this, will also allow you to benefit from connection pooling where the driver maintains active connections in parallel for you, so that you also only need to create one MongoClient instance across multiple threads.
Related
It's my understanding that you can use prepared statements or connection pooling (with tools like pgPool/pgBouncer) with Postgresql, but can benefit from only one at the same time (at least with Npgsql driver for .NET, plus library author suggests turning off clients-side connection pooling when using PgBouncer). Am I right?
If so - is this true for other runtimes and languages, like Java, Python, Go? Or is it a implementation-specific issue?
It's a complex question, but here are some answers.
As #laurenz-albe writes, you can use pgbouncer and prepared statements but need to use session pooling. This allows you to use prepared statements for the duration of your connection (i.e. as long as your NpgsqlConnection instance is open). However, if you're in a short-lived connection scenario (e.g. web app which opens and closes a connection for each HTTP request), you're out of luck. In this sense, one could say that pooling and prepared statements aren't compatible.
However, if you use Npgsql's internal pooling mechanism (on by default) instead of pgbouncer, then your prepared statements are automatically persisted across connection open/close. In other words, when you call NpgsqlCommand.Prepare(), if the physical connection happened to have already prepared the SQL, then the prepared statement is reused. This is done specifically to unlock the speed advantages of prepared statements for short-lived connection scenarios. This is a pretty unique behavior of Npgsql, see the docs for more info.
This is one of the advantages of an in-process connection pool, as opposed to an out-of-process pool such as pgbouncer - Npgsql retains information on the physical connection as it is passed around, in this case a table of which statements are prepared (name and SQL).
I think this is a generic question, so I'll give a generic answer. What aspect is applicable to a specific connection pool implementation will probably vary.
There are several modes of connection pooling:
A thread retains a connection for the duration of a session (session pooling):
In that case, persistent state like prepared statements can be held for the duration of the session, but you should clean the state when the session is returned to the pool.
A thread retains a connection for the duration of a database transaction (transaction pooling):
In that case, you'd have to clean the state after each transaction, so prepared statements don't make much sense.
A thread retains a connectoin for the duration of a statement (statement poling):
This is only useful in very limited cases where you don't need transactions spanning more than a single statement. Obviously, no state like prepared statements can be shared.
It depends what kind of connection pool you use. Basically, the longer a thread retains a connection, the more sense it makes to use prepared statements.
Of course, if you know what you are doing, you can also create a prepared statement right after the database connection is established and never deallocate it. This will only work if all threads need the same prepared statements. It is easy to screw up with a setup like that.
My server queries the db often.
But more often than not, the query retrieves unchanged data.
Therefore I would like to create and store a cached result.
My main mongoDB is stored in a remote address, and therefore takes slightly longer to respond as compared to a local mongoDB instance. I thought it would be beneficial to have therefore an additional, smaller, more static mongoDB running on localhost.
Such that, real-time queries will run on the remote main DB, and smaller, time efficient queries will run on the cached collections in localhost for optimizing speed.
Is this something that can be done?
Is it something people recommend to avoid?
How would I set two connections, one to my main remote server and one
to my local server?
This seems wrong to me
var mongooseMain = require ('mongoose');
var mongooseLocal = require ('mongoose');
mongooseMain.connect(mainDBInfo.url);
mongooseLocal.connect(localDBInfo.url);
In principal, you have the right idea! Caching is a big part of building performant web applications.
First of all, MongoDB wants to cache everything it's using in memory and has a very well designed system of deciding what to keep in memory and what to toss out of it's cache. When an object is asked for that is not in it's cache, it has to read it from disk. When MongoDB reads from disk instead of memory it's called a page fault.
Of course, this memory cache is on a remote server so you still have network latency to deal with.
To eliminate this latency, I would recommend saving the serialized objects you read from often, but rarely write to, in Redis. This is what Redis was built to do. It's basically a dictionary (key:value) which you can easily SET and GET from. You can run redis-server easily on your local machine and even use SETEX to set your objects to the dictionary with some unique key and an expiry for when it should be evicted from the cache.
You can also manually evict objects from the cache whenever they do get updated (I would recommend re-writing them to the cache at this moment). Then, whenever you need an object, just make sure you always try to read from your cache first and fall back to MongoDB if the cache returns null for a key.
Check it out and good luck with your application!
My application has 2 sections mainly,
User interface written in angular which uses a Django python back end.
Heavy map reduce kind of process.
Both uses postgres for look up, so my doubt is if I use same connection pool for both, at the time when my map reduce is runnning due to heavy lookup my other application won't work because of no connection available. Is there any work around this.(Avoiding the postgres itself is in the backlog)
PS: I am using pgbouncer for pooling
Simplest approach would be separating the two sections.
At least with respect to the connection resources.
(Whether e.g. memory consumption and gc would benefit from restructuring is not asked for)
You may achieve this using one of the following approaches:
use two separate pools, one for each section.
This way, you may setup the pools according to the connection requirements per section.
change your code to maintain sufficient "free" resources for the other section.
This is quite tedious and only useful as soon as the resource requirements
need fine grain control depending on internal state of the algorithms.
Usually you'd want to go with suggestion 1.
I would like to know if it is at all possible to have mongodb fail overs only using a single address. I know replica sets are typically used for this while relying on the driver to make the switch over, but I was hoping there may be a solution out there that would allow one address or hostname to automatically change over when the mongodb instance was recognized as being down.
Any such luck? I know there are solutions for MySQL, but I haven't had much luck with finding something for MongoDB.
Thanks!
Yes it is possible, the driver holds a cache map of your replica set which it will query for a new primary when the set suffers an election. This map is refreshed once every so often however, if your application restarts (process is quit or something, or each request of PHP fork mode) then the driver has no choice but to refresh its map. At this point you will suffer connectivity problems.
Of course the best thing to do is to add a seedlist.
Using a single IP defies the redundancy that is in-built into MongoDB.
we are running java6/hibernate/c3p0/postgresql stack.
Our JDBC Driver is 8.4-701.jdbc3
I have a few questions about Prepared Statements. I have read
excellent document about Prepared Statements
But i still have a question how to configure c3p0 with postgresql.
At the moment we have
c3p0.maxStatements = 0
c3p0.maxStatementsPerConnection = 0
In my understanding the prepared statements and statement pooling are two different things:
Our hibernate stack uses prepared statements. Postgresql is caching the
execution plan. Next time the same statement is used, postgresql reuses the
execution plan. This saves time planning statements inside DB.
Additionally c3p0 can cache java instances of "java.sql.PreparedStatement"
which means it is caching the java object. So when using
c3p0.maxStatementsPerConnection = 100 it caches at most 100 different
objects. It saves time on creating objects, but this has nothing to do with
the postgresql database and its prepared statements.
Right?
As we use about 100 different statements I would set
c3p0.maxStatementsPerConnection = 100
But the c3p0 docs say in c3p0 known shortcomings
The overhead of Statement pooling is
too high. For drivers that do not
perform significant preprocessing of
PreparedStatements, the pooling
overhead outweighs any savings.
Statement pooling is thus turned off
by default. If your driver does
preprocess PreparedStatements,
especially if it does so via IPC with
the RDBMS, you will probably see a
significant performance gain by
turning Statement pooling on. (Do this
by setting the configuration property
maxStatements or
maxStatementsPerConnection to a value
greater than zero.).
So: Is it reasonable to activate maxStatementsPerConnection with c3p0 and Postgresql?
Is there a real benefit activating it?
kind regards
Janning
I don't remember offhand if Hibernate actually stores PreparedStatement instances itself, or relies on the connection provider to reuse them. (A quick scan of BatcherImpl suggests it reuses the last PreparedStatement if executing the same SQL multiple times in a row)
I think the point that the c3p0 documentation is trying to make is that for many JDBC drivers, a PreparedStatement isn't useful: some drivers will end up simply splicing the parameters in client-side and then passing the built SQL statement to the database anyway. For these drivers, PreparedStatements are no advantage at all, and any effort to reuse them is wasted. (The Postgresql JDBC FAQ says this was the case for Postgresql before sever protocol version 3 and there is more detailed information in the documentation).
For drivers that do handle PreparedStatements usefully, it's still likely necessary to actually reuse PreparedStatement instances to get any benefit. For example if the driver implements:
Connection.prepareStatement(sql) - create a server-side statement
PreparedStatement.execute(..) etc - execute that server-side statement
PreparedStatement.close() - deallocate the server-side statement
Given this, if the application always opens a prepared statement, executes it once and then closes it again, there's still no benefit; in fact, it might be worse since there are now potentially more round-trips. So the application needs to hang on to PreparedStatement instances. Of course, this leads to another problem: if the application hangs on to too many, and each server-side statement consumes some resources, then this can lead to server-side issues. In the case where someone is using JDBC directly, this might be managed by hand- some statements are known to be reusable and hence are prepared; some aren't and just use transient Statement instances instead. (This is skipping over the other benefit of prepared statements: handling argument escaping)
So this is why c3p0 and other connection pools also have prepared statement caches- it allows application code to avoid dealing with all this. The statements are usually kept in some limited LRU pool, so common statements reuse a PreparedStatement instance.
The final pieces of the puzzle are that JDBC drivers may themselves decide to be clever and do this; and servers may themselves also decide to be clever and detect a client submitting a statement that is structurally similar to a previous one.
Given that Hibernate doesn't itself keep a cache of PreparedStatement instances, you need to have c3p0 do that in order to get the benefit of them. (Which should be reduced overhead for common statements due to reusing cached plans). If c3p0 doesn't cache prepared statements, then the driver will just see the application preparing a statement, executing it, and then closing it again. Looks like the JDBC driver has a "threshold" setting for avoiding the prepare/execute server overhead in the case where the application always does this. So, yes, you need to have c3p0 do statement caching.
Hope that helps, sorry it's a bit long winded. The answer is yes.
Remember that statements have to be cached per connection which will mean you're going to have to consume quite a chunk of memory and it will take a long time before you'll see any benefit. So if you set it to use 100 statements to be cached, that's actually 100*number of connections or else 100/no of connections but you will still need to take quite some time until your cache will have any meaningful effect.