How do I place a read lock on MongoDB? - mongodb

My application needs to access a Mongo db where if more than one process/thread is reading from a specific collection, bad things will happen.
I need to restrict the ability of a group of processes to read from the collection (or db, if need be). So for example, if there are multiple processes trying to read from the db, they read sequentially, not in parallel.

This could be done in the driver level. If you set connection pool size to 1 then all access to to database will be in sequence.
In nodejs you can set the driver as:
MongoClient.connect(url, {
poolSize: 1
});
From the documentation:
poolSize, this allows you to control how many tcp connections are
opened in parallel. The default value for this is 5 but you can set it
as high as you want. The driver will use a round-robin strategy to
dispatch and read from the tcp connection.

Related

How does vertx database connection work under the hood?

I am new to reactive world and trying to understand how the db-connections works under the hood with vertx-sql-clients.
I am using io.vertx:vertx-mysql-client, io.vertx:vertx-oracle-client and io.vertx:vertx-pg-client in my project.
As this is reactive database client, I can understand that one thread can handle multiple db-connections objects. I have set pool.maxSize to 1 and executed concurrently 5000 times a function, which gets the connection from the pool, executes a select query in the database and fetches some rows from the DB.
Now my question is, even though I have configured a single db-connection in the pool, still it can handle my 5000 requests concurrently, so how this works under the hood? is all the selects run over single database connection? if so, how it can handle a transaction management with single db-connection?

Options for tuning mongoose

I am new to MongoDB and having a hard time understanding the different flag used inside connect() method which are passed inside the second object argument.
const connectDB = async() => {
const conn = await mongoose.connect(process.env.MONGO_URI,{
useNewUrlParser: true, useCreateIndex:true, useFindAndModify:false, useUnifiedTopology: true
});
}
There is documentation available for mongoose.connect.
[options] «Object» passed down to the MongoDB driver's connect()
function, except for 4 mongoose-specific options explained below.
[options.bufferCommands=true] «Boolean» Mongoose specific option. Set to false to disable buffering on all models associated with this
connection.
[options.dbName] «String» The name of the database we want to use. If not provided, use database name from connection string.
[options.user] «String» username for authentication, equivalent to options.auth.user. Maintained for backwards compatibility.
[options.pass] «String» password for authentication, equivalent to options.auth.password. Maintained for backwards compatibility.
[options.autoIndex=true] «Boolean» Mongoose-specific option. Set to false to disable automatic index creation for all models associated
with this connection.
[options.useNewUrlParser=false] «Boolean» False by default. Set to true to opt in to the MongoDB driver's new URL parser logic.
[options.useUnifiedTopology=false] «Boolean» False by default. Set to true to opt in to the MongoDB driver's replica set and sharded
cluster monitoring engine.
[options.useCreateIndex=true] «Boolean» Mongoose-specific option. If true, this connection will use createIndex() instead of
ensureIndex() for automatic index builds via Model.init().
[options.useFindAndModify=true] «Boolean» True by default. Set to false to make findOneAndUpdate() and findOneAndRemove() use native
findOneAndUpdate() rather than findAndModify().
[options.reconnectTries=30] «Number» If you're connected to a single server or mongos proxy (as opposed to a replica set), the
MongoDB driver will try to reconnect every reconnectInterval
milliseconds for reconnectTries times, and give up afterward. When the
driver gives up, the mongoose connection emits a reconnectFailed
event. This option does nothing for replica set connections.
[options.reconnectInterval=1000] «Number» See reconnectTries option above.
[options.promiseLibrary] «Class» Sets the underlying driver's promise library.
[options.poolSize=5] «Number» The maximum number of sockets the MongoDB driver will keep open for this connection. By default,
poolSize is 5. Keep in mind that, as of MongoDB 3.4, MongoDB only
allows one operation per socket at a time, so you may want to increase
this if you find you have a few slow queries that are blocking faster
queries from proceeding. See Slow Trains in MongoDB and Node.js.
[options.bufferMaxEntries] «Number» The MongoDB driver also has its own buffering mechanism that kicks in when the driver is
disconnected. Set this option to 0 and set bufferCommands to false on
your schemas if you want your database operations to fail immediately
when the driver is not connected, as opposed to waiting for
reconnection.
[options.connectTimeoutMS=30000] «Number» How long the MongoDB driver will wait before killing a socket due to inactivity during
initial connection. Defaults to 30000. This option is passed
transparently to Node.js' socket#setTimeout() function.
[options.socketTimeoutMS=30000] «Number» How long the MongoDB driver will wait before killing a socket due to inactivity after
initial connection. A socket may be inactive because of either no
activity or a long-running operation. This is set to 30000 by default,
you should set this to 2-3x your longest running operation if you
expect some of your database operations to run longer than 20 seconds.
This option is passed to Node.js socket#setTimeout() function after
the MongoDB driver successfully completes.
[options.family=0] «Number» Passed transparently to Node.js' dns.lookup() function. May be either 0, 4, or 6. '4' means use IPv4
only, '6' means use IPv6 only, '0' means try both.

Haskell database connections

Please look at this scotty app (it's taken directly from this old answer from 2014):
import Web.Scotty
import Database.MongoDB
import qualified Data.Text.Lazy as T
import Control.Monad.IO.Class
runQuery :: Pipe -> Query -> IO [Document]
runQuery pipe query = access pipe master "nutrition" (find query >>= rest)
main = do
pipe <- connect $ host "127.0.0.1"
scotty 3000 $ do
get "/" $ do
res <- liftIO $ runQuery pipe (select [] "stock_foods")
text $ T.pack $ show res
You see how the the database connection (pipe) is created only once when the web app launches. Subsequently, thousands if not millions of visitors will hit the "/" route simultaneously and read from the database using the same connection (pipe).
I have questions about how to properly use Database.MongoDB:
Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?
Even if you create connection per client you won't be able to create too many of them. You will hit ulimit. Once you hit that ulimit the client that hit this ulimit will get a runtime error.
The reason it doesn't make sense is because mongodb server will be spending too much time polling all those connections and it will have only as many meaningful workers as many CPUs your db server has.
One connection is not a bad idea, because mongodb is designed to send several requests and wait for responses. So, it will utilize as much resources as your mongodb can have with only one limitation - you have only one pipe for writing, and if it closes accidentally you will need to recreate this pipe yourself.
So, it makes more sense to have a pool of connections. It doesn't need to be big. I had an app which authenticates users and gives them tokens. With 2500 concurrent users per second it only had 3-4 concurrent connections to the database.
Here are the benefits connection pool gives you:
If you hit pool connection limit you will be waiting for the next available connection and will not get runtime error. So, you app will wait a little bit instead of rejecting your client.
Pool will be recreating connections for you. You can configure pool to close excess of connections and create more up until certain limit as you need them. If you connection breaks while you read from it or write to it, then you just take another connection from the pool. If you don't return that broken connection to the pool pool will create another connection for you.
If the database connection is closed then: mongodb listener on this connection will exit printing a error message on your terminal, your app will receive an IO error. In order to handle this error you will need to create another connection and try again. When it comes to handling this situation you understand that it's easier to use a db pool. Because eventually you solution to this will resemble connection pool very much.
I do auth once as part of opening a connection. If you need to auth another user later you can always do it.
Yes, mongodb handles simultaneous usage, but like I said it gives only one pipe to write and it soon becomes a bottle neck. If you create at least as many connections as your mongodb server can afford threads for handling them(CPU count), then they will be going at full speed.
If I missed something feel free to ask for clarifications.
Thank you for your question.
What you really want is a database connection pool. Take a look at the code from this other answer.
Instead of auth, you can use withMongoDBPool to if your MongoDB server is in secure mode.
Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
You do not want to open one connection and then use it. The HTTP server you are using, which underpins Scotty, is called Warp. Warp has a multi-core, multi-green-thread design. You are allowed to share the same connection across all threads, since Database.MongoDB says outright that connections are thread-safe, but what will happen is that when one thread is blocked waiting for a response (the MongoDB protocol follows a simple request-response design) all threads in your web service will block. This is unfortunate.
We can instead create a connection on every request. This trivially solves the problem of one thread's blocking another but leads to its own share of problems. The overhead of setting up a TCP connection, while not substantial, is also not zero. Recall that every time we want to open or close a socket we have to jump from the user to the kernel, wait for the kernel to update its internal data structures, and then jump back (a context switch). We also have to deal with the TCP handshake and goodbyes. We would also, under high load, run out file descriptors or memory.
It would be nice if we had a solution somewhere in between. The solution should be
Thread-safe
Let us max-bound the number of connections so we don't exhaust the finite resources of the operating system
Quick
Share connections across threads under normal load
Create new connections as we experience increased load
Allow us to clean up resources (like closing a handle) as connections are deleted under reduced load
Hopefully already written and battle-tested by other production systems
It is this exactly problem that resource-pool tackles.
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?
It is unclear what you mean by simultaneous usages. There is one interpretation I can guess at: you mean something like HTTP/2, which has pipelining built into the protocol.
standard picture of pipelining http://research.worksap.com/wp-content/uploads/2015/08/pipeline.png
Above we see the client making multiple requests to the server, without waiting for a response, and then the client can receive responses back in some order. (Time flows from the top to the bottom.) This MongoDB does not have. This is a fairly complicated protocol design that is not that much better than just asking your clients to use connection pools. And MongoDB is not alone here: the simple request-and-response design is something that Postgres, MySQL, SQL Server, and most other databases have settled on.
And: it is true that connection pool limits the load you can take as a web service before all threads are blocked and your user just sees a loading bar. But this problem would exist in any of the three scenarios (connection pooling, one shared connection, one connection per request)! The computer has finite resources, and at some point something will collapse under sufficient load. Connection pooling's advantages are that it scales gracefully right up until the point it cannot. The correct solution to handling more traffic is to increase the number of computers; we should not avoid pooling simply due to this problem. 
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
I believe these kinds of what-if's are outside the scope of Stack Overflow and deserve no better answer than "try it and see." Buuuuuuut given that the server terminates the connection, I can take a stab at what might happen: assuming Warp forks a green thread for each request (which I think it does), each thread will experience an unchecked IOException as it tries to write to the closed TCP connection. Warp would catch this exception and serve it as an HTTP 500, hopefully writing something useful to the logs also. Assuming a single-connection model like you have now, you could either do something clever (but high in lines of code) where you "reboot" your main function and set up a second connection. Something I do for hobby projects: should anything odd occur, like a dropped connection, I ask my supervisor process (like systemd) to watch the logs and restart the web service. Though clearly not a great solution for a production, money-makin' website, it works well enough for small apps.
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
It should be called once after creating the connection. MongoDB authentication is per-connection. You can see an example here of how the db.auth() command mutates the MongoDB server's data structures corresponding to the current client connection.

Concurrent read operations on MongoDB

I have a scala application which is accessing a Mongo Collection with 13 million records over 4 threads.
I want the four threads to access Mongo concurrently and want to make sure that they never read a same record. Also, a record accessed by thread 2 in pass 3 should not be accessed by any other thread in future.
Any suggestion on how could I achieve it?
It looks a good place for a dispatcher feature.
Dispatcher will need to read all ids and then using let's say roundRobin queue push ids to f1,f2,f3,f4. There is no lock mechanism that will prevent to read data from SINGLE document so when id will dispatched then underling function will have to carry all operations.

Pattern for a singleton application process using the database

I have a backend process that maintains state in a PostgreSQL database, which needs to be visible to the frontend. I want to:
Properly handle the backend being stopped and started. This alone is as simple as clearing out the backend state tables on startup.
Guard against multiple instances of the backend trampling each other. There should only be one backend process, but if I accidentally start a second instance, I want to make sure either the first instance is killed, or the second instance is blocked until the first instance dies.
Solutions I can think of include:
Exploit the fact that my backend process listens on a port. If a second instance of the process tries to start, it will fail with "Address already in use". I just have to make sure it does the listen step before connecting to the database and wiping out state tables.
Open a secondary connection and run the following:
BEGIN;
LOCK TABLE initech.backend_lock IN EXCLUSIVE MODE;
Note: the reason for IN EXCLUSIVE MODE is that LOCK defaults to the AccessExclusive locking mode. This conflicts with the AccessShare lock acquired by pg_dump.
Don't commit. Leave the table locked until the program dies.
What's a good pattern for maintaining a singleton backend process that maintains state in a PostgreSQL database? Ideally, I would acquire a lock for the duration of the connection, but LOCK TABLE cannot be used outside of a transaction.
Background
Consider an application with a "broker" process which talks to the database, and accepts connections from clients. Any time a client connects, the broker process adds an entry for it to the database. This provides two benefits:
The frontend can query the database to see what clients are connected.
When a row changes in another table called initech.objects, and clients need to know about it, I can create a trigger that generates a list of clients to notify of the change, writes it to a table, then uses NOTIFY to wake up the broker process.
Without the table of connected clients, the application has to figure out what clients to notify. In my case, this turned out to be quite messy: store a copy of the initech.objects table in memory, and any time a row changes, dispatch the old row and new row to handlers that check if the row changed and act if it did. To do it efficiently involves creating "indexes" against both the table-stored-in-memory, and handlers interested in row changes. I'm making a poor replica of SQL's indexing and querying capabilities in the broker program. I'd rather move this work to the database.
In summary, I want the broker process to maintain some of its state in the database. It vastly simplifies dispatching configuration changes to clients, but it requires that only one instance of the broker be connected to the database at a time.
it can be done by advisory locks
http://www.postgresql.org/docs/9.1/interactive/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
I solved this today in a way I thought was concise:
CREATE TYPE mutex as ENUM ('active');
CREATE TABLE singleton (status mutex DEFAULT 'active' NOT NULL UNIQUE);
Then your backend process tries to do this:
insert into singleton values ('active');
And quits or waits if it fails to do so.