Replicate only documents where {'public':true} in MongoDB - mongodb

I have the following network/mongodb setup:
1 primary mongodb database (10.0.0.1, not accessible from the
Internet) - contains private info in collection A, and a collection
B, with documents created by a trusted user. At any point in time, a
user can mark any document in collection B as 'public', which changes
its property from {'public':false} to {'public':true}.
1 public mongodb database (10.0.0.2, runs a webserver accessible from
the Internet via a reverse proxy) - does not contain collection A,
but should contain all documents marked as 'public' from collection
B. This machine will serve those public documents to users outside
the network.
How would I set up mongodb so that when a document in the primary database (10.0.0.1) is updated with {'public':true}, it gets replicated to the public mongodb database (10.0.0.2)?
Other details:
I'm using the PHP driver
The documents are small, max 2KB
The load on these servers will probably never exceed 10 concurrent users
Eventual consistency is ok, up to a few minutes, but I'd like to know what my options are.
So, just to reiterate, here's a use case:
John VPNs into our private network, opens http://10.0.0.1/, creates a
document (call it D2), marks it as private. John then views an older
document, D1, and decides to make it public, by clicking the 'Make
public' button. The server automagically makes the document available
on the public server example.com (public IP x.y.z.w, internal IP
10.0.0.2).
John sends an e-mail to Sarah and asks her to read document D1 (the
one that was made public). Sarah goes to http://example.com and is
able to read D1, but never sees D2.
My goal is to achieve this without having to manually write scripts to synchronize those two databases. I suspect it should be possible, but I can't figure it out from what I've read about MongoDB replication.
I welcome any advice.
Thank you!

MongoDB (as at 2.0.6) does not have support for filtered replication.
However ... it may be possible for you to implement your own scheme to update records based on a tailable cursor of MongoDB's oplog. The local oplog.rs capped collection is the same mechanism used to relay changes to members of a replica set and includes details for inserts, deletes, and updates.
For an example of this technique, see this blog post: Creating Triggers for MongoDB.
In your case the actions would be something like:
copy record from collection A to B if it is inserted or updated with public:true
remove record from collection B if it is deleted or updated in collection A with public:false

Related

All documents in the collection magically disappeared. Can i find out what happened?

I cloned 2 of my collections from localhost to a remote location on MongoLab platform yesterday. I was trying to debug my (MEAN stack) application (with WebStorm IDE) and i realized one of those collections have no data in it. Well, there were 7800 documents this morning...
I am pretty much the only one who works on the database and especially with this collection. I didn't run any query to remove all of the documents from this collection. In mongolab's website there is a button says 'delete all documents from collection'. I am pretty sure I didn't hit that button. I asked my team mates; no one even opened that web page today.
Assuming that my team is telling the truth and I didn't remove everything and have a black out...
Is there a way to find out what happened?
And, is there a way to keep a query history (like unix command-line history) for mongo database that runs on a remote server? And if yes, how?
So, I am just curious about what happened. Also note that I don't have any DBA responsibilities or experience in that field.
MongoDB replica sets have a special collection called oplog. This collection stores all write operations for all databases in that replica set.
Here are instructions on how to access oplog in Mongolab:
Accessing the MongoDB oplog
Here is a query that will find all delete operations:
use local
db.oplog.rs.find({"op": "d", "ns" : "db_name.collection_name"})

How to check if a database exists on MongoDB without using mongClient.getDatabaseNames()

we configured our mongobd to forbid the “listdatabase” (i.e, mongClient.getDatabaseNames()) feature for privacy reasons
I would like to check whether a database exists in MongoDB without using mongClient.getDatabaseNames().
If I use, mongoClient.getDB("mydb"), mongodb creates a new db instance which can't help to check whether dbname exist.
Any suggestions please?
We have a common 'ui connection wizard' to connect to different mongodb servers which have different authentications. This wizard has to verify "dabase_name" field, whether user entered correct database name. In this case, "listDatabase"( i.e client.getDataBaseNames()) can't be used as authentication fails or "use db"(i.e, client.getDB()) can't be used as it creates an new db instance.
When do you consider a database to exist? Why do you need to do this check? MongoDB creates a database when it creates a collection in the database, which happens when data is inserted, an index is create, or the collection is created with db.createCollection(). A reasonable condition for a database not to exist is that it contains no collections. Another reasonable condition would be that all the collections in it contains do not contain data. You can check for the former using db.getCollectionNames(), the latter by examining db.<collection>.count() for each collection. Yet a third condition might be that the database has zero storage size, which you could check with db.stats().
In your situation where getDatabaseNames() is forbidden for security reasons, you should control access to the databases with security and worry about authorization, not existence. If a database exists, it should be restricted to the appropriate users and roles so that if another user tries to interact with that database, they will be rejected.

data synchronization between clients

I don't know if this is the right place to ask my question, but here it is.
Inspired by Firebase, I decided to write a little framework to synchronize data between clients. That should simplify the development of web applications such as chats, forums, etc...
Let's assume that there are one or more servers. A client can connect to one server and access a particular collection (a list of chat messages, for instance). If and when the client modifies the collection, those modifications will be sent to the other clients who requested access to the same collection.
I'd like my solution to be fast and general. The propagation of the modifications should be very fast and the collections should be persisted on a DB.
The collections may be very large but the clients may request just a view of the collection (for instance, the chat messages of the last 20 minutes).
Possible solution
We have n servers, 1 node with a fast in-memory DB (Redis) and a cluster with a NoSQL DB.
The cluster will contain the full collections.
When a client connects to a server and is given access to a collection for the first time, the requested part of the collection is read directly from the cluster.
When a client modifies a collection C, the modification is written to the in-memory DB which will contain something like:
123 added "message..."
124 deleted id235
125 modified id143 "new message..."
where 123, 124 and 125 are the versions of the collection.
In this case, the cluster contains the entire collection C and its version number which is 122.
When a client first connects to a server and accesses the collection C, the server reads the requested part of the collection from the cluster and then reads the updates from the in-memory DB in order to update the collection from version 122 to version 125.
When a client modifies the collection C,
the description of the modification is inserted into the in-memory DB;
the other servers are informed that a new version of C is available;
the client is sent the update.
Of course, the other servers, once informed, will send the updates to their clients as well.
Another process in the background will update the cluster the following way:
while (the in-memory database contains less than K updates for the collection C)
read the next update, U, from the in-memory database;
use U to update the collection C and its version number in the cluster ATOMICALLY.
The updates must be linearizable, i.e. no server should be able to see the collection C in a state where an update has been applied before a previous update.
When the cluster is fully-consistent, we remove the updates from the in-memory DB from lowest to highest version.
Problem
My solution requires a DB (for the cluster) which supports transactions (ACID?) and offers strong consistence. For instance, I can't use MongoDB.
Question
Can you think of a simpler solution?
or
If my solution is acceptable, what DB do you recommend for the cluster?
Thank you for your patience.
If each element of the collection is assigned a unique id and the updates in the in-memory DB include those ids, then one doesn't need a version number for the collection in the cluster and so transactions are unnecessary.
The idea is that the ids can be used to decide if an update is needed or not. For instance, if an update says
version=123 action=ADD id=05276 msg="Message"
and the collection in the cluster already contains a document with id=05276, then this update is old and was already applied to the collection in the cluster.
We only need to pay attention to this: before we remove some updates from the in-memory DB, we must make sure that those updates have been applied to the collection in the cluster and that the cluster is fully consistent wrt that collection.
When a client requests access to a collection, the server it connected to needs to:
read all the updates from the in-memory DB and save them in memory
read (the relevant part of) the collection from the cluster
update the read collection with the updates saved in memory
It's important to first read all the updates from the in-memory DB to avoid a race condition. Consider this scenario:
we read an old version of the collection from the cluster
the collection is updated
the cluster becomes fully consistent
some updates are removed from the in-memory DB
we update our read collection with the new updates
Problem is, in point 5 we'd miss some updates.

Can MongoDB be a consistent event store?

When storing events in an event store, the order in which the events are stored is very important especially when projecting the events later to restore an entities current state.
MongoDB seems to be a good choice for persisting the event store, given its speed and flexibel schema (and it is often recommended as such) but there is no such thing as a transaction in MongoDB meaning the correct event order can not be garanteed.
Given that fact, should you not use MongoDB if you are looking for a consistent event store but rather stick with a conventional RDMS, or is there a way around this problem?
I'm not familiar with the term "event store" as you are using it, but I can address some of the issues in your question. I believe it is probably reasonable to use MongoDB for what you want, with a little bit of care.
In MongoDB, each document has an _id field which is by default in ObjectId format, which consists of a server identifier, and then a timestamp and then a sequence counter. So you can sort on that field and you'll get your objects in their creation order, provided the ObjectIds are all created on the same machine.
Most MongoDB client drivers create the _id field locally before sending an insert command to the database. So if you have multiple clients connecting to the database, sorting by _id won't do what you want since it will sort first by server-hash, which is not what you want.
But if you can convince your MongoDB client driver to not include the _id in the insert command, then the server will generate the ObjectId for each document and they will have the properties you want. Doing this will depend on what language you're working in since each language has its own client driver. Read the driver docs carefully or dive into their source code -- they're all open source. Most drivers also include a way to send a raw command to the server. So if you construct an insert command by hand this will certainly allow you to do what you want.
This will break down if your system is so massive that a single database server can't handle all of your write traffic. The MongoDB solution to needing to write thousands of records per second is to set up a sharded database. In this case the ObjectIds will again be created by different machines and won't have the nice sorting property you want. If you're concerned about outgrowing a single server for writes, you should look to another technology that provides distributed sequence numbers.

Splitting MongoDB collections in multiple servers via mongos Router

Having a MongoDB database named maindatabase which has 3 document collections named users, tags and categories, I would like to know if it is possible having them splitted on three different servers separately (on different cloud service providers).
I mean not as a replica, but just one collection for server (one db with just categories collection on a server, one with users on another server and one for tags on the third server) may be routed by a mongos Router selectively.
Anyone know if it is possible?
Aside from #matulef's answer regarding manual manipulation of databases through movePrimary, maybe this calls for a simpler solution of just maintaining 3 database connections: one per server, each in a different cloud provider's data center as you originally specified. You wouldn't have the simplicity of a single mongos connection point, but with your three connections, you could then directly manipulate users, tags, and categories on each of their respective connections.
Unfortunately you can't currently split up the collections in a single database this way. However it is possible to do this if you put each collection in a different database. In a sharded system, each database has a "primary shard" associated with it, where all the unsharded collections on that database live. If you separate your 3 collections into 3 different databases, you can individually move them to different shards using the "movePrimary" command:
http://www.mongodb.org/display/DOCS/movePrimary+Command
There is, however, some overhead associated with making more databases, so it's not clear whether this is the best solution for your needs.