MongoDB Shard key - enforce in every query - mongodb

If we use a compound shard key say {a,b} is there a possibility to throw error on any query which do not include these fields in the query at the java driver level. i.e any callbacks/life cycle event before query gets executed...like AbstractMongoEventListener - here we have options of onAfterLoad and onAfterConvert but our requirement is before executing the query... Something at the java driver level

I understand why you want this capability(If the query does not include even a single sharded key as part of its criteria then it will result in "scatter and gather" kind of queries which result in significant performance degradation). But the best practice for APIs suggest that a API should be designed for a single purpose and be generic ,if we tend to add this capability to the Java driver then it will tend it impose a additional constraint which might not be required.Hence there is no out of the box API that does this for you.
What you can do to make it work - Write a wrapper on top of this API with the additional capability.

Related

Spring Data Mongo - lookup vs client side filtering

I am new to Spring and MongoDB. Trying to identify that when I had to process records from more than one collection, is it better option to do a lookup or go for writing code in spring or go for lookup?
It cannot be explicit to decide due to the following reasons.
Data size
Number of collections which will be part of lookup
Index usage
Query efficiency.
It's better to evaluate both the options and decide. This is a good article to understand schema design

MongoDB large one-time query load on production system

I'm having a MongoDB database, holding tens of millions of documents.
Let's say I want to query a single value out of each document (see image below: target key under 0 key under references key)
so it's a 3rd level nested key, and only if the referenceType equals "CopiedFrom" (references level doesn't exists in all documents)
there's ~10M documents that will answer this condition, and this is a one-time query.
The DBA in my org tells me this database is transactional (and not for reporting) and serves many clients in production, hence, a query like i'm asking will put great load on the system and will compromise production response times.
I don't have much experience with MongoDB and cannot evaluate this claim (besides the fact that it's absurd to have historical data you cannot effectivly access).
Is he right, or he's exaggerating?
knowing this can help me deal with his claim, and get the data i need.
thanks!
Your use case is addressed by adding dedicated hidden nodes to the replica set for analytics queries. See here for example.
The DBA is generally correct in that an expensive analytical query is unsuitable for executing against servers that serve transactional workloads.

MongoDB aggregation from few operations

Every user in our system (Like Facebook and twitter) has an option to add other users to his predefined lists like: *"Favorites", "Follow", "Blocked", "Closed Friends". Than, we want to allow him to search the list, filter and see commutative data from all the above list. for example:
UserA {
IsFollow: 1,
IsFavorite: 0
...
IsBlocked: 0
}
We also want to keep some additional information when user adding another user to one of the above list such addingDate.
Option One is to manage different collections like "Favorites", "Follow", "Blocked", "Closed Friends"
Option Two - to manage one collection like "Relations" and keep all the data on that collection without the needs of using lookup...
Option Three - Use option One but create a flat collection with all the relevant data from each table (RabbitMQ, transaction update, etc).
Since I'm new in MongoDB (I'm migrating the system from MS SQL), I'm wondering what is the best approach for high scale system.
I would suggest you go with option 2, where all the keys will be present in one document.
MongoDB recommends a schema design where all the data are embedded into a single document. They claim that this will lead to less read-write operations to DB and faster CRUD operations compared to the relational mapping approach.
But, there is a catch here. The data should be embedded in a single document only if the relations are One-to-One, One-to-Few, or One-to-Many.
DO NOT GO WITH DOCUMENT EMBEDDING APPROACH IF YOUR DATA MAPPING RELATION IS One-to-Squillions. I recommend you to read this article
The reason why I am not recommending Option-1 to have a separate collection is you will have to make more requests to a DB for each and every collection linkage. Although the $lookup stage is fast, it is not as efficient compared to the embedding approach.
As far as option 3 goes, it's a viable approach (If you use transactions properly and effectively), but it adds up complexity in the coding side.
I have personally used both Option-1 & Option-2 approaches, and option-1 has always left the AWS-EC2 instance running MongoDB to higher CPU and RAM usage. As far as option-2 goes, I have a collection that has almost 1000 array elements (With key indexed) and 15K keys in each records (I am not joking) and MongoDB had no issues processing it. Just make sure that you use the Projection of return documents everywhere.
So, go for Option-2 as a standard approach and Option-3 for One-to-Squillions relation mapping.
For referencing two or more collections, make sure that you use MongoDB generated ObjectId instead of your own custom referencing since have seen a minor performance impact on using multi-document relation-mapping other than ObjectId (Even if that particular key is indexed)
Hope this helps. Reach me out if you have additional queries

Couchbase BulkGet in spring data couchbase

I am using Couchbase with Spring Data and wish to implement bulkGet of Couchbase. Please let me know the following:
Is it possible via Spring Data?
If yes, can you share an example?
Is findAll (using _all view) comparable to bulkGet in terms of performance?
Can I fetch the _id along with the Couchbase document?
Environment:- Couchbase 4.0, Spring Data 2.0.0.RELEASE, Java 8.
Thanks in Advance!
I assume you are asking about a bulk get in the context of repositories.
First, there is currently no complete support of a "bulkGet" in Spring Data Couchbase. Most of the implementation is based on the SDK synchronous API, and bulk get is something usually done using the asynchronous API, using RxJava.
Note that there is no actual "bulkGet" operation at the protocol level in Couchbase, it's just the SDK issuing multiple single Get and batching them together.
To answer your second question, the above is important. The bulk get pattern discussed in the Couchbase Java SDK documentation (here) gives a slight performance boost because unlike in synchronous mode, we don't wait for the retrieval of one item to get the next.
The findAll() and findAll(Iterable) methods in Spring Data Couchbase both operate on top of a view, which allows to only retrieve documents that match the entity type of your repository but introduces a level of indirection that can lower performance compared to a pure sequence of key/value gets.
So the closest you could get to a bulk operation like that in Spring Data Couchbase would be to know all the IDs you're interested in and then perform a findOne per ID.
In the near term, the code behind the findAll(Iterable) signature could maybe be improved by applying a bulk get pattern on all provided IDs, but that would mean forgetting about the type checking induced by the view, so I'm not sure...

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.