Spring Data Mongo - lookup vs client side filtering - mongodb

I am new to Spring and MongoDB. Trying to identify that when I had to process records from more than one collection, is it better option to do a lookup or go for writing code in spring or go for lookup?

It cannot be explicit to decide due to the following reasons.
Data size
Number of collections which will be part of lookup
Index usage
Query efficiency.
It's better to evaluate both the options and decide. This is a good article to understand schema design


How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?

I am trying to load huge amount of data from mongodb. Data size is in millions. So, it makes sense to pull this data using appropriate indexes and also query mongo in parallel. Thats why to do batch reads, I am using mongo spark.
How to use the appropriate index while querying mongo using mongospark connectors via withPipeline feature?
Also, I was exploring "com.mongodb.reactivestreams.client.MongoCollection". If possible, can someone throw some light on this?

Couchbase BulkGet in spring data couchbase

I am using Couchbase with Spring Data and wish to implement bulkGet of Couchbase. Please let me know the following:
Is it possible via Spring Data?
If yes, can you share an example?
Is findAll (using _all view) comparable to bulkGet in terms of performance?
Can I fetch the _id along with the Couchbase document?
Environment:- Couchbase 4.0, Spring Data 2.0.0.RELEASE, Java 8.
Thanks in Advance!
I assume you are asking about a bulk get in the context of repositories.
First, there is currently no complete support of a "bulkGet" in Spring Data Couchbase. Most of the implementation is based on the SDK synchronous API, and bulk get is something usually done using the asynchronous API, using RxJava.
Note that there is no actual "bulkGet" operation at the protocol level in Couchbase, it's just the SDK issuing multiple single Get and batching them together.
To answer your second question, the above is important. The bulk get pattern discussed in the Couchbase Java SDK documentation (here) gives a slight performance boost because unlike in synchronous mode, we don't wait for the retrieval of one item to get the next.
The findAll() and findAll(Iterable) methods in Spring Data Couchbase both operate on top of a view, which allows to only retrieve documents that match the entity type of your repository but introduces a level of indirection that can lower performance compared to a pure sequence of key/value gets.
So the closest you could get to a bulk operation like that in Spring Data Couchbase would be to know all the IDs you're interested in and then perform a findOne per ID.
In the near term, the code behind the findAll(Iterable) signature could maybe be improved by applying a bulk get pattern on all provided IDs, but that would mean forgetting about the type checking induced by the view, so I'm not sure...

Denormalization vs Parent Referencing vs MapReduce

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.
Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

MongoDB Shard key - enforce in every query

If we use a compound shard key say {a,b} is there a possibility to throw error on any query which do not include these fields in the query at the java driver level. i.e any callbacks/life cycle event before query gets executed...like AbstractMongoEventListener - here we have options of onAfterLoad and onAfterConvert but our requirement is before executing the query... Something at the java driver level
I understand why you want this capability(If the query does not include even a single sharded key as part of its criteria then it will result in "scatter and gather" kind of queries which result in significant performance degradation). But the best practice for APIs suggest that a API should be designed for a single purpose and be generic ,if we tend to add this capability to the Java driver then it will tend it impose a additional constraint which might not be required.Hence there is no out of the box API that does this for you.
What you can do to make it work - Write a wrapper on top of this API with the additional capability.