GridFS with location-based sharding - mongodb

Is there any way to shard the gridFS files and chunks based on a location information ?
I want to setup a mongo configuration with sharding and replication for handling multi-sites and guaranteeing that the data produced on one site remains accessible even if the sites are isolated.
I followed the following documentation from mongo and it works just perfectly for the standard collections but not for GridFS.
https://docs.mongodb.com/v3.2/tutorial/sharding-segmenting-data-by-location/
All I found for GridFS was _id based or file_id based sharding.
By now the only solution I have would be to use different GridFS prefixes which means different collections and to shard them each on a given site.
I'm using :
mogodb 3.2
Java driver
Python driver
Thanks for any advice.

I found an answer on StackExchange :
How to define shardKey on GridFS collections to achieve Location/Data Centre affinity (MongoDB)
Given that the only possibility is to shard the chunks on the ID, the idea is to use MongoGridFSCreateOptions to force a location-based ID to the chunks.

Related

Which approach of search is feasible (elastic search + mongoDB) or mongoDB text indexes

In my project I have to implement text search and and have to choose a feasible
approach among the two that are :-
Synchronising MongoDB database with ElasticSearch.
MongoDB's own text indexes that has Elastic Search like text searching
capabilities.
I have gone through many articles that provide the pros of each of the cases but haven't found any relevant document that provides comparison between the two approaches and which approach is better than the other or what are the limitation for a specific approach.
Note:- I am using Node.js with express.js.
MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed by Lucene. You can store data in MongoDB and use Elasticsearch exclusively for its' full-text search capabilities. According to your use case, you can only send a subset of the mongo data fields to elastic.
Synchronization of mongodb with Elasticsearch can be done with mongoosastic. You can solve data safety concern by persisting in mongo and speed up your search by using elasticsearch. Also can use python script using elasticsearch.py package to sync mongo and elasticsearch.
Mongodb search is very slow as compared to elasticsearch. Also indexing in mongodb takes up more time and more resources.

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

Shard key and how to choose it?

I'm new in NoSQL databases and now I use MongoDB, BTW I have a question about MongoDB shard key and I want to know what it does actually? Is it related to queries performance? And how we can choose a good shard key for a collection?
Thanks in advance
From 10gen's docs: http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
Choosing a shard-key is very dependent on your data and your use case.
Here's some more documentation you may find relevant:
http://docs.mongodb.org/manual/faq/sharding/
http://docs.mongodb.org/manual/sharding/
Specifically:
http://docs.mongodb.org/manual/core/sharding/
Essentially sharding allows you to partition your data across different servers. This means different writes/reads are going to different servers -- distributing the load of the application across multiple servers.
The shard key is the value in the collection that you are evaluating to determine which shard/server the document is being routed too.
You can have more explanation on shard key selection and working in Kristina Chodrow's book "Scaling MongoDB"
Check out this also

Using GridFS - Should it be on a separate DB?

I am making a site that has a lot of audio storage, terabytes, and I was wanting to use GridFS for sharding and to be able to easily expand the database across multiple machines.
My question is that would it be better to put the files in a separate mongo database? There will be a good amount of documents in the mongodb, I just was not sure what happens when you start sharding with the GridFS portion.
Thanks!
Even if you keep the GridFS storage in the same database as your other collections, you can still choose which collections to shard (or not) when you need to move to sharding. That said, if you have it in a separate database, you will be able to more easily move it to a separate cluster if you so choose -- so you could, for instance, have a 3 shard cluster for your "main" collections and a 5 shard cluster for GridFS (or any other configuration you choose).
As far as sharding GridFS collections, please see the MongoDB docs on choosing a shard key for GridFS. Commonly, people shard the chunks collection (which is where the file data itself is stored) on files_id so that all chunks for the same file reside on the same shard. Again, please see the documentation page for more detail.

Sharding GridFS on MongoDB

I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.
You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.
You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:
There are different ways that GridFS
can be sharded, depending on the need.
One common way to shard, based on
pre-existing indexes, is:
"files" collection is not sharded. All
file records will live in 1 shard. It
is highly recommended to make that
shard very resilient (at least 3 node
replica set) "chunks" collection gets
sharded using the existing index
"files_id: 1, n: 1". Some files at the
end of ranges may have their chunks
split across shards, but most files
will be fully contained within the
same shard.
Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't
work across shards yet. So you cannot split single file across shards.
Answer on google group7