Hadoop MongoInputFormat Read Preference does not work to Mongodb Shard - mongodb

I have a Spark job (PySpark) running a query to mongo. I intended to fire the query to secondary.
I have a mongo input uri like below
'mongo.input.uri': 'mongodb://'+host+':27017/'+dbName+'.'+collection+'?readPreference=secondary'
As per my understanding, having readPreference=secondary as option passed to mongo input uri is the way to make it read from secondary (ref: https://github.com/mongodb/mongo-hadoop/wiki/Configuration-Reference)
However, when I run the job, I see spike on mongo fault at my primary nodes monitoring. When I checked the log at primary nodes, it is confirmed that query run against primary instead of secondary.
What did I do wrong? Did I wrongly placed the configuration? Is it only working for normal replica set, and not for shard?
Note:
Mongodb setting is sharded with version 2.6.X

Related

About MongoDB add shard and router server need to restart?

I build a MongoDB sharding environment and want to test the performance of migration data.
I insert one billion rows in a collection in Replica Set A.
I added another shard setting Replica Set B.
MongoDB starts to balance chunks between those shards.
After balancing is finished, I found out I can't look up some data.
Because those data have been moved to Replica Set B, only when I restart all mongo router service am I able to query them.
Is it a normal and inevitable procedure, or is there any way to reload the whole system (through mongo shell command or anything else)?
Thank you !!!
I found a command that it seems help to reload the router config
db.adminCommand({"flushRouterConfig":1});
2017-05-18 After testing, it works!

Setting up mongo replication in production

How do you setup mongodb replication in production environments? I started using cloud formation with this template but it crashes half way. I want to setup mongo so that it has one primary and two replications.
I haven't found a good tutorial for how to setup Mongo replication.
Some other questions I have are:
How does the failover work, if I have three Ec2 instances each with mongo and the primary fails. Another instance becomes the primary but how does my client PyMongo and Scala Mongo know the IP address of the new primary.
Lets say the primary goes down for 1 hour and there are 2,000 writes. When it goes back up, how does the primary gets updated. Do I need a script for this?
I am trying to do this with flask PyMongo
I ended up testing this on my local machine here is what I found.
Failover is done by the client, in the Mongo URI you specify all your replications and when PyMongo connects to it. He checks to see which one is the primary and writes to that one.
When the database goes back up they all sync to match the same records in the all the databases.
Readthedocs has step by step manual on setting up MongoDB cluster on different platforms, including AWS EC2:
https://mongodb-documentation.readthedocs.io/en/latest/ecosystem/tutorial/install-mongodb-on-amazon-ec2.html#deploy-a-multi-node-replica-set
To provide your clients with working mongo instance you can employ several different strategies. For example:
Set up Route53 failover. Route53 will monitor health instance of primary node, and change DNS record to point to secondary in case of failure.
Use service discovery. Consul, etc, ZooKeeper and doozerd are worth exploring.
In case of failing and then coming back a mongodb node will receive latest data from other nodes — that's just what replica set does.

Mongo DB write operation in shard setting

I am new to mongoDB while going through some tutorial I got a question in my mind that, in sharded environment during reading operation "mongos" first checks config server to get details to which shard it has to query. But what about during write operation does it first checks to which shard it has to perform write operation?
Thanks in advance,
Kitty
I am going to answer based on the current stable release of MongoDB v3.2.
The config servers store the cluster's metadata in the config database. The mongos instances cache this data and use it to route reads and writes to shards.
MongoDB only writes data to the config servers when the metadata changes, such as:
After a chunk migration, or
After a chunk split.
MongoDB reads data from the config server in the following cases:
A new mongos starts for the first time, or an existing mongos restarts.
After change in the cluster metadata, such as after a chunk migration.
See also: Sharded Cluster Mechanics.

One Shard with Multiple Mongos

Can we have this type of configuration?
Two server running the following things each-
1.Mongo Config Server.
2.Mongo Router.
3.Application.
Total 4 EC2 servers-
First Server-Running the web application & mongos.
Second Server-Running the web application & mongos.
Third Server-Running the First Shard with complete DB(Say for
example Demo).
Forth Server-Running The Second Shard with complete DB(Say for
example Demo).
Both the Mongos should point to one shard named Shard1?
Yes, You can have multiple mongos instances running against a single shard. Think of the mongos instances as clients for the sharded cluster which have to run as a daemon process in order to keep metadata and heartbeats up to date.
Edit: as for having a complete DB, this is only possible for a single DB. You can have one DB on shard1 and the other DB on shard2, for example. but you can never have a single complete DB on two shards. To achieve the goal of having db1 on shard1 and db2 on shard2, you simply make the respective shard the primary shard of the respective database and don't shard any collection. Please read the docs for the movePrimary command for details.
A bit OOT:
However, running a single config server is strongly advised against, and for a good reason. If the single config server goes down or gets corrupted, your cluster will be impossible to use - and recreating the sharded cluster will not an easy task to be done. And it's going to be a lengty process. So please, use three config servers.*

mongodb single DB replication

I've a working MongoDB "replica set" made up by 3 servers.
It is storing two DBs, I wonder if is it possible to replicate only one of the DBs without running more than one mongoDB instance(one per DB).
Here is a sketch of the "problem"
Server1 Server2 Server3
DB1 X X X
DB2 X X
X stands for Server where DBs have to be replicated in.
thank
I don't believe it is possible.
Unlike sharding, where you specify down to the collection level what gets sharded, with replica sets you're defining that a given MongoDB instance is part of a replica set. As only one node in a replica set can be the master at any given time, based on the scenario you are talking about, then there would be a problem if e.g. Server1 went down and Server3 was promoted to master - as DB2 would then not be able to be written to.
I had a simliar problem and found a quite easy solution in javascript to be executed in a mongo-shell.
Sourcecode available here:
http://www.suenkel.de/blog/2012/02/mongodb-replicate-one-database-or-collection/
With opening a tailable cursor on the oplog of the master server each operation could be applied to another server (of course you can filter by the namespace of the collections or even the databases...)
According to current MongoDB ReplicaSet architecture, you can't use a single Replica Set with some members having parts of the databases or collections.
However, if you have the requirement of replicating a single database or collection in real-time in another location, I ended up with following workaround:
Use directoryPerDB to separate the desired database files (Create a new replica with this option enabled if you don't have this already)
Copy the directory of desired database to the new location.
Deploy a new ReplicaSet with this single database.
Write a simple script and use Change Streams to perform the replication for you.
As I said, you will end up with another Replica Set dedicated for this database, but replication is done in real-time and both Replica Sets has the data in a consistent way (You have to perform your write operations on first ReplicaSet, though).