Executing Mongo mapreduce jobs on a machine without mongo installed - mongodb

I have a set of mapreduce jobs which I need to execute from my Java program. Right now I am executing them via a Java Process calling
$MONGO_HOME/bin/mongo host:port/database jsFiles
Is there a way I could execute these mapreduce taks on a machine which does not have Mongo. Does the mongo Java driver support such functionality ?
Thanks!

MongoDB MapReduce jobs are always run on the Mongo server, never in the client, and any client can send a job to the server.
#Chris Shain pointed you to the docs (http://api.mongodb.org/java/current/com/mongodb/MapReduceCommand.html), and I recommend you read them, but also understand that most MapReduce operations will be all about reducing the huge volumes of data stored into your database down to smaller result sets. The closer this is done to where the data is actually stored, the better, and most people do not execute commands directly on the server. In order for the MapReduce operation to be useful, Mongo would have to (and did!) provide a way to use it from the client. For general strategies, see here: http://www.mongodb.org/display/DOCS/MapReduce
Note that because the operation runs on the server, you may notice increased lock percentage. Consider running the MapReduce job on a slave or secondary Mongo instance if this is a problem for you.

The Java client driver for Mongo has the MapReduceCommand, documented here: http://api.mongodb.org/java/current/com/mongodb/MapReduceCommand.html

Related

Under which circumstances can documents insert with insert_many not appear in DB

We are using pymongo 3.12 and Python 3.12, MongoDB 4.2. We write results of our task from Celery worker process into MongoDB using pymongo. MongoClient is not instantiated each time, thus, we use connection pooling and reuse the connections. There are multiple instances of Celery workers competing for running a job, so our Celery server has multiple connections to MongoDB.
The problem: sometimes, results of that particular operation are not in MongoDB, however, no error is logged, and our code captures all exceptions, so it looks like no exception ever happens. We use plain insert_many with default parameters, which means our insert is ordered and any failure would trigger an error. Only this particular operation fails, others that read or write data from/to the same or another MongoDB instance work fine. Problem can be reproduced in different systems. We have added maxIdleTimeMS parameter to MongoDB connection in order to close idle connections, but it did not help.
Is there a way to tell programmatically which local port is used by pymongo connection which is going to serve my request?

In MongoDB, can I run the compact command without shutting down each instance?

In the server structure, primary, secondary, and arbiter are each physically operated.
mongo db version is 4.2.3.
Some of the documents were deleted in the oldest order because too many documents were accumulated in a specific collection.
However, even deleting documents did not release the storage area.
Upon checking, I found that mongodb's mechanism retains reusable bytes even if the document is deleted.
Also, I found out that unnecessary disk space can be freed with the compact command in the WiredTiger engine.
Currently, all clients connected to the db are querying using the arbiter ip and port.
Since the DB is composed only of replication, not sharding, if each individual executes the compact command independently, Even if each instance is locked, it is expected that the arbiter will distribute the query to the currently available instances.
Is this possible?
Or, Should I shutdown each instance, run it standalone, run the compact command, and then reconfigure psa?
You may upgrade your MonogDB to latest version 4.4. Documentation of compact:
Blocking
Changed in version 4.4.
Starting in v4.4, on WiredTiger, compact only blocks the following
metadata operations:
db.collection.drop
db.collection.createIndex and db.collection.createIndexes
db.collection.dropIndex and db.collection.dropIndexes
compact does not block MongoDB CRUD Operations for the database it is
currently operating on.
Before v4.4, compact blocked all operations for the database it was
compacting, including MongoDB CRUD Operations, and was therefore
recommended for use only during scheduled maintenance periods.
Starting in v4.4, the compact command is appropriate for use at any
time.
To anyone looking for the answer with 4.4 please see this bug and the documentation entry as the compact routine still forces the node to recovery state if you are running in replica set (and I assume this is the default use case for most projects)

MongoDB inserts slow down when in Replica Set mode

I'm running MongoDB 2.4.5 and recently I've started digging into Replica Set to get some kind of redundancy.
I started same mongo instance with --replSet parameter and also added an Arbiter to running Replica Set. What happened was writing to mongo slowed down significantly (from 15ms to 30-60ms, sometimes even around 300ms). As soon as I restarted it in not-replicaset-mode performance went back to normal.
I also set up the newest 3.0 version of MongoDB with no data and run same tester as before and the result was quite similar - writes were at least 50% slower while running the ReplicaSet mode.
I could not find many examples of such behaviour online so I guess something is wrong with my mongo configuration or OS configuration.
Any ideas? Thanks for help.
It sounds like you're using "replica acknowledged" write concern, which means that the operation will not return until the data has been written to both the primary and replica. The write concern can be set when doing any write operation (from 2.6 onwards - it looks from the 2.4 documentation that calling getLastError causes a write concern of replica acknowledged in 2.4, are you doing that in your test code?).
Read this section (v3)) or this section (v2.4) of the MongoDB documentation to understand the implications of different write concerns and try explicitly setting it to Acknowledged.
Okay so the problem was C# library. I used a native C# driver (works fine even with 2.4.5 MongoDB) and there seems to be no difference in performance. Thanks for help.

Does MongoDB's aggregate engine run on the server or the client?

In designing an infrastructure I'd like to put as much processing into the application servers as opposed to my database servers - as my application servers can scale horizontally more easily.
So when thinking about how I would use MongoDB I wondered if MongoDB would utilise the computational power of the client (the mongo-client on the application server) over the computational power of the database server. I imagined this along the lines of the database server sends all the documents (maybe post index-lookup) to the client and the client then aggregates the data to the desired result.
I haven't found any documentation confirming if this is possible though what I have read seems to imply that all the aggregation is done on the database server.
So my question is is it possible to use the mongoDB client to aggregate data as opposed to the mogoDB server
Don't believe you are going to be able to do that with the current version of MongoDB. The aggregation pipeline has always run on top of the mongod process which is what is running the database software on the server. Previously there was one special case - sharded collections and databases - that split the aggregation pipeline into two pieces and ran the second half on the mongos process. If that were still true you could run mongos on your application server and run at least part of the aggregation framework on your application server.
Unfortunately that changed in version 2.6 - it now runs all on the shards and does not utilize the mongos process. From the documentation:
Aggregation Pipeline and Sharded Collections
Behavior
Changed in version 2.6.
When operating on a sharded collection, the aggregation pipeline is
split into two parts. The first pipeline runs on each shard, or if an
early $match can exclude shards through the use of the shard key in
the predicate, the pipeline runs on only the relevant shards.
The second pipeline consists of the remaining pipeline stages and runs
on the primary shard. The primary shard merges the cursors from the
other shards and runs the second pipeline on these results. The
primary shard forwards the final results to the mongos. In previous
versions, the second pipeline would run on the mongos. [1] Until all
shards upgrade to v2.6, the second pipeline runs on the mongos if any
shards are still running v2.4.
So you could run version 2.4 and use the mongos process - or keep at least one shard as version 2.4 and force the processing on to mongos.
Yes, aggregation is done on the server side and to perform aggregation on mongodb client side you would need to bring all the data from the server to the client side and then process it, bringing large amount of data can be very slow, take a look at this
aggregation in mongodb
here it says that you can do aggregation on client side, but it doesn't say how you can do that, what I think is bring all the data from the server side and store it in a variable then aggregate using javascript.
and also check this out, this may also be helpful
https://groups.google.com/forum/#!topic/mongodb-user/UxUvgBTHnjM

Need explanation on how the select queries are handled by mongoDB

I have no issues running a select query on mongoS, in a sharded environment, but my question is:
If have a 2 shard server setup and run a find query via application layer, which part of the sharded environment is reponsible for executing the query?
I am not able to see any change in any of instance's consoles and also no new process is created. I tested this by executing 3000 find queries on a locally implemented sharding setup.
Can anybody explain where I am wrong in understanding, or find statements don't put load on servers.
How does mongoDB handle select Or read operations?
I badly understand this.
Thanks in advance for responding
When you connect to a mongoD or mongoS server via the shell (mongo), you won't be able to look at the queries happening on that server. The shell is mainly there to execute queries, configurate the database and check it's status.
MongoS is simply a router of the queries coming from the application.
To the see the individual queries you'll need to check the log files which are located on each server based on your configuration.
By default only slow queries (under 100ms) will be logged. So will need to enable the Profiler to log all queries.
You can read this documentation pages for more info on Sharding.