I am trying to make MongoDB to count documents based on where clause
def headResult(): C = Await.result(observable.head(), Duration(10, TimeUnit.SECONDS))
val database: MongoDatabase = mongoClient.getDatabase("dbname")
val collection: MongoCollection[Document] = database.getCollection("tablename")
val recordCount = collection.countDocuments()
.headResult()
This query returns the count as 766 782 but it takes 2-2.5 seconds. When I make the same query through MongoDB Compass it takes 0.2 seconds .
db.tbltrackerdata.find({},{}).count()
Since where clause is dynamic I cannot save prior or maintain any metadata for this.
This isn't because of Scala. It's because db.collection.count({}) or db.collection.find({}).count() does not actually fetch the documents. Instead it simply retrieves the number of documents from the collection's metadata.
The countDocuments method performs a proper aggregation, which is expected to be slower but also more accurate. You can see it mentioned in the MongoDB documentation here:
https://docs.mongodb.com/manual/reference/method/db.collection.countDocuments/#mechanics
Related
I am running a mongo query like this
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true, $ne:[]}}).sort({dateField:-1})
The collection has approx. 10^6 documents. I have indexes on the stringField and dateField (both ascending). This query takes ~3-4 seconds to run.
However, if I change my query to either of the below, it executes within 100ms
Remove $ne
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true}}).sort({dateField:-1})
Remove $exists
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$ne:[]}}).sort({dateField:-1})
Remove sort
db.getCollection('collection').find({stringField:"stringValue", arrayField:{$exists:true, $ne:[]}})
Use arrayField.0
db.getCollection('foodfulfilments').find({stringField:"stringValue", "arrayField.0":{$exists:true}}).sort({dateField:-1})
The explain of these queries do not provide any insights to why the first query is so slow?
MongoDb version 3.4.18
I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver
I have a 30GB mongodb 3.6 collection with 500k documents. the main _id field is a float timestamp (i didnt manually define an index, but inserted on the _id field, assuming from documentation that _id will be used as index and automatically maintained.
Now if I query last data I do in Python 3
cursor = cry.find().sort('_id', pymongo.DESCENDING).limit(600)
df = list(cursor)
However, just querying the last 600 records takes about 1 minute. How can this be if the index is maintained? Is there a faster way to query (like by natural order) or do I need to re-index although documentation says its done automatically?
I also tried
cursor=cry.find().skip(cry.count() - 1000)
df = list(cursor)
but this is just as slow
I'm iterating over the whole mongo documents from a mongo slave using mongo java API.
Mongo Server: 2.4.10
Number of records in slave: 300 million.
I've one mongo master, one mongo slave.
(No sharding done)
The mongo slave gets replicated very high frequency 2000 insertions and deletions every 10 seconds.
The iteration is taking more than 10 hours.
My goal is to fetch each record in the collection and create a csv and load it to redshift.
DB db = null;
DBCursor cursor = null;
mongo = new MongoClient(mongoHost);
mongo.slaveOk();
db = mongo.getDB(dbName);
DBCollection dbCollection = db.getCollection(dbCollectionName);
cursor = dbCollection.find();
while (cursor.hasNext()) {
DBObject resultObject = cursor.next();
String uid = (String) ((Map) resultObject.get("user")).get("uid");
String category = (String) resultObject.get("category");
resultMap.put(uid, category);
if (resultMap.size() >= csvUpdateBatchSize) {
//store to a csv - append to an existing csv
}
}
is there a way to bring down the iteration time to below 1 hours?
Infrastructure changes can be done too ..Like increasing shards.
Please suggest.
Have you considered performing a parallel mongoexport on your collection?
If you have a way to partition your data with a query (something like modulo over an id or indexed field) and pass this as a standard input to your program.
Your program then will handle each document as a JSON row, which you can load to a designated object representing document structure with GSON or some other similar library
and eventually run your logic on that object.
Using mongoexport and adding parallelism can improve your performance greatly.
I am trying to use mongodb's ObjectID to do a range query on the insertion time of a given collection. I can't really find any documentation that this is possible, except for this blog entry: http://mongotips.com/b/a-few-objectid-tricks/ .
I want to fetch all documents created after a given timestamp. Using the nodejs driver, this is what I have:
var timeId = ObjectId.createFromTime(timestamp);
var query = {
localUser: userId,
_id: {$gte: timeId}
};
var cursor = collection.find(query).sort({_id: 1});
I always get the same amount of records (19 in a collection of 27), independent of the timestamp. I noticed that createFromTime only fills the bytes in the objectid related to time, the other ones are left at 0 (like this: 4f6198be0000000000000000).
The reason that I try to use an ObjectID for this, is that I need the timestamp when inserting the document on the mongodb server, not when passing the document to the mongodb driver in node.
Anyone knows how to make this work, or has another idea how to generate and query insertion times that were generated on the mongodb server?
Not sure about nodejs driver in ruby, you can simply apply range queries like this.
jan_id = BSON::ObjectId.from_time(Time.utc(2012, 1, 1))
feb_id = BSON::ObjectId.from_time(Time.utc(2012, 2, 1))
#users.find({'_id' => {'$gte' => jan_id, '$lt' => feb_id}})
make sure
var timeId = ObjectId.createFromTime(timestamp) is creating an ObjectId.
Also try query without localuser