I have data in a collection 'Test' with following entries like,
{
"_id" : ObjectId("588b65f1e9a1e01dfc55a5ff"),
"SALES_AMOUNT" : 4500
},
{
"_id" : ObjectId("588b65f1e9a1e01dfct5a5ff"),
"SALES_AMOUNT" : 500
}
and so on.
I want to equally separate into 10 buckets.
Eg :
If my total entries has 50, then it should be like,
first_bucket : First 5 entries from test collection second_bucket :
Next 5 entries from test collection .....
....
tenth_bucket : Last 5 entries from test collection.
Suppose, If total entries count has 101, then it should be like
first_bucket : First 10 entries from test collection second_bucket :
Next 10 entries from test collection .....
....
tenth_bucket : Last 11 entries from test collection.(Because 1
additional entry is there).
$bucket & $bucketAuto is in mongo 3.4.. But I use mongo 3.2.
Related
I have a collection in mongo of 10 billion documents. Some of which have false information and require updating. The documents look like
{
"_id" : ObjectId("5567c71e2cdc06be25dbf7a0"),
"index1" : "stringIndex1",
"field" : "stringField",
"index2" : "stringIndex2",
"value" : 100,
"unique_id" : "b21fc73e-55a0-4e15-8db0-fa94e4ebcc0b",
"t" : ISODate("2015-05-29T01:55:39.092Z")
}
I want to update the value field for all documents that match criteria on index1, index2 and field. I want to do this across many combinations of the 3 fields.
In an ideal world we could create a second collection and compare the 2 before replacing the original in order to guarantee that we haven't lost any documents. But the size of the collection means that this isn't possible. Any suggestions for how to update this large amount of data without risking damaging it.
Here is an example of my MongoDB database in screenshot:
I'm trying to create a request which count numbers of actors per owner.login in my movies collection.
The result should be like this for all my Objects, from "movies" collection:
henry.tom total_actors : 1
tercik.comip total_actors : 9
I have some problems with very slow distinct commands that use a query.
From what I have observed the distinct command only makes use of an index if you do not specify a query:
I have created a test database on my MongoDB 3.0.10 server with 1Mio objects. Each object looks as follows:
{
"_id" : ObjectId("56e7fb5303858265f53c0ea1"),
"field1" : "field1_6",
"field2" : "field2_10",
"field3" : "field3_29",
"field4" : "field4_64"
}
The numbers at the end of the field values are random 0-99.
On the collections two simple indexes and one compound-index has been created:
{ "field1" : 1 } # simple index on "field1"
{ "field2" : 1 } # simple index on "field2"
{ # compound index on all fields
"field2" : 1,
"field1" : 1,
"field3" : 1,
"field4" : 1
}
Now I execute distinct queries on that database:
db.runCommand({ distinct: 'dbtest',key:'field1'})
The result contains 100 values, nscanned=100 and it has used index on "field1".
Now the same distinct query is limited by a query:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, however nscanned=9991 and the used index is the third one on all fields.
Now the third index that was used in the last query is dropped. Again the last query is executed:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, nscanned=9991 and the used index is the "field2" one.
Conclusion: If I execute a distinct command without query the result is taken directly from an index. However when I combine a distinct command with a query only the query uses an index, the distinct command itself does not use an index in such a case.
My problem is that I need to perform a distinct command with query on a very large database. The result set is very large but only contains ~100 distinct values. Therefore the complete distinct command takes ages (> 5 minutes) as it has to cycle through all values.
What needs to be done to perform my distinct command presented above that can be answered by the database directly from an index?
The index is automatically used for distinct queries if your Mongo database version supports it.
The possibility to use an index in a distinct query requires Mongo version 3.4 or higher - it works for both storage engines MMAPv1/WiredTiger.
See also the bug ticket https://jira.mongodb.org/browse/SERVER-19507
I'm new to mongodb so, please bear with me. I googled this but could not find a convincing answer.
I understand the following should limit n1 documents in the result and skip n2 of that.
>db.mycol.find({},{"title":1}).limit(n1).skip(n2)
Why should the following query return the second document in the collection? Should it not return nothing ? ( Limit one gives the first document and skipping that leaves us with nothing ).
>db.mycol.find({},{"title":1}).limit(1).skip(1)
What do you want to do when you put limit before skip?
If you limit N elements and then skip K
this is logically equivalent to skip K and limit N-K.
I suppose the optimizer knows this as well and expect you too as well.
See pipeline optimization
I understand the following should limit n1 documents in the result and skip n2 of that.
Nope, you got that wrong. Here is what happens:
Your query is processed by the query optimizer, which puts .sort(), .skip() and .limit() into exactly this order
The documents to be returned are identified, either by means of utilizing indices or a collection scan
Now they are sorted according to the parameters of the .sort() clause, if present
The first number of documents of that sorted list of documents are skipped according to the parameter of the .skip() clause.
Now, a number of documents equalling the parameter of the .limit() clause are returned
Actually it is easy to prove:
> db.bg.insert({a:1})
WriteResult({ "nInserted" : 1 })
> db.bg.insert({a:2})
WriteResult({ "nInserted" : 1 })
> db.bg.insert({a:3})
WriteResult({ "nInserted" : 1 })
> db.bg.insert({a:4})
WriteResult({ "nInserted" : 1 })
> db.bg.find()
{ "_id" : ObjectId("56889a8a32a39e5b2c96acb5"), "a" : 1 }
{ "_id" : ObjectId("56889a8d32a39e5b2c96acb6"), "a" : 2 }
{ "_id" : ObjectId("56889a9032a39e5b2c96acb7"), "a" : 3 }
{ "_id" : ObjectId("56889ad332a39e5b2c96acb8"), "a" : 4 }
// According to your logic, this query would be empty
// (Only one doc returned, and of that returned one skipped)
// But it bears a result…
> db.bg.find().sort({a:-1}).limit(1).skip(1)
{ "_id" : ObjectId("56889a9032a39e5b2c96acb7"), "a" : 3 }
// …actually the same result when switching the place of the clauses
> db.bg.find().sort({a:-1}).skip(1).limit(1)
{ "_id" : ObjectId("56889a9032a39e5b2c96acb7"), "a" : 3 }
// Even when we put the sort clause to the end.
// If the query optimizer would not have enforced the order mentioned
// we would have natural order as in the default query,
// then skip 1 (we would be at {a:2}),and limit to that document, making
// the sort clause useless.
// But, as you can see, it is the same result as before
> db.bg.find().skip(1).limit(1).sort({a:-1})
{ "_id" : ObjectId("56889a9032a39e5b2c96acb7"), "a" : 3 }
https://docs.mongodb.org/v3.0/reference/method/cursor.skip/
NOTE
You must apply cursor.skip() to the cursor before retrieving any documents from the database.
Whereas limit is applied while querying results.
So find finds all the documents which match criteria and Skip is applied before retrieving documents and number of documents given in limit are retrieved.
Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):
> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }
Now I need to collect some statistics on that collection. For example:
db.data.mapReduce(function(){
emit(this.type,this.value);
},function(key,values){
var total = 0;
for(i in values) {total+=values[i]};
return total;
},
{out:'stat'})
will collect totals in 'stat' collection.
> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }
At this point everything is perfect, but I've stuck on the next move:
'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment
So the question is:
Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?
You can cache the time and use it as a barrier for your next incremental map-reduce.
We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)
We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):
m = Code(<map function>)
r = Code(<reduce function>)
# pseudo code
end = last_time + 5 minutes
# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])
collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)
Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.
You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.
EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/
I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:
https://gist.github.com/2233072
""" This method performs an incremental map-reduce on any new data in 'source_table_name'
into 'target_table_name'. It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.
The set of data to be processed incrementally is determined non-invasively (meaning the source table is not
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed,
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.
If reset is True, it will drop 'target_table_name' before starting.
If max_datetime is given, it will only process records up to that date.
If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all.
If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""
We solve this issue using 'normalized' ObjectIds. Steps that we are doing:
normalize id - take timestap from the current/stored/last processed id and set other
parts of id to its min values. C# code: new ObjectId(objectId.Timestamp,
0, short.MinValue, 0)
run map-reduce with all items that have id
greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.
Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.