Most efficient MongoDB find() query for very large data

Most efficient MongoDB find() query for very large data - mongodb

I trying to implement a research system for a scientific study in 2D astronomical coordinate systems. Into an hand I have a process that generates a lot of geospatial data that we can thing organized in documents of two coordinates and a string value. In the other hand I have a little set of data which have only one coordinate that must match wit at least one of the two of the generated date.
Simplifying, I have organized data in two collections:
collA avg. size: ~2GB (constant in the time)
collB more than 50GB (continuously increasing)
Where:
The schema of a document in the collA is:
{
terrainType: 'myType00001',
lat: '000000123',
lon: '987000000'
},
{
terrainType: 'myType00002',
lat: '000000124',
lon: '987000000'
},
{
terrainType: 'myType00003',
lat: '000000124',
lon: '997000000'
}
Please note that first of all we put indexes to avoid COLLSCAN. There are two indexes in collA: __lat_idx (unique) and __long_idx (unique). I can guarantee that the generation process could not generate duplicates in lat e in lot columns (as seen above: lat and lon have nine digits, but it is only for simplicty... in the real case these values are extremely huge).
The schema of a document in the collB is:
{
latOrLon: '0045600'
},
{
latOrLon: '0045622'
},
{
latOrLon: '1145600'
}
I tried some different query strategies.
Strategy A
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({lat: c.latOrLon})
collA.find({lon: c.latOrLon})
})
This takes two mongo calls for each document in collB: is extremely slow.
Strategy B
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({$expr: {$or: [{$eq: [lat, c.latOrLon]}, {$eq: [lon, c.latOrLon]} })
})
This taks on mongo call for each document in collB, faster than A but still slow.
Strategy C
for chunk in chunks:
docs = []
for doc in chunk:
CHUNK_SIZE = 5000
batch_cursor = collB.find({}, {'_id': 0}, batch_size=CHUNK_SIZE)
chunks = yield_rows(batch_cursor, CHUNK_SIZE) # yield_rows defined below
res = collA.find({
'$or': [
{
'lon': {
'$in': docs
}
}, {
'lat': {
'$in': docs
}
}
]
})
This takes firstly 5000 documents from collA then puts them into an array and send the find() query. Good efficiency: one mongo call for each 5000 documents in collA.
This solution is very fast, but I noticed that becomes slower as collA increases in size, why? Since I am using indexes, the search for a indexed value should costs O(1) in terms of computation time... for example, when collA was around 25GB in size, it will takes roughly 30 minutes to perform a full find(). Now this collection is 50GB size and it is taking morte than 2 hours.
We plan that the DB will reach at least 5TB next month and this will be a problem.
I am asking to university the possibility to parallelize this job using MongoDB Sharding but it will no be immediate. I am looking for a temporary solution until we can parallelize this job.
Please note that we tried more than these three strategies and we mixed Python3 and NodeJS.
Definition of yield_rows:
def yield_rows(batch_cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(batch_cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk

You can just index the field through which you're querying ( It doesn't have to be unique), that way only the matching data will be returned, It won't go through the entire collection data.

Related

MongoDB fast count of subdocuments - maybe trough index

I'm using MongoDB 4.0 on mongoDB Atlas cluster (3 replicas - 1 shard).
Assuming i have a collection that contains multiple documents.
Each of this documents holding an array out of subdocuments that represent cities in a certain year with additional information. An example document would look like that (i removed unessesary information to simplify example):
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
I have a compound index on and year. What is the fastest way to count the occurrences of name and year combinations?
I already tried the following aggregation:
[{$unwind: {
path: '$cities'
}}, {$group: {
_id: {
name: 'cities.name',
year: '$cities.year'
},
count: {
$sum: 1
}
}}, {$project: {
count: 1,
name: '$_id.name',
year: '$_id.year',
_id: 0
}}]
Another approach i tried was a map-reduce in the following form - the map reduce performed a bit better ~30% less time needed.
map function:
function m() {
for (var i in this.cities) {
emit({
name: this.cities[i].name,
year: this.cities[i].year
},
1);
}
}
reduce function (also tried to replace sum with length, but surprisingly sum is faster):
function r(id, counts) {
return Array.sum(counts);
}
function call in mongoshell:
db.test.mapReduce(m,r,{out:"mr_test"})
Now i was asking myself - Is it possible to access the index? As far as i know it is a B+ tree that holds the pointers to the relevant documents on disk, therefore from a technical point of view I think is would be possible to iterate through all leaves of the index tree and just counting the pointers? Does anybody if this is possible?
Does anybody knows another way to solve this approach in a high performant way? (It is not possible to change the design, because of other dependencies of the software, we are running this on a very big dataset). Has anybody maybe experience in solve such task via shards?

The index will not be very helpful in this situation.
MongoDB indexes were designed for identifying documents that match a given critera.
If you create an index on {cities.name:1, cities.year:1}
This document:
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
Will have 2 entries in the b-tree that refer to this document:
vienna|1985
berlin|2001
Even if it were possible to count the incidence of a specific key in the index, this does not necessarily correspond.
MongoDB does not provide a method to examine the raw entries in an index, and it explicitly refuses to use an index on a field containing an array for counting.
The MongoDB count command and helper functions all count documents, not elements inside of them. As you noticed, you can unwind the array and count the items in an aggregation pipeline, but at that point you've already loaded all of the documents into memory, so it's too late to make use of an index.

Mongo Query - Number of Constraints vs Speed (and Indexing!)

Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?

It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.

It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()

Get the size of all the documents in a query

Is there a way to get the size of all the documents that meets a certain query in the MongoDB shell?
I'm creating a tool that will use mongodump (see here) with the query option to dump specific data on an external media device. However, I would like to see if all the documents will fit in the external media device before starting the dump. That's why I would like to get the size of all the documents that meet the query.
I am aware of the Object.bsonsize method described here, but it seems that it only returns the size of one document.

Here's the answer that I've found:
var cursor = db.collection.find(...); //Add your query here.
var size = 0;
cursor.forEach(
function(doc){
size += Object.bsonsize(doc)
}
);
print(size);
Should output the size in bytes of the documents pretty accurately.
I've ran the command twice. The first time, there were 141 215 documents which, once dumped, had a total of about 108 mb. The difference between the output of the command and the size on disk was of 787 bytes.
The second time I ran the command, there were 35 914 179 documents which, once dumped, had a total of about 57.8 gb. This time, I had the exact same size between the command and the real size on disk.

Starting in Mongo 4.4, $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to sum the bson size of all documents matching your query:
// { d: [1, 2, 3, 4, 5] }
// { a: 1, b: "hello" }
// { c: 1000, a: "world" }
db.collection.aggregate([
{ $group: {
_id: null,
size: { $sum: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "size" : 177 }
This $groups all matching items together and $sums grouped documents' $bsonSize.
$$ROOT represents the current document from which we get the bsonsize.

MongoDB, Atomic Level Operation

i want to ask some info related findAndModify in MongoDB.
As i know the query is "isolated by document".
This mean that if i run 2 findAndModify like this:
{a:1},{set:{status:"processing", engine:1}}
{a:1},{set:{status:"processing", engine:2}}
and this query potentially can effect 2.000 documents then because there are 2-query (2engine) then maybe that some document will have "engine:1" and someother "engine:2".
I don't think findAndModify will isolate the "first query".
In order to isolate the first query i need to use $isolated.
Is everything write what i have write?
UPDATE - scenario
The idea is to write an proximity engine.
The collection User has 1000-2000-3000 users, or millions.
1 - Order by Nearest from point "lng,lat"
2 - in NodeJS i make some computation that i CAN'T made in MongoDB
3 - Now i will group the Users in "UserGroup" and i write an Bulk Update
When i have 2000-3000 Users, then this process (from 1 to 3) take time.
So i want to have Multiple Thread in parallel.
Parallel thread mean parallel query.
This can be a problem since Query3 can take some users of Query1.
If this happen, then at point (2) i don't have the most nearest Users but the most nearest "for this query" because maybe another query have take the rest of Users. This can create maybe that some users in New York is grouped with users of Los Angeles.
UPDATE 2 - scenario
I have an collection like this:
{location:[lng,lat], name:"1",gender:"m", status:'undone'}
{location:[lng,lat], name:"2",gender:"m", status:'undone'}
{location:[lng,lat], name:"3",gender:"f", status:'undone'}
{location:[lng,lat], name:"4",gender:"f", status:'done'}
What i should be able to do, is create 'Group' of users by grouping by the most nearest. Each Group have 1male+1female. In the example above, i'm expecting to have only 1 group (user1+user3) since there are Male+Female and are so near each other (user-2 is also Male, but is far away from User-3 and also user-4 is also Female but have status 'done' so is already processed).
Now the Group are created (only 1 group) so the 2users are marked as 'done' and the other User-2 is marked as 'undone' for future operation.
I want to be able to manage 1000-2000-3000 users very fast.
UPDATE 3 : from community
Okay now. Can I please try to summarise your case. Given your data, you want to "pair" male and female entries together based on their proximity to each other. Presumably you don't want to do every possible match but just set up a list of general "recommendations", and let's say 10 for each user by the nearest location. Now I'd have to be stupid to not see the full direction of where this is going, but does this sum up the basic initial problem statement. Process each user, find their "pairs", mark them as "done" once paired and exclude them from other pairings by combination where complete?

This is a non-trivial problem and can not be solved easily.
First of all, an iterative approach (which admittedly was my first one) may lead to wrong results.
Given we have the following documents
{
_id: "A",
gender: "m",
location: { longitude: 0, latitude: 1 }
}
{
_id: "B",
gender: "f",
location: { longitude: 0, latitude: 3 }
}
{
_id: "C",
gender: "m",
location: { longitude: 0, latitude: 4 }
}
{
_id: "D",
gender: "f",
location: { longitude: 0, latitude: 9 }
}
With an iterative approach, we now would start with "A" and calculate the closest female, which, of course would be "B" with a distance of 2. However, in fact, the closest distance between a male and a female would be 1 (distance from "B" to "C"). But even when we found this, that would leave the other match, "A" and "D", at a distance of 8, where, with our previous solution, "A" would have had a distance of only 2 to "B".
So we need to decide what way to go
Naively iterate over the documents
Find the lowest sum of distances between matching individuals (which itself isn't trivial to solve), so that all participants together have the shortest travel.
Matching only participants within an acceptable distance
Do some sort of divide and conquer and match participants within a certain radius of a common landmark (say cities, for example)
Solution 1: Naively iterate over the documents
var users = db.collection.find(yourQueryToFindThe1000users);
// We can safely use an unordered op here,
// which has greater performance.
// Since we use the "done" array do keep track of
// the processed members, there is no drawback.
var pairs = db.pairs.initializeUnorderedBulkOp();
var done = new Array();
users.forEach(
function(currentUser){
if( done.indexOf(currentUser._id) == -1 ) { return; }
var genderToLookFor = ( currentUser.gender === "m" ) ? "f" : "m";
// using the $near operator,
// the returned documents automatically are sorted from nearest
// to farest, and since findAndModify returns only one document
// we get the closest matching partner.
var nearPartner = db.collection.findAndModify(
query: {
status: "undone",
gender: genderToLookFor,
$near: {
$geometry: {
type: "Point" ,
coordinates: currentUser.location
}
}
},
update: { $set: { "status":"done" } },
fields: { _id: 1}
);
// Obviously, the current use already is processed.
// However, we store it for simplifying the process of
// setting the processed users to done.
done.push(currentUser._id, nearPartner._id);
// We have a pair, so we store it in a bulk operation
pairs.insert({
_id:{
a: currentUser._id,
b: nearPartner._id
}
});
}
)
// Write the found pairs
pairs.execute();
// Mark all that are unmarked by now as done
db.collection.update(
{
_id: { $in: done },
status: "undone"
},
{
$set: { status: "done" }
},
{ multi: true }
)
Solution 2: Find the smallest sum of distances between matches
This would be the ideal solution, but it is extremely complex to solve. We need to all members of one gender, calculate all distances to all members of the other gender and iterate over all possible sets of matches. In our example it is quite simple, since there are only 4 combinations for any given gender. Thinking of it twice, this might be at least a variant of the traveling salesman problem (MTSP?). If I am right with that, the number of combinations should be
for all n>2, where n is the number of possible pairs.
and hence
for n=10
and an astonishing
for n=25
That's 7.755 quadrillion (long scale) or 7.755 septillion (short scale).
While there are approaches to solving this kind of problem, the world record is somewhere in the range of 25,000 nodes using massive amounts of hardware and quite tricky algorithms. I think for all practical purposes, this "solution" can be ruled out.
Solution 3
In order to prevent the problem that people might be matched with unacceptable distances between them and depending on your use case, you might want to match people depending on their distance to a common landmark (where they are going to meet, for example the next bigger city).
For our example assume we have cities at [0,2] and [0,7]. The distance (5) between the cities hence has to be our acceptable range for matches. So we do a query for each city
db.collection.find({
$near: {
$geometry: {
type: "Point" ,
coordinates: [ 2 , 0 ]
},
$maxDistance: 5
}, status: "done"
})
and iterate over the results naively. Since "A" and "B" would be the first in the result set, they would be matched and done. Bad luck for "C" here, as no girl is left for him. But when we do the same query for the second city he gets his second chance. Ok, his travel gets a bit longer, but hey, he got a date with "D"!
To find the respective distances, take a fixed set of cities (towns, metropolitan areas, whatever your scale is), order them by location and set each cities radius to the bigger of the two distances to their immediate neighbors. This way, you get overlapping areas. So even when a match can not be found in one place, it may be found on others.
Iirc, Google Maps allows it to grab the cities of a nation based on their size. An easier way would be to let people choose their respective city.
Notes
The code shown is not production ready and needs to be refined.
Instead of using "m" and "f" for denoting a gender, I suggest using 1 and 0: Can still be easily mapped, but needs less space to save.
Same goes for status.
I think the last solution is the best, optimizing distances some wayish and keeping the chances high for a match.

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time

MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.