Data model for geo spatial data and multiple queries - mongodb

I have some mongodb object let's call it place which contains geo information, look at the example:
{
"_id": "234235425e3g33424".
"geo": {
"lon": 12.23456,
"lat": 34.23322
}
"some_field": "value"
}
With every place, a list of features is associated with:
{
"_id": "2334sgfgsr435d",
"place_id": "234235425e3g33424",
"feature_field" : "some_value"
}
As you see features are linked to places thanks to place_id field. Now I would like to find: list of features connected with nearest places. But I would like also add search contition on place.some_field and feature.feature_field. And what is important I would like to limit results.
Now I am using such approach:
I query on places with condition on geo and some_filed
I query on features with condition on feature_field and place_id (limit only to ones found in 1.)
I limit results in my application code
My question is: is there better approach to such task? Now I cannot use mongo limit() function, as when I do it on places I can end with too few results as I need to make second query. I cannot limit() on second query as results will come up with random order, and I would like to sort it by distance.
I know I can put data into one document, but I presume that list of features will be long and I can exceed BSON size limit.

Running out of 16mb for just the features seems unlikely... but it's possible. I don't think you realize how much 16mb is, so do the maths before assuming anything!
In any case, with MongoDB you can not do a query with fields from two collections. A query always deals with one specific collection only. I have done a very similar thing than what you have here though, which i've described in an article: http://derickrethans.nl/indexing-free-tags.html — have a look at that for some more inspiration.

Related

MongoDb many to many with big relations

I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.

Best way to structure MongoDB with the following use cases?

sorry to have to ask this but I am new to MongoDB (only have experience with relational databases) and was just curious as to how you would structure your MongoDB.
The documents will be in the format of JSONs with some of the following fields:
{
"url": "http://....",
"text": "entire ad content including HTML (very long)",
"body": "text (50-200 characters)",
"date": "01/01/1990",
"phone": "8001112222",
"posting_title": "buy now"
}
Some of the values will be very long strings.
Each document is essentially an ad from a certain city. We are storing all ads for a lot of big cities in the US (about 422). We are storing more ads every day, and the amount of ads per city varies from as little as 0 to as big as 2000. The average is probably around 700-900.
We need to do the following types of queries, in almost instant time (if possible):
Get all ads for any specific city, for any specific date range.
Get all ads that were posted by a specific phone number, for any city, for any date range.
What would you recommend? I'm thinking I should have 422 collections - one for each city. I'm just worried about the query time when we query for phone numbers because it needs to go through each collection. I have an iterable list of all collection names.
Or would it be faster to just have one collection so that I don't have to switch through 422 collections?
Thank you so much, everyone. I'm here to answer any questions!
EDIT:
Here is my "iterating through all collections" snippet:
for name in glob.glob("Data\Nov. 12 - 5pm\*"):
val = name.split("5pm")[1].split(".json")[0][1:]
coll = db[val]
# Add into collection here...
MongoDB does not offer any operations which get results from more than one collection, so putting your data in multiple collections is not advisable in this case.
You can considerably speed up all the use-cases you mentioned by creating indexes for them. When you have a very large dataset and always query for exact equality, then hashed indexes are the fastest.
When you query a range of dates (between day x and day y), you should use the Date type and not strings, because this not just allows you to use lots of handy date operators in aggregation but also allows you to speed up ranged queries and sorts with ascending or descending indexes.
Maybe I'm missing something, but wouldn't making "city" a field in your JSON solve your problem? That way you only need to do something like this db.posts.find({ city: {$in: ['Boston', 'Michigan']}})

In MongoDB, is one big $or search faster than multiple single searches?

I have a list of about 50 tags in an array, and want to search through my documents to find records that match these tags.
Because they're user-submitted and mongoDB is case-sensitive, I'm using /wildcard/i as a means of searching. I know this is not the fastest way to do a search but I can't think of a better solution.
I can do my query in two ways. The first is to run a for loop over my tags array, and for each result, perform:
db.collection.find({tags: /<tag[x]>/i})
Or, I can collect all of the tags and run one single lookup using $or, like so:
db.collection.find({$or:[{tags:/<tag1>/i},{tags:/<tag2>/i},{tags:/<tag3>/i}, ... {tags:/<tag50>/i}]});
I have tried both, and found using $or to be significantly faster - but because of the work-in-progress state of my application, it's very difficult to tell whether this is because it's actually faster or whether my app is causing significant overhead in other areas (it is).
So for clarification, in MongoDB is a big query performed once faster than small queries performed many times?
EDIT: Another example would be whether looking up 3 individual records based on _id is faster than doing one lookup using {$or:[{_id: ObjectId([id1])},{_id: ObjectId([id2])},{_id: ObjectId([id3])}]}. Is less more?
I recommend you adjust your schema so it keeps a normalized array of tags. When you insert a new document, do it like this:
tags : [ "business", "Computing", "PayPal" ],
lowercaseTags : [ "business", "computing", "paypal" ]
Similarly when you update the tags, update both arrays.
Create an index on lowercaseTags, and then when you want to query them, use a single query with the $in operator, and the normalized form of the search terms.
For example, to search for business iTunes YouTube, use this query:
db.collection.find( { tags : $in: [ "business", "itunes", "youtube" ] } )
This answer gives an example of this approach. It should be loads faster than what you have.
An alternate approach you can take is to create a text index and use the text command.
Both of these approaches are geared toward index optimization, and designing your schema to work well with Mongo. The payoff should be a lot higher than whatever difference there is between a single $or query and 50 simpler queries.

MongoDB selection with existing value

I'm using PyMongo to fetch data from MongoDB. All documents in the collection look like the structure below:
{
"_id" : ObjectId("50755d055a953d6e7b1699b6"),
"actor":
{
"languages": ["nl"]
},
"language":
{
"value": "nl"
}
}
I'm trying to fetch all the conversations where the property language.value is inside the property actor.languages.
At the moment I know how to look for all conversations with a constant value inside actor.languages (eg. all conversations with en inside actor.languages).
But I'm stuck on how to do the same comparison with a variable value (language.value) inside the current document.
Any help is welcome, thanks in advance!
db.testcoll.find({$where:"this.actor.languages.indexOf(this.language.value) >= 0"})
You could use a $where provided your query set is small, but any real size and you could start seeing problems, especially since this query seems like one that needs to be run in realtime on a page and the JS engine is single threaded among other problems.
I would actually consider a better way in this case is through the client side, it is quite straight forward, pull out records based on one of the values, iterate and test their conditional double value (i.e. pull out based on language.value being nl and test actor.languages value for that previous value).
I would imagine you might be able to do this with the aggregation framework however, at the min you cannot use computed fields within $match. I would imagine it would look like this:
{$project:
{languages_value: "$language.value", languages: "$actor.languages"}
}, {$match: {"$languages": {$in:"$languages_values"}}
If you could. But there might be a way.

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.