There are several collections, i.e. Country, Province, City, Univ.
Just like in the real world, every country has several provinces, and every province has several cities, and every city has several universities.
How can I know whether a university is in the given country?For example, country0 may have some universities, what are their _ids?
Documents in those collections are showed below:
{
_id:"country0",
provinces:[
{
$ref:"Province",
$id:"province0"
},
...
]
}
{
_id:"province0",
belongs:{$ref:"Country", $id:"country0"},
cities:[
{
$ref:"City",
$id:"city0"
}
...
]
}
{
_id:"city0",
belongsTo:{$ref:"Province",$id:"province0"},
univs:[
{
$ref:"Univ",
$id:"univ0"
}
...
]
}
{
_id:"univ0",
address:{$ref:"City", $id:"city0"}
}
If there are only two collections, I know fetch() may be useful.
Also, python drivers may be useful, but I can't know their performance well, because I can't use db.system.profile in a .py file.
MongoDB doesn't do joins. N queries are required to get information from N collections. In this situation, to get the _id's of universities in a given country in an array one could do the following (in the mongo shell):
> var country = db.countries.findOne({ "_id": "country0" });
> var province_ids = [];
> country.provinces.forEach(function(province) { province_ids.push(province["$id"]); });
> var provinces = db.provinces.find({ "_id": { "$in": province_ids });
> var city_ids = [];
> provinces.forEach(function(province) { province.cities.forEach(function(city) { city_ids.push(city["$id"]); }); });
> var cities = db.cities.find({ "_id": { "$in": city_ids } });
> univ_ids = [];
> cities.forEach(function(city) { city.univs.forEach(function(univ) { univ_ids.push(univ["$id"]); }); });
It's also possible to accomplish the same thing using the belongsTo field, using similar steps. This is cumbersome and it seems like there should be a better way. There is! Normalize the data. Countries have provinces, which have cities, which have universities, but the relationships are fixed and not of huge cardinality. For doing queries like "what universities are in a given country?" I would suggest storing province documents entirely within countries and university documents entirely within city documents. You could store cities inside of province documents, or inside country documents directly, but a province or country could have hundreds or thousands of cities and this might be too much information for one document (16MB limit per document in MongoDB). Having provinces in countries and universities in cities reduces the number of queries necessary to two.
Another option is to store more information in each child document. Essentially you have a forest (a collection of trees): countries are parents of provinces which are parents of cities which are parents of universities. The belongsTo field is a parent reference. You could store a reference to all ancestors instead of just the parent. Then finding all universities in a given country is one query on the universities collection.
> db.universities.findOne();
{
_id: "univ0",
city: "city0",
province: "province0",
country: "country0"
}
> db.universities.find({ "country": "country0" });
The schema design that is best for you depends on the types of queries your application will need to answer and their relative frequencies and importance. I can't determine that from your question so I can't firmly recommend one schema over another.
As to your mini-question about performance and the db.system.profile collection, note that db.system.profile is a collection. You can query it from a .py file using a driver.
Related
Basically we have four tables that are joined: Malls, Stores, Brands, and Categories.
Malls can have many Stores.
Each Store is linked to a Brand.
Each Brand has many Categories.
e.g. Mall A has 2 "McDonald's Cafes", each belonging to the Brand "McDonald's". "McDonald's" Brand has "Fast Food" and "Breakfast" as Categories.
We're trying to figure out how to display all the Categories that exist within Mall A.
e.g. Mall A has "Fast Food" and "Breakfast" categories, based on the "McDonald's" stores.
Ideally this would be a query so that the Categories are updated automatically.
We've tried querying for all Stores within a Mall, and then finding the Categories via the Store-Brand join, then reducing duplicates. But some Malls have more than 700 Stores, so this process is quite expensive in terms of querying and processing the data.
I managed to figure it out! Using Sequelize, my query was as follows:
postgres.BrandsCategories.findAndCountAll({
limit,
offset,
include: [
{
model: postgres.Brands,
as: "brands",
include: [
{
model: postgres.Stores,
as: "stores",
where: {
mallId: parent.dataValues.id
}
}
]
}
]
})
I have a large collection 300K records of very large and complex records. Aggregation framework seems to struggle when working with nested arrays and objects. If I wanted to set up Views where I have simplified the record, is it better to have many records of very basic data that comprises a single complex record or a single flat record that has hundreds of dynamic field names?
Here is an example
var obj = {
orderID:1,
items:[
{
sku:"abc123",
price:"$1.21"
},
{
sku:"aaa111",
price:"$2.21"
},
{
sku:"bbb222",
price:"$3.21"
}
]
}
var vertical = [ //Each item is a record in the collection
{
orderID:1,
sku:"abc123",
price:"$1.21"
},
{
orderID:1,
sku:"aaa111",
price:"$3.21"
},
{
orderID:1,
sku:"bbb222",
price:"$3.21"
}
];
var flat = { // A single record in the collection
orderID:1,
abc123:"$1.21",
aaa111:"$2.21",
bbb222:"$3:21"
}
In my case, each record includes anywhere from 200-1000 items in the arrays within the record. So think of this scaled up to the extreme.
With vertical, I have well defined fields(indexable) and the records are small. But in my DB it would generate around 50M of these bite sized records.
With flat, I only have to pull a single record and as long as I know the fields I am looking for, it can work. But the fields are dynamic and not indexable.
With the only purpose of this being for reporting use, what is the best way to go?
I've got a question on the design of documents in order to be able to efficiently perform aggregation. I will take a dummy example of document :
{
product: "Name of the product",
description: "A new product",
comments: [ObjectId(xxxxx), ObjectId(yyyy),....]
}
As you could see, I have a simple document which describes a product and wraps some comments on it. Imagine this product is very popular so that it contains millions of comments. A comment is a simple document with a date, a text and eventually some other features. The probleme is that such a product can easily be larger than 16MB so I need not to embed comments in the product but in a separate collection.
What I would like to do now, is to perform aggregation on the product collection, a first step could be for example to select various products and sort the comments by date. It is a quite easy operation with embedded documents, but how could I do with such a design ? I only have the ObjectId of the comments and not their content. Of course, I'd like to perform this aggregation in a single operation, i.e. I don't want to have to perform the first part of the aggregation, then query the results and perform another aggregation.
I dont' know if that's clear enough ? ^^
I would go about it this way: create a temp collection that is the exact copy of the product collection with the only exception being the change in the schema on the comments array, which would be modified to include a comment object instead of the object id. The comment object will only have the _id and the date field. The above can be done in one step:
var comments = [];
db.product.find().forEach( function (doc){
doc.comments.forEach( function(x) {
var obj = {"_id": x };
var comment = db.comment.findOne(obj);
obj["date"] = comment.date;
comments.push(obj);
});
doc.comments = comments;
db.temp.insert(doc);
});
You can then run your aggregation query against the temp collection:
db.temp.aggregate([
{
$match: {
// your match query
}
},
{
$unwind: "$comments"
},
{
$sort: { "comments.date": 1 } // sort the pipeline by comments date
}
]);
There are collections
city: {"_id", "name"}
company: {"_id", "name", "cityID"}
comments: {"_id", "text", "companyID"}
You must select the last 10 comments on companies of a certain city.
Now I select _id first of all companies in the city, and then the 10 comments received on _id
Here's the code:
$ db-> execute ('function () {
var result = {};
var company = [];
result.company = [];
db.company.find ({"city": "msk"}, {"title": 1, "_id": 1}). forEach (function (a) {
result.company [a._id] = Object.deepExtend (a, result [a._id]);
company.push (a._id);
});
var comments = [];
result.comments = [];
db.comments.find ({"company": {"$ in": company}}). sort ({"createTime": -1}). limit (10). forEach (function (a) {
result.comments [a._id] = Object.deepExtend (a, result [a._id]);
comments.push (a._id);
});
return result;
} ');
Esti's a better option to do what I need?
Thanks in advance for your advice!
Right now your code
db.comments
.find({"company": {"$in": company}})
.sort({"createTime": -1})
.limit(10)
is only fetching the last 10 comments overall for the specified city, rather than the last 10 for each company. Is this what you want? If so, then it might make sense to just store a "city" field for each comment, and just query the comments collection based on city.
If you want to last 10 for each company, with your current schema you will have to make individual queries on the comments collection for each relevant company.
What is the use case for your data?
Edit:
In a relational database model, you would be able to execute a JOIN query that would get you the information from both tables (companies and comments) in one query. However, since MongoDB does not support joins, if you want to be able to fetch the information you need in a single query then you need to denormalize the data a little bit by storing a city field in the comments collection.
I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});