I have a music app that has a job to find music recommendations based on a tag id.
There are two entities involved:
Song - a song record contains its name and a list of music tag ids (genres) this song belongs to
MusicTag - the music tag itself, includes id, name etc.
Data is currently stored in MongoDB.
The Songs collections in mongo have millions of songs, and each song has an average of 7 tag ids.
The MusicTags has about 30K records.
The Songs collection looks like that:
[
{
name: "Metallica - one",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"601861cc8cef62ba86765017", // Heavy metal
"5fda07ac8db0615c1c503a46" // Hard Rock
]
},
{
name: "Metallica - unforgiven",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"5fda07ac8db0615c1c503a46", // Metal
]
},
{
name: "Lady Gaga - Bad Romance",
tags: [
"5fc7b9f95e38e17282896b64", // Pop
"5fc729be5e38e17282844eff", // Dance
]
}
]
Given the tag "6018703624d8a5e8efa1b76e" (Rock), I want to query the Songs collection and find all songs that have Rock tag in their tags array.
In Mongo this is the query i'm doing:
db.songs.find({ tags: { $in: [ObjectId("6018703624d8a5e8efa1b76e")] }});
The performance of it is very bad (between 10 to 40 seconds and getting worst as long as the collection grows), I tried to index Mongo in various ways (the table contains more data that involve in the search, such as score and duration, but it's not relevant for now) but my queries are still take too long, I can't explain it (and I read a lot of official and unofficial stuff) but I have a feeling that holding the data in this nested form makes the index worthless and somehow still make a full scan on the table each time - but I can't prove it (the Mongo "explain" not really explained me something :) )
I'm thinking of using ElasticSearch for it, sync all songs data, and query it instead of the Mongo that will stay as the data SSOT and other lightweight ops.
But then the question remains open and I want to make sure: is in Elastic I can hold the data in that form (nested array inside song) or I need to represent it differently (e.g. flat it so every record will be song_tag index etc?
Thanks.
Elasticsearch doesn't offer a dedicated array type so what you'd typically do is define the mapping based on the type of the individual array items -- in your case a keyword:
PUT songs
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
Then you'd index the docs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
"6018703624d8a5e8efa1b76e",
"601861cc8cef62ba86765017",
"5fda07ac8db0615c1c503a46"
]
}
and query the tags:
POST songs/_search
{
"query": {
"bool": {
"must": [
{ ... other queries },
{
"terms": {
"tags": [
"6018703624d8a5e8efa1b76e" // one or more
]
}
}
]
}
}
}
The tags are unique keywords but are not human-readable so you'd need to keep the map of them vs. the actual genres somewhere. Since the genres are probably set once and rarely, if ever, updated, you could use nested fields too. But your tags would then become an array of key-value pairs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
{
"tag": "6018703624d8a5e8efa1b76e",
"genre": "Rock"
}
...
]
}
The mapping would be slightly different and so would be the queries but now you wouldn't need the translation map, plus you could query or aggregate by human-readable values -- tags.genre.
Related
I'm putting together a REST based API but I'm not sure on how I should deliver the response for collections vs individual resources.
Does it make sense to have a slimmed down representation for a collection over a single item in the world of REST?
Say I have something along the lines of this for a collection of albums:
{
items: [
{
"id": 1,
"title": "Thriller"
},
...
]
}
But then for the actual individual item I had
{
"id": 1,
"title": "Thriller",
"artist": "Michael Jackson",
"released": "1982",
"imageLinks": {
"smallThumbnail": "...",
"largeThumbnail": "..."
}
...
}
A resource representation should be unique irrespective of whether it is given as a collection or a single item. But, you can introduce a new parameter like fields which can be used by the clients to get only the required field thereby optimising the bandwidth.
/albums - This should give the list of objects each having the structure of what you would give in a individual item api
/albums?fields=id,title - This can give the list of objects with just the id & title.
I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.
I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.
I am having some difficulty structuring my data so that I can benefit from reactivity in Meteor. Mainly nesting arrays of objects makes queries tricky.
The three main views I am trying to project this data onto are
waiter: shows order for one table, each persons meal (items nested, essentially what I have below)
kitchen manager: columns of orders by table (only needs table, note, and the items)
cook: columns of items, by category where started=true (only need item info)
Currently I have a meteor collection of order objects like this:
Order {
table: "1",
waiter: "neil",
note: "note from kitchen",
meals: [
{
seat: "1",
items: [ {n: "potato", category: "fryer", started: false },
{n: "water", category: "drink" }
]
},
{
seat: "2",
items: [ {n: "water", category: "drink" } ]
},
]
}
Is there any way to query inside the nested array and apply some projection, or do I need to look at an entirely different data model?
Assuming you're building this app for one restaurant, there shouldn't be many active orders at any given time—presumably you don't have thousands of tables. Even if you want to keep the orders in the database after they're served, you could add a field active to separate out the current ones.
Then your query is simple: activeOrders = Orders.find({active: true}).fetch(). The fetch returns an array, which you could loop through several times for each of your views, using nested if and for loops as necessary to dig down into the child objects. See also Underscore's _.pluck. You don't need to get everything right with some complicated Mongo query, and in fact your app will run faster if you query once and process the results however many times you need to.
I've got two tables.
Movies, which lists all the movies in the database.
Users, which has the users.
Usually, I'd create a join table to connect a user to a movie (as in, the user likes a certain movie).
However, since you can't do that in MongoDB, what should I do? I want to be able to find all the movies a certain user likes, as well as all the users that like a certain movie, and movies that a given set of users like.
Embedded documents?
Thanks!
For a many-to-many relationship between movies and users like this, I'd probably have separate collections for each, but denormalise users who like a movie into the movies collection by embedding their _id and name fields into a likes array.
This way, you can retrieve the names of users who like a movie without having to make a separate lookup to the users collection, but still have extra user fields that won't be embedded inside movies.
The trade off is that you'd need to update both collections if a user changed their name, but I think that's a worthwhile cost.
db.movies
{
_id: <objectid>,
name:"Star Wars",
likes: [
{ userid: <user-objectid>, name: "John Smith" },
{ userid: <user-objectid>, name: "Alice Brown" }
]
}
db.users
{
_id: <objectid>,
name: "John Smith",
username: "jsmith",
passwordhash: "d131dd02c5e6eec4693d"
}
Movies a certain user likes
db.movies.find( { "likes.userid": <user-objectid> }, { "name": 1 } );
Users that like a certain movie
db.movies.find( { "_id": <movie-objectid> },
{ "likes.userid": 1, "likes.name": 1 } );
Movies that a given set of users like
db.movies.find( { "likes.userid":
{ $in: [ <user1-objectid>, <user2-objectid> ] } },
{ "name": 1 } );
You can do that in MongoDB, you just can't do the 'join' operation at the database level, you have to do it at the application level.
If you have millions of movies and millions of users you have to do it using a join collection because there is no way you can fit the number of likes for one movie or one user into either document.
Lookup the User and get their _id
Lookup the UserMovie documents with matching _id values
Lookup the Movies as necessary
The denormalization you might do here would be to store the Movie names in the UserMovie collection so you can display the movies a user likes without having to fetch each one from the Movie collection.
A possible optimization
One optimization you can try on this scheme is to create documents in the UserMovie collection which contain multiple relationships instead of using a single document for each relationship (like you would in SQL).
For example, if the most common access pattern is finding what movies a user likes, you could group them by user and put them in one or more documents indexed by that user id. Take a look at the StatementGroups in this blog post for a more complete explanation.