Full Join/Intersection in couchdb - nosql

I have some documents which have 2 sets of attributes: tag and lieu. Here is an example of what they look like:
{
title: "doc1",
tag: ["mountain", "sunny", "forest"],
lieu: ["france", "luxembourg"]
},
{
title: "doc2",
tag: ["sunny", "lake"],
lieu: ["france", "germany"]
},
{
title: "doc3",
tag: ["sunny"],
lieu: ["belgium", "luxembourg", "france"]
}
How can I map/reduce and query my DB to be able to retrieve only the intersection of documents that match these criteria:
lieu: ["france", "luxembourg"]
tag: ["sunny"]
Returns: doc1 and doc3
I cannot figure out any format map/reduce could return to be able to have only one query. What I am doing now is: emit every lieu/tag as key and the documents' id related as value, then reduce for every keys have an array of docs' ids. Then from my app I query this view, on the app side do an intersection of the documents (only take the docs that have the 3 keys (luxembourg, france and sunny) and then requery couchdb with these docs' ids to retrieve the actual docs. I feel that's not the right/best way to do it?
I am using lists to do the intersection job, it works quite well. But I still need to do an other request to get the documents using the documents ids. Any idea what could I do differently to retrieve the documents directly?
Thank you!

This is going to be awkward. The basic idea is that you have to build a view where the map function emits every possible combination of tags and countries as the key, and there's no reduce function. This way, looking for ["france","luxembourg"] would return all documents that emitted that key (and therefore are in the intersection), because views without a reduce function return the emitting document for every entry. This way, you only have to do one request.
This causes a lot of emits to happen, but you can lower that number by sorting the tags both when emitting and when searching (automatically turn ["luxembourg","france"] into ["france","luxembourg"]), and by taking advantage of the ability of CouchDB to query prefixes (this means that emitting ["belgium","france","luxembourg"] will let you match searches for ["belgium"] and ["belgium","france"]).
In your example above, for the countries, you would only emit:
// doc 1
emit(["luxembourg"],null);
emit(["france","luxembourg"],null);
// doc 2
emit(["germany"],null);
emit(["france","germany"],null);
// doc 3
emit(["luxembourg"],null);
emit(["belgium","luxembourg"],null);
emit(["france","luxembourg"],null);
emit(["belgium","france","luxembourg"],null);
Anyway, for complex queries like this one, consider looking into a CouchDB-Lucene combination.

Related

custom sort for a mongodb collection in meteor

I have this collection of products and i want to display a top 10 products based on a custom sort function
[{ _id: 1, title, tags:['a'], createdAt:ISODate("2016-01-28T00:00:00Z") } ,
{ _id: 2, title, tags:['d','a','e'], createdAt:ISODate("2016-01-24T00:00:00Z") }]
What i want to do is to sort it based on a "magic score" that can be calculated. For example, based on this formula: tag_count*5 - number_of_days_since_it_was_created.
If the first one is 1 day old, this makes the score:
[{_id:1 , score: 4}, {_id:2, score: 10}]
I have a few ideas on how i can achieve this, but i'm not sure how good they are, especially since i'm new to both mongo and meteor:
start an observer (Meteor.observe) and every time a document is
modified (or a new one created), recalculate the score and update it
on the collection itself. If i do this, i could just use $orderBy
where i need it.
after some reading i discovered that mongo aggregate or map_reduce
could help me achieve the same result, but as far as i found out,
meteor doesn't support it directly
sort the collection on the client side as an array, but using this
method i'm not sure how it will behave with pagination (considering that i subscribe to a limited number of documents)
Thank you for any information you can share with me!
Literal function sorting is just being implemented in meteor, so you should be able to do something like
Products.find({}, {sort: scoreComparator});
in an upcoming release.
You can use the transform property when creating collection. In this transform, store the magic operation as a function.
score=function(){
// return some score
};
transformer=function(product){
product.score=score;
// one could also use prototypal inheritance
};
Products=new Meteor.Collection('products',{transform:transformer});
Unfortunately, you cannot yet use the sort operator on virtual fields, because minimongo does not support it.
So the ultimate fall-back as you mentioned while nor the virtual field sorting nor the literate function sorting are supported in minimongo is client side sorting :
// Later, within some template
scoreComparator=function(prd1,prd2){
return prd1.score()-prd2.score();
}
Template.myTemplate.helpers({
products:function(){
return Products.find().fetch().sort(scoreComparator);
}
});
i'm not sure how it will behave with pagination (considering that i subscribe to a limited number of documents)
EDIT : the score will be computed among the subscribed documents, indeed.

In MongoDB, when to use a simple subdocument, when an array with 2-field elements?

Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.

Datastore solution for tag search

I've got millions of items ordered by a precomputed score. Each item has many boolean attributes.
Let says that there is about ten thousand possible attributes totally, each item having dozen of them.
I'd like to be able to request in realtime (few milliseconds) the top n items given ~any combination of attributes.
What solution would you recommend? I am looking for something extremely scalable.
--
- We are currently looking at mongodb and array index, do you see any limitation ?
- SolR is a possible solution but we do not need text search capabilities.
Mongodb can handle what you want, if you stored your objects like this
{ score:2131, attributes: ["attr1", "attr2", "attr3"], ... }
Then the following query will match all the items that have att1 and attr2
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] } })
but this won't match it
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr4" ] } })
the query returns a cursor, if you want this cursor to be sorted, then just add the sort parameters to the query like so
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] }}).sort({score:1})
Have a look at Advanced Queries to see what's possible.
Appropriate indexes can be setup as follows
db.mycol.ensureIndex({attributes:1, score:1})
And you can get performance information using
db.mycol.find({ attributes: { $all: [ "attr1" ] }}).explain()
Mongo explains how many objects were scanned, how long the operation took
and various other statistics.
This is exactly what Mongo can deal with. The fact that your attributes are boolean type helps here. A possible schema is listed below:
[
{
true_tags:[attr1, attr2, attr3, ...],
false_tags: [attr4, attr5, attr6, ...]
},
]
Then we can index on true_tags and false_tags. And it should be efficient to search with $in, $all, ... query operators.
Redis would be a perfect candidate for
"the top n items" for "millions of items ordered by score"
Redis has a built in data structure that you can start from: Sorted Set => every member of a Sorted Set is associated with score. Which for example can be ranked by score with ZRANGEBYSCORE:
ZRANGEBYSCORE key min max [WITHSCORES] [LIMIT offset count]
I encourage you to look at Sorted Set commands and get a feel for Redis, as your problem (as it is stated) asks for it. You may of course keep as many attributes as you'd like within a single Set element.
As far as MongoDB, since you mentioned millions, unless you can bent incremental queries to work for your problem, I would not expect a sub second response.
As #nickdos mentioned Solr Relevancy is a quite powerful feature, but the number of attributes will be a problem, since it would need to keep all this attributes in memory for each item. Although a dozen for each may not be that bad => just try and see.

MongoDB: Speed of field ("inside record") search in comporation with speed of search in "global scope"

My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.