I have unique documents which are indexed by non-unique keys. What makes this document unique, is the combination of multiple keys within the document. For example:
{
first: 'John',
last: 'Foo'
}
{
first: 'Henry',
last: 'Bar'
}
{
first: 'Frank',
last: 'Foo'
}
{
first: 'John',
last: 'Bar'
}
So, based on the example above: If we wanted to query for the first name of Frank, we would only get one result. Ideally, since we only have one result, we wouldn't even need to compare the last name to our query. However, if we query for the name John, we would get two results, so we would need to compare the secondary argument.
How would this style of query be achieved in Mongo? The goal is simply to save needless compares if there is only a single match to begin with.
Note that i am aware that this style of query doesn't guarantee the correct document. It assumes that the primary, and each subsequent field match, is "good enough" to verify the identity of the document, if only one document is matched. Though if there are other less obvious reasons why this method should not be used, by all means discuss it :)
I wouldn't worry about this at all, especially if you have an index on this. A compound index on first, last will only scan the index elements that start with "first". If that's one document, then it stops. If you it also needs to match "last", then it will scan those parts of the index.
Related
I am creating a Mongo DB collection which will contain tens of trillions of records. The shape of documents will be like so:
{
"_id": ObjectId("AbCdhijk"),
"val": "hello world"
},
{
"_id": ObjectId("aBCDlmnop"),
"val": "goodbye world"
}
I have two query requirements:
query all values where id begins with a prefix string
query all values where id begins with a prefix string, ignore case
For example: querying for AbC should give one document: the document w/ val=hello world, whereas querying for abc ignore case should return both documents. The queries should take little to no time as possible. Ideally logarithmic performance (rather than needing to scan the whole collection) per query.
Some soft requirements would be supporting an endswith query as well.
What would be the ideal indices to add and query to use?
Changing the shape of documents to accomplish this and potentially having multiple documents per value even spread between multiple collections is acceptable.
For example I was considering making two inserts per value: one with the original ID and one with the ID transformed to all lowercase to aid with the ignore case lookup.
So I need to create a lookup collection in MongoDB to verify uniqueness. The requirement is to check if the same 2 values are being repeated or not. In SQL, I would something like this
SELECT count(id) WHERE key1 = 'value1' AND key2 = 'value2'
If the above query returns a count then it means the combination is not unique. I have 2 solutions in mind but I am not sure which one is more scalable. There are 30M+ docs against which I need to create this mapping.
Solution1:
I create a collection of docs with compound index on key1 and key2
{
_id: <MongoID>,
key1: <value1>,
key2: <value2>
}
Solution2:
I write application logic to create custom _id by concatenating value1 and value2
{
_id: <value1>_<value2>
}
Personally, I feel the second one is more optimised as it only has a single index and the size of doc is also smaller. But I am not sure if it is a good practice to create my own _id indexes as they may not be completely random. What do you think?
Thanks in advance.
Update:
My database already has a lot of indexes which take up memory so I want to keep index size to as low as possible specially for collections which are only used to verify uniqueness.
I would suggest Solution 1 i.e to use compound index and use two different properties key1 and key2
db.yourCollection.ensureIndex( { "key1": 1, "key2": 1 }, { unique: true } )
You can search easily by individual field if required. i.e if you require to search only by key1 or key2 then it would be easy with compound index. If you make _id with combination of keys, then it will be hard to search by individual field.
Size of document in Mongo is very least bothered while designing document.
If in near future if you would required to change keys values of same document with respect to other values, it will be easy. Keep in mind if you are using reference of this document in other collection's document.
In terms of your scalability, _id index would be sequential, easily shardable, and you can let MongoDB manage it.
If you are searching with those keys then it will use that index otherwise it will use the other required indexes for your search.
If you are still thinking of size of document than searching then you can go with Solution 1, make _id like
{_id:{key1:<value1>,key2:<value2>}}
By this you can search specific _id.key1 too.
Update:
Yes if document size is your concern than maintaining. And if you are sure about keys will not modify in future of same document and if it still modifying and do not have reference in other collections, then you can use Solution 1. Just use keys as objects than underscore _. You can add more keys later too if wanted in future.
I think the solution 2 is more suitable for your requirement. It is absolutely ok to generate the _id value of MongoDB. Most of the applications does populate the _id value with UUID. In your case, it make sense to concatenate value 1 and 2 for _id value assuming this collection is primarily used for verifying the uniqueness (i.e kind of temporary table) or lookup purpose.
Solution 1 is expensive as it requires additional index. Again, it depends on whether you are going to use this collection for verifying the uniqueness purpose alone or for some other use case as well.
Please note that you need to create the unique compound index, so that it doesn't allow to insert data for duplicate values.
Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.
I tried to update an existing document with two dot notation parameters, my query:
{ _id: "4eda5...", comments._id: "4eda6...", comments.author: "john" }
my update was:
{ "comments.$.deleted": true }
However, weirdly enough, when I passed a non-existent combination of comment id+author, it just updated the first matching comment by that author.
Any ideas why that's happening?
EDIT: C# Code sample
var query = Query.And(Query.EQ("_id", itemId), Query.EQ("cmts._id", commentId));
if (!string.IsNullOrEmpty(author))
query = Query.And(query, Query.EQ("cmts.Author", author));
var update = Update.Set("cmts.$.deleted", true);
var result = myCol.Update(query, update, UpdateFlags.None, SafeMode.True);
You want $elemMatch if you want the _id and author to be in the same comment. Really, your query doesn't make much sense including the author as the id should be as unique as you can get, no?
It is based on the first matching array element which replaces the "$" in for the update.
This is working by design. It is similar to an or since it can find a document which both has the _id and an author that match in any of the array elements.
The query is not working the way you are expecting it to. Basically, when using the $ positional notation you need to make sure that your query only has one clause that queries an array, otherwise it is ambiguous which of the two array comparisons the $ should refer to.
In your case, you are asking for a document where:
The _id is equal to some value
The comments array contains some document where the _id is equal to some value
The comments array contains some document where the author is equal to some value
Nothing in your query says that 2. and 3. need to be satisfied by the same document.
So even though you are using a non-existent combination of comment._id and comment.author, your comment array does have at least one entry where the _id is equal to your search value and some other entry (just not the same one) where the author is equal to your search value.
Since the author was the last one checked, that's what set the value of the $, and that's why that array element got updated.
My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query