MongoDB Map Reduce: Auto-created index name too long, possible to customize? - mongodb

Debugging MongoDB mapreduce is painful, so I'm not 100% sure I understand what's going on here, but I think I get the general idea...
The error message I'm getting is this: mr failed, removing collectionCannotCreateIndex: namespace name generated from index name "my_dbname.tmp.mr.collectionname_69.$_id.aggregation_method_1__id.date_key.start_1__id.date_key.timeres_1__id.region.center_2dsphere" is too long (127 byte max)
The key I'm using for mapreduce is a complex object with four or five properties, so I'm guessing what's happening is that when Mongo tries to create its temporary output collections using my specified key, it tries to auto-create an index on that complex key; but since the key itself has several properties, the default name for the key is too long. When I index complex objects like this under "normal" circumstances, I just give the index a custom name. But I don't see a way to do that for the collections mapreduce generates automatically.
Is there a simple way to fix this without changing my key structure?

Well, turns out I was tricked by the error message! the <collectionname> in the error message referenced above is the name of the INPUT collection whose records I'm processing with mapreduce... but the index it's referring to is an index that was part of the OUTPUT collection! So I just had to give the index in the output collection a name, and voila, problem solved. What weird behavior.

Related

#ExistQuery in Spring data mongodb

Hello I would like to do exist query in spring mongo repository. I read about #ExistQuery but I don't know how write query inside, my method now:
#ExistsQuery("{ 'userAccount.socialTokenId': ?1}")
boolean existBySocialAccountId(String socialAccountId);
But I getting IndexOutOfBoundsException, 'userAccount' is a List of objects which contain variable socialTokenId. I know that I can just get whole User object and find it by myself but I would like to optimize my queries :).
I believe your problem is that the paramaters are zero indexed, so there is no parameter with index of 1, which is causing an IndexOutOfBoundsException.
Try changing your code to the following:
#ExistsQuery("{ 'userAccount.socialTokenId': ?0}")
boolean existBySocialAccountId(String socialAccountId);

MongoDB empty string value vs null value

In MongoDB production, if a value of a key is empty or not provided (optional), should I use empty string value or should I use null for value.
1) Is there any pros vs cons between using empty string vs null?
2) Is there any pros vs cons if I set value to undefined to remove properties from your existing doc vs letting properties value to either empty string or null?
Thanks
I think the best way is undefined as I would suggest not including this key altogether. Mongo doesn't work as SQL, where you have to have at least null in every column. If you don't have value, simply don't include the key. Then if you make query for all documents, where this key doesn't exists it will work correctly, otherwise not. Also if you don't use the key you save a little bit of disk space. Do this is the correct way in Mongo.
function deleteEmpty (v) {
if(v==null){
return undefined;
}
return v;
}
var UserSchema = new Schema({
email: { type: String, set: deleteEmpty }
});
i would say that null indicates absence of the value and empty string indicates that the value is there, but its empty.
While reading the data you can distinguish between blank values and non-existing values.
Still it depends upon your use-case
This question has been answered at least 4 times by me and a Google search will get you a lot of information.
You must take into consideration what removing the key means. If your document will eventually use that schema in most of its defined state, within the application, then you could be seeing a lot of movement of the document, this neuts the benefit of no having these keys: space. Those couple of bytes you will save in space will be rendered useless and you will get a swiss cheese effect.
However if you do not use these fields at all then having those few extra bytes with millions of documents in your working set could cause real problems that need not be there (if you for some reason want to shove that many documents into your working set), as for the space issue, MongoDB fundamentally has a space issue and I have not really known omitting a couple of keys to do anything to help that.

get duplicate record in large file using MapReduce

I have a large file contain > 10 million line. I want to get dupplicate line using MapReduce.
How can I solve this problem?
Thanks for help
You need to make use of the fact that the default behaviour of MapReduce is to group values based on a common key.
So the basic steps required are:
Read in each line of you file to you mapper, probably using something like the TextInputFormat.
Set the output Key (Text object) to the value of each line. The contents of the value doesn't really matter. You can just set it to a NullWritable if you want.
In the reduce check the number of values grouped for each key. If you have more than one value you know you have a duplicate.
If you just want the duplicate values, write out the keys that have multiple values.

Rogue query orderAsc with variable field according to its name

I am using Rogue/Lift Mongo record to query MongoDb. I am trying to create different query according to the sort field name. I have therefore a string name of the field that I want to use to sort the results.
I have tried to use Record.fieldByName in OrderAsc:
...query.orderAsc (elem => elem.fieldByName(columnName).open_!)
but I obtain "no type parameter for orderAsc".
How can I make it working? Honestly all the type programming in Rogue is quite difficult to follow.
Thanks
The problem is that you cannot dynamically generate a query with Rogue easily. As solution I used Lift Mongo Db that allows the usage of strings (without compile checking) for these kind of operations that requires dynamic sorting.

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.