My Mongo schema is as follows:
KEY: Client ID
Value: { Location1: Bitwise1, Location2: Bitwise2, ...}
So the Column names would be names of locations. This data represents the locations to which a client has been to, and bitwise captures the days for which the client was present at that location.
I'd like to run a map-reduce query on the above schema. In that, I'd like to iterate on all the columns of the Value for a Row. How can that be done? Could some one give a small code snipped which explains it clearly? I'm having a hard time finding it on web.
Related
I have an array of MongoDB Object IDs which I want to send to my nodejs server in order to get the objects' data.
The problem is that this array contains hundreds of IDs and I am not sure how should I fetch and query it as concatenating it to the query string will not do as the maximal URL can have up to 2048 characters.
You can do the following things:
If records in DB ordered by _id
Send an array of Ids in the POST request or send Min and max Id only in the query parameter in the GET request.
Instead, query for individual Id(objectId) or doing {$in : [ObjectId( )... ObjectId.....]} do as mentioned in below point.
Using min and max id from an array, query DB like this:
User.find({"_id":{$lte: idmax,$gte:idmin}})
If records are not ordered on _id:
User.find({"_id":{$in:[_id1,id_2.....]})
PS: Learn more about range queries.
The relationship between the order of ObjectId values and generation
time is not strict within a single second. If multiple systems, or
multiple processes or threads on a single system generate values,
within a single second; ObjectId values do not represent a strict
insertion order. Clock skew between clients can also result in
non-strict ordering even for values, because client drivers generate
ObjectId values, not the mongod process. So if this is a case you will
have to query using in operator {$in : [ObjectId( ). ObjectId.....]}
Long story short: If ids are not in the order in DB you will have to use
the in operator because otherwise, the range query will fetch
extra-record.
This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.
I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.
You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.
I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.
I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.
I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.