I took a shortcut earlier and made the primary key of my Mongo database by concatenating various fields to create a "unique id"
I would now like to change it to actually use the ObjectId. What's the best approach to do it? I have a little over 3M documents and would like this to be as least disruptive as possible.
A simple approach would be to bring down the site for a bit and then copy every document from one to the other one which is using ObjectIds but I'd like to keep the application running if I can. I imagine one way would be to write to both for a period of time while the migration happens but that would require me having two similar code bases so I wonder if there's a way to avoid all that.
To provide some additional information:
It's just one collection that's not referenced by any others. I have another MySQL database that contains some values that are used to create the queries that read from this MongoDB collection.
I'm using PyMongo/Mongoengine libraries to interact with MongoDB from Python and I don't know if it's possible to just change the primary key for a collection.
You shouldn't bring your site down for any reason if it does not go down itself. :)
No matter how many millions of records you have, the solution to the problem resides on how you use your ids.
If you cross-reference documents in different collections using these ids, then for every updated object, you will update all other objects that references this one.
As first step, your system should be updated to stop creating new objects in the old way. If your system lets you easily do this, then you can update your database very easily. If this change is not easy to make, then your system has some architectural problems and you should first change this. If this is the situation, please update your question so I can update my answer.
Since I don't know anything about your applications and data, what I say will be too general. Let's call the collection you want to update coll_bad_id. Every item in this collection is referenced in other collections like coll_poor_guy and coll_wisdom_searcher. How I would do this is to run over coll_bad_id one item at a time like this:
1. read one item
2. update _id with new style of _id
3. insert item back to collection
-- now we have two copies of the same item one with old-style id, one with new
4. update each item referencing this to use new style id
5. remove the duplicate item with old-style id from collection
One thing you should keep in mind that, bson ObjectId's keep date/time data that can be very useful. Since you rebuild all these objects on one day, your ObjectId's will not reflect correct creation times for these items. For newly added items, they will. You can note the first newly added item as the milestone of items with ids with correct-creation times.
UPDATE: Code sample to run on Mongo shell.
This is not the most efficient way to do this; but it is safe to run since we do not remove anything before adding them back with a new _id. Better can be doing this in small amounts by adding queries to find() call.
var cursor = db.testcoll.find()
cursor.forEach(function(item) {
var oldid= item._id; // we save old _id to use for removal below.
delete item._id; // When we add an item without _id, Mongo creates a unique _id.
db.testcoll.insert(item); // We add item without _id.
db.testcoll.remove(oldid); // We delete the item with bad _id.
});
Related
I'm currently learning a lot about the MEAN stack and obviously MongoDB. I want to set my database up so that nothing is ever 'removed', things are only marked as deleted or moved somewhere else, like an archived collection/database. What's the industry standard way of doing this?
The way i see it is that I have two options, both raising more questions:
Marking documents as deleted with a deleted key.
Would I store this as a timestamp with an accompanying array of timestamps? The array is needed as I'm wanting to also create a 'restoring' functionality, in-turn meaning it can be deleted more than once which I want to track. This will mean that I have to update a lot of my queries to ignore that key.
Move the documents to another collection or database.
This would require the most work as I'd need to handle any other functionality that references that document. For example deleting a user from a cinema database, would this mean that I have to archive previous bookings as well or just update queries to also search in the archive?
I couldn't find any useful resources on this but if you guys know of any then please point me in that direction :) thanks.
Thanks hector! his answer:
"Actually, there is not a "standard" way to do this. Each company does it by its own way. For your first option, you don't need to store a timestamp array but just a flag indicating that document is "deleted". Then, in another collection you can store the events. For instance: {event: "deleted", date: "03/08/2017 08:00:00", documentId: "7726"} An event store is the way to go"
How do i do this in mongo aggregation framework?
given these records
record 1. {id:1,action:'clicked',user:'id', time:'1'}
record 2. {id:2,action:'video play',user:'id',time:'2'}
record 3. {id:3,action:'page load',user:'id',time:'3'}
record 4. {id:4,action:'video play',user:'id',time:'4'}
record 5. {id:1,action:'clicked',user:1id', time:'5'}
record 6. {id:2,action:'video play',user:'id',time:'6'}
now, how do i get the all "video play" that are after clicked action? anybody come across with this kind of aggregation?
You will need to redesign your schema. I can think of two approaches. In your application you can track the click path of a session. When you insert an action to your collection, you will need to also track the previous interaction. Once you have this, then you just need to do something like db.actions.find({prevAction:"clicked",action:"video play"}).count(). This will be very fast.
Alternatively, if you decide you like to track session click path information, you may have a document like:
{_id:sessionid
user:usderid
actions:[
{...login}
{...click link}
{...play video}
]}
You can create this collection by doing upserts. Make sure you keep the action subdocuments small so you don't exceed the 16MB limit for standard documents. Also set the collection's padding factor to "powersof2."
Once you have this collection, you can pull out these documents to get some interesting info. The specific aggregation that you want to do would be more complex on this collection than the suggestion that I previously made though. You will need to create a MR process that may run periodically behind the scenes to calculate what you want (emit key value only if the area contains the expected sequence of actions).
I have a use case where I need to get list of Objects from mongo based off a query. But, to improve performance I am adding Pagination.
So, for first call I get list of say 10 Objects, in next I need 10 more. But I cannot use offset and pageSize directly because the first 10 objects displayed on the page may have been modified [ deleted ].
Solution is to find Object Id of last object passed and retrieve next 10 objects after that ObjectId.
Please help how to efficiently do it using Morphia mongo.
Using morphia you can do this by the following command.
datastore.find(YourClass.class).field(id).smallerThan(lastId).limit(10).order("-ts");
Since you are querying for retrieving the items after the last retrieved id, you won't be bothered to deal with deleted items.
One thing I have thought up of is that you will have the same problem as with using skip() here unless you intend to change how your interface works.
Using ranged queries like this demands that you use a different kind of interface since it is must harder to detect now exactly what page you are on and how many pages exist in the future, especially if you are doing this to avoid problems with conventional paging.
The default type of interface to arise from this type of paging is merely a infinitely scrolling page, think of YouTube video comments or Facebook wall feed or even Google+. There is no physical pagination or "pages", instead you have a get more button.
This is the type of interface you will need to use to get ranged paging working better.
As for the query #cubbuk gives a good example:
datastore.find(YourClass.class).field(id).smallerThan(lastId).limit(10).order("-ts");
Except it should be greaterThan(lastId) since you want to find everything above that last _id. I would also sort by _id unless you make your OjbectIds sometime before you insert a record, if this is the case then you can use a specific timestamp set on insert instead.
I have a Student collection and a Person Collection.
Person contains the fields: name, address, etc
Student contains: rollno, and a person field that stores the person._id for this student
Now I want to show the name of student in the student template, but note that there's no name field in Student, I'll need to get that from that student's Person document.
Is there a way to get a mongodb cursor on the client that has the student information as well as selective field from that student's person document?
Also, is there a better or more standard way of achieving what I'm trying to achieve?
Note: I don't want to use redundancy and store the name field on the Student document, so that's not a solution
is there a better or more standard way of achieving what I'm trying to
achieve?
It sounds like you are trying to read all information about a student in one read - the only way to do that is to have all that information in a single document.
Flexible schema of document databases allow you to have documents in a single collection which are not required to have the same schema, aka number of fields.
So I would recommend that you consider why you actually need separate collections for person and student - this causes writes to two collections when you add a student (and while a single write is atomic, two writes are not) and it also causes the issue you have now where you need to have two separate reads to get all information about a student.
This SO question is somewhat related to your situation.
See the accepted answer in this thread:
Possible bug when observing a cursor, when deleting from collection
It involves using a modified version of the built-in _publishCursor titled publishModifiedCursor, which allows you to specify a callback to add properties to each document in the cursor you are publishing.
I would change your code to have a role / job attribute in the Person object.
It's semantic, at least for me, and think about the difficulty level in someone changing jobs in your original method vs simply changing the role.
Then you could just search
Persons.find {role: 'student'}
And that would just be totally analogous with having a student object.
As Asya said, the students can just have extra fields the other ones done have.
I am trying to design a simple application where in I have two entities Notebook and Note. So Notebook can contain multiple Notes.In RDBMS I could have two tables and have One to Many
relationship between them. I am not sure in MongoDB whether I should not take a two collection
approach or I should embed notes in Notebook collection. What would you suggest?
That seems like a perfectly reasonable situation to use a single collection called Notebook, and each Notebook document contains embedded Notes. You can easily index on embedded documents.
If a Notebook document has a 'notes' key, and value is a list of notes:
{
"notes": [
{"created_on": Date(1343592000000), text: "A note."}
]
}
# create index
db.notebook.ensureIndex({"notes.created_on" : -1})
My opinion is to try and embed as much as possible, and then choose to reference another collection via an id as a second option when the reference needs to be to a more general set of data that is shared and might change. For instance, a collection of category documents which many other collections reference. And the category can be updated over time. But in your case, a note should always belong to a note book
You should ask yourself what kind of queries you will need to run on it. The "by default" approach is to embed them, but there are cases (that will depend on how you plan on using them) where a more relational approach is applicable. So the simple answer is "probably, but you should probably think about it" :)