MongoDB: Making sure referenced document still exists

MongoDB: Making sure referenced document still exists - mongodb

Let's say I have two types of MongoDB documents: 'Projects' and 'Tasks'. A Project can have many tasks. In my case it is more suitable to link the documents rather than embed.
When a user wants to save a task I first verify that the project the task is being assigned to exists, like so:
// Create new task
var task = new Task(data);
// Make sure project exists
Project.findById(task.project, function(err, project) {
if(project) {
// If project exists, save task
task.save(function(err){
...
});
} else {
// Project not found
}
});
My concern is that if another user happens to delete the project after the Project.findById() query is run, but before the task is saved, the task will be created anyway without a referenced project.
Is this a valid concern? Is there any practice that would prevent this from happening, or is this just something that has to be faced with MongoDB?

Technically yes, this is something you need to face when using MongoDB. But it's not really a big deal as it's rarely someone to delete a project and another person is unaware of it and creating task for that project. I would not use the if statement to check the project status, rather just leave task created as a bad record. You can either manually remove those bad records or schedule a cron task to clean them.

The way you appear to be doing it, i.e, with two separate Models -- not subdocuments (hard to tell without seeing the Models), I guess you will have that race condition. The if won't help. You'd need to take advantage of the atomic modifiers to avoid this issue, and using separate Models (each being it's own MongoDB collection), the atomic modifiers are not available. In SQL world, you'd use a transaction to ensure consistentcy. Similarly, with a document store like MongoDB, you'd make each Task a subdocument of a Project, and then just .push() new tasks. But perhaps your use case necessitates separate, unrelated Models. MongoDB is great for offering that flexibility, but it enables you to retain SQL-thinking without being SQL, which can lead to design problems.
More to the point, though, the race condition you're worried about doesn't seem to be a big deal. After all, the Project could be deleted after the task is saved, too. You obviously have a method for cleaning that up. One more cleanup function isn't the end of the world -- probably a good thing to have in your back pocket anyway.

Related

Data syncing with pouchdb-based systems client-side: is there a workaround to the 'deleted' flag?

I'm planning on using rxdb + hasura/postgresql in the backend. I'm reading this rxdb page for example, which off the bat requires sync-able entities to have a deleted flag.
Q1 (main question)
Is there ANY point at which I can finally hard-delete these entities? What conditions would have to be met - eg could I simply use "older than X months" and then force my app to only ever displays data for less than X months?
Is such a hard-delete, if possible, best carried out directly in the central db, since it will be the source of truth? Would there be any repercussions client-side that I'm not foreseeing/understanding?
I foresee the number of deleted's growing rapidly in my app and i don't want to have to store all this extra data forever.
Q2 (bonus / just curious)
What is the (algorithmic) basis for needing a 'deleted' flag? Is it that it's just faster to check a flag rather than to check for the omission of an object from, say, a very large list. I apologize if it's kind of a stupid question :(

Ultimately it comes down to a decision that's informed by your particular business/product with regards to how long you want to keep deleted entities in your system. For some applications it's important to always keep a history of deleted things or even individual revisions to records stored as a kind of ledger or history. You'll have to make a judgement call as to how long you want to keep your deleted entities.
I'd recommend that you also add a deleted_at column if you haven't already and then you could easily leverage something like Hasura's new Scheduled Triggers functionality to run a recurring job that fully deletes records older than whatever your threshold is.
You could also leverage Hasura's permissions system to ensure that rows that have been deleted aren't returned to the client. There is documentation and examples for ways to work with soft deletes and Hasura
For your second question it is definitely much faster to check for the deleted flag on records than to have to try and diff the entire dataset looking for things that are now missing.

Mongo schema: Todo-list with groups

I want to learn mongo and decided to create a more complex todo-application for learning purpose.
The basic idea is a task-list where tasks are grouped in folders. Users may have different access to those folders (read, write) and tasks may be moved to other folders. Usually (especially for syncing) tasks will be requested by-folder and not alone.
Basically I thought about three approaches and would like to hear your opinion for them. Maybe I missed some points or just have the wrong way of thinking.
A - List of References
Collections: User, Folder, Task
Folders contain references to Users
Folders contain references to Tasks
Problem
When updating a Task a reference to Folder is needed. Either those reference is stored within the Task (redundancy) or it must be passed with each API-call.
B - Subdocuments
Collections: User, Folder
Folders contain references to Users
Tasks are subdocuments within Folders
Problem
No way to update a Task without knowing the Folder. Both need to be transmitted as well but there is no redundancy compared to A.
C - References
Collections: User, Folder, Task
Folders contain references to Users
Taskskeep a reference to their Folders
Problem
Requesting a folder means searching in a long list instead of having direct references (A) or just returning the folder (B).

If you don't need any metadata for the folder except the name you could also go with:
Collections: User,Task
Task has field folder
User has arrays read_access and write_access
Then
You can get a list of all folders with
db.task.distinct("folder")
The folder a specific user can access are automatically retrieved when you retrieve the user document so those can basically known at login.
You can get all tasks a user can read with
db.task.find( { folder: { $in: read_access } } )
with read_access beeing the respective array you got from your users document. The same with write_access.
You can find all tasks within a folder with a simple find query for the folder name.
Renaming a folder can be achieved with one update query on each of the collections.
Creating a folder or moving a task to another folder can also be achieved in simple manners.
So without metadata for folders that is what I would do. If you need metadata for folders it can become a little more complicated but basically you could manage those independent of the tasks and users above using a folder collection containing the metadata with _id beeing the folder name referenced in user and task.
Edit:
Comparison of the different approaches
Stumbled over this link which might be of interest for you. In there is a discussion of transitioning from a relational database model to mongo. The difference beeing that in a relational database you usually try to go for third normal form where one of the goals is to avoid bias to any form of access pattern where in mongodb you can try to model your data to best fit your access patterns (while keeping in mind not to introduce possible data anomalies through redundancy).
So with that in mind:
your model A is a way how you could do it in a relational database (each type of information in one table referenced over id)
model B would be tailored for an access pattern where you always list a complete folder and tasks are only edited when the folder is opened (if you retrieve one folder you have all the task without an additional query)
C would be a different relational model than A and I think little closer to third normal form (without knowing the exact tables)
My suggestion would support the folder access not as optimal as B but would make it easier to show and edit single tasks
Problems that could come up with the schemas: Since A and C are basically relational you can get a problem with foreign keys since mongodb does not enforce foreign key constraints (e.g. you could delete a folder while there are still tasks referencing it in C or a task without deleting its reference in the folder in A). You could circumvent this problem by enforcing it from the application. For B the 16MB document limit could become a problem circumventable by allowing folders to split into multiple document when they reach a certain task count.
So new conclusion: I think A and C might not show you the advanatages of mongodb (and might even be more work to build in mongodb than in sql) since they are what you would do on a traditional relational database which is the way mongodb was not designed for (e.g. the missing join statement, no foreign key constraints). In sum B most matches your access patern "Usually (especially for syncing) tasks will be requested by-folder" while still allowing to easily edit and move tasks once the folder is opened.

MongoDB in Go (golang) with mgo: How do I update a record, find out if update was successful and get the data in a single atomic operation?

I am using mgo driver for MongoDB under Go.
My application asks for a task (with just a record select in Mongo from a collection called "jobs") and then registers itself as an assignee to complete that task (an update to that same "job" record, setting itself as assignee).
The program will be running on several machines, all talking to the same Mongo. When my program lists the available tasks and then picks one, other instances might have already obtained that assignment, and the current assignment would have failed.
How can I get sure that the record I read and then update does or does not have a certain value (in this case, an assignee) at the time of being updated?
I am trying to get one assignment, no matter which one, so I think I should first select a pending task and try to assign it, keeping it just in the case the updating was successful.
So, my query should be something like:
"From all records on collection 'jobs', update just one that has assignee=null, setting my ID as the assignee. Then, give me that record so I could run the job."
How could I express that with mgo driver for Go?

This is an old question, but just in case someone is still watching at home, this is nicely supported via the Query.Apply method. It does run the findAndModify command as indicated in another answer, but it's conveniently hidden behind Go goodness.
The example in the documentation matches pretty much exactly the question here:
change := mgo.Change{
Update: bson.M{"$inc": bson.M{"n": 1}},
ReturnNew: true,
}
info, err = col.Find(M{"_id": id}).Apply(change, &doc)
fmt.Println(doc.N)

I hope you saw the comments on the answer you selected, but that approach is incorrect. Doing a select and then update will result in a round trip and two machines and be fetching for the same job before one of them can update the assignee. You need to use the findAndModify method instead: http://www.mongodb.org/display/DOCS/findAndModify+Command

The MongoDB guys describe a similar scenario in the official documentation: http://www.mongodb.org/display/DOCS/Atomic+Operations
Basically, all you have to do, is to fetch any job with assignee=null. Let's suppose you get the job with the _id=42 back. You can then go ahead and modify the document locally, by setting assignee="worker1.example.com" and call Collection.Update() with the selector {_id=42, assignee=null} and your updated document. If the database is still able to find a document that matches this selector, it will replace the document atomically. Otherwise you will get a ErrNotFound, indicating that another thread has already claimed the task. If that's the case, try again.

Salesforce.com: UNABLE_TO_LOCK_ROW, unable to obtain exclusive access to this record

In our production org, we have a system of uploading sales data into Salesforce using command line data loader. This data is loaded into a temporary object Temp. We have created a formula field (which combines three fields) to form a unique key. The purpose of the object is to reduce user efforts for creating the key manually.
There is an after insert trigger on Temp which calls an asynchronous method which upserts the data to another object SalesData using the key. The insert/update trigger on SalesData checks the various fields and creates/updates the records in another object SalesRecords. After the insertion/updation is complete, all the records in temp object Temp are deleted. The SalesRecords object does not have any trigger on it and is a child of another object Sales. The Sales object has some rollup fields which are summing up fields from SalesRecords object.
Lately, we are getting the below error for some of the records which are updated.
UNABLE_TO_LOCK_ROW, unable to obtain exclusive access to this record
Please provide some pointers to resolve the issue

this could either be caused by conflicting DML operations in the various trigger execution or some recursive trigger execution. i would assume that the async executions cause multiple subsequent updates on the same records, probably on the SalesRecords object. I would recommend to try to simplify the process to avoid too many related trigger executions.

I'm a little surprised you were able to get this to work in the first place. After triggers should be used with caution and only when before triggers can't be. One reason for this is that you don't need to perform additional DML to make changes to records, since in before triggers you simply change the values and the insert/update commit happens automatically. But recursive trigger firings is the main problem with after triggers.
One quick way to avoid trigger re-entry is to use a public static Boolean in a class that states whether you're already in this trigger from the same thread of execution.
Something like:
public static Boolean isExecuting = false;
Once set to true, any trigger code that is a re-fire can be avoided with:
if(Class.isExecuting == false)
{
Class.isExecuting = true;
// Perform trigger logic
// ...
}
Additionally, since the order of trigger execution cannot be determined up front, you might be seeing an issue with deletions or other data changes that depend on other parts of your flow to finish first.
Also, without knowing the details of your custom unique 3-part key, I'd wonder if there's a problem there too such as whether it's truly unique or not. Case insensitivity is a common mistake and it's the reason there are 15 AND 18 character Ids in Salesforce. For example, when people export to Excel (a case-insensitive environment) and do VLOOKUPs, they would occasionally find the wrong record. The 3-digit calculated suffix was added to disambiguate for case-insensitive environments.

Googling for this same error lead me to this post:
http://boards.developerforce.com/t5/General-Development/Unable-to-obtain-exclusive-access-to-this-record/td-p/345319
Which points out some common causes for this to happen:
Sharing Rules are being calculated.
A picklist value has been replaced and replacement is in progress.
A custom index creation/removal is in progress.
Most unlikely one - someone else is already editing the same record that you are trying to access at the same time.
Posting here in case somebody else needs it.

I got this error multiple times today. Turned out one of our vendors was updating their installed package during that time in the same org. All kinds of things were going wrong also - some object validation exceptions were being thrown on DMLs, without any error message content.

Resolution
The error is shown when a field update such as a roll-up summary field is being attempted on a parent object that already had a field update to cause the roll-up summary field to calculate. This could also occur if a trigger or another apex job running on the master object and it also attempting to do an update.
You can either reduce the batch size and try again or create separate smaller files to be imported if this issue occurs.

Getting past Salesforce trigger governors

I'm trying to write an "after update" trigger that does a batch update on all child records of the record that has just been updated. This needs to be able to handle 15k+ child records at a time. Unfortunately, the limit appears to be 100, which is so far below my needs it's not even close to acceptable. I haven't tried splitting the records into batches of 100 each, since this will still put me at a cap of 10k updates per trigger execution. (Maybe I could just daisy-chain triggers together? ugh.)
Does anyone know what series of hoops I can jump through to overcome this limitation?
Edit: I tried calling following #future function in my trigger, but it never updates the child records:
global class ParentChildBulkUpdater
{
#future
public static void UpdateChildDistributors(String parentId) {
Account[] children = [SELECT Id FROM Account WHERE ParentId = :parentId];
for(Account child : children)
child.Site = 'Bulk Updater Fired';
update children;
}
}

The best (and easiest) route to take with this problem is to use Batch Apex, you can create a batch class and fire it from the trigger. Like #future it runs in a separate thread, but it can process up to 50,000,000 records!
You'll need to pass some information to your batch class before using database.executeBatch so that it has the list of parent IDs to work with, or you could just get all of the accounts of course ;)
I've only just noticed how old this question is but hopefully this answer will help others.

It's worst than that, you're not even going to be able to get those 15k records in the first place, because there is a 1,000 row query limit within a trigger (This scales to the number of rows the trigger is being called for, but that probably doesnt help)
I guess your only way to do it is with the #future tag - read up on that in the docs. It gives you much higher limits. Although, you can only call so many of those in a day - so you may need to somehow keep track of which parent objects have their children updating, and then process that offline.
A final option may be to use the API via some external tool. But you'll still have to make sure everything in your code is batched up.
I thought these limits were draconian at first, but actually you can do a hell of a lot within them if you batch things correctly, we regularly update 1,000's of rows from triggers. And from an architectural point of view, much more than that and you're really talking batch processing anyway which isnt normally activated by a trigger. One things for sure - they make you jump through hoops to do it.

I think Codek is right, going the API / external tool route is a good way to go. The governor limits still apply, but are much less strict with API calls. Salesforce recently revamped their DataLoader tool, so that might be something to look into.
Another thing you could try is using a Workflow rule with an Outbound Message to call a web service on your end. Just send over the parent object and let a process on your end handle the child record updates via the API. One thing to be aware of with outbound messages, it is best to queue up the process on your end somehow, and immediately respond to Salesforce. Otherwise Salesforce will resend the message.

#future doesn't work (does not update records at all)? Weird. Did you try using your function in automated test? It should work and and the annotation should be ignored (during the test it will be executed instantly, test methods have higher limits). I suggest you investigate this a bit more, it seems like best solution to what you want to accomplish.
Also - maybe try to call it from your class, not the trigger?
Daisy-chaining triggers together will not work, I've tried it in the past.
Your last option might be batch Apex (from Winter'10 release so all organisations should have it by now). It's meant for mass data update/validation jobs, things you typically run overnight in normal databases (it can be scheduled). See http://www.salesforce.com/community/winter10/custom-cloud/program-cloud-logic/batch-code.jsp and release notes PDF.

I believe in version 18 of the API the 1000 limit has been removed. (so the documentations says but in some cases I still hit a limit)
So you may be able to use batch apex. With a single APEX update statement
Something like:
List children = new List{};
for(childObect__c c : [SELECT ....]) {
c.foo__c = 'bar';
children.add(c);
}
update(children);;
Besure you bulkify your tigger also see http://sfdc.arrowpointe.com/2008/09/13/bulkifying-a-trigger-an-example/

Maybe a change to your data model is the better option here. Think of creating a formula on the children object where you access the data from the parent. This would be far more efficient probably.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse