Why is index not created after teardown if some connections persist? - mongodb

I setup and teardown my MongoDB database during functional test.
One of my models will make use of GridFS and I am going to run that test (which also calls setup and teardown). Suppose we started out with a clean empty database called test_repoapi:
python serve.py testing.ini
nosetests -a 'write-file'
The second time I run the test, I am getting this:
OperationFailure: command SON([('filemd5', ObjectId('518ec7d84b8aa41dec957d3c')), ('root', u'fs')]) failed: need an index on { files_id : 1 , n : 1 }
If we look at client:
> use test_repoapi
switched to db test_repoapi
> show collections
fs.chunks
system.indexes
users
Here is the log: http://pastebin.com/1adX4svG
There are three kinds of timestamps:
(1) the top one is when I first launched the web app
(2) anything before 23:06:27 were the first iteration
(3) then everything else were the second iteration
As you can see I did initialized commands to drop the database. Two possible explanations:
(1) Web app holds two active connections to the database, and
(2) Some kind of "lock" prevents the index from fully created. Also look fs.files was not recreated.
The workaround is to stop the web app, start again, and run the test; then the error will not appear.
By the way, I am using Mongoengine as my ODM in my web app.
Any thoughts on this?

We used to have similar issue with mongoengine failing to recreate indexes after drop_collection() during tests because it failed to realise dropping collection also drops the indexes. But that was happening with normal collections and a rather ancient version of mongoengine (and a call to QuerySet._reset_already_indexed() fixed it for us - but we haven't needed that since 0.6)
Maybe this is another case of mongoengine internally keeping track of which indexes have been created and it's just failing to realize the database/collection vanished and those indexes must be recreated? FWIW using drop_collection() between tests is working for us and that includes GridFS.

Related

NpgsqlConnection fails when database has been dropped and recreated

For an XUnit integration test automation project, that runs with a PostgreSQL database, I have created a script that first drops and then recreates the database, so that every test can start with the same set of data as input. When I run the tests individually (one-by-one) through the test explorer, they all run fine. When I try to run them all in the same testrun it fails on the second test that is being executed
The structure of every test is:
initialize the new database using the script that drops, creates and fills it with data
run the test
open a NpgsqlConnection to the database
query the database and check if the resulting content matches my expectations
the second time this causes a Npgsql.NpgsqlException : Exception while writing to stream
it seems that when the connection is being created for the second time, NpgSql sees it's a previously used connection, so it reuses it. But it has been dropped and can't be used again.
If for instance I don't use the command query after creating the first connection and only in the second connection it also works fine.
I hope someone can give me a good suggestion on how to deal with this. It is the first time that I use PostgreSQL in one of my projects. I could maybe use the entity framework data provider for PostgreSQL but I will try asking this first...
I added Pooling=false to the connection string and now it works. I can drop and recreate the database as often as I want now in the same test, and simply reconnect to it from the C# code

Can I debug a PostgreSQL query sent from an external source, that I can't edit?

I see how to debug queries stored as Functions in the database. But my problem is with an external QGIS plugin that connects to my Postgres 10.4 via network and does a complex query and calculations, and stores the results back into PostGIS tables:
FOR r IN c LOOP
SELECT
(1 - ST_LineLocatePoint(path.geom, ST_Intersection(r.geom, path.geom))) * ST_Length(path.geom)
INTO
station
(continues ...)
When it errors, it just returns that line number as the failing location, but no clue where it was in the loop through hundreds of features. (And any features it has processed are not stored to the output tables when it fails.) I totally don't know enough about the plugin and about SQL to hack the external query, and I suspect if it was a reasonable task the plugin author would have included more revealing debug messages.
So is there some way I could use pgAdmin4 (or anything) from the server side to watch the query process? Even being able to see if it fails the first time through the loop or later would help immensely. Knowing the loop count at failure would point me to the exact problem feature. Being able to see "station" or "r.geom" would make it even easier.
Perfectly fine if the process is miserably slow or interferes with other queries, I'm the only user on this server.
This is not actually a way to watch the RiverGIS query in action, but it is the best I have found. It extracts the failing ST_Intersects() call from the RiverGIS code and runs it under your control, where you can display any clues you want.
When you're totally mystified where the RiverGIS problem might be, run this SQL query:
SELECT
xs."XsecID" AS "XsecID",
xs."ReachID" AS "ReachID",
xs."Station" AS "Station",
xs."RiverCode" AS "RiverCode",
xs."ReachCode" AS "ReachCode",
ST_Intersection(xs.geom, riv.geom) AS "Fraction"
FROM
"<your project name>"."StreamCenterlines" AS riv,
"<your project name>"."XSCutLines" AS xs
WHERE
ST_Intersects(xs.geom, riv.geom)
ORDER BY xs."ReachID" ASC, xs."Station" DESC
Obviously replace <your project name> with the QGIS project name.
Also works for the BankLines step if you replace "StreamCenterlines" with "BankLines". Probably could be adapted to other situations where ST_Intersects() fails without a clue.
You'll get a listing with shorter geometry strings for good cross sections and double-length strings for bad ones. Probably need to widen your display column a lot to see this.
Works for me in pgAdmn4, or in QGIS3 -> Database -> DB Manager -> (click the wrench icon). You could select only bad lines, but I find the background info helpful.

MongoDB in Go (golang) with mgo: How do I update a record, find out if update was successful and get the data in a single atomic operation?

I am using mgo driver for MongoDB under Go.
My application asks for a task (with just a record select in Mongo from a collection called "jobs") and then registers itself as an assignee to complete that task (an update to that same "job" record, setting itself as assignee).
The program will be running on several machines, all talking to the same Mongo. When my program lists the available tasks and then picks one, other instances might have already obtained that assignment, and the current assignment would have failed.
How can I get sure that the record I read and then update does or does not have a certain value (in this case, an assignee) at the time of being updated?
I am trying to get one assignment, no matter which one, so I think I should first select a pending task and try to assign it, keeping it just in the case the updating was successful.
So, my query should be something like:
"From all records on collection 'jobs', update just one that has assignee=null, setting my ID as the assignee. Then, give me that record so I could run the job."
How could I express that with mgo driver for Go?
This is an old question, but just in case someone is still watching at home, this is nicely supported via the Query.Apply method. It does run the findAndModify command as indicated in another answer, but it's conveniently hidden behind Go goodness.
The example in the documentation matches pretty much exactly the question here:
change := mgo.Change{
Update: bson.M{"$inc": bson.M{"n": 1}},
ReturnNew: true,
}
info, err = col.Find(M{"_id": id}).Apply(change, &doc)
fmt.Println(doc.N)
I hope you saw the comments on the answer you selected, but that approach is incorrect. Doing a select and then update will result in a round trip and two machines and be fetching for the same job before one of them can update the assignee. You need to use the findAndModify method instead: http://www.mongodb.org/display/DOCS/findAndModify+Command
The MongoDB guys describe a similar scenario in the official documentation: http://www.mongodb.org/display/DOCS/Atomic+Operations
Basically, all you have to do, is to fetch any job with assignee=null. Let's suppose you get the job with the _id=42 back. You can then go ahead and modify the document locally, by setting assignee="worker1.example.com" and call Collection.Update() with the selector {_id=42, assignee=null} and your updated document. If the database is still able to find a document that matches this selector, it will replace the document atomically. Otherwise you will get a ErrNotFound, indicating that another thread has already claimed the task. If that's the case, try again.

Issue with Entity Framework 4.2 Code First taking a long time to add rows to a database

I am currently using Entity Framework 4.2 with Code First. I currently have a Windows 2008 application server and a database server running on Amazon EC2. The application server has a Windows Service installed that runs once per day. The service executes the following code:
// returns between 2000-4000 records
var users = userRepository.GetSomeUsers();
// do some work
foreach (var user in users)
{
var userProcessed = new UserProcessed { User = user };
userProcessedRepository.Add(userProcessed);
}
// Calls SaveChanges() on DbContext
unitOfWork.Commit();
This code takes a few minutes to run. It also maxes out the CPU on the application server. I have tried the following measures:
Remove the unitOfWork.Commit() to see if it is network related when the application server talks to the database. This did not change the outcome.
Changed my application server from a medium instance to a high CPU instance on Amazon to see if it is resource related. This caused the server not to max out the CPU anymore and the execution time improved slightly. However, the execution time was still a few minutes.
As a test I modified the above code to run three times to see if execution time for the second and third loop using the same DbContext. Every consecutive loop took longer to run that the previous one but that could be related to using the same DbContext.
Am I missing something? Is it really possible that something as simple as this takes minutes to run? Even if I don't commit to the database after each loop? Is there a way to speed this up?
Entity Framework (as it stands) isn't really well suited to this kind of bulk operation. Are you able to use one of the bulk insert methods with EC2? Otherwise, you might find that hand-coding the T-SQL INSERT statements is significantly faster. If performance is important then that probably outweighs the benefits of using EF.
My guess is that your ObjectContext is accumulating a lot of entity instances. SaveChanges seems to have a phase that has time linear in the number of entities loaded. This is likely the reason for the fact that it is taking longer and longer.
A way to resolve this is to use multiple, smaller ObjectContexts to get rid of old entity instances.

MongoDB: Making sure referenced document still exists

Let's say I have two types of MongoDB documents: 'Projects' and 'Tasks'. A Project can have many tasks. In my case it is more suitable to link the documents rather than embed.
When a user wants to save a task I first verify that the project the task is being assigned to exists, like so:
// Create new task
var task = new Task(data);
// Make sure project exists
Project.findById(task.project, function(err, project) {
if(project) {
// If project exists, save task
task.save(function(err){
...
});
} else {
// Project not found
}
});
My concern is that if another user happens to delete the project after the Project.findById() query is run, but before the task is saved, the task will be created anyway without a referenced project.
Is this a valid concern? Is there any practice that would prevent this from happening, or is this just something that has to be faced with MongoDB?
Technically yes, this is something you need to face when using MongoDB. But it's not really a big deal as it's rarely someone to delete a project and another person is unaware of it and creating task for that project. I would not use the if statement to check the project status, rather just leave task created as a bad record. You can either manually remove those bad records or schedule a cron task to clean them.
The way you appear to be doing it, i.e, with two separate Models -- not subdocuments (hard to tell without seeing the Models), I guess you will have that race condition. The if won't help. You'd need to take advantage of the atomic modifiers to avoid this issue, and using separate Models (each being it's own MongoDB collection), the atomic modifiers are not available. In SQL world, you'd use a transaction to ensure consistentcy. Similarly, with a document store like MongoDB, you'd make each Task a subdocument of a Project, and then just .push() new tasks. But perhaps your use case necessitates separate, unrelated Models. MongoDB is great for offering that flexibility, but it enables you to retain SQL-thinking without being SQL, which can lead to design problems.
More to the point, though, the race condition you're worried about doesn't seem to be a big deal. After all, the Project could be deleted after the task is saved, too. You obviously have a method for cleaning that up. One more cleanup function isn't the end of the world -- probably a good thing to have in your back pocket anyway.