Performance of repeatedly executing the same javascript on MongoDb? - mongodb

I'm looking at using some JavaScript in a MongoDb query. I have a couple of choices:
db.system.js.save the function in the db then execute it
db.myCollection.find with a $where clause and send the JS each time
exec_js in MongoEngine (which I imagine uses one of the above)
I plan to use the JavaScript in a regularly used query that's executed as part of a request to a site or API (i.e. not a batch administrative jobs) so it's important that the query executes with reasonable speed.
I'm looking at a 30ish line function.
Is the Javascript interpreted fresh each time? Will the performance be ok? Is it a sensible basis upon which to build queries?

Is the Javascript interpreted fresh each time?
Pretty much. MongoDB only has one "javascript instance" per running instance of MongoDB. You'll notice this if you try to run two different Map/Reduces at the same time.
Will the performance be ok?
Obviously, there are different definitions of "OK" here. The $where clause can not use indexes. You can combine that clause with another indexed query. In either case each object will need to be pushed from BSON over to the Javascript run-time and then acted on inside the run-time.
The process is definitely not what you would call "performant". Of course, by that measure Map/Reduce is also not very performant and people use that on production systems.
Is it a sensible basis upon which to build queries?
The real barrier here isn't the number of lines in the code, it's the number of possible documents this code will interpret. Even though it's "server-side" javascript, it's still a bunch of work that the server has to do. (in one thread, in an interpreted environment)
If you can test it and scope it correctly, it may well work out. Just don't expect miracles.

What is your point here? Write a JS script and call it regularly through cron. What should be the problem with that?

Related

Time of first occurrence of the query from pg_stat_statements. Possible to get?

I'm debugging a DB performance issue. There's a lead that the issue was introduced after a certain deploy, e.g. when DB started to serve some new queries.
I'm looking to correlate deployment time with the performance issues, and would like to identify the queries that are causing this.
Using pg_stat_statements has been very handy so far. Unfortunately it does not store the time stamp of the first occurrence of each query.
Is there any auxiliary tables I could look into to see the time of first occurrence of queries?
Ideally, if this information would have been available in pg_stat_statements, I'd make a query like this:
select queryid from where date(first_run) = '2020-04-01';
Additionally, it'd be cool to see last_run as well, so to filter out some old queries that no longer execute at all, but remain in pg_stat_statements. That's more of a nice thing that's a necessity though.
This information is not stored anywhere, and indeed it would not be very useful. If the problem statement is a new one, you can easily identify it in your application code. If it is not a new statement, but something made the query slower, the first time the query was executed won't help you.
Is your source code not under version control?

Not recommended to use server-side functions in MongoDB, does this go for MapReduce as well?

The MongoDB documentation states that it is not recommended to use its stored functions feature. This question goes through some of the reasons, but they all seem to boil down to "eval is evil".
Are there specific reasons why server-side functions should not be used in a MapReduce query?
The system.js functions are available to Map Reduce jobs by default ( https://jira.mongodb.org/browse/SERVER-8632 notes a slight glitch to that in 2.4.0rc ).
They are not actually evaled within the native V8/Spidermonkey evironment so tehcnically that part of them is also gone.
So no, there is no real problems, they will run as though native within that Map Reduce and should run just as fast and "good" as any other javascript you write. In fact the system.js collection is more designed to house code for map reduce jobs, it is later uses that sees it used as a hack for "stored procedures".

Is eval that evil?

I understand that eval locks the whole database, which can't be good for throughput - however I have a scenario where a very specific transaction involving several documents must be isolated.
Because that transaction does not happen very often and is fairly quick (a few updates on indexed queries), I was thinking of using eval to execute it.
Are their any pitfalls that I should be aware of (I have seen several eval=evil posts but without much explanation)?
Does it make a difference if the database is part of a replica set?
Many developers would suggest using eval is "evil" as their are obvious security concerns with potentially unsanitized JavaScript code executing within the context of the MongoDB instance. Normally MongoDB is immune to those types of injection attacks.
Some of the performance issues of using JavaScript in MongoDB via the eval command are mitigated in version 2.4, as muliple JavaScript operations can execute at the same time (depending on the setting of the nolock option). By default though, it takes a global lock (which is what you specifically want apparently).
When a eval is being used to try to perform an (ACID-like) transactional update to several documents, there's one primary concern. The biggest issue is that if all operations must succeed for the data to be in a consistent state, the developer is running the risk that a failure mid-way through the operation may result in a partially complete update to the database (like a hardware failure for example). Depending on the nature of the work being performed, replication settings, etc., the data may be OK, or may not.
For situations where database corruption could occur as a result of a partially complete eval operation, I would suggest considering an alternative schema design and avoiding eval. That's not to say that it wouldn't work 99.9999% of the time, it's really up to you to decide ultimately whether it's worth the risk.
In the case you describe, there are a few options:
{ version: 7, isCurrent: true}
When a version 8 document becomes current, you could for example:
Create a second document that contains the current version, this would be an atomic set operation. It would mean that all reads would potentially need to read the "find the current version" document first, followed by the read of the full document.
Use a timestamp in place of a boolean value. Find the most current document based on timestamp (and your code could clear out the fields of older documents if desired once the now current document has been set)

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.

will I typically get better performance if I run an update/calc loop via javascript?

I have a script that loops over a set of records, performs some statistical calculations and updates the records. It's a big cursor: get record, calculate statistics from embedded documents, set fields on record, save record. There's <5k records that are being looped and each one embeds 90 history entries.
Question: would I get substantially better performance if I did this via javascript? The alternative being writing it in Ruby. My opinion (unfounded) is that since this can be done entirely in the database I will get better performance if send a chunk of js to Mongodb instead of adding Ruby in to the mix.
Related: is map/reduce appropriate for finding the median and mode of a set of values for many records?
The answer is really "it depends" - if the fields you need to do the calculations are very large, doing the calculation on the server side with JS might be a lot faster simply by cutting down on network traffic.
But, executing JS on the server side also holds a write lock, so depending on how complicated the calculations are, it might be more efficient to just do your calculations on the client side and then simply update the document.
Your best bet is to do a simple benchmark with ruby vs. server side JS. If you need to serve other database traffic at the same time, this should also be considered as well, because your lock % could be different in the two scenarios (you can monitor this with mongostat).
Also, keep in mind that using db.eval will not work with sharding, so avoid it if you are using a sharded environment or plan to in the future.