Mongodb aggregation pipelines and the expressions available inside it make it look like a full language. However, I do not know how to test if a system is turing complete. Has anybody written/said anything about the turing completeness of mongodb aggregations?
I was curious about this as well. Theoretically it seems like it could be turing complete, similar to how vanilla sql can be turing complete. See answer here: https://stackoverflow.com/a/7580013/15314201
However, in practice the aggregation pipeline isn't meant for "scripting" but a rather a linear flow through the pipeline. Any sort of loops or functions or whatever would probably be better done in a language that interfaces with Mongo, such as using pymongo for the linear pipeline and python for more advanced control flows.
Related
The MongoDB documentation states that it is not recommended to use its stored functions feature. This question goes through some of the reasons, but they all seem to boil down to "eval is evil".
Are there specific reasons why server-side functions should not be used in a MapReduce query?
The system.js functions are available to Map Reduce jobs by default ( https://jira.mongodb.org/browse/SERVER-8632 notes a slight glitch to that in 2.4.0rc ).
They are not actually evaled within the native V8/Spidermonkey evironment so tehcnically that part of them is also gone.
So no, there is no real problems, they will run as though native within that Map Reduce and should run just as fast and "good" as any other javascript you write. In fact the system.js collection is more designed to house code for map reduce jobs, it is later uses that sees it used as a hack for "stored procedures".
It's sounding like the V8 JavaScript engine might be replacing SpiderMonkey in MongoDB v2.2+.
What benefits, if any, will this bring to MongoDB map-reduce performance?
For example:
Will overall JavaScript evaluation performance improve (I'm assuming this one's a given?)
Will concurrent map and reduce operations be better able to run in parallel on a single instance?
Will map-reduces still block eachother?
Yes, it will help with parallelism, and help performance. The Spidermonkey engine restricts MongoDB to single threads, however the operations are usually short and allow other threads to interleave so the exact impact is hard to quantify. Of course, testing is always the way to really figure out the benefits.
As you can see here: https://jira.mongodb.org/browse/SERVER-4258
And here: https://jira.mongodb.org/browse/SERVER-4191
Some of the improvements are already available for testing in the development release. To test with V8, just build using V8 as outlined here:
http://www.mongodb.org/display/DOCS/Building+with+V8
I'm looking at using some JavaScript in a MongoDb query. I have a couple of choices:
db.system.js.save the function in the db then execute it
db.myCollection.find with a $where clause and send the JS each time
exec_js in MongoEngine (which I imagine uses one of the above)
I plan to use the JavaScript in a regularly used query that's executed as part of a request to a site or API (i.e. not a batch administrative jobs) so it's important that the query executes with reasonable speed.
I'm looking at a 30ish line function.
Is the Javascript interpreted fresh each time? Will the performance be ok? Is it a sensible basis upon which to build queries?
Is the Javascript interpreted fresh each time?
Pretty much. MongoDB only has one "javascript instance" per running instance of MongoDB. You'll notice this if you try to run two different Map/Reduces at the same time.
Will the performance be ok?
Obviously, there are different definitions of "OK" here. The $where clause can not use indexes. You can combine that clause with another indexed query. In either case each object will need to be pushed from BSON over to the Javascript run-time and then acted on inside the run-time.
The process is definitely not what you would call "performant". Of course, by that measure Map/Reduce is also not very performant and people use that on production systems.
Is it a sensible basis upon which to build queries?
The real barrier here isn't the number of lines in the code, it's the number of possible documents this code will interpret. Even though it's "server-side" javascript, it's still a bunch of work that the server has to do. (in one thread, in an interpreted environment)
If you can test it and scope it correctly, it may well work out. Just don't expect miracles.
What is your point here? Write a JS script and call it regularly through cron. What should be the problem with that?
Does a MongoDB MapReduce job lock the database? I am developing a multi-user MongoDB web application and am worried about multi-user conflicts and performance. Does anyone have any words of wisdom for me?
Simple answer? Sometimes ...
It depends a lot on how you are using map/reduce ... but in my experience it's never been a problem.
There isn't much info on this, but it's clearly stated in the docs that is does sometimes lock but it "Allows substantial concurrent operation."
There are a couple of questions in the mongodb-user group asking about this ... the best response I've seen offically is that ... "in 1.4 it yields but isn't as nice as it should be, in 1.5 its much friendlier to other requests."
That does not mean that it doesn't block at all, but compared to db.eval() which blocks the whole mongod process ... it's your best bet.
That said, in 1.7.2 and up there is now a nolock option for db.eval() ...
No, mapreduce does not lock the database. See the note here, just after "Using db.eval()" (it explains why mapreduce may be more appropriate to use than eval, because mapreduce does not block).
If you are going to run a lot of mapreduce jobs you should use sharding, because that way the job can run in parallel on all the shards. Unfortunately mapreduce jobs can't run on secondaries in a replica set, since the results must be written and replicas are read-only.
In version 2.1.0 added a "nonAtomic" flag to output option.
See: https://jira.mongodb.org/browse/SERVER-2581
I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.