Can I search across collections in MongoDB? - mongodb

I am inserting my data into MongoDB and had 240 such files. Instead of inserting everything into one big collection, I was thinking of inserting the files as a collection by themselves. Is this a good idea if I do a lot of queries on a commonly indexed column?
If so, how can I initiate a query to query all the collections in my database?

Using an application server such as Solr can help you achieve what you want, also with the addition of fuzzy matching, synonyms, phonetic matching, misspellings, etc.
Solor is built on top of Lucene. It's docs are here:
http://lucene.apache.org/solr/
The learning curve is a little bit steep, but you can get pretty good searchability using much of its defaults, leaving you to build a schema and index your data to get started.

I think the answer you're looking for is really here on your other question: Is there any multicore exploiting NoSQL system?
There is no way to query across all collections in Mongo. It wouldn't make a lot of sense to do so. MongoDB's strength is focused on tactically denormalizing data into collections. Providing operations to query across all collections run exactly counter to the concept of tactical denormalization.
In theory, you could just run 240 queries. But more practically you'll probably end up "partitioning" your data so that you only need to query some of the collections. At this point you end up back at the link I provided above, which suggests that sharding your data is probably the answer here.

Related

Why use ElasticSearch with Mongo?

I have read a few articles recently on the combination of mongodb for storage and elasticsearch for indexing/search. I feel like I'm missing something though. Why would you go this route as opposed to just using mongo to index the data? What benefits does elasticsearch bring and is it worth the added complexity?
ElasticSearch implements a lot more features, such as custom splitting of text into words, custom stemming, facetted search and a whole lot more. While MongoDB's (rather simple) text search does some of this, it is not nearly as powerful as ElasticSearch.
If all you ever do is look for a single string in a single field, then MongoDB's normal query system will work excellently for that. If you need to look for words in multiple fields, then MongoDB's text search will work. If you need anything more than that, ElasticSearch is the way to go.
A search engine and a database do some fundamentally different things. A good search engine (like ElasticSearch) supports far more elaborate and complex indexing, facets, highlighting etc. In the case of ElasticSearch, you also get your replies 'real-time'. On the other hand, a search engine doesn't return every single document that matches your query. Instead, it will score documents according to how much they match, and return the top scoring ones. When you query a database such as MongoDB, you should expect it to return everything that matches your query.
You can store the entire document in ElasticSearch, but it is usually not the optimal solution. Normally you will have it configured to return the document id's, which you use to fetch the document from a database. MongoDB is a database optimized for document based storage. this is why you hear about people using them together.
edit:
When this was posted, it matched the recommendations, but this may no longer be the case.
Derick's answer pretty much nails it. The questions behind all this is:
What are the features you want to implement in your application?
If you rely on heavy searching capabilities in large chunks of text, ElasticSearch is probably a good thing to use. If you want to have a flexible datastore that can cope with complex ad-hoc queries, Mongo might be a good fit. If you have different requirements for a datastore, it is often a good thing to combine two tools instead of implementing all kind of workarounds to make it work with just one datastore.
Choose the right tool for the job.

Is mongoDB efficient in doing multi-key lookups?

I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.

Extract to MongoDB for analysis

I have a relational database with about 300M customers and their attributes from several perspectives (360).
To perform some analytics I intent to make an extract to a MongoDB in order to have a 'flat' representation that is more suited to apply data mining techniques.
Would that make sense? Why?
Thanks!
No.
Its not storage that would be the concern here, its your flattening strategy.
How and where you store the flattened data is a secondary concern, note MongoDB is a document database and not inherently flat anyway.
Once you have your data in the shape that is suitable for your analytics, then, look at storage strategies, MongoDB might be suitable or you might find that something that allows easy Map Reduce type functionality would be better for analysis... (HBase for example)
It may make sense. One thing you can do is setup MongoDB in a horizontal scale-out setup. Then with the right data structures, you can run queries in parallel across the shards (which it can do for you automatically):
http://www.mongodb.org/display/DOCS/Sharding
This could make real-time analysis possible when it otherwise wouldn't have been.
If you choose your data models right, you can speed up your queries by avoiding any sorts of joins (again good across horizontal scale).
Finally, there is plenty you can do with map/reduce on your data too.
http://www.mongodb.org/display/DOCS/MapReduce
One caveat to be aware of is there is nothing like SQL Reporting Services for MongoDB AFAIK.
I find MongoDB's mapreduce to be slow (however they are working on improving it, see here: http://www.dbms2.com/2011/04/04/the-mongodb-story/ ).
Maybe you can use Infobright's community edition for analytics? See here: http://www.infobright.com/Community/
A relational db like Postgresql can do analytics too (afaik MySQL can't do a hash join but other relational db's can).

NoSQL engines that support dynamic queries?

What "NoSQL" database engines support dynamic / advanced queries in a similar fashion to MongoDB (http://www.mongodb.org/display/DOCS/Advanced+Queries) ?
Specifically interested in options that support ad-hoc querying from a shell or within client languages.
None just use MongoDB ;)
Honestly, it really depends on what type of querying you plan to do. For Key/Value style queries where you plan to just pull up one document at a time, then basically all of the NoSQL DBs are good for this.
When it comes to pulling back "sets" of data or using alternate keys, then MongoDB is probably your best "crossover" here. Many NoSQL DBs have limited querying functions, especially on non-key fields. Of course, that's kind of the point of "Key-Value stores", so Mongo is kind of a mutant here.
The last I checked with Cassandra, there was definitely some "hoop-jumping" involved to really support ad-hoc non-key queries. And CouchDB seems to point to "just Map / Reduce".
That stated, I believe that there is motion from several NoSQL dbs to support such ad-hoc querying mechanism. So this answer could be completely wrong in 2 months :)

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono