Sphinx + NoSQL Help - nosql

So I'm looking to run Sphinx over a NoSQL system such as MongoDB, HBase, Cassandra, etc.
Right now, we're comparing all the NoSQL systems out there. Basically, we need to query 50+ Million rows of product data with fulltext searches thousands of times a second, so we're trying to find the most efficient NoSQL system.
Here is our question, though. If we use any NoSQL system with Sphinx, when we perform the actual searches, will the search have any interaction with the NoSQL system itself, or will Sphinx be doing the work as it has the data indexed? If it's only Sphinx, then wouldn't the performance of the NoSQL system be only secondary?
Thanks!

Using the latest string attribute, you can cut of the database part of the search completely, that will be much more efficient.

As my understanding, I think you can do it. Because I'm only familiar with mongodb and hbase, i can only talk about this question based on the 2 databases. You need to do some work on the indexer and build the data/attributes into the sphinx index file, and to include the primary key(which mark the sole record in the database) into it too(for mongodb, it's object_id, for hbase, it's row key), then after you do the fulltext search, you can get the whole data/attributes from databases by the primary key.
Besides, another full-text search engine supports no-sql db very well, it's solr. you can try it if the performance of it can satisfy your request.

Related

Real Time Searching of a Lucene Index that is Updated Frequently - Is this practical?

I have a query that involves many joins on many tables and takes too long to query. I've been asked to try to use Lucene to speed things up. What I've done is exported the query to XML, and used Java to parse the XML, Lucene to index the XML, and created a API to query this index in Java. This reduces the query time 6-10 fold.
However, unless a dedicated VM or machine constantly queries the database, exports the data, and reindexes the data, any end user who uses the API to search the Lucene index will be receiving not-up-to-date data. Even if a machine is dedicated for this purpose, the data will not be up to date on every attempt to search the Lucene index.
I believe that "delta import" for Solr is what I am talking about. I think that is unique to Solr though, not Lucene.
What options exist for Lucene to index data that will change with some frequency, and allow users to search/query in real time? Is this too much to ask from Lucene?
Solr happens to be a search application build on top of lucene. So any indexing and searching functionality provided comes from lucene.
Lucene supports Near real time search - http://wiki.apache.org/lucene-java/NearRealtimeSearch
For your indexing concerns I would say it depends on your app which syncs data between your database and lucene. Lucene can index at a very high throughput. http://people.apache.org/~mikemccand/lucenebench/indexing.html
So your app should be smart enough to figure changes made in the database and re-index only that "delta"

NoSQL engines that support dynamic queries?

What "NoSQL" database engines support dynamic / advanced queries in a similar fashion to MongoDB (http://www.mongodb.org/display/DOCS/Advanced+Queries) ?
Specifically interested in options that support ad-hoc querying from a shell or within client languages.
None just use MongoDB ;)
Honestly, it really depends on what type of querying you plan to do. For Key/Value style queries where you plan to just pull up one document at a time, then basically all of the NoSQL DBs are good for this.
When it comes to pulling back "sets" of data or using alternate keys, then MongoDB is probably your best "crossover" here. Many NoSQL DBs have limited querying functions, especially on non-key fields. Of course, that's kind of the point of "Key-Value stores", so Mongo is kind of a mutant here.
The last I checked with Cassandra, there was definitely some "hoop-jumping" involved to really support ad-hoc non-key queries. And CouchDB seems to point to "just Map / Reduce".
That stated, I believe that there is motion from several NoSQL dbs to support such ad-hoc querying mechanism. So this answer could be completely wrong in 2 months :)

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono

Do you need Solr/Lucene for MongoDB, CouchDB and Cassandra?

If you have RDBMS you probably have to use Solr to index your relational tables to fully nested documents.
Im new to non-sql databases like Mongodb, CouchDB and Cassandra, but it seems to me that the data you save is already in that document structure like the documents saved in Solr/Lucene.
Does this mean that you don't have to use Solr/Lucene when using these databases?
Is it already indexed so that you can do full-text search?
It depends on your needs. They have a full text search. In CouchDB the search is Lucene (same as solr). Unfortunately, this is just a full text index, if you need complex scoring or DisMax type searching, you'll likely want the added capabilities of an independent Solr Index.
Solr (Lucene) uses an algorithm to returns relevant documents from a query. It will returns a score to indicate how relevant each document is related to the query.
It is different than what a database (relational or not) does, which is returning results that matches or not a query.

NoSQL (MongoDB) vs Lucene (or Solr) as your database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
With the NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).
So, the question: Why would you want to use NoSQL (MongoDB, Cassandra, CouchDB, etc) over Lucene (or Solr) as your "database"?
What I am (and I am sure others are) looking for in an answer is some deep-dive comparisons of them. Let's skip over relational database discussions all together, as they serve a different purpose.
Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.
You can even throw in a comparison of out-of-proc cache providers such as Velocity or MemCached when talking about similar data storing and scalability of MongoDB.
The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.
MongoDB (and others) do have some advantages, such as the ease of use of their API. New up a document, create an id, and store it. Done. Nice and easy.
This is a great question, something I have pondered over quite a bit. I will summarize my lessons learned:
You can easily use Lucene/Solr in lieu of MongoDB for pretty much all situations, but not vice versa. Grant Ingersoll's post sums it up here.
MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. It appears to be a simpler and arguably easier transition for programmers detoxing from the RDBMS world. Unless one's used to it Lucene & Solr have a steeper learning curve.
There aren't many examples of using Lucene/Solr as a datastore, but Guardian has made some headway and summarize this in an excellent slide-deck, but they too are non-committal on totally jumping on Solr bandwagon and "investigating" combining Solr with CouchDB.
Finally, I will offer our experience, unfortunately cannot reveal much about the business-case. We work on the scale of several TB of data, a near real-time application. After investigating various combinations, decided to stick with Solr. No regrets thus far (6-months & counting) and see no reason to switch to some other.
Summary: if you do not have a search requirement, Mongo offers a simple & powerful approach. However if search is key to your offering, you are likely better off sticking to one tech (Solr/Lucene) and optimizing the heck out of it - fewer moving parts.
My 2 cents, hope that helped.
You can't partially update a document in solr. You have to re-post all of the fields in order to update a document.
And performance matters. If you do not commit, your change to solr does not take effect, if you commit every time, performance suffers.
There is no transaction in solr.
As solr has these disadvantages, some times NoSQL is a better choice.
UPDATE: Solr 4+ Started supporting commit and soft-commits. Refer to the latest document https://lucene.apache.org/solr/guide/8_5/
We use MongoDB and Solr together and they perform well. You can find my blog post here where i described how we use this technologies together. Here's an excerpt:
[...] However we observe that query performance of Solr decreases when index
size increases. We realized that the best solution is to use both Solr
and Mongo DB together. Then, we integrate Solr with MongoDB by storing
contents into the MongoDB and creating index using Solr for full-text
search. We only store the unique id for each document in Solr index
and retrieve actual content from MongoDB after searching on Solr.
Getting documents from MongoDB is faster than Solr because there is no
analyzers, scoring etc. [...]
Also please note that some people have integrated Solr/Lucene into Mongo by having all indexes be stored in Solr and also monitoring oplog operations and cascading relevant updates into Solr.
With this hybrid approach you can really have the best of both worlds with capabilities such as full text search and fast reads with a reliable datastore that can also have blazing write speed.
It's a bit technical to setup but there are lots of oplog tailers that can integrate into solr. Check out what rangespan did in this article.
http://denormalised.com/home/mongodb-pub-sub-using-the-replication-oplog.html
From my experience with both, Mongo is great for simple, straight-forward usage. The main Mongo disadvantage we've suffered is the poor performance on unanticipated queries (you cannot created mongo indexes for all the possible filter/sort combinations, you simple can't).
And here where Lucene/Solr prevails big time, especially with the FilterQuery caching, Performance is outstanding.
Since no one else mentioned it, let me add that MongoDB is schema-less, whereas Solr enforces a schema. So, if the fields of your documents are likely to change, that's one reason to choose MongoDB over Solr.
#mauricio-scheffer mentioned Solr 4 - for those interested in that, LucidWorks is describing Solr 4 as "the NoSQL Search Server" and there's a video at http://www.lucidworks.com/webinar-solr-4-the-nosql-search-server/ where they go into detail on the NoSQL(ish) features. (The -ish is for their version of schemaless actually being a dynamic schema.)
If you just want to store data using key-value format, Lucene is not recommended because its inverted index will waste too much disk spaces. And with the data saving in disk, its performance is much slower than NoSQL databases such as redis because redis save data in RAM. The most advantage for Lucene is it supports much of queries, so fuzzy queries can be supported.
MongoDB Atlas will have a lucene-based search engine soon. The big announcement was made at this week's MongoDB World 2019 conference. This is a great way to encourage more usage of their high revenue MongoDB Atlas product.
I was hoping to see it rolled into the MongoDB Enterprise version 4.2 but there's been no news of bringing it to their on-prem product line.
More info here: https://www.mongodb.com/atlas/full-text-search
The third party solutions, like a mongo op-log tail are attractive. Some thoughts or questions remain about whether the solutions could be tightly integrated, assuming a development/architecture perspective. I don't expect to see a tightly integrated solution for these features for a few reasons (somewhat speculative and subject to clarification and not up to date with development efforts):
mongo is c++, lucene/solr are java
maybe lucene could use some mongo libs
maybe mongo could rewrite some lucene algorithms, see also:
http://clucene.sourceforge.net/
http://lucy.apache.org/
lucene supports various doc formats
mongo is focused on JSON (BSON)
lucene uses immutable documents
single field updates are an issue, if they are available
lucene indexes are immutable with complex merge ops
mongo queries are javascript
mongo has no text analyzers / tokenizers (AFAIK)
mongo doc sizes are limited, that might go against the grain for lucene
mongo aggregation ops may have no place in lucene
lucene has options to store fields across docs, but that's not the same thing
solr somehow provides aggregation/stats and SQL/graph queries