Real Time Searching of a Lucene Index that is Updated Frequently - Is this practical? - real-time

I have a query that involves many joins on many tables and takes too long to query. I've been asked to try to use Lucene to speed things up. What I've done is exported the query to XML, and used Java to parse the XML, Lucene to index the XML, and created a API to query this index in Java. This reduces the query time 6-10 fold.
However, unless a dedicated VM or machine constantly queries the database, exports the data, and reindexes the data, any end user who uses the API to search the Lucene index will be receiving not-up-to-date data. Even if a machine is dedicated for this purpose, the data will not be up to date on every attempt to search the Lucene index.
I believe that "delta import" for Solr is what I am talking about. I think that is unique to Solr though, not Lucene.
What options exist for Lucene to index data that will change with some frequency, and allow users to search/query in real time? Is this too much to ask from Lucene?

Solr happens to be a search application build on top of lucene. So any indexing and searching functionality provided comes from lucene.
Lucene supports Near real time search - http://wiki.apache.org/lucene-java/NearRealtimeSearch
For your indexing concerns I would say it depends on your app which syncs data between your database and lucene. Lucene can index at a very high throughput. http://people.apache.org/~mikemccand/lucenebench/indexing.html
So your app should be smart enough to figure changes made in the database and re-index only that "delta"

Related

ElasticSearch & Mongo

Very newbie question I assume.. I started playing around with ES and MongoDB and I'm trying to move data out a SQL DB as an exercise.
I can't help but wonder, what data would I store in Mongo and what in ES? Can I store everything in ES? Assume big data load, as in price trends.
To begin with, MongoDB is so-called a document store. Key feature of such concept is that is stores schema-dynamic documents:
Each record in a document collection can have a different structure
Types of each records can be different
Document properties (columns) can have nested structures
It's not schema-free, it's schema-dynamic (or flexible schema). To get into the concept, you can find a great tutorial here: https://docs.mongodb.org/manual/data-modeling/
MongoDB is the most widely used document store - please, see http://db-engines.com/en/system/MongoDB.
It has "drivers" for most programming languages, enabling rapid development. You can dive into Mongo quite quickly, there are a lot of tutorials and official Mongo University - a great course for developers and DBAs.
In short terms it supports indexing, aggregations, filters, load balancing, sharding, replications (replica sets) etc. Data is stored and transferred in a BSON format (http://bsonspec.org/).
A good comparison of MongoDB vs RDBMS concepts can be found in this official reference: https://docs.mongodb.org/manual/reference/sql-comparison/
What is it good for? It enables agile development, where schema can change over a period of time, especially form based data, user generated content, location based data, user profiles and more. It also enables storing large documents (up to 16MB each).
Now, Elasticsearch is not a database. It is a search engine with some great aggregation capabilities (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html - make sure you check out Metrics, Buckets and Pipeline aggregations).
Typical RDBSM is not designed for full-text searches or loosely structured data. Queries in ES can return results much faster than any database (e.g. seconds in RDBMS compared to milliseconds in ES). You need to remember that a key is to design indexes well, and that they will take your disk space.
There is a very detailed article comparing both in regards to performance, you may find it useful: http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html.
You can actually use both successfully - MongoDB will store your data, where ES will be used as serving layer (search, aggregations etc.).
There is a big difference between mongodb and ES.
MongoDB is a database which was design in order to store data in it and query thats, while elasticsearch is an lucene base indexer in which you should only index data for searches and should not trust elastisearch. even though you can use store:true in elastic search, it is not recommended and i wouldn't rely on that for important data.

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono

Sphinx + NoSQL Help

So I'm looking to run Sphinx over a NoSQL system such as MongoDB, HBase, Cassandra, etc.
Right now, we're comparing all the NoSQL systems out there. Basically, we need to query 50+ Million rows of product data with fulltext searches thousands of times a second, so we're trying to find the most efficient NoSQL system.
Here is our question, though. If we use any NoSQL system with Sphinx, when we perform the actual searches, will the search have any interaction with the NoSQL system itself, or will Sphinx be doing the work as it has the data indexed? If it's only Sphinx, then wouldn't the performance of the NoSQL system be only secondary?
Thanks!
Using the latest string attribute, you can cut of the database part of the search completely, that will be much more efficient.
As my understanding, I think you can do it. Because I'm only familiar with mongodb and hbase, i can only talk about this question based on the 2 databases. You need to do some work on the indexer and build the data/attributes into the sphinx index file, and to include the primary key(which mark the sole record in the database) into it too(for mongodb, it's object_id, for hbase, it's row key), then after you do the fulltext search, you can get the whole data/attributes from databases by the primary key.
Besides, another full-text search engine supports no-sql db very well, it's solr. you can try it if the performance of it can satisfy your request.

Search multiple indices at once using Lucene Search

I am using Zend_Search_Lucene to implement site search. I created separate indices for different data types (e.g. one for users, one for posts etc). The results are similarly divided by data type however there is an 'all' option which should show a combination of the different result types. Is it possible to search across the different indices at once? or do i have to index everything in an all index?
Update: The readme for ZF 1.8 suggests that it's now possible to do in ZF 1.8 but I've been unable to track down where this is at in the documentation.
So after some research you have to use Zend_Search_Lucene_Interface_MultiSearcher. I don't see any mention of it in the documentation as of this writing but if you look at the actual class in ZF 1.8 it's straightforward t use
$index = new Zend_Search_Lucene_Interface_MultiSearcher();
$index->addIndex(Zend_Search_Lucene::open('search/index1'));
$index->addIndex(Zend_Search_Lucene::open('search/index2'));
$index->find('someSearchQuery');
NB it doesn't follow PEAR syntax so won'w work with Zend_Loader::loadClass
That's exactly how I handled search for huddler.com. I used multiple Zend_Search_Lucene indexes, one per datatype. For the "all" option, I simply had another index, which included everything from all indexes -- so when I added docs to the index, I added them twice, once to the appropriate "type" index, and once to the "all" index. Zend Lucene is severely underfeatured compared to other Lucene implementations, so this was the best solution I found. You'll find that Zend's port supports only a subset of the lucene query syntax, and poorly -- even on moderate indexes (10-100 MB), queries as simple as "a*", or quoted phrases fail to perform adequately (if at all).
When we brought a large site onto our platform, we discovered that Zend Lucene doesn't scale. Our index reached roughly 1.0 GB, and simple queries took up to 15 seconds. Some queries took a minute or longer. And building the index from scratch took about 20 hours.
I switched to Solr; Solr not only performs 50x faster during indexing, and 1000x faster for many queries (most queries finish in < 5ms, all finish in < 100ms), it's far more powerful. Also, we were able to rebuild our 100,000+ document index from scratch in 30 minutes (down from 20 hours).
Now, everything's in one Solr index with a "type" field; I run multiple queries against the index for each search, each one with a different "type:" filter query, and one without a "type:" for the "all" option.
If you plan on growing your index to 100+ MB, you receive at least a few search requests per minute, or you want to offer any sort of advanced search functionality, I strongly recommend abandoning Zend_Search_Lucene.
I don't how it integrates with Zend, but in Lucene one would use a MultiSearcher, instead of the usual IndexSearcher.