Zend: index generation and the pros and cons of Zend_Search_Lucene - zend-framework

I've never came across an app/class like Zend Search Lucene before, as I've always queried my database.
Zend_Search_Lucene operates with
documents as atomic objects for
indexing. A document is divided into
named fields, and fields have content
that can be searched.
A document is represented by the
Zend_Search_Lucene_Document class, and
this objects of this class contain
instances of Zend_Search_Lucene_Field
that represent the fields on the
document.
It is important to note that any
information can be added to the index.
Application-specific information or
metadata can be stored in the document
fields, and later retrieved with the
document during search.
So this is basically saying that I can apply this to anything including databases, the key thing here is making indexes for searching.
What I'm trying to grasp is where exactly should I store the indexes in my application, let's take for example we have phones stored in a database, a manufacturers, models - how should I categorize the indexes?
If I'm making indexes of users with say, addresses I obviously wouldn't want them to be publically viewable, I'm just confused on how it all works out together, if there are known disadvantages, any gotchas I should know while using it.

A Lucene index is stored outside the database. I'd store it in a "data" directory as a sister to your controllers, models, and views. But you can store it anywhere; you just need to specify the path when you open the index for querying.
It's basically a redundant copy of the documents stored in your database, and you have to keep them in sync yourself. That's one of the disadvantages: you have to write code to populate the Lucene index based on results of a query against your database. As you add data to the database, you have to update your Lucene index as well.
An advantage of using an external full-text index solution is that you can reduce the workload on your RDBMS. To find a document, you execute a search using the Lucene API. The result should include a field containing the primary key value (as part of the document but no need to make it analyzed for FT search). You get this field back when you do a Lucene search, so you can look up the respective row in the database.
Does that help answer your question?
I gave a presentation recently for MySQL University comparing full-text search solutions:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
I also publish my slides at http://www.SlideShare.net/billkarwin.

Related

Multi-Criteria search with CouchDB

How do I get data from CouchDB, filtering over multiple fields.
For example, if I have a person database with fields like Name, State, Country etc; and a search form on a web page, How do I get the data from CouchDB, considering only the non-null conditions.
in SQL, i would append conditions to the where clause WHERE Person.Name="John" AND Person.State in ("NY","CA"), but How do I frame this query as a CouchDB View
In CouchDB you use map/reduce Views. In SQL you have to explicitly say for which field index will be create. In CouchDB you write custom function creating index, so it can be more specific for your needs. If you want the index for such a simple thing as a search with name, state and country fields the view is just a map function:
function (doc) {
if (doc.name && doc.state && doc.country)
emit([doc.name, doc.state, doc.country], doc);
}
To search using this view you search for the key ["my_name", "my_state", "my_country"]. You can use it for querying with subset of name, state and country as long as they are a prefix of emitted array (like, searching with name but not with state and country) because the searchable result of map is sorted lexicographically.
In principle, the view is index with some capabilities of the queries, not as flexible as SQL queries, though. They are executed once and stored on disk, and incrementally calculated for new/modified data. Mind that it is hard to do things that are inefficient in the distributed system (for which CouchDB is designed): more complicated joins, searching without index... Although, in many cases artificial division for tables in relational model in not necessary when structured documents are available and some of the joins are not needed.
For some brief comparison of CouchDB vs. SQL see this chapter of The Definitive Guide book and other chapters and the official wiki for more information about the views.
You need to create a view, which contain [doc.name, doc state]. This is described in documentation well enough. The real problem is how to select persons from two random states.
Here is a good article answering this question: Using Multiple Start and End Keys for CouchDB Views

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

MongoDB embedded documents vs. referencing by unique ObjectIds for a system user profile

I'd like to code a web app where most of the sections are dependent on the user profile (for example different to-do lists per person etc) and I'd love to use MongoDB. I was thinking of creating about 10 embedabble documents for the main profile document and keep everything related to one user inside his own document.
I don't see a clear way of using foreign keys for mongodb, the only way would be to create a field to_do_id with the type of ObjectId for example, but they would be totally unrelated internally, just happen to have the same Ids I'd have to query for.
Is there a limit on the number of embedded document types inside a top level document that could degrade performance?
How do you guys solve the issue of having a central profile document that most of the documents have to relate to in presenting a view per person?
Do you use semi foreign keys inside MongoDb and have fields with ObjectId types that would have some other document's unique Id instead of embedding them?
I cannot feel what approach should be taken when. Thank you very much!
There is no special limit with respect to performance. The larger the document, though, the longer it takes to transmit over the wire. The whole document is always retrieved.
I do it with references. You can choose between simple manual references and the database DBRef as per this page: http://www.mongodb.org/display/DOCS/Database+References
The link above documents how to have references in a document in a semi-foreign key way. The DBRef might be good for what you are trying to do, but the simple manual reference is very efficient.
I am not sure a general rule of thumb exists for which reference approach is best. Since I use Java or Groovy mostly, I like the fact that I get a DBRef object returned. I can check for this datatype and use that to decide how to handle the reference in a generic way.
So I tend to use a simple manual reference for references to different documents in the same collection, and a DBRef for references across collections.
I hope that helps.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).