Hibernate with Lucene search - hibernate-search

I would like to perform the search based on synonyms, words with spell mistakes and so on. Can somebody suggest a good example using the latest version of Hibernate Search.

There are really two things at play here. First the synonyms and then the spelling mistakes. For the former I recommend you have a look at SynonymFilterFactory and how to use #AnalyzerDef. Obviously you somehow need the synonym file to begin with.
The latter problem (spelling mistakes) is not so much of a indexing issue (as with synonyms), but more of a search issue. To cater for different spelling mistakes you can search using FuzzyQuery.

Related

Postgresql Rule vs Trigger

I looked up official documents and researched them, but I still don't know what to use in what situations. Could you tell me a simple example and the difference? What are the advantages of having to write a RULE?
Forget about rules.
The Postgres Wiki recommends to never use them
Why not?
Rules are incredibly powerful, but they don't do what they look like they do. They look like they're some conditional logic, but they actually rewrite a query to modify it or add additional queries to it.
That means that all non-trivial rules are incorrect.
Depesz has more to say about them.
When should you?
Never. While the rewriter is an implementation detail of VIEWs, there is no reason to pry up this cover plate directly.
(emphasis mine)

Postgresql - case insensitive build to allow all wheres, joins, group bys etc to be case insensitive

I've had this thought brewing for some time but I can't find anyone online who's discussed this as a possibility.
Currently the recommendations available for making case insensitive searches seem to be either to use "ilike" or "citext".
We're moving away from Microsoft Sql Server to Postgresql and all our code assumes case insensitive comparisons - but our TSQL code base is huge so changing it all to use UCASE() or ilike or citext etc etc isn't really feasible as a commercial development project.
However it must be possible to grab the source of postgresql and change some of the C code so that all string comparisons as case-insensitive, and then make our own compilation of the whole product. I think it would possibly require only a few lines of code to be changed and so upgradeability might not be a huge issue.
So I'm wondering whether anyone on here knows the Postgresql code base well enough to kick around ideas about whether this is feasible and whereabout the code is that does the comparisons just to help us get started. I'm continuing to research this in the meantime, and getting started with just being able to build postgresql on windows, but the hope is to bring others onboard with the idea such that a community project could be started, and as well as case insensitivity there might be other tweaks to allow tsql code to work better thus easing migration projects. My company would contribute to strongly.
Sorry if this is off topic but it seems to strongly lean towards being a developer question and I'm sure many other postgres users would appreciate a case insensitive build in this day and age -thanks
I understand your sentiment, but I believe that you are wrong to assume that this would be a simple change. Otherwise PostgreSQL would probably already have case insensitive collations...
I'd say that your best bet is to use citext throughout. What is the problem you have with that?
You should take this to the hackers list to start a serious discussion, but make sure you read the archives first, because the problem is not a new one.

Document similarity framework

I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.
I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.
For example, if I would implement it in Lucene, I would do the following:
Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
Than adding the created elements to the Lucene index
And finally using the MoreLikeThis class to find the similar documents
So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.
My question is: are the any applications/frameworks, which might fit for the above mentioned problem?
Thanks,
krisy
UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.
There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...
Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.
Also, the answer that #mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).
If I am getting correctly you have your own database, and you are searching if its duplicate, or copy/similar, in database while/after user uploads.
If That is the case, the domain is very big in comparison..
1) For Image you must use pattern matching, there are few papers available for image duplicate finder, on net, search for them you will get many options for that,
2) for Document there is again characteristically division
DOC(x)
PDF
TXT
RTF, etc..
Each document carry different property, now here Lucene may help you but its search engine,
While searching for Language pattern, there are many things we need to check, as you are searching for similar(not exact same).
So, fuzzy language program will come handy.
This requirement is too large that the forum page will not be enough to explain everything anyways, I hope this much will do

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance
You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.
I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Full Text Searching in Apple's Core Data Framework

I would like to implement a full text search in an iPhone application. I have data stored in an sqlite database that I access via the Core Data framework. Just using predicates and ORing a bunch of "contains[cd]" phrases for every search word and column does not work well at all.
What have you done that seems to work well?
We have FTS3 working very nicely on 150,000+ records. We are getting subsecond query times returning over 200 results on a single keyword query.
Presently the only way to get Sqlite FTS3 working on the iPhone is to compile your own binary and link it to your project. To my knowledge, the binary included in your own project will not work with Core Data. Perhaps Apple will turn on the FTS3 compiler option in a future release?
You can still link in your own Sqlite FTS3 binary and use it just for full text searches. This would be very similar to the way Sphinx or Lucene are used in Web App environments. Note you will still have to update the search index at some point to keep synchronicity with the Core Data stores.
Good luck !!
I assume that by "does not work well" you mean 'performs badly'. Full-text search is always relatively slow, especially in memory or space constrained environments. You may be able to speed things up by making sure the attributes you're searching against are indexed and using BEGINSWITH[cd] instead of CONTAINS[cd]. My recollection (can't find the cocoa-dev post at this time) is that SQLite will use the index for prefix matching, but falls back to linear search for infix searches.
I use contains[cd] in my predicate and this works fine. Perhaps you could post your predicate and we could see if there's an obvious fault.
Sqlite has its own full text indexing module: http://sqlite.org/fts3.html
You have to have full control of the SQL you send to the db (I don't know how Core Data works), but using the full text indexing module is key to speed of execution and simplicity in your SQL SELECT statements that do full text searching.
Using CONTAINS is fine if you don't need fast execution, but selects made with it can't make use of regular indexes so are destined to be slow, and the larger the database the slower it will be. Using real full text indexing allows same sort of searches as you can do with 'CONTAINS', but things are indexed for fast results even with large db's.
I've been working on this same problem and just got around to following up on my post about this from a few weeks ago. Instead of using CONTAINS, I created a separate entity with an instance for each canonicalized word. I added an index on the words (in XCode model builder) and can then use a BEGINSWITH operator to exploit the index. Nevertheless, as I just posted a few minutes ago, query time is still very slow for even small data sets.
There must be a better way! After all, we see this sort of full text search in lots of apps!