Lucene .NET different results on different servers - lucene.net

We use Lucene .NET in a web application. We build an index from an SQL database, and then search it.
We have a situation that is baffling: same database on two servers, same code, same configuration (believe me, we've checked really thoroughly). BUT running identical searches gives wildly different numbers of documents returned (one produces a lot, the other not so).
We log the actual queries run against Lucene, and we have run those in the Lucene query analyser and we get the expected results on both servers. However, executing the query via our code gives a completely different result between servers.
We've even copied the index files between servers and we get the same result - the one server always gives a lot of results, the other doesn't.
We've checked the configuration of the two servers - both servers are Windows 2008 R2, 64 bit version, both set to the same locale etc.
We're baffled - any ideas greatly appreciated

Luke is a java application using the java-based analyzers available. I would expect that the result differ between Luke and Lucene.Net, especially if you use custom analyzers which differ from the java versions.
Also, remember that a new TermQuery("f", "my value") will produce a query that looks like "f:my value", but throwing that into Luke will really be parsed as "f:my default-field:value", i.e. a completely different search query.
I would start by verifying that you are using the correct analyzers, and that they are producing correct tokens. I would check for culture-sensitive things since they usually sneak up on you when you least expect it. (And cases where processes are executing under non-system locales, and where threads change their locales during execution.)

Related

Unit tests producing different results when using PostgreSQL

Been working on a module that is working pretty well when using MySQL, but when I try and run the unit tests I get an error when testing under PostgreSQL (using Travis).
The module itself is here: https://github.com/silvercommerce/taxable-currency
An example failed build is here: https://travis-ci.org/silvercommerce/taxable-currency/jobs/546838724
I don't have a huge amount of experience using PostgreSQL, but I am not really sure why this might be happening? The only thing I could think that might cause this is that I am trying to manually set the ID's in my fixtures file and maybe PostgreSQL not support this?
If this is not the case, does anyone have an idea what might be causing this issue?
Edit: I have looked again into this and the errors appear to be because of this assertion, which should be finding the Tax Rate vat but instead finds the Tax Rate reduced
I am guessing there is an issue in my logic that is causing the incorrect rate to be returned, though I am unsure why...
In the end it appears that Postgres has different default sorting to MySQL (https://www.postgresql.org/docs/9.1/queries-order.html). The line of interest is:
The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on
In the end I didn't really need to test a list with multiple items, so instead I just removed the additional items.
If you are working on something that needs to support MySQL and Postgres though, you might need to consider defining a consistent sort order as part of your query.

Why is Crystal Reports Query so slow?

I have many Crystal Reports to the same database. Some execute quickly given the same date parameters and many fields are the same as well as the tables they access. One of my reports used to run quickly is now running very slow and I can see it looking through all the records - represented in the bottom 0 of 100000 til it finds records. I have no idea what I may have changed to make it do this. Some reports still run fast and some do not. These findings are consistent with the reports I am talking about. Does anyone know why setting might be causing this?
I have tried looking for any subtle differences in them - I cannot see anything. Many of them where clones from the original(still works fast).
In my CR book in the performance section it states if the where clause can not be translated it will be ignored and for the process of all records - which is what this looks like - though I have a valid where clause when I check it in the report.
Use Indexes Or Server For Speed is checked. All other setting in Report Options as identical.
Thanks
You can do some troubleshoot:
Try run your query directly on db and see how long it takes.
Is there any business logic added in your report.
May be also try to put same query in fresh report and see if it takes similar time.
Also try debug your application and see if some part of your code making your report to show slow.
Are you running it on local db or on some server.
Also if you can share your query, so I can take a look.
Let me know if you need more help.

EntityFramework taking excessive time to return records for a simple SQL query

I have already combed through this old article:
Why is Entity Framework taking 30 seconds to load records when the generated query only takes 1/2 of a second?
but no success.
I have tested the query:
without lazy loading (not using .Include of related entities) and
without merge tracking (using AsNoTracking)
I do not think I can easily switch to compiled queries in general due to the complexity of queries and using a Code First model, but let me know if you experience otherwise...
Setup
Entity Framework '4.4' (.Net 4.0 with EF 5 install)
Code First model and DbContext
Testing directly on the SQL Server 2008 machine hosting the database
Query
- It's just returning simple fields from one table:
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Active] AS [Active],
[Extent1].[ChangeUrl] AS [ChangeUrl],
[Extent1].[MatchValueSetId] AS [MatchValueSetId],
[Extent1].[ConfigValueSetId] AS [ConfigValueSetId],
[Extent1].[HashValue] AS [HashValue],
[Extent1].[Creator] AS [Creator],
[Extent1].[CreationDate] AS [CreationDate]
FROM [dbo].[MatchActivations] AS [Extent1]
The MatchActivations table has relationships with other tables, but for this purpose using explicit loading of related entities as needed.
Results (from SQL Server Profiler)
For Microsoft SQL Server Management Studio Query: CPU = 78 msec., Duration = 587 msec.
For EntityFrameworkMUE: CPU = 31 msec., Duration = 8216 msec.!
Does anyone know, besides suggesting the use of compiled queries if there is anything else to be aware of when using Entity Framework for such a simple query?
A number of people have run into problems where cached query execution plans due to parameter sniffing cause SQL Server to produce a very inefficient execution plan when running a query through ADO.NET, while running the exact same query directly from SQL Server Management Studio uses a different execution plan because some flags on the query are set differently by default.
Some people have reported success in forcing a refresh of the query execution plans by running one or both of the following commands:
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
But a more long-term, targeted solution to this problem would be to use Query Hints like OPTIMIZE FOR and OPTION(Recompile), as described in this article, to help ensure that good execution plans are chosen more consistently in the first place.
I think the framework is doing something funky if you what you say is true i.e. running the query in management studio takes half a second while entity framework takes 8.2 seconds. My hunch is that it's trying to do something with those 25K+ records set (perhaps bind to something else).
Can you download NP NET profiler and profile your app once? http://www.microsoft.com/en-in/download/details.aspx?id=35370
This nifty little program is going to record every method call and their execution time and basically give you info from under the hood on where it's spending those 7+ seconds. If that does not help, I also recommend trying out JetBrains .NET profiler. https://www.jetbrains.com/profiler/
Previous answer suggests that the execution plan can be off and that's true in many cases but it's also worth to sometimes look under the hood to determine the cause.
My thanks to Kalagen and others who responded to this - I did come to a conclusion on this, but forgot about this post.
It turns it is the number of records being returned X processing time (LINQ/EF I presume) to repurpose the raw SQL data back into objects on the client side. I set up wireshark on the SQL server to monitor the network traffic between it and client machines post-query and discovered:
There is a constant stream of network traffic between SQL server and
the rate of packet processing varies greatly between different machines (8x)
While that is occurring, the SQL server CPU utilization is < 25% and no resource starvation seems to be happening (working set, virtual memory, thread, handle counts, etc.)
so it is basically the constant conversion of the results back into EF objects.
The query in question BTW was part of a 'performance' unit test so we ended up culling it down to a more reasonable typical web-page loading of 100 records in under 1 sec. which passes easily.
If anyone wants to chime in on the details of how Entity Framework processes records post-query, I'm sure that would be useful to know.
It was an interesting discovery that the processing time depended more heavily on the client machine than on the SQL server machine (this is an intranet application).

Database design: Postgres or EAV to hold semi-structured data

I was given the task to decide whether our stack of technologies is adequate to complete the project we have at hand or should we change it (and to which technologies exactly).
The problem is that I'm just a SQL Server DBA and I have a few days to come up with a solution...
This is what our client wants:
They want a web application to centralize pharmaceutical researches separated into topics, or projects, in their jargon. These researches are sent as csv files and they are somewhat structured as follows:
Project (just a name for the project)
Segment (could be behavioral, toxicology, etc. There is a finite set of about 10 segments. Each csv file holds a segment)
Mandatory fixed fields (a small set of fields that are always present, like Date, subjects IDs, etc. These will be the PKs).
Dynamic fields (could be anything here, but always as a key/pair value and shouldn't be more than 200 fields)
Whatever files (images, PDFs, etc.) that are associated with the project.
At the moment, they just want to store these files and retrieve them through a simple search mechanism.
They don't want to crunch the numbers at this point.
98% of the files have a couple of thousand lines, but there's a 2% with a couple of million rows (and around 200 fields).
This is what we are developing so far:
The back-end is SQL 2008R2. I've designed EAVs for each segment (before anything please keep in mind that this is not our first EAV design. It worked well before with less data.) and the mid-tier/front-end is PHP 5.3 and Laravel 4 framework with Bootstrap.
The issue we are experiencing is that PHP chokes up with the big files. It can't insert into SQL in a timely fashion when there's more than 100k rows and that's because there's a lot of pivoting involved and, on top of that, PHP needs to get back all the fields IDs first to start inserting. I'll explain: this is necessary because the client wants some sort of control on the fields names. We created a repository for all the possible fields to try and minimize ambiguity problems; fields, for instance, named as "Blood Pressure", "BP", "BloodPressure" or "Blood-Pressure" should all be stored under the same name in the database. So, to minimize the issue, the user has to actually insert his csv fields into another table first, we called it properties table. This action won't completely solve the problem, but as he's inserting the fields, he's seeing possible matches already inserted. When the user types in blood, there's a panel showing all the fields already used with the word blood. If the user thinks it's the same thing, he has to change the csv header to the field. Anyway, all this is to explain that's not a simple EAV structure and there's a lot of back and forth of IDs.
This issue is giving us second thoughts about our technologies stack choice, but we have limitations on our possible choices: I only have worked with relational DBs so far, only SQL Server actually and the other guys know only PHP. I guess a MS full stack is out of the question.
It seems to me that a non-SQL approach would be the best. I read a lot about MongoDB but honestly, I think it would be a super steep learning curve for us and if they want to start crunching the numbers or even to have some reporting capabilities,
I guess Mongo wouldn't be up to that. I'm reading about PostgreSQL which is relational and it's famous HStore type. So here is where my questions start:
Would you guys think that Postgres would be a better fit than SQL Server for this project?
Would we be able to convert the csv files into JSON objects or whatever to be stored into HStore fields and be somewhat queryable?
Is there any issues with Postgres sitting in a windows box? I don't think our client has Linux admins. Nor have we for that matter...
Is it's licensing free for commercial applications?
Or should we stick with what we have and try to sort the problem out with staging tables or bulk-insert or other technique that relies on the back-end to do the heavy lifting?
Sorry for the long post and thanks for your input guys, I appreciate all answers as I'm pulling my hair out here :)

full text search in java web applicatioin [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 12 months ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm building a Django site and I am looking for a search engine.
A few candidates:
Lucene/Lucene with Compass/Solr
Sphinx
Postgresql built-in full text search
MySQl built-in full text search
Selection criteria:
result relevance and ranking
searching and indexing speed
ease of use and ease of integration with Django
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
scalability
extra features such as "did you mean?", related searches, etc
Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.
EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay
Good to see someone's chimed in about Lucene - because I've no idea about that.
Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.
Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.
I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.
I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.
I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.
Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.
As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.
It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.
Apache Solr
Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.
Simple Introduction
Anyone who has had experience with the search engines above, or other
engines not in the list -- I would love to hear your opinions.
Solr shouldn't be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
result relevance and ranking
The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.
http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john
As you can see, firstname field is boosted up with a score of 2.
More on SolrRelevancy
searching and indexing speed
The speed is unbelievably fast and no compromise on that. The reason I moved to Solr.
Regarding the indexing speed, Solr can also handle JOINS from your database tables. A higher and complex JOIN do affect the indexing speed. However, an enormous RAM config can easily tackle this situation.
The higher the RAM, The faster the indexing speed of Solr is.
ease of use and ease of integration with Django
Never attempted to integrate Solr and Django, however you can achieve to do that with Haystack. I found some interesting article on the same and here's the github for it.
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr.
Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. As explained, Solr is only a near real-time solution.
scalability
Solr is highly scalable. Have a look on SolrCloud.
Some key features of it.
Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)
Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)
Distributed Search
High Availability
extra features such as "did you mean?", related searches, etc
For the above scenario, you could use the SpellCheckComponent that is packed up with Solr. There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book, you will be presented with results related to book.
This answer broadly focuses on Apache Solr & MySQL. Django is out of scope.
Assuming that you are under LINUX environment, you could proceed to this article further. (mine was an Ubuntu 14.04 version)
Detailed Installation
Getting Started
Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.
After downloading the archive , extract it to a folder of your choice.
Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/
On your prompt.. Navigate inside the directory
shankar#shankar-lenovo: cd Downloads/solr-4.8.1
So now you are here ..
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1$
Start the Jetty Application Server
Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar
Now , do not close the terminal , minimize it and let it stay aside.
( TIP : Use & after start.jar to make the Jetty Server run in the
background )
To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr
Running Jetty on custom Port
It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.
java -Djetty.port=9091 -jar start.jar
Download the JConnector
This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here
After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib
Creating the MySQL table to be linked to Apache Solr
To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.
1.Table Structure
CREATE TABLE test_solr_mysql
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NULL,
created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
2.Populate the above table
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');
Getting inside the core and adding the lib directives
1.Navigate to
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf
2.Modifying the solrconfig.xml
Add these two directives to this file..
<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />
Now add the DIH (Data Import Handler)
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler" >
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
3.Create the db-data-config.xml file
If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
<document>
<entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
OR `created` > '${dataimporter.last_index_time}'" >
<field name="id" column="rid" />
<field name="solr_name" column="name" />
</entity>
</document>
</dataConfig>
( TIP : You can have any number of entities but watch out for id field,
if they are same then indexing will skipped. )
4.Modify the schema.xml file
Add this to your schema.xml as shown..
<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />
Implementation
Indexing
This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.
Step 1: Go to Solr Admin Panel
Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.
As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.
Step 2: Check your Logs
Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.
Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport
Step 3: DIH (Data Import Handler)
Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.
To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.
Alternatively, you could use a direct full-import query like this too..
http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true
After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.
Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.
Step 4: Running Solr Queries
Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.
You will see the indexed records as shown.
The corresponding Solr query for listing all the records is
http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true
Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.
http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true
That's how you write Solr Queries. To read more about it, Check this beautiful article.
I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).
The answer at the link below details a few things about Sphinx which also applies to Solr.
Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Solr also provides the following additional features:
Supports replication
Multiple cores (think of these as separate databases with their own configuration and own indexes)
Boolean searches
Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
Update index via XML or delimited file
Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
PDF, Word document indexing
Dynamic fields
Facets
Aggregate fields
Stop words, synonyms, etc.
More Like this...
Index directly from the database with custom queries
Auto-suggest
Cache Autowarming
Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)
BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.
Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)
I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database.
But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site. While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great. Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter.
OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer. It's the default for the MediaWiki search, and I really think it's no good for end-users: http://www.searchtools.com/analysis/mediawiki-search/
In all cases I've seen, Lucene/Solr and Sphinx are really great. They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone.
for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine.
Just my two cents to this very old question. I would highly recommend taking a look at ElasticSearch.
Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
The advantages over other FTS (full text search) Engines are:
RESTful interface
Better scalability
Large community
Built by Lucene
developers
Extensive documentation
There are many open source libraries available (including Django)
We are using this search engine at our project and very happy with it.
SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."
FYIs, The MySQL fulltext min word length is adjustable since at least MySQL 5.0. Google 'mysql fulltext min length' for simple instructions.
That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...
I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.
The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.
In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.
As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.
And, of course mnoGoSearch is open source, GNU GPL.
We just switched from Elasticsearch to Postgres Full Text. Since we have already used Postgres, we now save ourselves the hassle of keeping the index up to date.
But this only affects the full text search. There are, however, use cases where Elasicsearch is significantly better. Maybe facets or something like that.