Document similarity framework

Document similarity framework - frameworks

I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.
I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.
For example, if I would implement it in Lucene, I would do the following:
Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
Than adding the created elements to the Lucene index
And finally using the MoreLikeThis class to find the similar documents
So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.
My question is: are the any applications/frameworks, which might fit for the above mentioned problem?
Thanks,
krisy
UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.
There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...

Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.
Also, the answer that #mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).

If I am getting correctly you have your own database, and you are searching if its duplicate, or copy/similar, in database while/after user uploads.
If That is the case, the domain is very big in comparison..
1) For Image you must use pattern matching, there are few papers available for image duplicate finder, on net, search for them you will get many options for that,
2) for Document there is again characteristically division
DOC(x)
PDF
TXT
RTF, etc..
Each document carry different property, now here Lucene may help you but its search engine,
While searching for Language pattern, there are many things we need to check, as you are searching for similar(not exact same).
So, fuzzy language program will come handy.
This requirement is too large that the forum page will not be enough to explain everything anyways, I hope this much will do

Related

MongoDB + Elasticsearch or only Elasticsearch?

We have a new project there for index a large amount of data and for provide real time. I have also complexe search with facets, full text, geospatial...
The first prototype is to index in MongoDB and next, into Elasticsearch, because I had read that Elasticsearch does not apply a checksum on stored files and the index can't be fully trusted.
But since last versions (in the version 1.5), there is now a checksum and I'm guessing if we can use Elasticsearch as primary data store ? And what is the benefit to use MongoDB in addition to Elasticsearch ?
I can't find up to date answer about thoses features in Elasticsearch
Thanks a lot

Talking about arguments to use Mongo instead of/together with ES:
User/role management.
Built-in in MongoDB. May not fit all your needs, may be clumsy somewhere, but it exists and it was implemented pretty long time ago.
The only thing for security in ES is shield. But it ships only for Gold/Platinum subscription for production use.
Schema
ES is schemaless, but its built on top of Lucene and written in Java. The core idea of this tool - index and search documents, and working this way requires index consistency. At back end, all documents should be fitted in flat lucene index, which requires some understanding about how ES should deal with your nested documents and values, and how you should organize your indexes to maintain balance between speed and data completeness/consistency. Working with ES requires you to keep some things about schema in mind constantly. I.e: as you can index almost anything to ES without putting corresponding mapping in advance, ES can "guess" mapping on the fly but sometimes do it wrong and sometimes implicit mapping is evil, because once it put, it can't be changed w/o reindexing whole index. So, its better to not treat ES as schemaless store, because you can step on a rake some time (and this will be pain :) ), but rather treat it as schema-intensive, at least when you work with documents, that can be sliced to concrete fields.
Mongo, on the other hand, can "chew and leave no crumbs" out of almost anything you put in it. And most your queries will work fine, `til you remember how Mongo will deal with your data from JavaScript perspective. And as JS is weakly typed, you can work with really schemaless workflow (for sure, if you need such)
Handling non-table-like data.
ES is limited to handle data without putting it to search index. And this solution is good enough, when you need to store and retrieve some extra data (comparing to data you want to search against).
MongoDB supports gridFS. This gives you ability to handle large chunks of data behind the same interface. I.e., you can store binary data in Mongo and retrieve it within the same interface, from your code perspective.

Well, choose the right tool for the right job. If you require searching capabilities such as full text search, faceting etc, then nothing can beat a full fledged search engine. ElasticSearch(ES) or Solr is just a matter of choice.
You can actually feed(index) documents into ES for searching and then fetch the complete details for a particular entry from MongoDB or any other database.
I can make your task easier, do take a look at my open source work that's using MongoDB, ES, Redis and RabbitMQ, all integrated at one place, here on github
Please note that the application is built in .Net C#.

After having used Elasticsearch on production, I can add up to this thread few notes :
We securized our Elasticsearch clustering via a reverse proxy which check client certificate authenticity at request time before letting the query in : it proves that there is multiple way to add authentication anyway. (If you need more accuracy in security, like by using roles, there is few plugins that can be added to manage permissions)
Elasticsearch mapping and settings (tuning) are really important concepts to fully understand before going on production with it, and that's no that easy to get how everything works quickly.
Clustering and horizontal scaling is very flexible and easy to set up
The suite tools (Kibana, beats, etc ..) are a very convinient way to gather logs, expose key data, etc ...
Search features are extremely advanced, you can really do amazing things when you master a bit how full text search works (fuzzyness, boosting, scoring, stemming, tokenizer, analyzers, and so on ...).
API's are a bit scattered and there is not unique ways to achieve something. And some API are really WTF to use, like the bulk insert API: you need to pass binary data, with JSON format (ofc don't forget end of line characters) and repeating some fields multiple times. This is very verbose and I guess it's legacy code like we all have in our projects ;).
Last thing : if you develop a Java project, do not use Hibernate Search to duplicate data from a datasource to your ES cluster, we had so much issues with Hibernate Search, if we had to do that again, we'd do that manually.
Now about the real question :
To my mind, using only Elasticsearch is sufficient and may reduce complexity of having a multiple NoSQL storage systems.
I think it's worthy when you are doing a duo Relational and Transactional database + NoSQL search engine, but having two system which roughly serves the same purposes is a bit overkilled

I have recently developed a feature in my company,
we wanted to perform some searches and rank the result according to its relevance on multiple factors and conditions.
So in my application, we were already using MongoDB as Db,
So on ElasticSearch index, I exported some of the fields from MongoDB that I want to perform search and filters on.
So according to required conditions I prepared my mongo query and elasticsearch query also and performed the search. Then I filtered and sorted the result according to my need.
The whole flow will was designed in such a way that,
even if there is an error from ES, mongo will fetch the records.
If I get the result from ES then, mongo result will depend on ES result.
This is how I used mongo and ES in combination.
Also, don't forget to properly handle all updates, deletes and new record insertions.
And Just to Know, results for me were Really Good.

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance

You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.

I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

full text search in java web applicatioin [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 12 months ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm building a Django site and I am looking for a search engine.
A few candidates:
Lucene/Lucene with Compass/Solr
Sphinx
Postgresql built-in full text search
MySQl built-in full text search
Selection criteria:
result relevance and ranking
searching and indexing speed
ease of use and ease of integration with Django
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
scalability
extra features such as "did you mean?", related searches, etc
Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.
EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay

Good to see someone's chimed in about Lucene - because I've no idea about that.
Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.
Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.
I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.
I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.

I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.
Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.
As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.
It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.

Apache Solr
Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.
Simple Introduction
Anyone who has had experience with the search engines above, or other
engines not in the list -- I would love to hear your opinions.
Solr shouldn't be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
result relevance and ranking
The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.
http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john
As you can see, firstname field is boosted up with a score of 2.
More on SolrRelevancy
searching and indexing speed
The speed is unbelievably fast and no compromise on that. The reason I moved to Solr.
Regarding the indexing speed, Solr can also handle JOINS from your database tables. A higher and complex JOIN do affect the indexing speed. However, an enormous RAM config can easily tackle this situation.
The higher the RAM, The faster the indexing speed of Solr is.
ease of use and ease of integration with Django
Never attempted to integrate Solr and Django, however you can achieve to do that with Haystack. I found some interesting article on the same and here's the github for it.
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr.
Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. As explained, Solr is only a near real-time solution.
scalability
Solr is highly scalable. Have a look on SolrCloud.
Some key features of it.
Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)
Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)
Distributed Search
High Availability
extra features such as "did you mean?", related searches, etc
For the above scenario, you could use the SpellCheckComponent that is packed up with Solr. There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book, you will be presented with results related to book.
This answer broadly focuses on Apache Solr & MySQL. Django is out of scope.
Assuming that you are under LINUX environment, you could proceed to this article further. (mine was an Ubuntu 14.04 version)
Detailed Installation
Getting Started
Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.
After downloading the archive , extract it to a folder of your choice.
Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/
On your prompt.. Navigate inside the directory
shankar#shankar-lenovo: cd Downloads/solr-4.8.1
So now you are here ..
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1$
Start the Jetty Application Server
Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar
Now , do not close the terminal , minimize it and let it stay aside.
( TIP : Use & after start.jar to make the Jetty Server run in the
background )
To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr
Running Jetty on custom Port
It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.
java -Djetty.port=9091 -jar start.jar
Download the JConnector
This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here
After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib
Creating the MySQL table to be linked to Apache Solr
To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.
1.Table Structure
CREATE TABLE test_solr_mysql
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NULL,
created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
2.Populate the above table
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');
Getting inside the core and adding the lib directives
1.Navigate to
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf
2.Modifying the solrconfig.xml
Add these two directives to this file..
<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />
Now add the DIH (Data Import Handler)
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler" >
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
3.Create the db-data-config.xml file
If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
<document>
<entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
OR `created` > '${dataimporter.last_index_time}'" >
<field name="id" column="rid" />
<field name="solr_name" column="name" />
</entity>
</document>
</dataConfig>
( TIP : You can have any number of entities but watch out for id field,
if they are same then indexing will skipped. )
4.Modify the schema.xml file
Add this to your schema.xml as shown..
<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />
Implementation
Indexing
This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.
Step 1: Go to Solr Admin Panel
Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.
As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.
Step 2: Check your Logs
Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.
Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport
Step 3: DIH (Data Import Handler)
Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.
To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.
Alternatively, you could use a direct full-import query like this too..
http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true
After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.
Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.
Step 4: Running Solr Queries
Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.
You will see the indexed records as shown.
The corresponding Solr query for listing all the records is
http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true
Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.
http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true
That's how you write Solr Queries. To read more about it, Check this beautiful article.

I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).
The answer at the link below details a few things about Sphinx which also applies to Solr.
Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Solr also provides the following additional features:
Supports replication
Multiple cores (think of these as separate databases with their own configuration and own indexes)
Boolean searches
Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
Update index via XML or delimited file
Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
PDF, Word document indexing
Dynamic fields
Facets
Aggregate fields
Stop words, synonyms, etc.
More Like this...
Index directly from the database with custom queries
Auto-suggest
Cache Autowarming
Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)
BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.
Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)

I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database.
But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site. While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great. Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter.
OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer. It's the default for the MediaWiki search, and I really think it's no good for end-users: http://www.searchtools.com/analysis/mediawiki-search/
In all cases I've seen, Lucene/Solr and Sphinx are really great. They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone.
for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine.

Just my two cents to this very old question. I would highly recommend taking a look at ElasticSearch.
Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
The advantages over other FTS (full text search) Engines are:
RESTful interface
Better scalability
Large community
Built by Lucene
developers
Extensive documentation
There are many open source libraries available (including Django)
We are using this search engine at our project and very happy with it.

SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."
FYIs, The MySQL fulltext min word length is adjustable since at least MySQL 5.0. Google 'mysql fulltext min length' for simple instructions.
That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...

I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.
The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.
In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.
As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.
And, of course mnoGoSearch is open source, GNU GPL.

We just switched from Elasticsearch to Postgres Full Text. Since we have already used Postgres, we now save ourselves the hassle of keeping the index up to date.
But this only affects the full text search. There are, however, use cases where Elasicsearch is significantly better. Maybe facets or something like that.

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?

For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.

If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.

Full Text Searching in Apple's Core Data Framework

I would like to implement a full text search in an iPhone application. I have data stored in an sqlite database that I access via the Core Data framework. Just using predicates and ORing a bunch of "contains[cd]" phrases for every search word and column does not work well at all.
What have you done that seems to work well?

We have FTS3 working very nicely on 150,000+ records. We are getting subsecond query times returning over 200 results on a single keyword query.
Presently the only way to get Sqlite FTS3 working on the iPhone is to compile your own binary and link it to your project. To my knowledge, the binary included in your own project will not work with Core Data. Perhaps Apple will turn on the FTS3 compiler option in a future release?
You can still link in your own Sqlite FTS3 binary and use it just for full text searches. This would be very similar to the way Sphinx or Lucene are used in Web App environments. Note you will still have to update the search index at some point to keep synchronicity with the Core Data stores.
Good luck !!

I assume that by "does not work well" you mean 'performs badly'. Full-text search is always relatively slow, especially in memory or space constrained environments. You may be able to speed things up by making sure the attributes you're searching against are indexed and using BEGINSWITH[cd] instead of CONTAINS[cd]. My recollection (can't find the cocoa-dev post at this time) is that SQLite will use the index for prefix matching, but falls back to linear search for infix searches.

I use contains[cd] in my predicate and this works fine. Perhaps you could post your predicate and we could see if there's an obvious fault.

Sqlite has its own full text indexing module: http://sqlite.org/fts3.html
You have to have full control of the SQL you send to the db (I don't know how Core Data works), but using the full text indexing module is key to speed of execution and simplicity in your SQL SELECT statements that do full text searching.
Using CONTAINS is fine if you don't need fast execution, but selects made with it can't make use of regular indexes so are destined to be slow, and the larger the database the slower it will be. Using real full text indexing allows same sort of searches as you can do with 'CONTAINS', but things are indexed for fast results even with large db's.

I've been working on this same problem and just got around to following up on my post about this from a few weeks ago. Instead of using CONTAINS, I created a separate entity with an instance for each canonicalized word. I added an index on the words (in XCode model builder) and can then use a BEGINSWITH operator to exploit the index. Nevertheless, as I just posted a few minutes ago, query time is still very slow for even small data sets.
There must be a better way! After all, we see this sort of full text search in lots of apps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse