Are there alternative/deeper ways to analyze a Sphinx Query besides `Show Meta`? - sphinx

I have a complex Sphinx Config which utilizes regepx and wordforms as well.
Often I get unexpected results and have started using Show Meta to see what my various manipulations are actually searching for.
I have records 1,2 and 3 that should each be found by SphinxQL Query A or SphinxQL Query B. Yet only 1 and 2 are found by Query A and only 3 is found by Query B.
When I run Show Meta however it shows me the expected keywords for Query A and Query B are identical.
I am unclear why all 3 records are not found by either Query when they are found by one or the other which request the same keywords ultimately.
In fact if I manually do a SphinxQL search for the same keywords each of the original Queries ended up pushing as described by Show Meta I get all the records.
Is there another analysis tool in Sphinx that might help me uncover this mystery?

Well mentioned a range of tools in another post of yours
How to see what Sphinx is actually finding?
In particular SHOW PLAN might be helpful, or the data from PACKEDFACTORS() - lots of information, but in particular can show what fields match for example.

Related

How well do the search engines of databases like mongoDB and Postgres do compared to something like Elasticsearch? (in 2020)

I am working on a website, where users can create entries into a database. These entries are all of the same form, so I was using Sqlite (with FTS5) so far, which is the only database I know ;)
However it is crucial that users can search these entries on the website. The full text search is working decently well (the users know approximately what they are looking for) but I need two improvements:
Wrong spelling should return the correct entry (I have seen the spellfix extension for sqlite for that, but I don't know how well it works)
more importantly if a user enters a few query words on the website I try to MATCH those with a sql query. If a user enters too many words, it will return 0 matches:
for example: if a user inputs "sars covid 19" into the search-bar:
CREATE VIRTUAL TABLE TEST USING FTS5(SomeText);
INSERT INTO TEST(SomeText) VALUES
('Covid 19');
SELECT SomeText
FROM TEST
WHERE SomeText MATCH 'sars covid 19';
=> 0 matches, but I would want it to return the 'covid 19' entry.
I realise that sqlite might be too simplistic for a database that needs to handle searches well. Now my question is: Do Postgres or MongoDB have search engines that include the functionality that I need? Or is it worth diving into solutions with Elastic Search etc.?
Most articles I found on this are 5-10 years old, so I am wondering what the current state of affairs is regarding search engines in databases. Any hints are greatly appreciated
Combination es + mongodb work well, you index and perform full text search in es and you keep the original documents with some key fields indexed in mongodb...
Elasticsearch will work for sure. You only have to think about how you will index your document, and you will be able to find them the way you index them, in your context it seems that the default text will work with a match query :
https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
MongoDb will work too in this simple case : https://docs.mongodb.com/manual/text-search/ , but mongo wont work with tokenizer so if you need to upgrade your text search mongo will be limited.
Postgresql could do it, using the like but I am not familiar with enough, if you have 10k entries, it will be ok for sure, if you expect 1 millions, mongo or es would be better.
If you have to choose between mongodb and es, you have to be more specific in your question, for full text, es is really nice, having a lot of features, mongodb give some nice database tools too. Sometimes es will be better, sometimes mongo, depends of what you need. If you only want full text, es is a must.

Filter unneccessary words when doing full text search in PostgreSQL

I've created full text search in postgreSQL based on this wonderfull article.
It works good enough, but thing should be fixed.
Say I have blog post in my DB with text:
"All kittens go to heaven"
If user searches "All kittens go to heaven, may be..." the DB will return nothing, because words may be are not found.
I can post my sql query but it's pretty much the same as described in article.
Is there way to show found articles which have most of searched words?
This is a fundamental problem with PostgreSQL's Text Search.
You could try to pre-parse the query, and strip out any terms that aren't in the "corpus" of terms of all your documents, but that doesn't really solve your problem.
You could try changing your query to 'or' all the terms, but this could have performance problems.
The best bet would be to try the smlar extension (written by the Text Search authors), which can use cosine/tfidf weighting. This means that the query can have terms that aren't in the document and still match.

Full Text Search & Inverted Indexes in MongoDB

I’m playing around with MongoDB for the moment to see what nice features it has. I’ve created a small test suite representing a simple blog system with posts, authors and comments, very basic.
I’ve experimented with a search function which uses the MongoRegEx class (PHP Driver), where I’m just searching through all post content and post titles after the sentence ‘lorem ipsum’ with case sensitive on “/I”.
My code looks like this:
$regex = new MongoRegEx('/lorem ipsum/i');
$query = array('post' => $regex, 'post_title' => $regex);
But I’m confused and stunned about what happens. I check every query for running time (set microtime before and after the query and get the time with 15 decimals).
For my first test I’ve added 110.000 blog documents and 5000 authors, everything randomly generated. When I do my search, it finds 6824 posts with the sentence “lorem ipsum” and it takes 0.000057935714722 seconds to do the search. And this is after I’ve reset the MongoDB service (using Windows) and this is without any index other than the default on _id.
MongoDB uses a B-tree index, which most definitely isn’t very efficient for a full text search. If I create an index on my post content attribute, the same query as above runs in 0.000150918960571, which funny enough is slower than without any index (slower with a factor of 0.000092983245849). Now this can happen for several reasons because it uses a B-tree cursor.
But I’ve tried to search for an explanation as to how it can query it so fast. I guess that it probably keeps everything in my RAM (I’ve got 4GB and the database is about 500MB). This is why I try to restart the mongodb service to get a full result.
Can anyone with experience with MongoDB help me understand what is going on with this kind of full text search with or without index and definitely without an inverted index?
Sincerely
- Mestika
I think you simply didn't iterate over the results? With just a find(), the driver will not send a query to the server. You need to fetch at least one result for that. I don't believe MongoDB is this fast, and I believe your error to be in your benchmark.
As a second thing, for regular expression search that is not anchored at the beginning of the field's value with an ^, no index is used at all. You should play with explain() to see what is actually happening.

Running a second search against the results of the first

Here's the scenario:
I have an indexed database table with over a half-million records.
Search #1 is run against this table to generate the, say, 20 best matches, which are then ordered descending according to relevance.
Search #2 needs to be run against only these results. This search may, or may not, be the exact same query as Search #1. Regardless, it needs to create a second, independent, set of weights against the results of Search #1.
Any pointers or suggestions on how to go about implementing something like this?
Not asking for someone to write me a solution - tips on what methods and objects to look at would be significantly helpful.
Thanks!
You should be able to do
... WHERE id IN ({list of ids from q1})
in sphinxQL for your second query.

Search multiple indices at once using Lucene Search

I am using Zend_Search_Lucene to implement site search. I created separate indices for different data types (e.g. one for users, one for posts etc). The results are similarly divided by data type however there is an 'all' option which should show a combination of the different result types. Is it possible to search across the different indices at once? or do i have to index everything in an all index?
Update: The readme for ZF 1.8 suggests that it's now possible to do in ZF 1.8 but I've been unable to track down where this is at in the documentation.
So after some research you have to use Zend_Search_Lucene_Interface_MultiSearcher. I don't see any mention of it in the documentation as of this writing but if you look at the actual class in ZF 1.8 it's straightforward t use
$index = new Zend_Search_Lucene_Interface_MultiSearcher();
$index->addIndex(Zend_Search_Lucene::open('search/index1'));
$index->addIndex(Zend_Search_Lucene::open('search/index2'));
$index->find('someSearchQuery');
NB it doesn't follow PEAR syntax so won'w work with Zend_Loader::loadClass
That's exactly how I handled search for huddler.com. I used multiple Zend_Search_Lucene indexes, one per datatype. For the "all" option, I simply had another index, which included everything from all indexes -- so when I added docs to the index, I added them twice, once to the appropriate "type" index, and once to the "all" index. Zend Lucene is severely underfeatured compared to other Lucene implementations, so this was the best solution I found. You'll find that Zend's port supports only a subset of the lucene query syntax, and poorly -- even on moderate indexes (10-100 MB), queries as simple as "a*", or quoted phrases fail to perform adequately (if at all).
When we brought a large site onto our platform, we discovered that Zend Lucene doesn't scale. Our index reached roughly 1.0 GB, and simple queries took up to 15 seconds. Some queries took a minute or longer. And building the index from scratch took about 20 hours.
I switched to Solr; Solr not only performs 50x faster during indexing, and 1000x faster for many queries (most queries finish in < 5ms, all finish in < 100ms), it's far more powerful. Also, we were able to rebuild our 100,000+ document index from scratch in 30 minutes (down from 20 hours).
Now, everything's in one Solr index with a "type" field; I run multiple queries against the index for each search, each one with a different "type:" filter query, and one without a "type:" for the "all" option.
If you plan on growing your index to 100+ MB, you receive at least a few search requests per minute, or you want to offer any sort of advanced search functionality, I strongly recommend abandoning Zend_Search_Lucene.
I don't how it integrates with Zend, but in Lucene one would use a MultiSearcher, instead of the usual IndexSearcher.