Sphinx - How to get an Exact Match, i.e. same as mysql WHERE column = "value"

Sphinx - How to get an Exact Match, i.e. same as mysql WHERE column = "value" - sphinx

I've got a real time index containing information on people (a definition is included below). The problem is that I'm trying to run an exact match on a phone number and email address and no matter what I try, I'm getting matches even if the database column values contains what I've searched for, not where the column value exactly matches.
The query I'm using is:
SELECT id, first_name,last_name,email_personal, phone_number, WEIGHT() as relevance FROM people WHERE MATCH('#(phone_number,email_personal) "^+447111$" "^myemail#gmail\.com$ "');
That returns rows that contains a full phone number (i.e. +44711122334), as far as I understand it, shouldn't, it should be trying to match "^+447111$" as the start & end of the field?
I've also tried this test query and have much the same issue, apart from the fact it returns a lot more matches, as it would do it was matching any of the field values containing the criteria, rather than the whole field value. The values aren't the full values I'm looking for, but this is a test as it should be matching rows that only have a phone number of "+447711" and email of "#gmail.com", which don't exist in the database, but it does return rows, where the phone number starts with +447711 and the email has #gmail.com in it.
SELECT id, first_name,last_name,email_personal,phone_number, WEIGHT() as relevance FROM people WHERE MATCH('#phone_number "^+447711$" #email_personal "^#gmail\.co$"') ORDER BY relevance DESC;
Just to confirm, I'm trying to find matches where the values of the fields match the exact text, i.e. this would be the SQL query (and yes, this doesn't work either!)
SELECT id,first_name,last_name,email_personal,phone_number FROM people WHERE phone_number = '+44711122334' AND email_personal = 'myemail#gmail.com';
Config:
index people
{
type = rt
path = /var/local/sphinx/indexes/ppl/
rt_field = first_name
rt_field = last_name
rt_field = phone_number
rt_field = email_personal
stored_fields = first_name,last_name,phone_number,email_personal
rt_mem_limit = 512M
expand_keywords = 1
min_prefix_len = 2
min_word_len = 2
index_exact_words = 1
}

bah! This is always the way. You spend hours trying to figure it out, post it to StackOverflow and then within a few moments the answer jumps out at you.
It turns out it was the 'expand_keywords' setting in the config that was responsible. For those who don't know, this is what it does...
Queries against indexes with expand_keywords feature enabled are internally expanded as follows. If the index was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of keyword itself and a respective prefix or infix (keyword with stars). If the index was built with both stemming and index_exact_words enabled, exact form is also added. Here's an example that shows how internal expansion works when all of the above (infixes, stemming, and exact words) are combined:
running -> ( running | *running* | =running )
So that despite trying to search for exact matches, that was causing it to always expand and search for the text within the column, not that the column exactly matched.
Taking that line out of the config & restarting Sphinx solved the issue straight away, you don't even need to re-index, which is good.
I thought I'd leave the question and answer here incase anyone else has a similar "issue" ;)

Related

Sphinx Mulit-Level Sort with Randomize

Here is my challenge with Sphinx Sort where I have Vendors who pay for premium placement and those who don't:
I already do a multi-level order including the PaidVendorStatus which is either 0 or 1 as:
order by PaidVendorStatus,Weight()
So in essence I end up with multiple sort groups:
PaidVendorStatus=1, Weight1
....
PaidVendorStatus=1, WeightN
PaidVendorStatus=0, Weight1
...
PaidVendorStatus=0, WeightN
The problem is I have three goals:
Randomly prioritize each vendor in any given sort group
Have each vendor's 'odds' of being randomly assigned top position be equal regardless of how many records they have returned in the group (so if Vendor A has 50 results and VendorB has 2 results they still both have 50% odds of being randomly assigned any given spot)
Ideally, maintain the same results order in any given search (so that if the user searches again the same order will be displayed
I've tried various solutions:
Select CRC32(Vendor) as RANDOM...Order by PaidVendorStatus,Weight(),RANDOM
which solves 2 and 3 except due to the nature of CRC32 ALWAYS puts the same vendor first (and second, third, etc.) so in essence does not solve the issue at all.
I tried making a sphinx sql_attr_string in my Sphinx Configuration which was a concatenation of Vendor and the record Title (Select... concat(Vendor,Title) as RANDOMIZER..)` and then used that to randomize
Select CRC32(RANDOMIZER) as RANDOM...
which solves 1 and 3 as now the Title field gets thrown in the randomization mis so that the same Vendor does not always get first billing. However, it fails at 2 since in essence I am only sorting by Title and thus Vendor B with two results now has a very low change of being sorted first.
In an ideal world naturally I could just order this way;
Order by PaidVendorStatus,Weight(),RAND(Vendor)
but that is not possible.
Any thoughts on this appreciated. I did btw check out as per Barry Hunter's suggestion this thread on UDF but unless I am not understanding it at all (possible) it does not seem to be the solution for this problem.

Well one idea is:
SELECT * FROM (
SELECT *,uniqueserial(vendor_id) AS sorter FROM index WHERE MATCH(...)
ORDER BY PaidVendorStatus DESC ,Weight() DESC LIMIT 1000
) ORDER BY sorter DESC, WEIGHT() DESC:
This exploits SPhixnes 'multiple sort' function with pysudeo subquery.
This works wors becasuse the inner query is sorted by PaidVendor first, so their items are fist. Which works to affect the ordr that unqique serial is called in.
Its NOT really 'randomising' the results as such, seems you jsut randomising them to mix up the vendors (so a single vendor doesnt domninate results. Uniqueserial works by 'spreading' the particular vendors results out - the results will tend to cycle through the vendors.
This is tricky as it exploits a relative undocumented sphinx feature - subqueries.
For the UDF see http://svn.geograph.org.uk/svn/modules/trunk/sphinx/

Still dont have an answer for your biased random (as in 2.)
but just remembered another feature taht can help with 3. - can supply s specific seed to the random number. Typically random generators are seeded from the current time, which gives ever changing values, But using a specific seed.
Seed is however a number, so need a predictable, but changing number. Could CRC the query?
... sphinx doesnt support expressions in the OPTION so would have to caculate the hash in the app.
<?php
$query = $db->Quote($_GET['q']);
$crc = crc32($query);
$sql = "SELECT id,IDIV(WEIGHT(),100) as i,RAND() as r FROM index WHERE MATCH($query)
ORDER BY PaidVendorStatus DESC,i DESC,r ASC OPTION random_seed=$crc";
If wanted the results to only slowly evolve, add the current date, so each day is a new selection...
$crc = crc32($query.date('Ymd'));

Slow query with order and limit clause but only if there are no records

I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1

Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.

Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!

Return rows where words match in two columns OR in match in one column and the other column is empty?

This is a follow-up to another question I recently asked.
I currently have a SphinxQL query like this:
SELECT * FROM my_index
WHERE MATCH(\'#field1 "a few words"/1 #field2 "more text here"/1\')
However, I would still like it to match rows in the case where one of the fields in the row is empty.
For example, let's say the following rows exist in the database:
field1 | field2
-----------------------
words in here | text in here
| text in here
The above query would match the first row, but it would not match the second row because the quorum operator specifies that there has to be one or more matches for each field.
Is what I'm asking possible?
The actual query I'm trying to make this work with was provided in Barry Hunter's answer to my previous question:
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);

First problem is sphinx doesn't index "empty" so you can't search for it. (well actually the field_len attribute will be zero. But it can be hard to combine attribute filter with MATCH())
... so arrange for empty to be something to index
sql_query = SELECT id,...,IF(tags='','_empty_',tags) AS tags FROM ...
Then modify the query. As it happens your quorum search is easy!
#field1 "a few words _empty_"/1
Its just another word. But a more complex query would just have to be OR'ed with the word.
Then there is making it work within your complex query. But as luck would have it, its really easy. _empty_ is just another word. And in the case of the field being empty, one word will match. (ie there are no words in the field, not in the query)
So just add _empty_ into the two quorums and you done!

simplest example of a query by date range in cassandra 1.x

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?

the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.
first you need to declare a column family similar to this (examples from cassandra-cli):
create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
{column_name: id, validation_class: LongType}
{column_name: name, validation_class: UTF8Type, index_type: KEYS}
{column_name: age, validation_class: LongType}
];
few important things about this declaration:
the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client
now let's add some values:
set users[1][name] = john;
set users[1][age] = 19;
set users[2][name] = jane;
set users[2][age] = 21;
set users[3][name] = john;
set users[3][age] = 32;
according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.
what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.
here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;
the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.

There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn
The other solutions require additional column families and additional queries.
When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.
If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.
But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.
Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.
I've used variations on this, usually adding Composite keys or columns for additional flexibility.

Sqlite statement match exact string iphone

I have the following words in table database:
name id
aba 0
abac 1
abaca 2
abace 3
I want to make the following select:
Select id from words where name='abaca';
I tried but it doesn't work. I want the exact match of the word I'm entering to match.

I tried with your entries and following query does work perfectly fine for an Exact match
SELECT * FROM Match WHERE title='abaca';
Where Match is the table name with following schema
CREATE TABLE Match (title text,recordId integer)
Following is the record inserted
Though I made the same mistake by Taking Match as Table name as you did by taking id as the column name, but I only realized it while posting the code. But that should really not make much difference here.
Have you used it in some code to fetch result from database? in case yes you should consider avoiding KEYWORDS as variable,column names.. Just a suggestion.