Set Sphinx relevance ranking for RT index - sphinx

How do I include a ranking type (such as SPH_RANK_NONE) in a Sphinx RT query?
select id from my_index where match('hello')
order by date
limit 600 ;
Also, is there a way to just set it once, for example, in the config file?
Sphinx doc:
http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/

The default ranking mode is SPH_RANK_PROXIMITY_BM25 and it can't be changed using config.
This is how you set a ranking mode for a query (note that ORDER BY must have explicit ASC/DESC order clause):
SELECT id FROM my_index where MATCH('hello')
ORDER BY date DESC LIMIT 600 OPTION ranker=sph04;
Relevant parts in the doc:
http://sphinxsearch.com/docs/current.html#weighting
http://sphinxsearch.com/docs/current.html#sphinxql-select

Related

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

Postgres pagination with non-unique keys?

Suppose I have a table of events with (indexed) columns id : uuid and created : timestamp.
The id column is unique, but the created column is not. I would like to walk the table in chronological order using the created column.
Something like this:
SELECT * FROM events WHERE created >= $<after> ORDER BY created ASC LIMIT 10
Here $<after> is a template parameter that is taken from the previous query.
Now, I can see two issues with this:
Since created is not unique, the order will not be fully defined. Perhaps the sort should be id, created?
Each row should only be on one page, but with this query the last row is always included on the next page.
How should I go about this in Postgres?
SELECT * FROM events
WHERE created >= $<after> and (id >= $<id> OR created > $<after>)
ORDER BY created ASC ,id ASC LIMIT 10
that way the events each timestamp values will be ordered by id. and you can split pages anywhere.
you can say the same thing this way:
SELECT * FROM events
WHERE (created,id) >= ($<after>,$<id>)
ORDER BY created ASC ,id ASC LIMIT 10
and for me this produces a slightly better plan.
An index on (created,id) will help performance most, but for
many circumstances an index on created may suffice.
First, as you said, you should enforce a total ordering. Since the main thing you care about is created, you should start with that. id could be the secondary ordering, a tie breaker invisible to the user that just ensures the ordering is consistent. Secondly, instead of messing around with conditions on created, you could just use an offset clause to return later results:
SELECT * FROM events ORDER BY created ASC, id ASC LIMIT 10 OFFSET <10 * page number>
-- Note that page number is zero based

Are there plans to add 'OR' to attribute searches in Sphinx?

A little background is in order for this question since it is on surface too generic:
Recently I ran into an issue where I had to move the attribute values I was pushing into my sphinxql query as full-text because the attribute needed to be part of an 'OR' query.
In other words I was doing:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3)
When I tried to add an 'OR' to the attributes such as:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
it failed because Sphinx 2.* does not support OR in the attribute query.
I was also unable to simply put the name and customer IDs in to the query:
Select * from idx_test where MATCH('Terms ((#(name_id) 1|2|3)|(#customer_id) 4|5|6))')
Because (as far as I can tell) you can't push integer fields into the full_text search.
My solution was to index the id fields a second time appended by _text:
Select name_id, name_id as name_id_text
and then add that to the field list:
sql_attr_uint = name_id
sql_field_string = name_id_text
sql_attr_uint = customer_id
sql_field_string = customer_id_text
So now I can do my OR query as full_text:
Select * from idx_test where MATCH('Terms ((#(name_id_text) 1|2|3)|(#customer_id_text) 4|5|6))')
However recently I found an article that discussed the tradeoff between attribute and full-text searches. The upshot is that "it could reduce performance of queries that otherwise match few records". Which is precisely what my name_id/city_id query does. In an ideal world then I'd be able to go back to:
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
If Sphinx would only allow for OR between attributes since as far as I can tell once I have a query that is filtering down to a relatively low # of results I'd have a much faster query using attributes vs full_text.
So my two-part question therefor is:
Am I in fact correct that this is the case (a query that would reduce the # of results significantly is better served doing attributes then full-text)?
If so are there plans to add OR to the attribute part of the SphinxQL query?
If so, when?
OR filter has been added in the Sphinx fork (from 2.3 branch) - Manticore, see https://github.com/manticoresoftware/manticore/commit/76b04de04feb8a4db60d7309bf1e57114052e298
For now it's only between attributes, OR between MATCH and attributes is not supported yet.
While yes, OR is not supported directly in WHERE, can still run the query. Your
Select * from idx_test where MATCH('Terms') and name_id in (1,2,3) OR customer_id in (4,5,6)
example can be written as
Select *, IN(name_id,1,2,3) + IN(customer_id,4,5,6) as filter
from idx_test where MATCH('Terms') and filter > 0
It is a bit more cumbersome, but should work. You still get the full benefit of the full-text inverted index, so performance actully shoudnt be bad. The fitler is only executed against docs matching the terms.
(this may look crazy, if coming from say mysql background, but remeber sphinxQL isnt mysql :)
You dont get 'short circuiting (ie customer_id filter, will still be run, even if matches name_id), so perhaps
Select *, IF(IN(name_id,1,2,3) OR IN(customer_id,4,5,6),1,0) as filter
from idx_test where MATCH('Terms') and filter =1
is even better, the if function has an OR operator! (as sphinx could potentially short-circuit, but don't know if it does)
(but also yes, if the 'filter' is highly selective (matching few rows), than including in the full-text query can be good. As it discards the rows earlier in processing. The problem with non-selective filters, is they have lots of matching rows, so a long doclist to process during text-query processing)

Postgres DESC index on date field

I have a date field on a large table that I mostly query and sort in DESC order. I have an index on that field with the default ASC order. I read that if an index is on a single field it does not matter if it is in ASC or DESC order since an index can be read from both directions. Will I benefit from changing my index to DESC?
operating systems are generally more efficient reading files in a forwards direction, so you may get a slight speed up by creating a DESC index.
For a big speed up create the DESC index and CLUSTER the table on it.
CLUSTER tablename USING indexname;
clustering on the ASC index will also give improvement, but it will be less.

Equivalent of LIMIT for DB2

How do you do LIMIT in DB2 for iSeries?
I have a table with more than 50,000 records and I want to return records 0 to 10,000, and records 10,000 to 20,000.
I know in SQL you write LIMIT 0,10000 at the end of the query for 0 to 10,000 and LIMIT 10000,10000 at the end of the query for 10000 to 20,000
So, how is this done in DB2? Whats the code and syntax?
(full query example is appreciated)
Using FETCH FIRST [n] ROWS ONLY:
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db29.doc.perf/db2z_fetchfirstnrows.htm
SELECT LASTNAME, FIRSTNAME, EMPNO, SALARY
FROM EMP
ORDER BY SALARY DESC
FETCH FIRST 20 ROWS ONLY;
To get ranges, you'd have to use ROW_NUMBER() (since v5r4) and use that within the WHERE clause: (stolen from here: http://www.justskins.com/forums/db2-select-how-to-123209.html)
SELECT code, name, address
FROM (
SELECT row_number() OVER ( ORDER BY code ) AS rid, code, name, address
FROM contacts
WHERE name LIKE '%Bob%'
) AS t
WHERE t.rid BETWEEN 20 AND 25;
Developed this method:
You NEED a table that has an unique value that can be ordered.
If you want rows 10,000 to 25,000 and your Table has 40,000 rows, first you need to get the starting point and total rows:
int start = 40000 - 10000;
int total = 25000 - 10000;
And then pass these by code to the query:
SELECT * FROM
(SELECT * FROM schema.mytable
ORDER BY userId DESC fetch first {start} rows only ) AS mini
ORDER BY mini.userId ASC fetch first {total} rows only
Support for OFFSET and LIMIT was recently added to DB2 for i 7.1 and 7.2. You need the following DB PTF group levels to get this support:
SF99702 level 9 for IBM i 7.2
SF99701 level 38 for IBM i 7.1
See here for more information: OFFSET and LIMIT documentation, DB2 for i Enhancement Wiki
Here's the solution I came up with:
select FIELD from TABLE where FIELD > LASTVAL order by FIELD fetch first N rows only;
By initializing LASTVAL to 0 (or '' for a text field), then setting it to the last value in the most recent set of records, this will step through the table in chunks of N records.
#elcool's solution is a smart idea, but you need to know total number of rows (which can even change while you are executing the query!). So I propose a modified version, which unfortunately needs 3 subqueries instead of 2:
select * from (
select * from (
select * from MYLIB.MYTABLE
order by MYID asc
fetch first {last} rows only
) I
order by MYID desc
fetch first {length} rows only
) II
order by MYID asc
where {last} should be replaced with row number of the last record I need and {length} should be replaced with the number of rows I need, calculated as last row - first row + 1.
E.g. if I want rows from 10 to 25 (totally 16 rows), {last} will be 25 and {length} will be 25-10+1=16.
Try this
SELECT * FROM
(
SELECT T.*, ROW_NUMBER() OVER() R FROM TABLE T
)
WHERE R BETWEEN 10000 AND 20000
The LIMIT clause allows you to limit the number of rows returned by the query. The LIMIT clause is an extension of the SELECT statement that has the following syntax:
SELECT select_list
FROM table_name
ORDER BY sort_expression
LIMIT n [OFFSET m];
In this syntax:
n is the number of rows to be returned.
m is the number of rows to skip before returning the n rows.
Another shorter version of LIMIT clause is as follows:
LIMIT m, n;
This syntax means skipping m rows and returning the next n rows from the result set.
A table may store rows in an unspecified order. If you don’t use the ORDER BY clause with the LIMIT clause, the returned rows are also unspecified. Therefore, it is a good practice to always use the ORDER BY clause with the LIMIT clause.
See Db2 LIMIT for more details.
You should also consider the OPTIMIZE FOR n ROWS clause. More details on all of this in the DB2 LUW documentation in the Guidelines for restricting SELECT statements topic:
The OPTIMIZE FOR clause declares the intent to retrieve only a subset of the result or to give priority to retrieving only the first few rows. The optimizer can then choose access plans that minimize the response time for retrieving the first few rows.
There are 2 solutions to paginate efficiently on a DB2 table :
1 - the technique using the function row_number() and the clause OVER which has been presented on another post ("SELECT row_number() OVER ( ORDER BY ... )"). On some big tables, I noticed sometimes a degradation of performances.
2 - the technique using a scrollable cursor. The implementation depends of the language used. That technique seems more robust on big tables.
I presented the 2 techniques implemented in PHP during a seminar next year. The slide is available on this link :
http://gregphplab.com/serendipity/uploads/slides/DB2_PHP_Best_practices.pdf
Sorry but this document is only in french.
Theres these available options:-
DB2 has several strategies to cope with this problem.
You can use the "scrollable cursor" in feature.
In this case you can open a cursor and, instead of re-issuing a query you can FETCH forward and backward.
This works great if your application can hold state since it doesn't require DB2 to rerun the query every time.
You can use the ROW_NUMBER() OLAP function to number rows and then return the subset you want.
This is ANSI SQL
You can use the ROWNUM pseudo columns which does the same as ROW_NUMBER() but is suitable if you have Oracle skills.
You can use LIMIT and OFFSET if you are more leaning to a mySQL or PostgreSQL dialect.