SphinxSearch Ranker=matchany on multiple fields - sphinx

Using Sphinx 2.1.4-id64-dev (rel21-r4324)
I want to search over multiple fields but do not want "duplicate words" to increase weight.
So, I am using ranker=matchany option.
this works as I want when duplicates are in a single field:
MySQL [(none)]> select id, val, val2, weight() FROM nptest WHERE match('#(val,val2) bar') OPTION ranker=matchany;
+------+---------+------+----------+
| id | val | val2 | weight() |
+------+---------+------+----------+
| 3 | bar | | 1 |
| 4 | bar bar | | 1 |
+------+---------+------+----------+
2 rows in set (0.00 sec)
=> weights are equal, despite the duplicate word in doc 4.
But that do not work anymore when duplicates are over multiple fields:
MySQL [(none)]> select id, val, val2, weight() FROM nptest WHERE match('#(val,val2) foo') OPTION ranker=matchany;
+------+------+------+----------+
| id | val | val2 | weight() |
+------+------+------+----------+
| 2 | foo | foo | 2 |
| 1 | foo | | 1 |
+------+------+------+----------+
2 rows in set (0.00 sec)
weight of id-2 > weight of id-1
Is there a way to apply a "matchany" ranking mode on multiple fields?
Here is a sample sphinx.conf file :
source nptest
{
type = mysql
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = test
sql_port = 3306
sql_query = \
SELECT 1, 'foo' AS val, '' AS val2 \
UNION \
SELECT 2, 'foo', 'foo' \
UNION \
SELECT 3, 'bar', '' \
UNION \
SELECT 4, 'bar bar', ''
sql_field_string = val
sql_field_string = val2
}
index nptest
{
type = plain
source = nptest
path = /var/lib/sphinxsearch/data/nptest
morphology = none
}

You need the expression ranker
http://sphinxsearch.com/docs/current.html#weighting
can start with the default expression for the matchany and tweak it.
Using doc_word_count instead of sum(word_count) should be useful.

After upgrading to Sphinx 2.2.1-id64-beta (r4330) I was able to use top() aggregate function in a "custom expression ranker" like this :
MySQL [(none)]> SELECT id, val, val2, weight() FROM nptest WHERE match('#(val,val2) foo') OPTION ranker=expr('top((word_count+(lcs-1)*max_lcs)*user_weight)'), field_weights=(val=3,val2=4);
+------+-------------+------+----------+
| id | val | val2 | weight() |
+------+-------------+------+----------+
| 2 | foo | foo | 4 |
| 1 | foo | | 3 |
| 5 | bar bar foo | bar | 3 |
+------+-------------+------+----------+
3 rows in set (0.00 sec)
That way, multiple occurrences accross multiple fields do not increase global weight and if fields have different weights, top weighted field is taken.
Many Thanks to barryhunter for his great help!

Related

Postgresql - How to use string_to_array on another column value

How can I use string_to_array or split_part on another column value.
I want do something like select * from tenants where id IN (select string_to_array(select ancestry from tenants where id = 39,'/'));
-[ RECORD 1 ]-------------+----------------------
id | 1
domain |
subdomain |
name | My Company
login_text |
logo_file_name |
logo_content_type |
logo_file_size |
logo_updated_at |
login_logo_file_name |
login_logo_content_type |
login_logo_file_size |
login_logo_updated_at |
ancestry |
divisible | t
description | Tenant for My Company
use_config_for_attributes | t
default_miq_group_id | 1
source_type |
source_id |
-[ RECORD 3 ]-------------+----------------------
id | 35
domain |
subdomain |
name | Tenant_2
login_text |
logo_file_name |
logo_content_type |
logo_file_size |
logo_updated_at |
login_logo_file_name |
login_logo_content_type |
login_logo_file_size |
login_logo_updated_at |
ancestry | 1
divisible | t
description | Tenant_2
use_config_for_attributes | f
default_miq_group_id | 36
source_type |
source_id |
-[ RECORD 7 ]-------------+----------------------
id | 39
domain |
subdomain |
name | Child_Teanant_202
login_text |
logo_file_name |
logo_content_type |
logo_file_size |
logo_updated_at |
login_logo_file_name |
login_logo_content_type |
login_logo_file_size |
login_logo_updated_at |
ancestry | 1/35
divisible | t
description | Child_Teanant_202
use_config_for_attributes | f
default_miq_group_id | 52
source_type |
source_id |
Use regex to enforce word boundaries:
select *
from tenants
where (select ancestry from tenants where id = 39)
~ ('\y' || id || '\y')
See live demo.
Without the word boundaries an id of 1 would match an ancestry of 123.
Note Postgres's unusual regex for word boundary \y, which elsewhere is \b.
There are two ways to solve this.
One is to simply unnest the elements of ancestry
select *
from tenants
where id in (select a.id::int
from tenants t2
cross join unnest(string_to_array(t2.ancestry, '/')) as a(id)
where t2.id = 39);
Converting the string to an array in order to be able to use the = ANY() operator is a bit tricky, because you need two levels of parentheses plus a type cast to an integer array to make that work:
select *
from tenants
where id = any ((select string_to_array(t2.ancestry, '/')
from tenants t2
where t2.id = 39)::int[]);
Online example

How to expand columns into individual timesteps in PostgreSQL

I have a table of columns that represent a time series. The datatypes are not important, but anything after timestep2 could potentially be NULL.
| id | timestep1 | timestep2 | timestep3 | timestep4 |
|----|-----------|-----------|-----------|-----------|
| a | foo1 | bar1 | baz1 | qux1 |
| b | foo2 | bar2 | baz2 | NULL |
I am attempting to retrieve a view of the data more suitable for modeling. My modeling use-case requires that I break each time series (row) into rows representing their individual "states" at each step. That is:
| id | timestep1 | timestep2 | timestep3 | timestep4 |
|----|-----------|-----------|-----------|-----------|
| a | foo1 | NULL | NULL | NULL |
| a | foo1 | bar1 | NULL | NULL |
| a | foo1 | bar1 | baz1 | NULL |
| a | foo1 | bar1 | baz1 | qux1 |
| b | foo2 | NULL | NULL | NULL |
| b | foo2 | bar2 | NULL | NULL |
| b | foo2 | bar2 | baz2 | NULL |
How can I accomplish this in PostgreSQL?
Use UNION.
select id, timestep1, timestep2, timestep3, timestep4
from my_table
union
select id, timestep1, timestep2, timestep3, null
from my_table
union
select id, timestep1, timestep2, null, null
from my_table
union
select id, timestep1, null, null, null
from my_table
order by
id,
timestep2 nulls first,
timestep3 nulls first,
timestep4 nulls first
There is a more compact solution, maybe more convenient when dealing with a greater number of timesteps:
select distinct
id,
timestep1,
case when i > 1 then timestep2 end as timestep2,
case when i > 2 then timestep3 end as timestep3,
case when i > 3 then timestep4 end as timestep4
from my_table
cross join generate_series(1, 4) as i
order by
id,
timestep2 nulls first,
timestep3 nulls first,
timestep4 nulls first
Test it in Db<>fiddle.

postgresql | batch update with insert in single query, 1:n to 1:1

I need to turn a 1:n relationship into a 1:1 relationship with the data remaining the same.
I want to know if is it possible to achieve this with a single pure sql (no plpgsql, no external language).
Below there are more details, a MWE and some extra context.
To illustrate, if I have
+------+--------+ +------+----------+--------+
| id | name | | id | foo_id | name |
|------+--------| |------+----------+--------|
| 1 | foo1 | | 1 | 1 | baz1 |
| 2 | foo2 | | 2 | 1 | baz2 |
| 3 | foo3 | | 3 | 2 | baz3 |
+------+--------+ | 4 | 2 | baz4 |
| 5 | 3 | baz5 |
+------+----------+--------+
I want to get to
+------+--------+ +------+----------+--------+
| id | name | | id | foo_id | name |
|------+--------| |------+----------+--------|
| 4 | foo1 | | 1 | 4 | baz1 |
| 5 | foo1 | | 2 | 5 | baz2 |
| 6 | foo2 | | 3 | 6 | baz3 |
| 7 | foo2 | | 4 | 7 | baz4 |
| 8 | foo3 | | 5 | 8 | baz5 |
+------+--------+ +------+----------+--------+
Here is some code to set up the tables if needed:
drop table if exists baz;
drop table if exists foo;
create table foo(
id serial primary key,
name varchar
);
insert into foo (name) values
('foo1'),
('foo2'),
('foo3');
create table baz(
id serial primary key,
foo_id integer references foo (id),
name varchar
);
insert into baz (foo_id, name) values
(1, 'baz1'),
(1, 'baz2'),
(2, 'baz3'),
(2, 'baz4'),
(3, 'baz5');
I managed to work out the following query that updates only one entry (ie, the
pair <baz id, foo id> has to be provided):
with
existing_foo_values as (
select name from foo where id = 1
),
new_id as (
insert into foo(name)
select name from existing_foo_values
returning id
)
update baz
set foo_id = (select id from new_id)
where id = 1;
The real case scenario (a db migration in a nodejs environment) was solved using
something similar to
const existingPairs = await runQuery(`
select id, foo_id from baz
`);
await Promise.all(existingPairs.map(({
id, foo_id
}) => runQuery(`
with
existing_foo_values as (
select name from foo where id = ${foo_id}
),
new_id as (
insert into foo(name)
select name from existing_foo_values
returning id
)
update baz
set foo_id = (select id from new_id)
where id = ${id};
`)));
// Then delete all the orphan entries from `foo`
Here's a solution that works by first putting together what we want foo to look like (using values from the sequence), and then making the necessary changes to the two tables based on that.
WITH new_ids AS (
SELECT nextval('foo_id_seq') as foo_id, baz.id as baz_id, foo.name as foo_name
FROM foo
JOIN baz ON (foo.id = baz.foo_id)
),
inserts AS (
INSERT INTO foo (id, name)
SELECT foo_id, foo_name
FROM new_ids
),
updates AS (
UPDATE baz
SET foo_id = new_ids.foo_id
FROM new_ids
WHERE new_ids.baz_id = baz.id
)
DELETE FROM foo
WHERE id < (SELECT min(foo_id) FROM new_ids);

how to get the minor of three column values in postgresql

The common function to get the minor value of a column is min(column), but what I want to do is to get the minor value of a row, based on the values of 3 columns. For example, using the following base table:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 2 | 1 | 3 |
| 10 | 0 | 1 |
| 13 | 12 | 2 |
+------+------+------+
I want to query it as:
+-----------+
| min_value |
+-----------+
| 1 |
| 0 |
| 2 |
+-----------+
I found a solution as follows, but for SQL, not Postgresql. So I am not getting it to work in postgresql:
select
(
select min(minCol)
from (values (t.col1), (t.col2), (t.col3)) as minCol(minCol)
) as minCol
from t
I could write something using case statement but I would like to write a query like the above for postgresql. Is this possible?
You can use least() (and greatest() for the maximum)
select least(col1, col2, col3) as min_value
from the_table

Sphinx query takes too much time

I am making an index on a table with ~90 000 000 rows. Fulltext search must be done on a varchar field, called email. I also set parent_id as an attribute.
When I run queries to search emails that match words with small amount of hits, they are fired immediately:
mysql> SELECT count(*) FROM users WHERE MATCH('diedsmiling');
+----------+
| count(*) |
+----------+
| 26 |
+----------+
1 row in set (0.00 sec)
mysql> show meta;
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | diedsmiling |
| docs[0] | 26 |
| hits[0] | 26 |
+---------------+-------------+
6 rows in set (0.00 sec)
Things get complicated when I am searching for emails that match words with a big amount of hits:
mysql> SELECT count(*) FROM users WHERE MATCH('mail');
+----------+
| count(*) |
+----------+
| 33237994 |
+----------+
1 row in set (9.21 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 9.210 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
+---------------+----------+
6 rows in set (0.00 sec)
Using parent_id attribute, doesn't give any profit:
mysql> SELECT count(*) FROM users WHERE MATCH('mail') AND parent_id = 62003;
+----------+
| count(*) |
+----------+
| 21404 |
+----------+
1 row in set (8.66 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 8.666 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
Here are my sphinx configs:
source src1
{
type = mysql
sql_host = HOST
sql_user = USER
sql_pass = PASS
sql_db = DATABASE
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT id, parent_id, email \
FROM users
sql_attr_uint = parent_id
}
index test1
{
source = src1
path = /var/lib/sphinx/test1
}
The query that I need to run looks like:
SELECT * FROM users WHERE MATCH('mail') AND parent_id = 62003;
I need to get all emails that match a certain work and have a certain parent_id.
My questions are:
Is there a way to optimize the situation described above? Maybe there is a more convenient matching mode for such type of queries? If I migrate to a server with SSD disks will the performance growth be significant?
Just to get count can just do
Select id from index where match(...) limit 0 option ranker=none; show meta;
And get from total_found.
Will be much more efficient than count[*) which invokes group by.
Or even call keywords('word','index',1); if only single words.