Limit results on OR condition in Sphinx - sphinx

I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9

There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)

Related

How to optimize fuzzy string matching pairs of phrases (intersection names) in PostgreSQL

We have a table of intersection names like 'Main St / Broadway Ave' and we are trying to match potentially messy user input (of the form (street1, street2)) to these names. There's no guarantee the input would be in the same order as the street names.
We split the intersection names to a long format table in order to optimize doing two fuzzy distance comparisons, e.g.
+--------+----------------+
| int_id | street |
+--------+----------------+
| 1 | 'Broadway Ave' |
+--------+----------------+
| 1 | 'Main St' |
+--------+----------------+
And put a gist trigram index on the street column.
So then the query finds all int_ids that are close to one or the other street input and then does a GROUP BY to find the one with the closest combined distances (I'll insert the query later). This works pretty well but we still need it to work faster. Is there something in PostgreSQL's full text search library that could do the trick?
Query example used in a function, with related explain https://explain.depesz.com/s/J9lj
SELECT intersections.int_id,
SUM(LEAST(
intersections.street <-> street1,
intersections.street <-> street2))
, intersections.int_id
FROM intersections
WHERE (street1 <% intersections.street
OR
street2 <% intersections.street
)
GROUP BY intersections.int_id
HAVING COUNT(DISTINCT TRIM(intersections.street)) > 1
ORDER BY AVG(
LEAST(
intersections.street <-> street1,
intersections.street <-> street2))
I know you are looking for a PostgreSQL solution, but this would be much easier (and faster) if you clone your data into Elastic Search and do the searches there. Elastic Search also gives you way more flexibility then a relational database could.

Partial vs complete match ranking in postgresql text search

I am trying to implement incremental search using PostgreSQL. The problem I am running into is result ranking. I would like complete matches to be ranked higher than partial matches and I don't really know how to do that. For example in this query (to show how things are ranked as the user types the query):
select
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jonath:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jonathan:*'))
or the other way around (to show how different documents rank the same query)
select
ts_rank_cd(to_tsvector('hello jon'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonah'), to_tsquery('jon:*')),
ts_rank_cd(to_tsvector('hello jonathan'), to_tsquery('jon:*'))
all rankings return 0.1. How would I go about making more complete results rank higher?
I would try using an operator from pg_trgm to break ties among the ts_rank_cd. Probably the "<->>>" operator (introduced in v11) would be my first choice:
select
'hello jon' <->>> 'jon:*',
'hello jonah' <->>> 'jon:*',
'hello jonathan'<->>>'jon:*';
?column? | ?column? | ?column?
----------+------------+----------
0 | 0.57142854 | 0.7
Note that this returns distance, not similarity, so lower is better.

Postgres: Are There Downsides to Using a JSON Column vs. an integer[] Column?

TLDR: If I want to save arrays of integers in a Postgres table, are there any pros or cons to using an array column (integer[]) vs. using a JSON column (eg. does one perform better than the other)?
Backstory:
I'm using a PostgreSQL database, and Node/Knex to manage it. Knex doesn't have any way of directly defining a PostgreSQL integer[] column type, so someone filed a Knex bug asking for it ... but one of the Knex devs closed the ticket, essentially saying that there was no need to support PostgreSQL array column types when anyone can instead use the JSON column type.
My question is, what downsides (if any) are there to using a JSON column type to hold a simple array of integers? Are there any benefits, such as improved performance, to using a true array column, or am I equally well off by just storing my arrays inside a JSON column?
EDIT: Just to be clear, all I'm looking for in an answer is either of the following:
A) an explanation of how JSON columns and integer[] columns in PostgreSQL work, including either how one is better than the other or how the two are (at least roughly) equal.
B) no explanation, but at least a reference to some benchmarks that show that one column type or the other performs better (or that the two are equal)
An int[] is a lot more efficient in terms of storage it requires. Consider the following query which returns the size of an array with 500 elements
select pg_column_size(array_agg(i)) as array_size,
pg_column_size(jsonb_agg(i)) as jsonb_size,
pg_column_size(json_agg(i)) as json_size
from generate_series(1,500) i;
returns:
array_size | jsonb_size | json_size
-----------+------------+----------
2024 | 6008 | 2396
(I am quite surprised that the JSON value is so much smaller than the JSONB, but that's a different topic)
If you always use the array as a single value it does not really matter in terms of query performance But if you do need to look into the array and search for specific value(s), that will be a lot more efficient with a native array.
There are a lot more functions and operators available for native arrays than there are for JSON arrays. You can easily search for a single value in a JSON array, but searching for multiple values requires workarounds.
The following query demonstrates that:
with array_test (id, int_array, json_array) as (
values
(1, array[1,2,3], '[1,2,3]'::jsonb)
)
select id,
int_array #> array[1] as array_single,
json_array #> '1' json_single,
int_array #> array[1,2] as array_all,
json_array ?& array['1','2'] as json_all,
int_array && array[1,2] as array_any,
json_array ?| array['1','2'] as json_any
from array_test;
You can easily query an array if it contains one specific value. This also works for JSON arrays. Those are the expressions array_single and json_single. With a native array you could also use 1 = any(int_array) instead.
But check if an array contains all values from a list, or any value from a list does not work with JSON arrays.
The above test query returns:
id | array_single | json_single | array_all | json_all | array_any | json_any
---+--------------+-------------+-----------+----------+-----------+---------
1 | true | true | true | false | true | false

array_agg guaranteed consistent across multiple columns in Postgres?

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?
According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.
It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.

How to match rows with one or more words in query, but without any words not in query?

I have a table in a MySQL database that has a list of comma separated tags in it.
I want users to be able to enter a list of comma separated tags and then use Sphinx or MySQL to select rows that have at least one of the tags in the query but not any tags the query doesn't have.
The query can have additional tags that are not in the rows, but the rows should not be matched if they have tags not in the query.
I either want to use Sphinx or MySQL to do the searching.
Here's an example:
creatures:
----------------------------
| name | tags |
----------------------------
| cat | wily,hairy |
| dog | cute,hairy |
| fly | ugly |
| bear | grumpy,hungry |
----------------------------
Example searches:
wily,hairy <-- should match cat
cute,hairy,happy <-- should match dog
happy,cute <-- no match (dog has hairy)
ugly,yuck,gross <-- should match fly
hairy <-- no match (dog has cute cat has wily)
grumpy <-- no match (bear has hungry)
grumpy,hungry <-- should match bear
wily,grumpy,hungry <-- should match bear
Is it possible to do this with Sphinx or MySQL?
To reiterate, the query will be a list of comma separated tags and rows that have at least one of the entered tags but not any tags the query doesn't have should be selected.
Sphinx expression ranker should be able to do this.
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1') AND w > 0
OPTION ranker=expr('IF(word_count>=tags_len,1,0)');
basically you want the number of matched tags never to be less than the number of tags.
Note these just gives all documents a weight of 1, if want to get more elaborate ranking (eg to match other keywords) it gets more complicated.
You need index_field_lengths enabled on the index to get the tags_len attribute.
(the same concept is obviouslly possible in mysql.. probably using FIND_IN_SET to do matching. And either a second column to store the number, or compute the number of tags, using say the REPLACE function)
Edit to add, details about multiple fields...
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);
The SUM function is run for each field in turn, so need to use the user_weight system to get be able to distinguish which field currently enumerating.