I am trying to check if my config has issues or I am not understanding Show Meta correctly;
If I make a regex in the config:
regexp_filter=NY=>New York
then if I do a SphinxQL search on 'NY'
Search Index where MATCH('NY')
and then Show Meta
it should show keyword1=New and keyword2=York not NY is that correct?
And if it does not then somehow my config is not working as intended?
it should show keyword1=New and keyword2=York not NY is that correct?
This is correct. When you do MATCH('NY') and have NY=>New York regexp conversion then Sphinx first converts NY into New York and only after that it starts searching, i.e. it forgets about NY completely. The same happens when indexing: it first prepares tokens, then indexes them forgetting about the original text.
To demonstrate (this is in Manticore (fork of Sphinx), but in terms of processing regexp_filter and how it affects searching works the same was as Sphinx):
mysql> create table t(f text) regexp_filter='NY=>New York';
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t values(0, 'I low New York');
Query OK, 1 row affected (0.01 sec)
mysql> select * from t where match('NY');
+---------------------+----------------+
| id | f |
+---------------------+----------------+
| 2810862456614682625 | I low New York |
+---------------------+----------------+
1 row in set (0.01 sec)
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | new |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | york |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
9 rows in set (0.00 sec)
Related
I'm trying to fill a table with data to test a system.
I have two tables
User
+----+----------+
| id | name |
+----+----------+
| 1 | Majikaja |
| 2 | User 2 |
| 3 | Markus |
+----+----------+
Goal
+----+----------+---------+
| id | goal | user_id |
+----+----------+---------+
I want to insert into goal one record for every user only using their IDs (they have to exists) and some fixed or random value.
I was thinking in something like this:
INSERT INTO Goal (goal, user_id) values ('Fixed value', select u.id from user u)
So it will generate:
Goal
+----+-------------+---------+
| id | goal | user_id |
+----+-------------+---------+
| 1 | Fixed value | 1 |
| 2 | Fixed value | 2 |
| 3 | Fixed value | 3 |
+----+-------------+---------+
I could just write a simple PHP script to achieve it but I wonder if is it possible to do using raw SQL only.
I use Proximity to good use with Sphinx e.g. Twain NEAR/1 Mark will return
Mark Twain
and
Twain, Mark
But say I had a word form like:
Weekday > Week Day
How could I set any given search to use Proximity NEAR/3 (or NEAR/X) so it would find
Week Day
and
Day of Week
I get in this case there are other ways to skin the cat but in general, looking for a way that the multiple word map doe not get pushed as 'Word1 Word2' i.e. 'Week Day' because otherwise I get docs such as
'I worked for one entire day before realizing it was going to take a
full week'
There's no easy way out of the box. You can perhaps make a change in your app so it does changes each 'word' to "word"~N in your search query or even better do that only for the same wordforms that Sphinx deals with. Here's an example:
mysql> select *, weight() from idx_min where match('weekday');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 1 | Weekday | 1 | 2319 |
| 2 | day of week | 2 | 1319 |
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 1319 |
+------+-------------------------------------------------------------------------------+------+----------+
3 rows in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"weekday"');
+------+---------+------+----------+
| id | doc | a | weight() |
+------+---------+------+----------+
| 1 | Weekday | 1 | 2319 |
+------+---------+------+----------+
1 row in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"weekday"~2');
+------+-------------+------+----------+
| id | doc | a | weight() |
+------+-------------+------+----------+
| 1 | Weekday | 1 | 2319 |
| 2 | day of week | 2 | 1319 |
+------+-------------+------+----------+
2 rows in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"entire"~2 "day"~2');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 1500 |
+------+-------------------------------------------------------------------------------+------+----------+
1 row in set (0.00 sec)
mysql> select *, weight() from idx_min where match('weekday full week');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 2439 |
+------+-------------------------------------------------------------------------------+------+----------+
1 row in set (0.01 sec)
mysql> select *, weight() from idx_min where match('"weekday"~2 full week');
Empty set (0.00 sec)
The last one would be the best way to go, but you would have to:
1) parse your query. E.g. like this:
mysql> call keywords('weekday full week', 'idx_min');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | weekday | week |
| 2 | weekday | day |
| 3 | full | full |
| 4 | week | week |
+------+-----------+------------+
4 rows in set (0.00 sec)
and if you see that for the same tokenized word you get 2 different normalized words that can be a signal for your app to wrap the tokenized word into "word"~N.
2) run the query. In this case "weekday"~2 full week
I am test keywords with sphinxQL
call keywords ('azori', 'test', 1);
And get results
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | azori | a260 | 1550 | 1551 |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)
What means a260?
That looks like a soundex representation
https://en.wikipedia.org/wiki/Soundex
seems like you have morphology=soundex enabled on the particular index
http://sphinxsearch.com/docs/current.html#conf-morphology
I have a table with a few million rows of data that looks like this:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /mom | pizza | 15 |
| /dad | pizza | 8 |
| /uncle | pizza | 2 |
| /brother | pizza | 7 |
| /mom | pasta | 12 |
| /dad | pasta | 23 |
+---------------+--------------+-------------------+
My goal is to run a HiveQL Query that will return the largest 'interactions' number for each unique page/term combo. For example:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /dad | pasta | 23 |
| /mom | pizza | 15 |
+---------------+--------------+-------------------+
How would I write this considering that each unique page has hundreds of thousands of search_terms, but I only want to pull the one search_term with the most interactions?
I have tried using max(interactions) and max(struct(interactions, search_term)).col1 but have had no luck. My output is consistently giving me all of the search_terms for each page no matter how many interactions.
Thanks!
Use row_number() analytic function:
select page, search_term, interactions
from
(select page, search_term, interactions,
row_number() over (partition by page order by interactions desc ) rn
)s
where rn = 1;
I am making an index on a table with ~90 000 000 rows. Fulltext search must be done on a varchar field, called email. I also set parent_id as an attribute.
When I run queries to search emails that match words with small amount of hits, they are fired immediately:
mysql> SELECT count(*) FROM users WHERE MATCH('diedsmiling');
+----------+
| count(*) |
+----------+
| 26 |
+----------+
1 row in set (0.00 sec)
mysql> show meta;
+---------------+-------------+
| Variable_name | Value |
+---------------+-------------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | diedsmiling |
| docs[0] | 26 |
| hits[0] | 26 |
+---------------+-------------+
6 rows in set (0.00 sec)
Things get complicated when I am searching for emails that match words with a big amount of hits:
mysql> SELECT count(*) FROM users WHERE MATCH('mail');
+----------+
| count(*) |
+----------+
| 33237994 |
+----------+
1 row in set (9.21 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 9.210 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
+---------------+----------+
6 rows in set (0.00 sec)
Using parent_id attribute, doesn't give any profit:
mysql> SELECT count(*) FROM users WHERE MATCH('mail') AND parent_id = 62003;
+----------+
| count(*) |
+----------+
| 21404 |
+----------+
1 row in set (8.66 sec)
mysql> show meta;
+---------------+----------+
| Variable_name | Value |
+---------------+----------+
| total | 1 |
| total_found | 1 |
| time | 8.666 |
| keyword[0] | mail |
| docs[0] | 33237994 |
| hits[0] | 33253762 |
Here are my sphinx configs:
source src1
{
type = mysql
sql_host = HOST
sql_user = USER
sql_pass = PASS
sql_db = DATABASE
sql_port = 3306 # optional, default is 3306
sql_query = \
SELECT id, parent_id, email \
FROM users
sql_attr_uint = parent_id
}
index test1
{
source = src1
path = /var/lib/sphinx/test1
}
The query that I need to run looks like:
SELECT * FROM users WHERE MATCH('mail') AND parent_id = 62003;
I need to get all emails that match a certain work and have a certain parent_id.
My questions are:
Is there a way to optimize the situation described above? Maybe there is a more convenient matching mode for such type of queries? If I migrate to a server with SSD disks will the performance growth be significant?
Just to get count can just do
Select id from index where match(...) limit 0 option ranker=none; show meta;
And get from total_found.
Will be much more efficient than count[*) which invokes group by.
Or even call keywords('word','index',1); if only single words.