Force Proximity Search into multiple word wordform? - sphinx

I use Proximity to good use with Sphinx e.g. Twain NEAR/1 Mark will return
Mark Twain
and
Twain, Mark
But say I had a word form like:
Weekday > Week Day
How could I set any given search to use Proximity NEAR/3 (or NEAR/X) so it would find
Week Day
and
Day of Week
I get in this case there are other ways to skin the cat but in general, looking for a way that the multiple word map doe not get pushed as 'Word1 Word2' i.e. 'Week Day' because otherwise I get docs such as
'I worked for one entire day before realizing it was going to take a
full week'

There's no easy way out of the box. You can perhaps make a change in your app so it does changes each 'word' to "word"~N in your search query or even better do that only for the same wordforms that Sphinx deals with. Here's an example:
mysql> select *, weight() from idx_min where match('weekday');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 1 | Weekday | 1 | 2319 |
| 2 | day of week | 2 | 1319 |
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 1319 |
+------+-------------------------------------------------------------------------------+------+----------+
3 rows in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"weekday"');
+------+---------+------+----------+
| id | doc | a | weight() |
+------+---------+------+----------+
| 1 | Weekday | 1 | 2319 |
+------+---------+------+----------+
1 row in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"weekday"~2');
+------+-------------+------+----------+
| id | doc | a | weight() |
+------+-------------+------+----------+
| 1 | Weekday | 1 | 2319 |
| 2 | day of week | 2 | 1319 |
+------+-------------+------+----------+
2 rows in set (0.00 sec)
mysql> select *, weight() from idx_min where match('"entire"~2 "day"~2');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 1500 |
+------+-------------------------------------------------------------------------------+------+----------+
1 row in set (0.00 sec)
mysql> select *, weight() from idx_min where match('weekday full week');
+------+-------------------------------------------------------------------------------+------+----------+
| id | doc | a | weight() |
+------+-------------------------------------------------------------------------------+------+----------+
| 3 | I worked for one entire day before realizing it was going to take a full week | 3 | 2439 |
+------+-------------------------------------------------------------------------------+------+----------+
1 row in set (0.01 sec)
mysql> select *, weight() from idx_min where match('"weekday"~2 full week');
Empty set (0.00 sec)
The last one would be the best way to go, but you would have to:
1) parse your query. E.g. like this:
mysql> call keywords('weekday full week', 'idx_min');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | weekday | week |
| 2 | weekday | day |
| 3 | full | full |
| 4 | week | week |
+------+-----------+------------+
4 rows in set (0.00 sec)
and if you see that for the same tokenized word you get 2 different normalized words that can be a signal for your app to wrap the tokenized word into "word"~N.
2) run the query. In this case "weekday"~2 full week

Related

Create a pivot table for Month over Month variation

I have these records returned from a query
+---------+--------------+-----------+----------+
| Country | other fields | sales | date |
+---------+--------------+-----------+----------+
| US | 1 | $100.00 | 01/01/21 |
| CA | 1 | $100.00 | 01/01/21 |
| UK | 1 | $100.00 | 01/01/21 |
| FR | 1 | $100.00 | 01/01/21 |
| US | 1 | $200.00 | 01/02/21 |
| CA | 1 | $200.00 | 01/02/21 |
| UK | 1 | $200.00 | 01/02/21 |
| FR | 1 | $200.00 | 01/02/21 |
And I want to show the sales variation from one month to previous, like this:
| Country | 01/02/21 | 01/01/21 | Var% |
| US | $200.00 | $100.00 | 100% |
| CA | $200.00 | $100.00 | 100% |
| FR | $200.00 | $100.00 | 100% |
+---------+--------------+-----------+----------+
How could be done with a Postgres query?
if you always comparing two month only :
select country
, sum(sales) filter (where date ='01/01/21') month1
, sum(sales) filter (where date ='01/02/21') month2
, ((sum(sales) filter (where date ='01/02/21') /sum(sales) filter (where date ='01/01/21')) - 1) * 100 var
from tablename
where date in ('01/01/21' , '01/02/21')
group by country
you also can look at crosstab from tablefunc extension which basically does the same as above query.
CREATE EXTENSION IF NOT EXISTS tablefunc;
select * ,("01/02/21" /"01/01/21") - 1) * 100 var
from(
select * from crosstab ('select Country,date , sales from tablename')
as ct(country varchar(2),"01/01/21" money , "01/02/21" money)
) t
for more info about crosstab , see tablefunc
but if you want to show date in rows instead of columns, you can easily generalize it for all the dates :
select *
, ((sales / LAG(sales,1,1) over (partition by country order by date)) -1)* 100 var
from
country

Output of Show Meta in SphinxQL

I am trying to check if my config has issues or I am not understanding Show Meta correctly;
If I make a regex in the config:
regexp_filter=NY=>New York
then if I do a SphinxQL search on 'NY'
Search Index where MATCH('NY')
and then Show Meta
it should show keyword1=New and keyword2=York not NY is that correct?
And if it does not then somehow my config is not working as intended?
it should show keyword1=New and keyword2=York not NY is that correct?
This is correct. When you do MATCH('NY') and have NY=>New York regexp conversion then Sphinx first converts NY into New York and only after that it starts searching, i.e. it forgets about NY completely. The same happens when indexing: it first prepares tokens, then indexes them forgetting about the original text.
To demonstrate (this is in Manticore (fork of Sphinx), but in terms of processing regexp_filter and how it affects searching works the same was as Sphinx):
mysql> create table t(f text) regexp_filter='NY=>New York';
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t values(0, 'I low New York');
Query OK, 1 row affected (0.01 sec)
mysql> select * from t where match('NY');
+---------------------+----------------+
| id | f |
+---------------------+----------------+
| 2810862456614682625 | I low New York |
+---------------------+----------------+
1 row in set (0.01 sec)
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | new |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | york |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
9 rows in set (0.00 sec)

How to query just the last record of every second within a period of time in postgres

I have a table with hundreds of millions of records in 'prices' table with only four columns: uid, price, unit, dt. dt is a datetime in standard format like '2017-05-01 00:00:00.585'.
I can quite easily to select a period using
SELECT uid, price, unit from prices
WHERE dt > '2017-05-01 00:00:00.000'
AND dt < '2017-05-01 02:59:59.999'
What I can't understand how to select price for every last record in each second. (I also need a very first one of each second too, but I guess it will be a similar separate query). There are some similar example (here), but they did not work for me when I try to adapt them to my needs generating errors.
Could some please help me to crack this nut?
Let say that there is a table which has been generated with a help of this command:
CREATE TABLE test AS
SELECT timestamp '2017-09-16 20:00:00' + x * interval '0.1' second As my_timestamp
from generate_series(0,100) x
This table contains an increasing series of timestamps, each timestamp differs by 100 milliseconds (0.1 second) from neighbors, so that there are 10 records within each second.
| my_timestamp |
|------------------------|
| 2017-09-16T20:00:00Z |
| 2017-09-16T20:00:00.1Z |
| 2017-09-16T20:00:00.2Z |
| 2017-09-16T20:00:00.3Z |
| 2017-09-16T20:00:00.4Z |
| 2017-09-16T20:00:00.5Z |
| 2017-09-16T20:00:00.6Z |
| 2017-09-16T20:00:00.7Z |
| 2017-09-16T20:00:00.8Z |
| 2017-09-16T20:00:00.9Z |
| 2017-09-16T20:00:01Z |
| 2017-09-16T20:00:01.1Z |
| 2017-09-16T20:00:01.2Z |
| 2017-09-16T20:00:01.3Z |
.......
The below query determines and prints the first and the last timestamp within each second:
SELECT my_timestamp,
CASE
WHEN rn1 = 1 THEN 'First'
WHEN rn2 = 1 THEN 'Last'
ELSE 'Somwhere in the middle'
END as Which_row_within_a_second
FROM (
select *,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp
) rn1,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp DESC
) rn2
from test
) xx
WHERE 1 IN (rn1, rn2 )
ORDER BY my_timestamp
;
| my_timestamp | which_row_within_a_second |
|------------------------|---------------------------|
| 2017-09-16T20:00:00Z | First |
| 2017-09-16T20:00:00.9Z | Last |
| 2017-09-16T20:00:01Z | First |
| 2017-09-16T20:00:01.9Z | Last |
| 2017-09-16T20:00:02Z | First |
| 2017-09-16T20:00:02.9Z | Last |
| 2017-09-16T20:00:03Z | First |
| 2017-09-16T20:00:03.9Z | Last |
| 2017-09-16T20:00:04Z | First |
| 2017-09-16T20:00:04.9Z | Last |
| 2017-09-16T20:00:05Z | First |
| 2017-09-16T20:00:05.9Z | Last |
A working demo you can find here

Using HiveQL, how do I pull the row with the highest integer?

I have a table with a few million rows of data that looks like this:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /mom | pizza | 15 |
| /dad | pizza | 8 |
| /uncle | pizza | 2 |
| /brother | pizza | 7 |
| /mom | pasta | 12 |
| /dad | pasta | 23 |
+---------------+--------------+-------------------+
My goal is to run a HiveQL Query that will return the largest 'interactions' number for each unique page/term combo. For example:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /dad | pasta | 23 |
| /mom | pizza | 15 |
+---------------+--------------+-------------------+
How would I write this considering that each unique page has hundreds of thousands of search_terms, but I only want to pull the one search_term with the most interactions?
I have tried using max(interactions) and max(struct(interactions, search_term)).col1 but have had no luck. My output is consistently giving me all of the search_terms for each page no matter how many interactions.
Thanks!
Use row_number() analytic function:
select page, search_term, interactions
from
(select page, search_term, interactions,
row_number() over (partition by page order by interactions desc ) rn
)s
where rn = 1;

Join column with timestamps where value is maximum

I have a table that looks like
+-------+-----------+
| value | timestamp |
+-------+-----------+
and I'm trying to build a query that gives a result like
+-------+-----------+------------+------------------------+
| value | timestamp | MAX(value) | timestamp of max value |
+-------+-----------+------------+------------------------+
so that the result looks like
+---+----------+---+----------+
| 1 | 1.2.1001 | 3 | 1.1.1000 |
| 2 | 5.5.1021 | 3 | 1.1.1000 |
| 3 | 1.1.1000 | 3 | 1.1.1000 |
+---+----------+---+----------+
but I got stuck on joining the column with the corresponding timestamps.
Any hints or suggestions?
Thanks in advance!
For further information (if that helps):
In the real project the max-values are grouped by month and day (with group by clause, which works btw), but somehow I got stuck on joining the timestamps for max-values.
EDIT
Cross joins are a good idea, but I want to have them grouped by month e.g.:
+---+----------+---+----------+
| 1 | 1.1.1101 | 6 | 1.1.1300 |
| 2 | 2.6.1021 | 5 | 5.6.1000 |
| 3 | 1.1.1200 | 6 | 1.1.1300 |
| 4 | 1.1.1040 | 6 | 1.1.1300 |
| 5 | 5.6.1000 | 5 | 5.6.1000 |
| 6 | 1.1.1300 | 6 | 1.1.1300 |
+---+----------+---+----------+
EDIT 2
I've added a fiddle for some sample data and and example of the current query.
http://sqlfiddle.com/#!1/efa42/1
How to add the corresponding timestamp to the maximum?
Try a cross join with two sub queries, the first one selects all records, the second one gets one row that represents the time_stamp of the max value, <3;"1000-01-01"> for example.
SELECT col_value,col_timestamp,max_col_value, col_timestamp_of_max_value FROM table1
cross join
(
select max(col_value) max_col_value ,col_timestamp col_timestamp_of_max_value from table1
group by col_timestamp
order by max_col_value desc
limit 1
) A --One row that represents the time_stamp of the max value, ie: <3;"1000-01-01">
Use the window cause you use with pg
Select *, max( value ) over (), max( timestamp ) over() from table
That gives you the max values from all values in every row
http://www.postgresql.org/docs/9.1/static/tutorial-window.html