Use field twice in sphinx-query with different user_weights - sphinx

I've got a working index and everything is fine, but there's a little thing that is bothering me.
There is one query in which I'm searching the same field twice with different strings and I'd like to assign different user_weights to those.
I've got this working by altering the index, selecting the field twice with different names.
But this feels a little bit wrong to me.
Is it possible to use the same Field twice with different user weights?
I read about keyword boosting, but as my queries are mostly a bit more complex than a single keyword this is going to be a little bit messy.
Example query (I cleaned this up a little bit, the lists of searchterms is generated):
SELECT titel, age_firstdate, age_priodatum, uniqueserial(partner) as sortid,
weight() as w, if (w > 200000, IF(w > 500000, 2, 1) ,0) as intitel, IF (w>5000, 1, 0) as ingroup,
sort_w + sort_intitel as score, 0 as geodist
FROM anzeigen
WHERE MATCH('(
(#titel_dup wachfrau) |
(#titel_mf (wachmann|wachfrau|wachleute)) |
(#titel (("Fachkraft Schutz Sicherheit"|[...]|"Wachleute"))) |
(#titel_low (("Security Detektiv"|[...]|"450 Security"))))
') AND geodist <= 30
ORDER BY geodist ASC, intitel DESC, ingroup DESC, score DESC
LIMIT 10000
OPTION max_matches=100000, field_weights = (titel_dup=500001, titel_mf=200001, titel=5001, titel_low = 100, beschreibung=5),
ranker=expr('sum(word_count*user_weight*lcs)')
index-definition (cleaned up too):
sql_query = SELECT SQL_NO_CACHE \
s.id, s.titel, s.titel as titel_dup, s.titel as titel_mf, s.titel as titel_low,[...] \
FROM _sphinxSource s
sql_field_string = titel
sql_field_string = titel_dup
sql_field_string = titel_mf
sql_field_string = titel_low
I also tried the following in Sphinxql:
SELECT [...] titel as title_dup, title as title_low
[...] OPTIONS field_weights = (titel_dup=500001, titel_mf=200001)
[...]
Which would be a nice solution but doesn't work.

Related

Can somebody help me translate this into postgresql?

I am very new to SQL and I do not know much about writing code in the different DBMS. I am trying to write a report in our school's MOODLE platform, which uses postgresql, using a configurable report found here. However, the code does not work in postgresql. In particular, how do I rewrite those lines with variable assignments like #prevtime := to make the code work in postgresql?
Here is the complete code from the link.
SELECT
l.id,
l.timecreated,
DATE_FORMAT(FROM_UNIXTIME(l.timecreated),'%d-%m-%Y') AS dTime,
#prevtime := (SELECT MAX(timecreated) FROM mdl_logstore_standard_log
WHERE userid = %%USERID%% AND id < l.id ORDER BY id ASC LIMIT 1) AS prev_time,
IF (l.timecreated - #prevtime < 7200, #delta := #delta + (l.timecreated-#prevtime),0) AS sumtime,
l.timecreated-#prevtime AS delta,
"User" AS TYPE
FROM prefix_logstore_standard_log AS l,
(SELECT #delta := 0) AS s_init
# CHANGE UserID
WHERE l.userid = %%USERID%% AND l.courseid = %%COURSEID%%
%%FILTER_STARTTIME:l.timecreated:>%% %%FILTER_ENDTIME:l.timecreated:<%%
This is supposed to report the time spent by students in courses in MOODLE.
I assume the original query was written for MySQL. You haven't explained what the query actually does, but the #prevtime hack is usually a workaround for missing window functions, so most probably this can be done using lag() in Postgres, something along the lines:
select l.id,
l.timecreated,
to_char(to_timestamp(l.timecreated), 'dd-mm-yyyy') as dtime,
lag(timecreated) over w as prev_time,
l.timecreated - lag(timecreated) over w as delta,
'User' as type,
FROM prefix_logstore_standard_log AS l
window w as (partition by userid order by id)
WHERE l.userid = %%USERID%%
AND l.courseid = %%COURSEID%%

Optimizing Postgres query with timestamp filter

I have a query:
SELECT DISTINCT ON (analytics_staging_v2s.event_type, sent_email_v2s.recipient, sent_email_v2s.sent) sent_email_v2s.id, sent_email_v2s.user_id, analytics_staging_v2s.event_type, sent_email_v2s.campaign_id, sent_email_v2s.recipient, sent_email_v2s.sent, sent_email_v2s.stage, sent_email_v2s.sequence_id, people.role, people.company, people.first_name, people.last_name, sequences.name as sequence_name
FROM "sent_email_v2s"
LEFT JOIN analytics_staging_v2s ON sent_email_v2s.id = analytics_staging_v2s.sent_email_v2_id
JOIN people ON sent_email_v2s.person_id = people.id
JOIN sequences on sent_email_v2s.sequence_id = sequences.id
JOIN users ON sent_email_v2s.user_id = users.id
WHERE "sent_email_v2s"."status" = 1
AND "people"."person_type" = 0
AND (sent_email_v2s.sequence_id = 1888) AND (sent_email_v2s.sent >= '2016-03-18')
AND "users"."team_id" = 1
When I run EXPLAIN ANALYZE on it, I get:
Then, if I change that to the following (Just removing the (sent_email_v2s.sent >= '2016-03-18')) as follows:
SELECT DISTINCT ON (analytics_staging_v2s.event_type, sent_email_v2s.recipient, sent_email_v2s.sent) sent_email_v2s.id, sent_email_v2s.user_id, analytics_staging_v2s.event_type, sent_email_v2s.campaign_id, sent_email_v2s.recipient, sent_email_v2s.sent, sent_email_v2s.stage, sent_email_v2s.sequence_id, people.role, people.company, people.first_name, people.last_name, sequences.name as sequence_name
FROM "sent_email_v2s"
LEFT JOIN analytics_staging_v2s ON sent_email_v2s.id = analytics_staging_v2s.sent_email_v2_id
JOIN people ON sent_email_v2s.person_id = people.id
JOIN sequences on sent_email_v2s.sequence_id = sequences.id
JOIN users ON sent_email_v2s.user_id = users.id
WHERE "sent_email_v2s"."status" = 1
AND "people"."person_type" = 0
AND (sent_email_v2s.sequence_id = 1888) AND "users"."team_id" = 1
when I run EXPLAIN ANALYZE on this query, the results are:
EDIT:
The results above from today are about as I expected. When I ran this last night, however, the difference created by including the timestamp filter was about 100x slower (0.5s -> 59s). The EXPLAIN ANALYZE from last night showed all of the time increase to be attributed to the first unique/sort operation in the query plan above.
Could there be some kind of caching issue here? I am worried now that there might be something else going on (transiently) that might make this query take 100x longer since it happened at least once.
Any thoughts are appreciated!

Postgres reverse LIKE lookup indexing and performance

We have a musicians table containing records with multiple string fields, say:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Sting", "", "Bass"
"Ringo", "Starr", "Drums"
"Paul", "McCartney", "Bass"
I want to pass postgres a long string, say:
"It is known that Jimi liked to set light to his guitar and smash up
all the drums while on stage."
and i want to get returned the fields that have any matches - preferably in order of the most matches first:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Ringo", "Starr", "Drums"
because i need the search to be case insensitive, i'm constructing a query like this...
select * from musicians where lowercase_string like '%'||firstname||'%' or lowercase_string like '%'||lastname||'%' or lowercase_string like '%'||instrument||'%'
and then looping through (in ruby in my case) to capture the result with the most matches.
this is however very slow in the sql stage (1 minute+).
i've tried adding lower-case GIN index using pg_trgm as suggested here - but it's not helping - presumably because the like query is back to front?
Thanks!
With my testing, it seems that no trigram index could help your query at all. And no other index type could possibly speed up an (I)LIKE / FTS based search.
I should mention that all of the queries below use the trigram indexes, when they are queried "reversed": when the table contains the document (which is indexed), and your parameter is the query. The (I)LIKE variant variant f.ex. 2-3 times faster with it.
These the queries I've tested:
select *
from musicians
where :input_string ilike '%' || firstname || '%'
or :input_string ilike '%' || lastname || '%'
or :input_string ilike '%' || instrument || '%'
At first, FTS seemed a great idea, but my testing shows that even without ranking, it is 60-100 times slower than the (I)LIKE variant. (So even, when you don't have to post-process results with these methods, these are not worth it).
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
However, ORDER BY rank doesn't slow down that much further: it is 70-120 times slower than the (I)LIKE variant.
select *
from musicians
where to_tsvector(:input_string) ## (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
order by ts_rank(to_tsvector(:input_string), plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
Then, for a last effort, I tried the (fairly new) "word similarity" operators of the trigram module: <% and %> (available from PostgreSQL 9.6).
select *
from musicians
where :input_string %> firstname
or :input_string %> lastname
or :input_string %> instrument
select *
from musicians
where firstname <% :input_string
or lastname <% :input_string
or instrument <% :input_string
These were somewhat faster then FTS: around 50-70 times slower than the (I)LIKE variant.
(Partially working) rextester: it is run against PostgreSQL 9.5, so the 9.6 operators obviously won't run here.
Update: IF full word match is enough for you, you can actually reverse your query, to be able to use indexes. You'll need to "parse" your query (aka. "long string") though:
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(ls, '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where firstname ilike word
or lastname ilike word
or instrument ilike word
group by musicians.id
Note: I parsed the query for every complete word. You can have some other logic there, or it can even be parsed in client side.
The default, btree index shines here, as it is much faster than the trigram index with (I)LIKE (we won't need them anyway, as we are looking for complete word match here):
with long_string(ls) as (
values (:input_string)
),
words(word) as (
select s
from long_string, regexp_split_to_table(lower(ls), '[^[:alnum:]]+') s
where s <> ''
)
select musicians.*
from musicians, words
where lower(firstname) = word
or lower(lastname) = word
or lower(instrument) = word
group by musicians.id
http://rextester.com/PSABJ6745
You could even get the match count with something like
sum((lower(firstname) = word)::int
+ (lower(lastname) = word)::int
+ (lower(instrument) = word)::int)
The ilike option with match ordering:
with long_string (ls) as (values
('It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage.')
)
select musicians.*, matches
from
musicians
cross join
long_string
cross join lateral
(select
(ls ilike format ('%%%s%%', first_name) and first_name != '')::int +
(ls ilike format ('%%%s%%', last_name) and last_name != '')::int +
(ls ilike format ('%%%s%%', instrument) and instrument != '')::int
as matches
) m
where matches > 0
order by matches desc
;
first_name | last_name | instrument | matches
------------+-----------+------------+---------
Jimi | Hendrix | Guitar | 2
Phil | Collins | Drums | 1
Ringo | Starr | Drums | 1

Count previous occurences of a value split by date ranges

Here's a simple query we do for ad hoc requests from our Marketing department on the leads we received in the last 90 days.
SELECT ID
,FIRST_NAME
,LAST_NAME
,ADDRESS_1
,ADDRESS_2
,CITY
,STATE
,ZIP
,HOME_PHONE
,MOBILE_PHONE
,EMAIL_ADDRESS
,ROW_ADDED_DTM
FROM WEB_LEADS
WHERE ROW_ADDED_DTM BETWEEN #START AND #END
They are asking for more derived columns to be added that show the number of previous occurences of ADDRESS_1 where the EMAIL_ADDRESS matches. But they want is for different date ranges.
So the derived columns would look like this:
,COUNT_ADDRESS_1_LAST_1_DAYS,
,COUNT_ADDRESS_1_LAST_7_DAYS
,COUNT_ADDRESS_1_LAST_14_DAYS
etc.
I've manually filled these derived columns using update statements when there was just a few. The above query is really just a sample of a much larger query with many more columns. The actual request has blossomed into 6 date ranges for 13 columns. I'm asking if there's a better way then using 78 additional update statements.
I think you will have a hard time writing a query that includes all of these 78 metrics per e-mail address without actually creating a query that hard-codes the different choices. However you can generate such a pivot query with dynamic SQL, which will save you some keystrokes and will adjust dynamically as you add more columns to the table.
The result you want to end up with will look something like this (but of course you won't want to type it):
;WITH y AS
(
SELECT
EMAIL_ADDRESS,
/* aggregation portion */
[ADDRESS_1] = COUNT(DISTINCT [ADDRESS_1]),
[ADDRESS_2] = COUNT(DISTINCT [ADDRESS_2]),
... other columns
/* end agg portion */
FROM dbo.WEB_LEADS AS wl
WHERE ROW_ADDED_DTM >= /* one of 6 past dates */
GROUP BY wl.EMAIL_ADDRESS
)
SELECT EMAIL_ADDRESS,
/* pivot portion */
COUNT_ADDRESS_1_LAST_1_DAYS = *count address 1 from 1 day ago*,
COUNT_ADDRESS_1_LAST_7_DAYS = *count address 1 from 7 days ago*,
... other date ranges ...
COUNT_ADDRESS_2_LAST_1_DAYS = *count address 2 from 1 day ago*,
COUNT_ADDRESS_2_LAST_7_DAYS = *count address 2 from 7 days ago*,
... other date ranges ...
... repeat for 11 more columns ...
/* end pivot portion */
FROM y
GROUP BY EMAIL_ADDRESS
ORDER BY EMAIL_ADDRESS;
This is a little involved, and it should all be run as one script, but I'm going to break it up into chunks to intersperse comments on how the above portions are populated without typing them. (And before long #Bluefeet will probably come along with a much better PIVOT alternative.) I'll enclose my interspersed comments in /* */ so that you can still copy the bulk of this answer into Management Studio and run it with the comments intact.
Code/comments to copy follows:
/*
First, let's build a table of dates that can be used both to derive labels for pivoting and to assist with aggregation. I've added the three ranges you've mentioned and guessed at a fourth, but hopefully it is clear how to add more:
*/
DECLARE #d DATE = SYSDATETIME();
CREATE TABLE #L(label NVARCHAR(15), d DATE);
INSERT #L(label, d) VALUES
(N'LAST_1_DAYS', DATEADD(DAY, -1, #d)),
(N'LAST_7_DAYS', DATEADD(DAY, -8, #d)),
(N'LAST_14_DAYS', DATEADD(DAY, -15, #d)),
(N'LAST_MONTH', DATEADD(MONTH, -1, #d));
/*
Next, let's build the portions of the query that are repeated per column name. First, the aggregation portion is just in the format col = COUNT(DISTINCT col). We're going to go to the catalog views to dynamically derive the list of column names (except ID, EMAIL_ADDRESS and ROW_ADDED_DTM) and stuff them into a #temp table for re-use.
*/
SELECT name INTO #N FROM sys.columns
WHERE [object_id] = OBJECT_ID(N'dbo.WEB_LEADS')
AND name NOT IN (N'ID', N'EMAIL_ADDRESS', N'ROW_ADDED_DTM');
DECLARE #agg NVARCHAR(MAX) = N'', #piv NVARCHAR(MAX) = N'';
SELECT #agg += ',
' + QUOTENAME(name) + ' = COUNT(DISTINCT '
+ QUOTENAME(name) + ')' FROM #N;
PRINT #agg;
/*
Next we'll build the "pivot" portion (even though I am angling for the poor man's pivot - a bunch of CASE expressions). For each column name we need a conditional against each range, so we can accomplish this by cross joining the list of column names against our labels table. (And we'll use this exact technique again in the query later to make the /* one of past 6 dates */ portion work.
*/
SELECT #piv += ',
COUNT_' + n.name + '_' + l.label
+ ' = MAX(CASE WHEN label = N''' + l.label
+ ''' THEN ' + QUOTENAME(n.name) + ' END)'
FROM #N as n CROSS JOIN #L AS l;
PRINT #piv;
/*
Now, with those two portions populated as we'd like them, we can build a dynamic SQL statement that fills out the rest:
*/
DECLARE #sql NVARCHAR(MAX) = N';WITH y AS
(
SELECT
EMAIL_ADDRESS, l.label' + #agg + '
FROM dbo.WEB_LEADS AS wl
CROSS JOIN #L AS l
WHERE wl.ROW_ADDED_DTM >= l.d
GROUP BY wl.EMAIL_ADDRESS, l.label
)
SELECT EMAIL_ADDRESS' + #piv + '
FROM y
GROUP BY EMAIL_ADDRESS
ORDER BY EMAIL_ADDRESS;';
PRINT #sql;
EXEC sp_executesql #sql;
GO
DROP TABLE #N, #L;
/*
Now again, this is a pretty complex piece of code, and perhaps it can be made easier with PIVOT. But I think even #Bluefeet will write a version of PIVOT that uses dynamic SQL because there is just way too much to hard-code here IMHO.
*/

DB2 V9 ZOS - Performance tuning

Background
Currently I am using DB2 V9 version. One of my stored procedure is taking time to execute. I looked BMC apptune and found the following SQL.
There are three tables we were using to execute the following query.
ACCOUNT table is having 3413 records
EXCHANGE_RATE is having 1267K records
BALANCE is having 113M records
Someone has added recently following piece of code in the query. I think because of this we had a problem.
AND (((A.ACT <> A.EW_ACT)
AND (A.EW_ACT <> ' ')
AND (C.ACT = A.EW_ACT))
OR (C.ACT = A.ACT))
Query
SELECT F1.CLO_LED
INTO :H :H
FROM (SELECT A.ACT, A.BNK, A.ACT_TYPE,
CASE WHEN :H = A.CUY_TYPE THEN DEC(C.CLO_LED, 21, 2)
ELSE DEC(MULTIPLY_ALT(C.CLO_LED, COALESCE(B.EXC_RATE, 0)), 21, 2)
END AS CLO_LED
FROM ACCOUNT A
LEFT OUTER JOIN EXCHANGE_RATE B
ON B.EFF_DATE = CURRENT DATE - 1 DAY
AND B.CURCY_FROM = A.CURNCY_TYPE
AND B.CURCY_TO = :H
AND B.STA_TYPE = 'A'
, BALANCE C
WHERE A.CUSR_ID = :DCL.CUST-ID
AND A.ACT = :DCL.ACT
AND A.EIG_RTN = :WS-BNK-ID
AND A.ACT_TYPE = :DCL.ACT-TYPE
AND A.ACT_CAT = :DCL.ACT-CAT
AND A.STA_TYPE = 'A'
AND (((A.ACT <> A.EW_ACT)
AND (A.EW_ACT <> ' ')
AND (C.ACT = A.EW_ACT))
OR (C.ACT = A.ACT))
AND C.BNK = :WS-BNK-ID
AND C.ACT_TYPE = :DCL.ACT-TYPE
AND C.BUS_DATE = :WS-DATE-FROM) F1
WITH UR
There's a number of wierd things going on in this query. The most twitchy of which is mixing explicit joins with the implicit-join syntax; frankly, I'm not certain how the system interprets it. You also appear to be using the same host-variable for both input and output; please don't.
Also, why are your column names so short? DB2 (that version, at least) supports column names that are much longer. Please save people's sanity, if at all possible.
We can't completely say why things are slow - we may need to see access plans. In the meantime, here's your query, restructured to what may be a faster form:
SELECT CASE WHEN :inputType = a.cuy_type THEN DEC(b.clo_led, 21, 2)
ELSE DEC(MULTIPLY_ALT(b.clo_led, COALESCE(c.exc_rate, 0)), 21, 2) END
INTO :amount :amountIndicator -- if you get results, do you need the indiciator?
FROM Account as a
JOIN Balance as b -- This is assumed to not be a 'left', given coalesce not used
ON b.bnk = a.eig_rtn
AND b.act_type = a.act_type
AND b.bus_date = :ws-date-from
AND ((a.act <> a.ew_act -- something feels wrong here, but
AND a.ew_act <> ' ' -- without knowing the data, I don't
AND c.act = a.ew_act) -- want to muck with it.
OR c.act = a.act)
LEFT JOIN Exchange_Rate as c
ON c.eff_date = current_date - 1 day
AND c.curcy_from = a.curncy_type
AND c.sta_type = a.sta_type
AND c.curcy_to = :destinationCurrency
WHERE a.cusr_id = :dcl.cust-id
AND a.act = :dcl.act
AND a.eig_rtn = :ws-bnk-id
AND a.act_type = :dcl.act-type
AND a.act_cat = :dcl.act-cat
AND a.sta_type = 'A'
WITH UR
FECTCH FIRST 1 ROW ONLY
A few other notes:
Only specify exactly those columns needed - under certain circumstances, this permits index-only access, where otherwise a followup table-access may be needed. However, this probably won't help here.
COALESCE(c.exc_rate, 0) feels off somehow - if no exchange rate is present, you return an amount of 0, which could otherwise be a valid amount. You may need to return some sort of indicator, or make it a normal join, not an outer one.
Also, try both this version, and possibly a version where host variables are specified in addition to the conditions between tables. The optimizer should be able to automatically commute the values, but may not under some conditions (implementation detail).