Sphinx Mulit-Level Sort with Randomize - sphinx

Here is my challenge with Sphinx Sort where I have Vendors who pay for premium placement and those who don't:
I already do a multi-level order including the PaidVendorStatus which is either 0 or 1 as:
order by PaidVendorStatus,Weight()
So in essence I end up with multiple sort groups:
PaidVendorStatus=1, Weight1
....
PaidVendorStatus=1, WeightN
PaidVendorStatus=0, Weight1
...
PaidVendorStatus=0, WeightN
The problem is I have three goals:
Randomly prioritize each vendor in any given sort group
Have each vendor's 'odds' of being randomly assigned top position be equal regardless of how many records they have returned in the group (so if Vendor A has 50 results and VendorB has 2 results they still both have 50% odds of being randomly assigned any given spot)
Ideally, maintain the same results order in any given search (so that if the user searches again the same order will be displayed
I've tried various solutions:
Select CRC32(Vendor) as RANDOM...Order by PaidVendorStatus,Weight(),RANDOM
which solves 2 and 3 except due to the nature of CRC32 ALWAYS puts the same vendor first (and second, third, etc.) so in essence does not solve the issue at all.
I tried making a sphinx sql_attr_string in my Sphinx Configuration which was a concatenation of Vendor and the record Title (Select... concat(Vendor,Title) as RANDOMIZER..)` and then used that to randomize
Select CRC32(RANDOMIZER) as RANDOM...
which solves 1 and 3 as now the Title field gets thrown in the randomization mis so that the same Vendor does not always get first billing. However, it fails at 2 since in essence I am only sorting by Title and thus Vendor B with two results now has a very low change of being sorted first.
In an ideal world naturally I could just order this way;
Order by PaidVendorStatus,Weight(),RAND(Vendor)
but that is not possible.
Any thoughts on this appreciated. I did btw check out as per Barry Hunter's suggestion this thread on UDF but unless I am not understanding it at all (possible) it does not seem to be the solution for this problem.

Well one idea is:
SELECT * FROM (
SELECT *,uniqueserial(vendor_id) AS sorter FROM index WHERE MATCH(...)
ORDER BY PaidVendorStatus DESC ,Weight() DESC LIMIT 1000
) ORDER BY sorter DESC, WEIGHT() DESC:
This exploits SPhixnes 'multiple sort' function with pysudeo subquery.
This works wors becasuse the inner query is sorted by PaidVendor first, so their items are fist. Which works to affect the ordr that unqique serial is called in.
Its NOT really 'randomising' the results as such, seems you jsut randomising them to mix up the vendors (so a single vendor doesnt domninate results. Uniqueserial works by 'spreading' the particular vendors results out - the results will tend to cycle through the vendors.
This is tricky as it exploits a relative undocumented sphinx feature - subqueries.
For the UDF see http://svn.geograph.org.uk/svn/modules/trunk/sphinx/

Still dont have an answer for your biased random (as in 2.)
but just remembered another feature taht can help with 3. - can supply s specific seed to the random number. Typically random generators are seeded from the current time, which gives ever changing values, But using a specific seed.
Seed is however a number, so need a predictable, but changing number. Could CRC the query?
... sphinx doesnt support expressions in the OPTION so would have to caculate the hash in the app.
<?php
$query = $db->Quote($_GET['q']);
$crc = crc32($query);
$sql = "SELECT id,IDIV(WEIGHT(),100) as i,RAND() as r FROM index WHERE MATCH($query)
ORDER BY PaidVendorStatus DESC,i DESC,r ASC OPTION random_seed=$crc";
If wanted the results to only slowly evolve, add the current date, so each day is a new selection...
$crc = crc32($query.date('Ymd'));

Related

Active Record efficient querying on multiple different tables

Let me give a summary of what I've been attempting to do and the efficiency issues I've been running into:
Essentially I want my users to be able to select parameters to filter data from my database, then I want to pass relevant data which passes those filters from the controller.
However, these filters query on data from multiple different tables (that is, about 5-6 different tables), some of which are quite large (as in 100k+ rows). These tables are all related to what I want to show, e.g. Here is a bond that meets so and so criteria, which is issued by so and so issuer, which must meet these criteria, and so on.
From an end result, I only really need about 100 rows after querying based on the parameters given by the user, but it feels like I need to look at everything in every table because I dont know how strict the filters will be beforehand. e.g. With a starting universe of 100k sets of data, passing filter f1,f2 of Table 1 might leave 90k, but after passing through filter f3 of table 2, f4,f5,f6 of table 3, and so ..., we might end up with 100 or less sets of data that pass these parameters because the last filters checked might be quite strict.
How can I go about querying through these multiple different tables efficiently?
Doing a join between them seems like it'd yield some time complexity of |T_1||T_2||T_3||T_4||T_5||T_6| where T_i is the "size" of table i.
On the other hand, just looking through the other tables based off the ids of the ones that pass the previous filter (as in, id 5,7,8 pass filters in T_1, which of those ids then pass filters in T_2, then which of those pass filters in T_3 and so on) looks like it might(?) have time complexity of |T_1| + |T_2| + ... + |T_6|.
I'm relatively new to Ruby on Rails, so im not entirely sure all of the tools at my disposal that could help with optimizing this, but at the same time I'm not entirely sure how to best approach this algorithmically.

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

Slow query with order and limit clause but only if there are no records

I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1
Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.
Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.