PostgreSQL Fuzzy Searching multiple words with Levenshtein - postgresql

I am working out a postgreSQL query to allow for fuzzy searching capabilities when searching for a company's name in an app that I am working on. I have found and have been working with Postgres' Levenshtein method (part of the fuzzystrmatch module) and for the most part it is working. However, it only seems to work when the company's name is one word, for example:
With Apple (which is stored in the database as simply apple) I can run the following query and have it work near perfectly (it returns a levenshtein distance of 0):
SELECT * FROM contents
WHERE levenshtein(company_name, 'apple') < 4;
However when I take the same approach with Sony (which is stored in the database as Sony Electronics INC) I am unable to get any useful results (entering Sony gives a levenshtein distance of 16).
I have tried to remedy this problem by breaking the company's name down into individual words and inputting each one individually, resulting in something like this:
user input => 'sony'
SELECT * FROM contents
WHERE levenshtein('Sony', 'sony') < 4
OR levenshtein('Electronics', 'sony') < 4
OR levenshtein('INC', 'sony') < 4;
So my question is this: is there some way that I can accurately implement a multi-word fuzzy search with the current general approach that I have now, or am I looking in the complete wrong place?
Thanks!

Given your data and the following query with wild values for the Levenshtein Insertion (10000), Deletion (100) and Substitution (1) cost:
with sample_data as (select 101 "id", 'Sony Entertainment Inc' as "name"
union
select 102 "id",'Apple Corp' as "name")
select sample_data.id,sample_data.name, components.part,
levenshtein(components.part,'sony',10000,100,1) ld_sony
from sample_data
inner join (select sd.id,
lower(unnest(regexp_split_to_array(sd.name,E'\\s+'))) part
from sample_data sd) components on components.id = sample_data.id
The output is so:
id | name | part | ld_sony
-----+------------------------+---------------+---------
101 | Sony Entertainment Inc | sony | 0
101 | Sony Entertainment Inc | entertainment | 903
101 | Sony Entertainment Inc | inc | 10002
102 | Apple Corp | apple | 104
102 | Apple Corp | corp | 3
(5 rows)
Row 1 - no changes..
Row 2 - 9 deletions and 3 changes
Row 3 - 1 insertion and 2 changes
Row 4 - 1 deletion and 4 changes
Row 5 - 3 changes
I've found that splitting the words out causes a lot of false positives whe you give a threshold. You can order by the Levenshtein distance to position the better matches close to the top. Maybe tweaking the Levenshtein variables will help you to order the matches better. Sadly, Levenshtein doesn't weight earlier changes differently than later changes.

Related

Should KSQL tables be showing multiple rows per key for aggregates?

My understanding of KSQL tables is that they show an "as is" view of our data rather than all the data.
So if I have a simple aggregating query and I SELECT from my table, I should see the data as it is at this point in time.
My data (stream):
MY_TOPIC_STREAM:
15 | BEACH | Steven Ebb | over there
24 | CIRCUS | John Doe | an adress
30 | CIRCUS | Alice Small | another address
35 | CIRCUS | Barry Share | a home
35 | CIRCUS | Garry Share | a home
40 | CIRCUS | John Mee | somewhere
45 | CIRCUS | David Three | a place
45 | CIRCUS | Mary Three | a place
45 | CIRCUS | Joffrey Three | a place
My table definition:
CREATE TABLE MY_TABLE WITH (VALUE_FORMAT='AVRO') AS
SELECT ROWKEY AS APPLICATION, COUNT(*) AS NUM_APPLICANTS
FROM MY_TOPIC_STREAM
WHERE header->eventType = 'CIRCUS'
GROUP BY ROWKEY;
I am confused as to why I see multiple rows in my table even though the eventual aggregates are correct?
SELECT * FROM MY_TABLE;
APPLICATION NUM_APPLICANTS
24 1
30 1
--> 35 1 <-- why do I see this?
35 2
40 1
--> 45 1 <-- why do I see this?
--> 45 2 <-- why do I see this?
45 3
My sink topic also shows me the same as the table output - presumably this is correct?
I expected my table result to be:
APPLICATION NUM_APPLICANTS
24 1
30 1
35 2
40 1
45 3
Outputs abridged for brevity and readability above, but you get the gist.
So - are my expectations of the table and sink topic outputs off the mark?
UPDATE
Matthias answer below explains correctly that the table and sink topic show changelog events so it is normal to see intermediate values. However what was confusing me was that I was seeing all intermediate rows. It turned out that this was because I was using the confluent 5.2.1 docker-compose which sets the environment variable KSQL_STREAMS_CACHE_MAX_BYTES_BUFFERING=0. This disables caching of all intermediate results in KSQL aggregations and therefore the table shows more rows than expected whilst eventually arriving at the correct aggregates. Setting this to e.g. 10MB caused the data to output as expected. This feature is not immediately obvious in the documentation for those starting to play with KSQL and using docker to stand up the instances! This issue pointed me in the right direction, and this page documents the parameters. I had spent a long time on this and could not work out why it was not behaving as expected! I hope this helps someone.
Not sure what version you are using, however, SELECT * FROM MY_TABLE; does not return the current content of the table, but the table's changelog stream (this holds for older versions; in newer version the query you show is not valid as the syntax was changed).
Since the transition from KSQL to ksqlDB, the query you showed would be called a push query expressed as SELECT * FROM my_table EMIT CHANGES;.
Furthermore, ksqlDB introduced pull queries that allow you to lookup the current state. However SELECT * FROM my_table; is not supported as a pull query yet (it will be added in the future). You can only do table lookups for a specific key, i.e., there must be a WHERE clause at the moment.
Check out the docs for more details: https://docs.ksqldb.io/en/latest/concepts/queries/pull/

Designing a database to save and query a dynamic range?

I need to design a (postgres) database table which can save a dynamic range of something.
Example:
We have a course table. Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
A math course can be started with 4 to 10 students while a physics course needs to have exactly 8 students to start.
After that, I want to be able to query that.
Let's say, I want all courses which can take 6 students. The math course should be returned, the physics course shouldn't as it requires exactly 8 students.
When I query for 8 students, both courses should be returned.
For the implementation I thought about two simple fields: min_students and max_students. Then I could simply check if the number is equal to or between these numbers.
The issue is: I have to fill both columns everytime. Also for the physics course which requires exactly 8 students.
example:
name | min_students | max_students
--------|--------------|-------------
math | 4 | 10
physics | 8 | 8
Is there a more elegant/efficient way? I also thought about making the max_students column nullable so I could check for
min_students = X OR (min_students >= X AND max_students <= Y)
Would that be more efficient? What about the performance?
Each course can have (a minimum AND a maximum) OR (a specific amount) of participants.
All courses has a minimum and a maximum, for some courses it happens to be the same value. It might seem trivial but thinking about it that way lets you define the problem in a simpler way.
Instead of:
min_students == X OR (min_students >= X AND max_students <= Y)
you can express it as:
num_students BETWEEN min_students AND max_students
BETWEEN is inclusive, so 8 BETWEEN 8 and 8 is true
Regarding optimizations
Additional conditionals makes queries exponentially harder to understand for humans and that leads to missed edge cases and usually results in inefficient queries anyway. Focus on making the code easy to understand, or "elegant", and never sacrifice readability for performance unless you are really sure that you have a performance issue in the first place and that your optimization actually helps.
If you have a table with 10M rows it might be worth looking at super optimizing disk usage if you run on extremely limited hw but reducing the disk usage of a table even with 20 MB is almost certainly wasting time in any normal circumstance even when it doesn't make the code more complicated.
Besides, each row takes up 23-24 bytes in addition to any actual data it contains so shaving of a byte or two wouldn't make a big difference. Setting values to NULL can actually increase disk usage in some situations.
Alternative solution
When using a range data type the comparison would look like this:
num_students #> x
where num_students represents a range (for example 4 to 10) and #> means "contains the value"
create table num_sequence (num int);
create table courses_range (name text, num_students int4range);
insert into num_sequence select generate_series(3,10);
insert into courses_range values
('math', '[4,4]'), ('physics', '[6,7]'), ('dance', '[7,9]');
select * from num_sequence
left join courses_range on num_students #> num;
num | name | num_students
-----+---------+--------------
3 | |
4 | math | [4,5)
5 | |
6 | physics | [6,8)
7 | physics | [6,8)
7 | dance | [7,10)
8 | dance | [7,10)
9 | dance | [7,10)
10 | |
Note that the ranges are output formatted like [x,y), hard brackets means inclusive while parenthesis means exclusive and that for integers: [4,4] = [4,5) = (3,5)

Issue with displaying of information in a Tables/Matrix visual - Power BI

Hi I'm new to Power BI desktop, but have come across an issue when displaying information, Hopefully it's due to my lack of knowledge, but I can't seem to find a way to display values in rows one after the other similar to Pivot tables functionality.
For example so if I had the following table
Location | Salary | Number
A | 100 | 1
A | 200 | 2
B | 100 | 3
B | 400 | 4
C | 400 | 5
D | 800 | 6
What I'd like to produce is something like .....
A | B | C | D
300 | 500 | 400 | 800 <-- Salary Sum
3 | 7 | 5 | 6 <-- Number Sum
I have a direct link with my data source, please suggest a way to display the same with tables/matrix
Thank you in advance
Unfortunately this is currently not supported in Power BI, but maybe there is some light at the end of the tunnel... The Power BI team have started working on this much requested feature. See here
As Tom said, this is available with the August release. You can check which version you have by going to File -> Help -> About. If you have an older verion, you can go here to download the right one for you (32-bit vs 64-bit).
Once you have made sure you are running the August version, simply create a matrix with Location in the Columns field and Salary and Number in the Values field. Then go into the formatting pane and under Values, turn Show on rows to on.
Try this : Go to query editor, select the first column of the desired table, location in your case and from transform tab, select unpivot other columns.
That's it! Now go and drop your visual.

postgres ANY() with BETWEEN Condition

in case someone is wondering, i am recycling a different question i answered myself, because is realized that my problem has a different root-cause than i thought:
My question actually seems pretty simple, but i cannot find a way.
How do is query postgres if any element of an array is between two values?
The Documentation states that a BETWEEN b and c is equivalent to a > b and a < c
This however does not work on arrays, as
ANY({1, 101}) BETWEEN 10 and 20 has to be false
while
ANY({1,101}) > 10 AND ANY({1,101}) < 20 has to be true.
{1,101} meaning an array containing the two elements 1 and 101.
how can i solve this problem, without resorting to workarounds?
regards,
BillDoor
EDIT: for clarity:
The scenario i have is, i am querying an xml document via xpath(), but for this problem a column containing an array of type int[] does the job.
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Bob
I want all Names, where there is a number that is between 20 and 30 - so i want Bob
The query
SELECT name from table where ANY(numbers) > 20 AND ANY(numbers) < 30
will return Alice and Bob, showing that alice has numbers > 20 as well as other numbers < 30.
A BETWEEN syntax is not allowed in this case, however between only gets mapped to > 20 AND < 30 internally anyways
Quoting the docs on the Between Operators' mapping to > and < documentation:
There is no difference between the two respective forms apart from the
CPU cycles required to rewrite the first one into the second one
internally.
PS.:
Just to avoid adding a new question for this: how can i solve
id::int | numbers::int[] | name::text
1 | {1,3,200} | Alice
2 | {21,100} | Alicia
SELECT id FROM table WHERE ANY(name) LIKE 'Alic%'
result: 1, 2
i can only find examples of matching one value to multiple regex, but not matching one regex against a set of values :/. Besides the shown syntax is invalid, ANY has to be the second operand, but the second operand of LIKE has to be the regex.
exists (select * from (select unnest(array[1,101]) x ) q1 where x between 10 and 20 )
you can create a function based on on this query
second approach:
select int4range(10,20,'[]') #> any(array[1, 101])
for timestamps and dates its like:
select tsrange( '2015-01-01'::timestamp,'2015-05-01'::timestamp,'[]') #> any(array['2015-05-01', '2015-05-02']::timestamp[])
for more info read: range operators

Is the last digit of a phone number random?

I have a telephony app which has a prompt which requires user choice. I made the app select one of 10 different phone prompts based on the last digit of the caller's phone number. Then I measure whether the user responds to the prompt (accept) or decides to skip to the next step (reject). I thought this would work well enough as a random selection, but I think I may be wrong.
What I'm finding is that the exact same prompt has a dramatically different response rate (25% vs 35%) for two different last digits. Now I'm curious why this is. Does anyone know how phone numbers are assigned and why the last digit would be significant?
I checked our billing database. We use Asterisk as PBX and store billing in PostgreSQL database.
select substring(cdr_callerid_norm from '[0-9]$') as last_digit, count(*)
from asterisk_cdr
where cdr_callerid_norm is not null and length(cdr_callerid_norm) > 2
group by last_digit
order by last_digit
Result:
last_digit | count
------------+-------
0 | 17919
1 | 13811
2 | 8257
3 | 20708
4 | 13492
5 | 13708
6 | 8813
7 | 6943
8 | 11693
9 | 7942
| 2584
(11 rows)
To me those numbers are not random. You can do similar thing with your billing and check it. I think phone numbers can be random in general, but if few customers calls you much more often then callers number will not be random. Consider using something other to vary prompt: random number, use time etc.