Selecting words out of table which sound similar - tsql

I read an interesting article about English and phonetics - and would like to see if my newfound knowledge can be applied in TSQL to generate a fuzzy result set. In one of my applications, there is a table containing words, which I extracted from a word list. It is literally a one-column table -
Word |
------
A
An
Apple
...
their
there
Is there an built-in function in SQL Server to Select a word which Sounds The same, even though it is spelled different? (The globalization settings are on en-ZA - as last time I checked)
SELECT Word FROM WordTable WHERE Word = <word that sounds similar>

SoundEx()
SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken.
Difference()
Returns an integer value that indicates the difference between the SOUNDEX values of two character expressions.
SELECT word
, SoundEx(word) As word
, SoundEx(word_that_sounds_similar) As word_that_sounds_similar
, Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) As how_similar
FROM wordtable
WHERE Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) <= 1 /* quite close! */
The value returned by Difference() indicates how similar the two words are.
A value of 0 indicates a strong match and a value of 4 means slim-to-no match.

Related

PostgreSQL pattern matching with Unicode graphemes

Is there any way to pattern match with Unicode graphemes?
As a quick example, when I run this query:
CREATE TABLE test (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (id),
UNIQUE (name)
);
INSERT INTO test (name) VALUES ('πŸ‘πŸ» One');
INSERT INTO test (name) VALUES ('πŸ‘ Two');
SELECT * FROM public.test WHERE test.name LIKE 'πŸ‘%';
I get both rows returned, rather than just 'πŸ‘ Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match 'πŸ‘ Two', because πŸ‘πŸ» is a different grapheme.
Is this possible?
It's a very interesting question!
I am not quite sure if it is possible anyway:
The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand πŸ‘ which is followed by an emoji skin modifier 🏻
This is how the light skinned hand is stored internally. So, for me, your result makes sense:
When you query any string, that begins with πŸ‘, it will return:
πŸ‘ Two (trivial)
πŸ‘_🏻 One (ignore the underscore, I try to suppress the automated ligature with this)
So, you can see, the light skinned emoji internally also starts with πŸ‘. That's why I believe, that your query doesn't work the way you like.
Workarounds/Solutions:
You can add a space to your query. This ensures, that there's no skin modifier after your πŸ‘ character. Naturally, this only works in your case, where all data sets have a space after the hand:
SELECT * FROM test WHERE name LIKE 'πŸ‘ %';
You can simply extend the WHERE clause like this:
SELECT * FROM test
WHERE name LIKE 'πŸ‘%'
AND name NOT LIKE 'πŸ‘πŸ»%'
AND name NOT LIKE 'πŸ‘πŸΌ%'
AND name NOT LIKE 'πŸ‘πŸ½%'
AND name NOT LIKE 'πŸ‘πŸΎ%'
AND name NOT LIKE 'πŸ‘πŸΏ%'
You can use regular expression pattern matching to exclude the skins:
SELECT * FROM test
WHERE name ~ '^πŸ‘[^🏻🏼🏽🏾🏿]*$'
see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

Postgres - difference between to_tsquery, to_tsvector and plainto_tsquery

The Full text search of postgres includes some of these functions to search: plainto_tsquery, to_tsquery and to_tsvector .
I don't get the difference between it, the results contain the same words always, but in tsvector it is detached with the number of position of that word.
SELECT plainto_tsquery('simple', 'The & Fat & Rats');
result will be like this:
plainto_tsquery: 'fat' & 'rat'
to_tsquery: 'fat' & 'rat'
to_tsvector: 'fat':2 'rat':3
I have tried longer queries, but i haven't found a bigger difference than that.
I already read the documentation, but I didnt get the difference there either.
I am happy for any help.
"plainto_tsquery" takes a phrase in plain English (or in this case plain "simple"--although your question is not consistent. "simple" does not strip out the word 'the', the way you show, unless you made nonstandard modifications to it) and converts it to a tsquery. Since "&" is punctuation, it gets ignored. But then it adds '&' in between the words, because that is what "plainto_tsquery" does. So those changes are not visible, because you chose a poor example to feed to plainto_tsquery.
"to_tsquery" compiles the query you gave it into the structure used for searching. But then, because you are selecting it rather than using it with a ts query operator, it converts it back to text again so it can display it. It requires that what you feed it already looks mostly like a tsquery (for example, has boolean operators between each word), otherwise it throws an error. Surely you noticed that when you tried longer queries?
"to_tsvector" creates a tsvector. This is not a tsquery, rather it is what the tsquery gets applied to.

officejs : Search Word document using regular expression

I want to search strings like "number 1" or "number 152" or "number 36985".
In all above strings "number " will be constant but digits will change and can have any length.
I tried Search option using wildcard but it doesn't seem to work.
basic regEx operators like + seem to not work.
I tried 'number*[1-9]*' and 'number*[1-9]+' but no luck.
This regular expression only selects upto one digit. e.g. If the string is 'number 12345' it only matches number 12345 (the part which is in bold).
Does anyone know how to do this?
Word doesn't use regular expressions in its search (Find) functionality. It has its own set of wildcard rules. These are very similar to RegEx, but not identical and not as powerful.
Using Word's wildcards, the search text below locates the examples given in the question. (Note that the semicolon separator in 1;100 may be soemthing else, depending on the list separator set in Windows (or on the Mac). My European locale uses a semicolon; the United States would use a comma, for example.
"number [0-9]{1;100}"
The 100 is an arbitrary number I chose for the maximum number of repeats of the search term just before it. Depending on how long you expect a number to be, this can be much smaller...
The logic of the search text is: number is a literal; the valid range of characters following the literal are 0 through 9; there may be one to one hundred of these characters - anything in that range is a match.
The only way RegEx can be used in Word is to extract a string and run the search on the string. But this dissociates the string from the document, meaning Word-specific content (formatting, fields, etc.) will be lost.
Try putting < and > on the ends of your search string to indicate the beginning and ending of the desired strings. This works for me: '<number [1-9]*>'. So does '<number [1-9]#>' which is probably what you want. Note that in Word wildcards the # is used where + is used in other RegEx systems.

Postgresql Function to sort characters within a string

Is there a postgresql function, preferably native function, that can sort a string such as 'banana' to 'aaabnn'?
Algorithmic efficiency of sorting is not of much importance since words will never be too long. However, database join efficiency is of some but not critical importance.
There is no native function with such functionality but you can use regexp_split_to_table to do so as this:
select theword
from (select regexp_split_to_table('banana',E'(?=.)') theword) tab
order by theword;
The result will be:
theword
a
a
a
b
n
n
This (?=.) will split by each character leaving the character as separator. It will also identify spaces. If you have a word with spaces and do not want it (the space) use E'(\\s*)' matches any whitespace character. I don't recall what the E means. I will search and edit the answer asap.
As explained in the DOCs in the section "regexp_split_to_table"
EDIT: As I said: The meaning of the E before the string you can see here: What's the "E" before a Postgres string?

zip code + 4 mail merge treated like an arithmetic expression

I'm trying to do a simple mail merge in Word 2010 but when I insert an excel field that's supposed to represent a zip code from Connecticut (ie. 06880) I am having 2 problems:
the leading zero gets suppressed such as 06880 becoming 6880 instead. I know that I can at least toggle field code to make it so it works as {MERGEFIELD ZipCode # 00000} and that at least works.
but here's the real problem I can't seem to figure out:
A zip+4 field such as 06470-5530 gets treated like an arithmetic expression. 6470 - 5530 = 940 so by using above formula instead it becomes 00940 which is wrong.
Perhaps is there something in my excel spreadsheet or an option in Word that I need to set to make this properly work? Please advise, thanks.
See macropod's post in this conversation
As long as the ZIP codes are reaching Word (with or without "-" signs in the 5+4 format ZIPs, his field code should sort things out. However, if you are mixing text and numeric formats in your Excel column, there is a danger that the OLE DB provider or ODBC driver - if that is what you are using to get the data - will treat the column as numeric and return all the text values as 0.
Yes, Word sometimes treats text strings as numeric expressions as you have noticed. It will do that when you try to apply a numeric format, or when you try to do a calculation in an { = } field, when you sum table cell contents in an { = } field, or when Word decides to do a numeric comparison in (say) an { IF } field - in the latter case you can get Word to treat the expression as a string by surrounding the comparands by double-quotes.
in Excel, to force the string data type when entering data that looks like a number, a date, a fraction etc. but is not numeric (zip, phone number, etc.) simply type an apostrophe before the data.
=06470 will be interpreted as a the number 6470 but ='06470 will be the string "06470"
The simplest fix I've found is to save the Excel file as CSV. Word takes it all at face value then.