How do I use pg_trgm to be more permissible - postgresql

I used pg_trgrm to check string matches and I am pretty happy with the results. But it is not pefrectly the way I want it. I want that searches like "poduto" finds "produtos" (the r was missing). And Also that "sofáa" finds "sofa". I am using posgresql 9.6.
It does find "vermelho" when I type "vermelo" (h is missing). And it does find "sofa" when I type "sof". It seems that only some letters in middle can be left out and I always can miss a final letter. I want to be able to miss any letter in the middle of the word. And also be able to commit "two mistakes" in the case of sofáa and sofá (I used an accent and used one additional "a").

The solution is to lower pg_trgm.similarity_threshold (or pg_trgm.word_similarity_threshold if you are using <% or %>).
Then words with lower similarity will also be found.

Related

In Python (or any language) what does an "upper" function do to Hindi, Amharric and other non-Latin character sets?

Subject says it all. Been looking for an answer, but cannot seem to find it.
I am writing a web app that will store data in a database and also have language files translated into a wide variety of character sets. At various moments, the text will be presented. I want to control presentation such as spurious blank spaces at the beginning and end of strings. Also I want to ensure some letters are upper or lower case.
My question is: what happens in upper/lower case functions when the character set only has one case?
EDIT Sub question: Are there any unexpected side effects to be aware of?
My guess is that you simply get back the one and only character.
EDIT - Added Description
The main reason for asking this question is that I am writing a webapp that will be distributed and run on machines in remote areas with little or no chance to fix "on-the-spot" bugs. It's not a complicated webapp, but will run with many different language char sets. I want to be certain of my footing before releasing the server.
First of all the upper() and lower() method in python can be applied to Hindi, Amharric and non-letter character sets.
For instance will the upper() method converts the lowercase characters if an equivalent uppercase of this char exists. If not, then not.
Or better said, if there is nothing to convert, it stays the same.

postgresql fulltext returning wrong results

I'm using postgresql full text tsvector column.
But I found a problem:
When I search for "calça"
The results contains the following results:
1- calça red
2- calça blue
3- calçado red
Why "calçado" is being returned when I search for "calça" ?
Is there any configuration so I can solve this?
Thanks.
It isn't just a matter that one string contains the other. The Portuguese stemmer thinks this is the way they should be stemmed. If you turn the longer word into 'calçadot', for example, it no longer stems it, because (presumably) 'adot' is not recognized as a Portuguese suffix which ought to be removed the way 'ado' is.
If you don't want stemming at all, then you could change the config to 'simple', which doesn't stem. But at that point, maybe you don't want full text search at all, and could just use LIKE instead with a pg_trgm index.
If it is just this particular word that you don't want stemmed, I think you can set up a synonym dictionary which will map calçado to itself, which will bypass stemming.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

I can't understand the behaviour of btrim()

I'm currently working with postgresql, I learned about this function btrim, I checked many websites for explanation, but I don't really understand.
Here they mention this example:
btrim('xyxtrimyyx', 'xyz')
It gives trim.
When I try this example:
btrim('xyxtrimyyx', 'yzz')
or
btrim('xyxtrimyyx', 'y')
I get this: xyxtrimyyx
I don't understand this. Why didn't it remove the y?
From the docs you point to, the definition says:
Remove the longest string consisting only of characters in characters
(a space by default) from the start and end of string
The reason your example doesn't work is because the function tries to strip the text from Both sides of the text, consisting only of the characters specified
Lets take a look at the first example (from the docs):
btrim('xyxtrimyyx', 'xyz')
This returns trim, because it goes through xyxtrimyyx and gets up to the t and doesn't see that letter in xyz, so that is where the function stops stripping from the front.
We are now left with trimyyx
Now we do the same, but from the end of the string.
While one of xyz is the last letter, remove that letter.
We do this until m, so we are left with trim.
Note: I have never worked with any form of sql. I could be wrong about the exact way that postgresql does this, But I am fairly certain from the docs that this is how it is done.

Is there a "n/a" symbol in unicode?

Is there an unicode symbol for "n/a"? There are some fractions like ½, but a n/a symbol seems to be missing.
If there is none, what would be the most appropriate unicode symbol to use for n/a in a website (which should be contained in common fonts, to avoid needing a webfont)?
Looking at the Unicode code charts, I do not see a single N/A symbol. I do, however, see ⁿ (U+207F) and ₐ (U+2090), which you could separate with / (U+002F) eg: ⁿ/ₐ, or ̷ (U+0337), eg: ⁿ̷ₐ, or ̸ (U+0338), eg: ⁿ̸ₐ. Probably not what you are hoping for, though. And I don't know if "common" fonts implement them, either.
For future reference, the fastest way I know to answer questions like the OP's when I have them myself is to go to unicodelookup.com, because of the way it works: there's a search bar at the top, and you just type a string and it will return any and all unicode characters containing that string (this is also a great way to discover new and useful symbols). So in the OP's case, he could proceed like this:
first try entering "not" (without the quotes) in the search field
visually scan through the results... doing so would not reveal a "not
applicable" character in this case
try again but this time entering "applic" in the search field
again, doing so would not turn up anything along the lines of what he's
looking for
At that point he would be reasonably confident the current Unicode standard does not have a "n/a" symbol.
If you use Firefox you can define a keyword like "uni" to search that site from the URL bar, meaning any time the browser is open and regardless of what page or site is currently showing, you could do this:
hit [F6]... this moves the cursor to the URL bar at the top
type something like "uni applic" and hit [Enter]... this brings up the
unicodelookup.com website with the search results for "applic" already
showing
For the above to work you would need to define your keyword ("uni" or wtv you prefer) to point to location http://unicodelookup.com/#%s.
There's a Negative Acknowlege icon...
␕ symbol for negative acknowledge 022025 9237 0x2415 ␕
Found by searching negative on the Unicode Lookup site.
I'm not a fan, and for my purposes have just gone with __N/A__ (Markdown..)
I see lots of answers going head-on at the "Not Applicable" abbreviation, without exploring what a symbol is. A quick search for the equivalent phrase "out of scope" brings up a couple of variations on the No symbol: ⃠ – this seems to fit the bill (and since I was looking for a way to represent inapplicability, I'll be using it in my technical document).
Per the Wikipedia article, the Unicode codepoint U+20E0 is a combining character, so it is superimposed on the preceding character; e.g. ! ⃠ overlays an exclamation point. To get it to appear isolated, use a non-breaking space
If you don't want to bother with the combining symbol, the article mentions there's also an emoji U+1F6AB 🚫 but it's typically going to be colored red, or won't render!
There's actually a single character that could be repurposed for this: the "Square Na" character ㎁ (U+3381), which is used to represent the nanoampere in fullwidth (CJK) scripts.
What about the "SYMBOL FOR NULL" ␀ (U+2400)?