Normalize human names in postgresql - postgresql

What is the easiest way to normalize a text field in postgresql table?
I am trying to find duplicates.
For example, I want to consider O'Reilly a duplicate of oreilly. La Salle should be a duplicate of la'salle as well.
In a nutshell, we want to
lowercase all text,
strip accents
strip punctuation marks such as these [.'-_] and
strip spaces
Can this all be done in one or two simple steps? Ideally using built in postgresql functions.
Cheers

The following will give you what you want, using just standard Postgres functions;
regexp_replace (lower(unaccent(string_in)),'[^0-9a-z]','','g')
See example here. Or if you do not want digits the just
regexp_replace (lower(unaccent(string_in)),'[^a-z]','','g')

Related

DB2 UnLOAD in unicode with two chardelimiter

I have to create an UNLOAD job for a DB2 table and save the UNload in unicode. That's no problem.
But unfortunately there are contents in the table columns that correspond to the separators.
For example, I would like the combination #! as a separator, but I can't do that in unicode.
Can someone tell me how to do this?
Now my statement looks like this:
DELIMITED COLDEL X'3B' CHARDEL X'24' DECPT X'2E'
UNICODE
thanks a lot for your help
The delimiter can be a single character (not two characters, as you want).
In this case the chosen solution was to find a single character that did not appear in the data.
When that is not possible, consider a non-delimited output format, or a different technique to get the data to the external system (for example via federation or other SQL-based interchange, or XML etc.

Remove infrequent words from ts_vector in postgres full text search

Is it possible to make postgres to_tsvector consider only words which occur more than N times in the table?
The only option I am seeing is to calculate the word frequencies myself beforehand and then construct a dictionary out of that list which replaces each with empty string. Is there any more elegant solution in the configurations ?
There is no dynamic solution. You have to write a stopword file.

When do Postgres column or table names need quotes and when don't they?

Let's consider the following postgres query:
SELECT *
FROM "MY_TABLE"
WHERE "bool_var"=FALSE
AND "str_var"='something';
The query fails to respond properly when I remove quotes around "str_var" but not when I do the same around "bool_var". Why? What is the proper way to write the query in that case, no quotes around the boolean column and quotes around the text column? Something else?
PostgreSQL converts all names (table name, column names etc) into lowercase if you don't prevent it by double quoting them in create table "My_Table_ABC" ( "My_Very_Upper_and_Lowercasy_Column" numeric,...). If you have names like this, you must always double quote those names in selects and other references.
I would recommend not creating tables like this and also not using chars outside a-z, 0-9 and _. You can not guarantee that every piece of software, library etc ever to be used against your database will support case-sensitivity. It's also tedious to remember and doing this double quoting.
Thanks to #TimBiegeleisen's comment, I was able to pinpoint the problem; I used a reserved keyword ("user") as a column name.
Link to reserved keywords in the doc: https://www.postgresql.org/docs/current/sql-keywords-appendix.html.
Now I know not to use quotes to query column names, but rather to avoid reserved keywords as column names.

How to specify minimum word length in PostgreSQL full text search?

Is there a way to prevent words shorter than a specified value end up in tsvector? MySQL has the ft_min_word_len option, is there something similar for PostgreSQL?
The short answer would be no.
The tsearch2 uses dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered
in a search (stop words), and to normalize words so that different
derived forms of the same word will match. A successfully normalized
word is called a lexeme.
and how the dictionaries are used Parsing and Lexing

Finding first alphabetic character in a DB2 database field

I'm doing a bit of work which requires me to truncate DB2 character-based fields. Essentially, I need to discard all text which is found at or after the first alphabetic character.
e.g.
102048994BLAHBLAHBLAH
becomes:-
102048994
In SQL Server, this would be a doddle - PATINDEX would swoop in and save the day. Much celebration would ensue.
My problem is that I need to do this in DB2. Worse, the result needs to be used in a join query, also in DB2. I can't find an easy way to do this. Is there a PATINDEX equivalent in DB2?
Is there another way to solve this problem?
If need be, I'll hardcode 26 chained LOCATE functions to get my result, but if there is a better way, I am all ears.
SELECT TRANSLATE(lower(column), ' ', 'abcdefghijklmnopqrstuvwxyz')
FROM table
write a small UDF (user defined function) in C or JAVA, that does your task.
Peter