How to change case of characters identified via pattern matching in PostgreSQL - postgresql

I have several PostgreSQL tables with "comment" columns (data type = text) for which I am trying to standardize the use of upper and lowercase. Specifically, I'd like to change the case of comment strings from all-caps to capitalization of only the first character in each sentence (there are typically 1-3 sentences per comment). I standardized the number of spaces between sentences (to 1) with
update table
set comment = regexp_replace(comment, '( ){2,}',' ','g');
and set all characters in each string except the first to lower case with
update table
set comment = upper(left(comment, 1)) || lower(right(comment, -1))
Now, how do I change the case of the first character after each period to uppercase? I can select the relevant characters with
select regexp_matches('Testing. this. using. some. text.', '([.]\s\S)', 'g');
but haven't been able to figure out how to capitalize these. Also, I'm sure there is a better way to conduct these steps in a more integrative way, but this is my noob-ish attempt.

The following worked in my situation, in which comments are made up of one or more sentences and sentences are separated by a period and single space:
with source as (
select regexp_split_to_table('hello. world', '\.\s') sentence
)
select array_to_string(
array(
select upper(left(sentence, 1)) || right(sentence, -1) sentence from source
), '. '
) modified_comment;

Related

How to extract an uppercase word from a string in Postgresql only if entire word is in capital letters

I am trying to extract words from a column in a table only if the entire word is in uppercase letters (I am trying to find all acronyms in a column). I tried using the following code, but it gives me all capital letters in a string even if it is just the first letter of a word. Appreciate any help you can provide.
SELECT title, REGEXP_REPLACE(title, '[^A-Z]+', '', 'g') AS acronym
FROM table;
Here is my desired output:
title
acronym
I will leave ASAP
ASAP
David James is LOL
LOL
BTW I went home
BTW
Please RSVP today
RSVP
You could use this regular expression:
regexp_replace(title, '.*?\m([[:upper:]]+)\M.*', '\1 ', 'g')
.*? matches arbitrary letters is a non-greedy fashion
\m matches the beginning of a word
( is the beginning of the part we are interested in
[[:upper:]]* are arbitrarily many uppercase characters
) is the end of that part we are interested in
\M matches the end of a word
.* matches arbitrary characters
\1 references the part delimited by the parentheses
You could try REGEXP_MATCHES function
WITH data AS
(SELECT 'I will leave ASAP' AS title
UNION ALL SELECT 'David James is LOL' AS title
UNION ALL SELECT 'BTW I went home' AS title
UNION ALL SELECT 'Please RSVP today' AS title)
SELECT title, REGEXP_MATCHES(title, '[A-Z][A-Z]+', 'g') AS acronym
FROM data;
See demo here

Postgresql: How to replace multiple substrings in a string and with regexp

Postgresql 9.4: 1 column in 1 table is a string of text representing a route of flight for an aircraft.
The complete field consists of "Fixes" and "Routes" up to 80
characters in total length.
Routes and Fixes can be either 3 or 5 characters in length.
Routes and Fixes can have the same name.
There may be zero, one, or two Routes
Routes are followed by a single non-zero digit or a hash.
Routes and Fixes can be preceded or followed by a "+" or "*".
The field may contain CR/LF or double-triple spaces which should remain.
Each schema contains 6-20000s fields in this table
There are nearly 1800 Route names, but generally only 40-80 per schema
Examples:
"KIND ROCKY1 STL BUM OATHE CLASH5 KDEN"
"+MEARZ7 OKK+
KIND OKK FWA MIZAR3 KDTW"
"KIND OOM OOM5 WEGEE PXV J131 LIT BYP5 KDFW"
"KIND MEARZ# OKK ECK YEE YXI N171B VALIEE***EGSS"
The task is to clean up the lazy use of the hash instead of a digit and to update the Route versions (the trailing numbers). I.e. replace-in-place the Route with the correct digit rather than the # or what might be a wrong number. So every instance of "MEARZ7" or "MEARZ#" becomes "MEARZ9" and "OOM5" becomes "OOM6" but "OOM " stays "OOM ".
Currently I have been testing this:
UPDATE target SET detail =
CASE WHEN POSITION('CLASH' in detail) > 0
AND SUBSTRING(detail,POSITION('CLASH' in detail)+5,1) != ' '
THEN REGEXP_REPLACE (detail, 'CLASH.', 'CLASH5')
WHEN POSITION('MEARZ' in detail) > 0
AND SUBSTRING(detail,POSITION('MEARZ' in detail)+5,1) != ' '
THEN REGEXP_REPLACE (detail, 'MEARZ.', 'MEARZ9')
WHEN POSITION('OOM' in detail) > 0
AND SUBSTRING(detail,POSITION('OOM' in detail)+3,1) != ' '
THEN REGEXP_REPLACE (detail, 'OOM.', 'OOM6')
WHEN POSITION('ROCKY' in fsrtedtail) > 0
AND SUBSTRING(detail,POSITION('ROCKY' in detail)+5,1) != ' '
THEN REGEXP_REPLACE (detail, 'ROCKY.', 'ROCKY1')
ELSE detail END;
My logic was to:
Find the Route name.
Check if it's followed by a space.
If not, replace it with the correct Route+digit
I hadn't yet attempted to avoid "+" or "* ". I was thinking I could first replace the "#" with a number, then update the Route+digit as to not worry about the # and this would eliminate the need to look for the "+" or "* ". Then I could just look for a trailing space.
The second Route (in order of the WHEN statements) does not get updated so I guess am barking up the wrong tree.
They other big obstacle is there can be 80 or more Routes in a schema so if I have to nest a statement, it's gonna be huge.
I have tried array_to_string(array_replace(string_to_array( but it leaves behind double quotes, commas, and curly brackets so doesn't seem feasible.
At this point I'm thinking a function is the way to go, but I don't know where to start.

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

How to keep the upper case and lower case letters in a column alias in the results in Redshift

In Redshift we are trying to give more meaningful aliases to the columns we are returning from the queries as we are importing the results into TABLEAU, the issue is that RedShift turns all the letter to lower case ones, i.e. from "Event Date" it then returns "event date", any idea on how to work this one out to keep the alias given?
I know I'm a bit late to the party but for anyone else looking, you can enable case sensitivity, so if you want to return a column with camel casing for example
SET enable_case_sensitive_identifier TO true;
Then in your query wrap what you want to return the column as in double quotes
SELECT column AS "thisName"
Or as per OP's example
SELECT a.event_date AS "Event Date"
https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html
Edit: To have this behaviour as default for the cluster you will need to create/update a parameter group in Configurations => Workload Management. You can't change the settings for the default parameter group. Note, you will need to reboot the cluster after applying the parameter group for the changes to take effect.
No, you cannot do this in Redshift. all columns are lowercase only.
You can enforce upper case only by using
set describe_field_name_in_uppercase to on;
Also see the examples here https://docs.aws.amazon.com/redshift/latest/dg/r_names.html you can see that the upper case characters are returned as lower case. and it says "identifiers are case-insensitive and are folded to lowercase in the database"
You can of course rename the column to include uppercase within Tableau.
I was going through AWS docs for redshift and looks like INTCAP function can solve your use case
For reference => https://docs.aws.amazon.com/redshift/latest/dg/r_INITCAP.html
Brief description (copied)
The INITCAP function makes the first letter of each word in a string uppercase, and any subsequent letters are made (or left) lowercase. Therefore, it is important to understand which characters (other than space characters) function as word separators. A word separator character is any non-alphanumeric character, including punctuation marks, symbols, and control characters. All of the following characters are word separators:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
And in your case you have declared field name as event_date which will convert to Event_Date.
And next you can use REPLACE function to replace underscore '_'
For reference => https://docs.aws.amazon.com/redshift/latest/dg/r_REPLACE.html
You need to put
set describe_field_name_in_uppercase to on;
in your Tableau's Initial SQL.

removing leading zero and hyphen in Postgres

I need to remove leading zeros and hyphens from a column value in Postgresql database, for example:
121-323-025-000 should look like 12132325
060579-0001 => 605791
482-322-004 => 4823224
timely help will be really appreciated.
Postgresql string functions.
For more advanced string editing, regular expressions can be very powerful. Be aware that complex regular expressions may not be considered maintainable by people not familiar with them.
CREATE TABLE testdata (id text, expected text);
INSERT INTO testdata (id, expected) VALUES
('121-323-025-000', '12132325'),
('060579-0001', '605791'),
('482-322-004', '4823224');
SELECT id, expected, regexp_replace(id, '(^|-)0*', '', 'g') AS computed
FROM testdata;
How regexp_replace works. In this case we look for the beginning of the string or a hyphen for a place to start matching. We include any zeros that follow that as part of the match. Next we replace that match with an empty string. Finally, the global flag tells us to repeat the search until we reach the end of the string.