Postgresql: Processing Text, Detect out of Alphabetical order rows - postgresql

I have some processed text that's in (mostly) alphabetical order, e.g. these are the first word of each paragraph:
Adelanto
Agoura Hills
Alameda
Albany
Old Albany
New Albany
Alhambra
Aliso Viejo
Alturas
So each of the words above represents the start of a paragraph e.g.:
Adelanto, a city in San Bernardino County, California about 9 miles (14 km) northwest of Victorville in the High Desert portion of the Inland Empire of the Greater Los Angeles Area...
The text can have many paragraphs per entry so that paragraphs not in alphabetical order are treated as new entries.
So each entry would correspond to a place.
In the Example, O(ld) is after A(lbany) so Old Albany is a Entry, but N(ew) is before O(ld), so New Albany a continuation of Old Albany.
My question is: Is there something already existing other than just using the ASCII character difference between the first letter of Albany and Old Albany/New Albany in Postgresql? E.g. ASCII ('A') - ASCII ('O') gives -14.
So do I just use ASCII values on the first characters? Or is there a more general solution?

Currently I'm using the ASCII difference between the first letters of the text, comparing to the previousRow.description and also nextRow.description e.g.
ABS (ASCII (substring ( currentRow.description, 1,1 ) ) -
ASCII ( substring ( previousRow.description, 1 ,1 ) )

Related

How to extract an uppercase word from a string in Postgresql only if entire word is in capital letters

I am trying to extract words from a column in a table only if the entire word is in uppercase letters (I am trying to find all acronyms in a column). I tried using the following code, but it gives me all capital letters in a string even if it is just the first letter of a word. Appreciate any help you can provide.
SELECT title, REGEXP_REPLACE(title, '[^A-Z]+', '', 'g') AS acronym
FROM table;
Here is my desired output:
title
acronym
I will leave ASAP
ASAP
David James is LOL
LOL
BTW I went home
BTW
Please RSVP today
RSVP
You could use this regular expression:
regexp_replace(title, '.*?\m([[:upper:]]+)\M.*', '\1 ', 'g')
.*? matches arbitrary letters is a non-greedy fashion
\m matches the beginning of a word
( is the beginning of the part we are interested in
[[:upper:]]* are arbitrarily many uppercase characters
) is the end of that part we are interested in
\M matches the end of a word
.* matches arbitrary characters
\1 references the part delimited by the parentheses
You could try REGEXP_MATCHES function
WITH data AS
(SELECT 'I will leave ASAP' AS title
UNION ALL SELECT 'David James is LOL' AS title
UNION ALL SELECT 'BTW I went home' AS title
UNION ALL SELECT 'Please RSVP today' AS title)
SELECT title, REGEXP_MATCHES(title, '[A-Z][A-Z]+', 'g') AS acronym
FROM data;
See demo here

postgres substring split text using regex

I am having following string pattern and I want to split the text into 4 fields.
NIFTY21JUN11100CE --> NIFTY, 21JUN, 11100, CE
In above string, only 2 string formats are constant. For ex: 21JUN represents year and month and it is constant 5 character representation. Before that represent name which can be any number of characters. I think regex will be like (([1-2][0-9]))(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)
last 2 characters are constant and its value can be either PE|CE. value between 21JUN and CE|PE represent strike price and it is always numeric but can be any number of digits.
Now I want them to be split into 4 fields and struggling to get the regex. Is anyone familiar with Postgres command for this requirement?
You can use SELECT regexp_match('NIFTY21JUN11100CE','^(\D+)(\d{2}[A-Z]{3})(\d+)(PE|CE)$');
Step by step:
^ Beginning of the string
( start capture
\D+ more than zero non-digit chars
) end capture
( start capture
\d{2} exactly 2 digits
[A-Z]{3} exactly 3 chars in the range from A to Z
) end capture
( start capture
\d+ more than zero digit chars
) end capture
( start capture
PE|CE one of 'PE' or 'CE'
) end capture
$ end of the string
The year-month regexes from your question using character classes [1-2][0-9] and alternations (JAN|FEB|...) are a little bit more strict and could also be used.

How to change case of characters identified via pattern matching in PostgreSQL

I have several PostgreSQL tables with "comment" columns (data type = text) for which I am trying to standardize the use of upper and lowercase. Specifically, I'd like to change the case of comment strings from all-caps to capitalization of only the first character in each sentence (there are typically 1-3 sentences per comment). I standardized the number of spaces between sentences (to 1) with
update table
set comment = regexp_replace(comment, '( ){2,}',' ','g');
and set all characters in each string except the first to lower case with
update table
set comment = upper(left(comment, 1)) || lower(right(comment, -1))
Now, how do I change the case of the first character after each period to uppercase? I can select the relevant characters with
select regexp_matches('Testing. this. using. some. text.', '([.]\s\S)', 'g');
but haven't been able to figure out how to capitalize these. Also, I'm sure there is a better way to conduct these steps in a more integrative way, but this is my noob-ish attempt.
The following worked in my situation, in which comments are made up of one or more sentences and sentences are separated by a period and single space:
with source as (
select regexp_split_to_table('hello. world', '\.\s') sentence
)
select array_to_string(
array(
select upper(left(sentence, 1)) || right(sentence, -1) sentence from source
), '. '
) modified_comment;

How to copy last 13 characters of a string?

In Notepad++ I have a list of entries and at the end of each entry is a phone number (with dashes, 12 characters total). How do I go about either removing all the text before the number or copy/cut the number from the end of the entry for multiple entries? Thanks!
i.e.
1 $1,300 Deposit $1,300 Available 12/15/16 2050 Hurricane Shoals 678-790-0986
2 7 $1,400 Deposit $1,400 Available 12/22/16 1453 Alamein Dr  404-294-6441
3 $1,500 - $1,590 Not Income Based  /  Deposit $1,500 - $1,590 678-328-7952
Here is a way:
Ctrl+H
Find what: ^.*([\d-]{12})$
Replace with: $1
Replace all

Encode a Date and a four digit number into a string with max 8 characters

I have a datetime and a four digit number and I need to encode this into a 8 character case insensitive ASCII string.
The four digit number is not actually an arbitrary number, but there are only a certain numbers (about 20 or so) of the form (2513, 2595, 2579, ...).
My current approach is to use Base36 encoding. Further, I have a dictionary for the four digit numbers that maps like this:
2513 -> '00'
2595 -> '01'
...
The first two characters of the resulting string are used for this. The remaining six characters are used for encoding a unix timestamp with seconds stripped (I only need seconds resolution) in Base36.
So, (2513, 07.01.2015) maps to '000E3HEU'.
My question is, if anyone can think of an even more compact encoding?