get till end of string after first space - postgresql

I need to split a string as follows, when I try with with split_part no luck
select split_part('8 HAMPSHIRE RD',' ',2)
Expected output: HAMPSHIRE RD

A cheaper solution without a regular expression:
SELECT substring (
'8 HAMPSHIRE RD'
FROM position(' ' IN '8 HAMPSHIRE RD') + 1
);

Use regexp_replace():
select regexp_replace('8 HAMPSHIRE RD', '.*?\s', '');
regexp_replace
----------------
HAMPSHIRE RD
(1 row)
An alternative solution using string manipulation functions:
with my_table(str) as (
values ('8 HAMPSHIRE RD')
)
select right(str, -strpos(str, ' '))
from my_table;
If you want to skip the first word if it contains only digits you should use \d (digit) instead of . (any char):
select regexp_replace('8 HAMPSHIRE RD', '\d*?\s', '');

Related

REGEXP_COUNT in postgres

We are migrating from Oracle to Postgres.
here is SQL, where i used to extract data from employee_name column and used to report.
but now i am not sure how to do the regex_count part.
Oracle SQL
with A4 as
(
select 'govinda j/INDIA_MH/9975215025' as employee_name from dual
)
select employee_name ,
TRIM(SUBSTR(upper(A4.employee_name),1,INSTR(A4.employee_name,'/',1,1)-1)) AS employee_name1,
TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'/',1,1)+1,INSTR(A4.employee_name,'_',1,1)-INSTR(A4.employee_name,'/',1,1)-1)) AS Country,
TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'_',1,1)+1,INSTR(A4.employee_name,'/',1,2)-INSTR(A4.employee_name,'_',1,1)-1)) AS STATE,
CASE WHEN REGEXP_COUNT(A4.employee_name,'_')>1 THEN 'WRONG_NAME>1_'
WHEN REGEXP_COUNT(A4.employee_name,'/')>2 THEN 'WRONG_NAME>2/'
WHEN TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'/',1,1)+1,INSTR(A4.employee_name,'_',1,1)-INSTR(A4.employee_name,'/',1,1)-1))NOT IN
('INDIA','NEPAL') THEN 'WRONG_COUNTRY'
ELSE 'CORRECT' END AS VALIDATION
from A4
In Postgres with help i am able to convert it into below part.
with A4 as
(
select 'govinda j/INDIA_MH/9975215025'::text as employee_name
)
select employee_name,
split_part(employee_name, '/', 1) as employee_name1,
split_part(split_part(employee_name, '/', 2), '_', 1) as country,
split_part(split_part(employee_name, '/', 2), '_', 2) as state
from A4
But validation part in not able to convert . any help is highly appreciated as we are very new to postgres.
You can create a custom function:
create or replace function number_of_chars(text, text)
returns integer language sql immutable as $$
select length($1) - length(replace($1, $2, ''))
$$;
Use:
with example(str) as (
values
('a_b_c'),
('a___b'),
('abc')
)
select str, number_of_chars(str, '_') as count
from example
str | count
-------+-------
a_b_c | 2
a___b | 3
abc | 0
(3 rows)
Note that the above function just counts occurrences of a character in a string and does not use regular expressions, which in general are more expensive.
A Postgres equivalent of regexp_count() may look like this:
create or replace function regexp_count(text, text)
returns integer language sql as $$
select count(m)::int
from regexp_matches($1, $2, 'g') m
$$;
with example(str) as (
values
('a_b_c'),
('a___b'),
('abc')
)
select str, regexp_count(str, '_') as single, regexp_count(str, '__') as double
from example
str | single | double
-------+--------+--------
a_b_c | 2 | 0
a___b | 3 | 1
abc | 0 | 0
(3 rows)
For anyone who (like me) is visiting this question in the present day, regexp_count is apparently going to be included in Postgres 15 as per: https://pgpedia.info/r/regexp_count.html
It has the following syntax:
regexp_count ( string text, pattern text [, start integer [, flags text ] ] ) → integer

T-SQL tag after JOIN

I have a t-sql query that looks like this:
select * from (
SELECT [Id], replace(ca.[AKey], '-', '') as [AKey1], rtrim(replace(replace(replace(lower([Name]), '#', ''), '(1.0)', ''), '(2.5)', '')) as [Name], [Key], dw.[AKey], replace(lower(trim([wName])), '#', '') as [wName]
FROM [dbo].[wTable] ca
FULL JOIN (select * from [dw].[wTable]) dw on
rtrim(left( replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%'
like
rtrim(left( replace(replace(replace(lower(ca.[Name] ), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%'
and
right(rtrim(replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)
like
right(rtrim(replace(replace(replace(lower(ca.[Name] ), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)
) tp
As you can see, during the JOIN, it's removing some fuzzy characters that may or may not exist, and it's checking to see if the first 5 characters in the wName column match with the first 5 characters in the Name column, then doing the same for the last 2 characters in the columns.
So essentially, it's matching on the first 5 characters AND last 2 characters.
What I'm trying to add is an additional column that will tell me if the resulting columns are an exact match or if they are fuzzy. In other words, if they are an exact match it should say 'True' or something like that, and if they are a fuzzy match I would ideally like it to tell me how far off they are. For example, how many characters do not match.
As JNevil mentioned you could use Levenshtein. You can also use Damarau-Levenshtein or the Longest Common Substring depending on how accurate you want to get and what your performance expectations are.
Below are two solutions. The first is a Levenshtein solution using a copy I grabbed from Phil Factor here. The Longest Common Substring solution uses my version of the Longest Common Substring which is fastest available for SQL Server (by far).
-- sample data
declare #t1 table (string1 varchar(100));
declare #t2 table (string2 varchar(100));
insert #t1 values ('abc'),('xxyz'),('1234'),('9923');
insert #t2 values ('abcd'),('xyz'),('2345'),('zzz');
-- Levenshtein
select string1, string2, Ld
from
(
select *, Ld = dbo.LEVENSHTEIN(t1.string1, t2.string2)
from #t1 t1
cross join #t2 t2
) compare
where ld <= 2;
-- Longest Common Substring
select string1, string2, lcss = item, lcssLen = itemlen, diff = mx.L-itemLen
from #t1 t1
cross join #t2 t2
cross apply dbo.lcssWindowAB(t1.string1, t2.string2, 20)
cross apply (values (IIF(len(string1) > len(string2), len(string1),len(string2)))) mx(L)
where mx.L-itemLen <= 2;
RESULTS
string1 string2 Ld
-------- -------- -----
abc abcd 1
xxyz xyz 1
1234 2345 2
string1 string2 lcss lcssLen diff
-------- -------- ----- ----------- -----------
abc abcd abc 3 1
xxyz xyz xyz 3 1
1234 2345 234 3 1
9923 2345 23 2 2
This does not answer your question but should get you started.
P.S. The Levenshtein function I posted does have a small bug, it says the distance between "9923" and "2345" is 4, the correct answer would be two. There's other Levenshtein functions out there though.

Trim white space from array values

My records in the table are as follows:
id column1
1 'Record1'
2 ' Record2'
3 ' Record3a, Record3b'
4 'Record4a , Record4b, Record4c '
column1 type: text
pre-defined array= {record1,record2,record3a}
While I'm checking the values with a pre-defined array using && operator, most of the values are missed because of the delimiter space between those which are unnecessary.
Hence I need to first remove these space that are there in beginning or end (only) and then do string_to_array() so that the result could be compared to my pre-defined array
Use trim() to remove leading a trailing whitespace:
SELECT string_to_array(trim(both ' ' from regexp_replace(column1, '\s*,\s*', ',')), ',')
FROM yourTable
SELECT string_to_array(trim(both ' ' from regexp_replace(column1,
'\s*,\s*', ',')), ',') FROM yourTable
It works, but the 'g' flag should be added to remove all whitespaces:
SELECT string_to_array(
trim(both ' ' from regexp_replace(column1, '\s*,\s*', ',')), ',','g')
FROM yourTable

Extract string of dynamic length when the indicator of completion exists in multiple instances. Postgres

So if I have a varchar length string column let's call ID(samples below):
97.128.39.256.1460854333288493
25.365.49.12.13454154815132
346.45.156.354.1523425161233
I want to grab, like a left in excel, everything to the left of the 4th period. How do i create a dynamic string to find the fourth instance of a period?
I know substring is a start but not sure how to write in the dynmic length that exists
This is probably the easiest for someone else to read:
select split_part(i, '.', 1) || '.' ||
split_part(i, '.', 2) || '.' ||
split_part(i, '.', 3) || '.' ||
split_part(i, '.', 4)
from (select '97.128.39.256.1460854333288493' as i) as sub;
Or if you don't like split_part and prefer to use arrays:
select array_to_string((string_to_array(i, '.'))[1:4], '.')
from (select '97.128.39.256.1460854333288493' as i) as sub;
I think the array example is a bit harder to grasp at first glance but both work.
Updated answer based on revised question to also convert the Unix timestamp to a Greenplum timestamp:
select 'epoch'::timestamp + '1 second'::interval *
(split_part(i, '.', 5)::numeric/1000000) as event_time,
array_to_string((string_to_array(i, '.'))[1:4], '.') as ip_address
from (
select '97.128.39.256.1460854333288493' as i
) as sub;
You could also try this:
mydb=> select regexp_replace('97.128.39.256.1460854333288493', E'^((?:\\d+\\.){3}\\d+).+$', E'\\1');
regexp_replace
----------------
97.128.39.256
(1 row)
Time: 0.634 ms
with t (s) as ( values
('97.128.39.256.1460854333288493'),
('25.365.49.12.13454154815132'),
('346.45.156.354.1523425161233')
)
select a[1] || '.' || a[2] || '.' || a[3] || '.' || a[4]
from (
select regexp_split_to_array(s, '\.')
from t
) t (a)
;
?column?
----------------
97.128.39.256
25.365.49.12
346.45.156.354

Padding zeros and remove commas, decimals, and dashes in tsql

I have an amount field and a commission field that I need to remove the comma: , decimal point: . dash: - and the percent sign: %.
I have tried replicate, format, replace and stuff,
right ('000000000')
right('000000000') + rtrim(field), len#)
RTRIM(replicate('0', 9 - len(field)) + REPLACE(REPLACE(REPLACE(cast(field as varchar), ',', ''), '.',''), '-', ''))
RTRIM(replicate('0', 9 - len(t.Commission_Amount)) + REPLACE(REPLACE(REPLACE(cast(t.Commission_Amount as varchar(9)), ',', ''), '.',''), '-', ''))
but I never get the results that I want. When I use replace it replaces the comma, dash, or % but cuts the field short and does not pad to the left with zeros. I know it's probably right in front of my face I just need some clarity please.
00-126.47 comes out as 0012647
0.00 comes out as 00000000
000126.47 comes out as 00012647
Try this:
create sample table
DECLARE #Table as table (
field varchar(15)
)
populate sample table
INSERT INTO #Table VALUES
('00-126.47'),
('0.00'),
('000126.47'),
('00033%2.422')
select
SELECT field As before,
RIGHT(REPLICATE('0', 9) +
REPLACE(
REPLACE(
REPLACE(
REPLACE(field, '-', '')
, '.', '')
, ',', '')
, '%', '')
, 9) As [After]
FROM #Table
results:
before After
--------------- ---------
00-126.47 000012647
0.00 000000000
000126.47 000012647
00033%2.422 000332422
this solved my problem;
RIGHT(REPLICATE('0', 9) + REPLACE(REPLACE(REPLACE(REPLACE(field, '-', ''), '.', ''), ',', ''), '%', ''), 9) As [After]