REGEXP_COUNT in postgres - postgresql

We are migrating from Oracle to Postgres.
here is SQL, where i used to extract data from employee_name column and used to report.
but now i am not sure how to do the regex_count part.
Oracle SQL
with A4 as
(
select 'govinda j/INDIA_MH/9975215025' as employee_name from dual
)
select employee_name ,
TRIM(SUBSTR(upper(A4.employee_name),1,INSTR(A4.employee_name,'/',1,1)-1)) AS employee_name1,
TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'/',1,1)+1,INSTR(A4.employee_name,'_',1,1)-INSTR(A4.employee_name,'/',1,1)-1)) AS Country,
TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'_',1,1)+1,INSTR(A4.employee_name,'/',1,2)-INSTR(A4.employee_name,'_',1,1)-1)) AS STATE,
CASE WHEN REGEXP_COUNT(A4.employee_name,'_')>1 THEN 'WRONG_NAME>1_'
WHEN REGEXP_COUNT(A4.employee_name,'/')>2 THEN 'WRONG_NAME>2/'
WHEN TRIM(SUBSTR(upper(A4.employee_name),INSTR(A4.employee_name,'/',1,1)+1,INSTR(A4.employee_name,'_',1,1)-INSTR(A4.employee_name,'/',1,1)-1))NOT IN
('INDIA','NEPAL') THEN 'WRONG_COUNTRY'
ELSE 'CORRECT' END AS VALIDATION
from A4
In Postgres with help i am able to convert it into below part.
with A4 as
(
select 'govinda j/INDIA_MH/9975215025'::text as employee_name
)
select employee_name,
split_part(employee_name, '/', 1) as employee_name1,
split_part(split_part(employee_name, '/', 2), '_', 1) as country,
split_part(split_part(employee_name, '/', 2), '_', 2) as state
from A4
But validation part in not able to convert . any help is highly appreciated as we are very new to postgres.

You can create a custom function:
create or replace function number_of_chars(text, text)
returns integer language sql immutable as $$
select length($1) - length(replace($1, $2, ''))
$$;
Use:
with example(str) as (
values
('a_b_c'),
('a___b'),
('abc')
)
select str, number_of_chars(str, '_') as count
from example
str | count
-------+-------
a_b_c | 2
a___b | 3
abc | 0
(3 rows)
Note that the above function just counts occurrences of a character in a string and does not use regular expressions, which in general are more expensive.
A Postgres equivalent of regexp_count() may look like this:
create or replace function regexp_count(text, text)
returns integer language sql as $$
select count(m)::int
from regexp_matches($1, $2, 'g') m
$$;
with example(str) as (
values
('a_b_c'),
('a___b'),
('abc')
)
select str, regexp_count(str, '_') as single, regexp_count(str, '__') as double
from example
str | single | double
-------+--------+--------
a_b_c | 2 | 0
a___b | 3 | 1
abc | 0 | 0
(3 rows)

For anyone who (like me) is visiting this question in the present day, regexp_count is apparently going to be included in Postgres 15 as per: https://pgpedia.info/r/regexp_count.html
It has the following syntax:
regexp_count ( string text, pattern text [, start integer [, flags text ] ] ) → integer

Related

DB2: How to transpose mutlidimensional table from row to column to find data changes across rows

I am trying the following with Db2:
Problem
So I've got a table with 80+ columns and two rows.
I need to accomplish is checking what columns have changed value between the two rows, and return a table of the column names that have changed, their initial value from row1, and their new value from row2.
Approach so far
My initial idea was to perform a pivot of the two rows into two columns, row 1 as column 1, row 2 as column 2, then join a column of column names (likely taken from syscat.columns) to the table as column 3, at which point I can then do a select where column1 != column2, hence returning the rows with all the data needed. But alas, it was not long after coming up with this that I discover DB2 doesn't support pivot / unpivot...
Question
So is there any idea for how to accomplish this in DB2, taking a table with 80+ columns and two rows like so:
| Col A | Col B | Col C | ... | Col Z|
| ----- | ----- | ----- | --- | ---- |
| Val A | Val B | 123 | ... | 01/01/2021 |
| Val C | Val B | 124 | ... | 02/01/2021 |
And returning a table with the columns changed, their initial value, and their new value:
| Initial | New | ColName|
| ----- | ----- | ----- |
| Val A | Val C | Col A |
| 123 | 124 | Col C |
| 01/01/2021 | 02/01/2021 | Col Z |
Also note the column data types also vary, so will need to be converted to varchar
DB2 version is 11.1
EDIT: Also for reference as per comment request, this is code I attempted to use to achieve this goal:
WITH
INIT AS (SELECT * FROM TABLE WHERE SOMEDATE=(SELECT MIN(SOMEDATE) FROM TABLE),
LATE AS (SELECT * FROM TABLE WHERE SOMEDATE=(SELECT MAX(SOMEDATE) FROM TABLE),
COLS AS (SELECT COLNAME FROM SYSCAT.COLUMNS WHERE TABNAME='TABLE' ORDER BY COLNO)
SELECT * FROM (
SELECT
COLNAME AS ATTRIBUTE,
(SELECT COLNAME AS INITIAL FROM INIT),
(SELECT COLNAME AS NEW FROM LATE)
FROM
COLS
WHERE
(INITIAL != NEW) OR (INITIAL IS NULL AND NEW IS NOT NULL) OR (INITIAL IS NOT NULL AND NEW IS NULL));
Only issue with this one is that I couldn't figure how to use the values from the COLS table as the columns to be selected
You may easily generate text of the expressions needed, if you don't want to type them manually.
Consider the following example, if you want to print different column values only in 2 rows of the same quite a wide table SYSCAT.TABLES. We use the following query for such an expression generation.
SELECT
'DECODE(I.I, '
|| LISTAGG(COLNO || ', A.' || COLNAME || CASE WHEN TYPENAME NOT LIKE '%CHAR%' AND TYPENAME NOT LIKE '%GRAPHIC' THEN '::VARCHAR(128)' ELSE '' END, ', ')
|| ') AS INITIAL' AS EXPR_INITIAL
, 'DECODE(I.I, '
|| LISTAGG(COLNO || ', B.' || COLNAME || CASE WHEN TYPENAME NOT LIKE '%CHAR%' AND TYPENAME NOT LIKE '%GRAPHIC' THEN '::VARCHAR(128)' ELSE '' END, ', ')
|| ') AS NEW' AS EXPR_NEW
, 'DECODE(I.I, '
|| LISTAGG(COLNO || ', ''' || COLNAME || '''', ', ')
|| ') AS COLNAME' AS EXPR_COLNAME
FROM SYSCAT.COLUMNS C
WHERE TABSCHEMA = 'SYSCAT' AND TABNAME = 'TABLES'
AND TYPENAME NOT LIKE '%LOB';
It doesn't matter how many columns the table contains. We just filter out the columns of *LOB types as an example. If you want them as well, you should change the ::VARCHAR(128) casting to some ::CLOB(XXX).
These 3 generated expressions we put to the corresponding places in the query below:
WITH MYTAB AS
(
-- We enumerate the rows to reference them later
SELECT ROWNUMBER() OVER () RN_, T.*
FROM SYSCAT.TABLES T
WHERE TABSCHEMA = 'SYSCAT'
FETCH FIRST 2 ROWS ONLY
)
SELECT *
FROM
(
SELECT
-- Place here the result got in the EXPR_INITIAL column
-- , Place here the result got in the EXPR_NEW column
-- , Place here the result got in the EXPR_COLNAME column
FROM MYTAB A, MYTAB B
,
(
SELECT COLNO AS I
FROM SYSCAT.COLUMNS
WHERE TABSCHEMA = 'SYSCAT' AND TABNAME = 'TABLES'
AND TYPENAME NOT LIKE '%LOB'
) I
WHERE A.RN_ = 1 AND B.RN_ = 2
)
WHERE INITIAL IS DISTINCT FROM NEW;
The result I got in my database:
|INITIAL |NEW |COLNAME |
|--------------------------|--------------------------|---------------|
|2019-06-04-22.44.14.493001|2019-06-04-22.44.14.502001|ALTER_TIME |
|26 |15 |COLCOUNT |
|2019-06-04-22.44.14.493001|2019-06-04-22.44.14.502001|CREATE_TIME |
|2019-06-04-22.44.14.493001|2019-06-04-22.44.14.502001|INVALIDATE_TIME|
|2019-06-04-22.44.14.493001|2019-06-04-22.44.14.502001|LAST_REGEN_TIME|
|ATTRIBUTES |AUDITPOLICIES |TABNAME |

Converting a table with a key and comment field into a key and row for every word in the column field

I have a table with unstructured data I am trying to analyze to try to build a relational lookup. I do not have use of word cloud software.
I really have no idea how to solve this problem. Searching for solutions has lead me to tools that might do this for me that cost money, not coded solutions.
Basically my data looks like this:
CK1 CK2 Comment
--------------------------------------------------------------
1 A This is a comment.
2 A Another comment here.
And this is what I need to create:
CK1 CK2 Words
--------------------------------------------------------------
1 A This
1 A is
1 A a
1 A comment.
2 A Another
2 A comment
2 A here.
What you are trying to do is tokenize a string using a space as a Delimiter. In the SQL world people often refer to functions that do this as a "Splitter". The potential pitfall of using a splitter for this type of thing is how words can be separated by multiple spaces, tabs, CHAR(10)'s, CHAR(13)'s, CHAR()'s, etc. Poor grammar, such as not adding a space after a period results in this:
" End of sentence.Next sentence"
sentence.Next is returned as a word.
The way I like to tokenize human text is to:
Replace any text that isn't a character with a space
Replace duplicate spaces
Trim the string
Split the newly transformed string using a space as the delimiter.
Below is my solution followed by the DDL to create the functions used.
-- Sample Data
DECLARE #yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));
INSERT #yourtable (CK1, CK2, Comment)
VALUES
(1,'A','This is a typical comment...Follewed by another...'),
(2,'A','This comment has double spaces and tabs and even carriage
returns!');
-- Solution
SELECT t.CK1, t.CK2, split.itemNumber, split.itemIndex, split.itemLength, split.item
FROM #yourtable AS t
CROSS APPLY samd.patReplace(t.Comment,'[^a-zA-Z ]',' ') AS c1
CROSS APPLY samd.removeDupChar8K(c1.newString,' ') AS c2
CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;
Results (truncated for brevity):
CK1 CK2 itemNumber itemIndex itemLength item
----------- ---- -------------------- ----------- ----------- --------------
1 A 1 1 4 This
1 A 2 6 2 is
1 A 3 9 1 a
1 A 4 11 7 typical
1 A 5 19 7 comment
...
2 A 1 1 4 This
2 A 2 6 7 comment
2 A 3 14 3 has
2 A 4 18 6 double
...
Note that the splitter I'm using is based of Jeff Moden's Delimited Split8K with a couple tweeks.
Functions used:
CREATE FUNCTION dbo.rangeAB
(
#low bigint,
#high bigint,
#gap bigint,
#row1 bit
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (#high-#low)/#gap,
N1 = #low,
N2 = #gap+#low
WHERE #row1 = 0
UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
SELECT TOP (ABS((COALESCE(#high,0)-COALESCE(#low,0))/COALESCE(#gap,0)+COALESCE(#row1,1)))
RN = i.rn,
OP = (#high-#low)/#gap+(2*#row1)-i.rn,
N1 = (i.rn-#row1)*#gap+#low,
N2 = (i.rn-(#row1-1))*#gap+#low
FROM iTally AS i
ORDER BY i.rn
) AS r
WHERE #high&#low&#gap&#row1 IS NOT NULL AND #high >= #low AND #gap > 0;
GO
CREATE FUNCTION samd.NGrams8k
(
#string VARCHAR(8000), -- Input string
#N INT -- requested token size
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
position = r.RN,
token = SUBSTRING(#string, CHECKSUM(r.RN), #N)
FROM dbo.rangeAB(1, LEN(#string)+1-#N,1,1) AS r
WHERE #N > 0 AND #N <= LEN(#string);
GO
CREATE FUNCTION samd.patReplace8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50),
#replace VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN #string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
WHEN #pattern+#replace+#string IS NOT NULL THEN
CASE WHEN PATINDEX(#pattern,token COLLATE Latin1_General_BIN)=0
THEN ng.token ELSE #replace END END
FROM samd.NGrams8K(#string, 1) AS ng
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'VARCHAR(8000)');
GO
CREATE FUNCTION samd.delimitedSplitAB8K
(
#string VARCHAR(8000), -- input string
#delimiter CHAR(1) -- delimiter
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
itemNumber = ROW_NUMBER() OVER (ORDER BY d.p),
itemIndex = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),
itemLength = CHECKSUM(item.ln),
item = SUBSTRING(#string, d.p+1, item.ln)
FROM (VALUES (DATALENGTH(#string))) AS l(s) -- length of the string
CROSS APPLY
(
SELECT 0 UNION ALL -- for handling leading delimiters
SELECT ng.position
FROM samd.NGrams8K(#string, 1) AS ng
WHERE token = #delimiter
) AS d(p) -- delimiter.position
CROSS APPLY (VALUES( --LEAD(d.p, 1, l.s+l.d) OVER (ORDER BY d.p) - (d.p+l.d)
ISNULL(NULLIF(CHARINDEX(#delimiter,#string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);
GO
CREATE FUNCTION dbo.RemoveDupChar8K(#string varchar(8000), #char char(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString =
replace(replace(replace(replace(replace(replace(replace(
#string COLLATE LATIN1_GENERAL_BIN,
replicate(#char,33), #char), --33
replicate(#char,17), #char), --17
replicate(#char,9 ), #char), -- 9
replicate(#char,5 ), #char), -- 5
replicate(#char,3 ), #char), -- 3
replicate(#char,2 ), #char), -- 2
replicate(#char,2 ), #char); -- 2
GO
1) If we are using SQL Server 2016 and above then we should probably
use the built-in function STRING_SPLIT
-- SQL 2016and above
DECLARE #txt NVARCHAR(100) = N'This is a comment.'
select [value] from STRING_SPLIT(#txt, ' ')
2) Only if 1 does not fit, then if the number of separation (the space in our case) is less then 3 which fit your sample data, then we should probably use PARSENAME
-- BEFORE SQL 2016 if we have less than 4 parts
DECLARE #txt NVARCHAR(100) = N'This is a comment.'
DECLARE #Temp NVARCHAR(200) = REPLACE (#txt,'.','#')
SELECT t FROM (VALUES(1),(2),(3),(4))T1(n)
CROSS APPLY (SELECT REPLACE(PARSENAME(REPLACE(#Temp,' ','.'),T1.n), '#','.'))T2(t)
3) Only if the 1 and 2 does not fit, then we should use SQLCLR function
http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/
4) Only if we cannot use 1,2 and we cannot use SQLCLR (which implies a real problematic administration and has nothing with security since you can have all the SQLCLR function in a read-only database for the use of all users, as I explain in my lectures), then you can use T-SQL and create UDF.
https://sqlperformance.com/2012/07/t-sql-queries/split-strings

Redshift. Convert comma delimited values into rows with all combinations

I have:
user_id|user_name|user_action
-----------------------------
1 | Shone | start,stop,cancell
I would like to see:
user_id|user_name|parsed_action
-------------------------------
1 | Shone | start
1 | Shone | start,stop
1 | Shone | start,cancell
1 | Shone | start,stop,cancell
1 | Shone | stop
1 | Shone | stop,cancell
1 | Shone | cancell
....
You can create the following Python UDF:
create or replace function get_unique_combinations(list varchar(max))
returns varchar(max)
stable as $$
from itertools import combinations
arr = list.split(',')
response = []
for L in range(1, len(arr)+1):
for subset in combinations(arr, L):
response.append(','.join(subset))
return ';'.join(response)
$$ language plpythonu;
that will take your list of actions and return unique combinations separated by semicolon (elements in combinations themselves will be separated by commas). Then you use a UNION hack to split values into separate rows like this:
WITH unique_combinations as (
SELECT
user_id
,user_name
,get_unique_combinations(user_actions) as action_combinations
FROM your_table
)
,unwrap_lists as (
SELECT
user_id
,user_name
,split_part(action_combinations,';',1) as parsed_action
FROM unique_combinations
UNION ALL
SELECT
user_id
,user_name
,split_part(action_combinations,';',2) as parsed_action
FROM unique_combinations
-- as much UNIONS as possible combinations you have for a single element, with the 3rd parameter (1-based array index) increasing by 1
)
SELECT *
FROM unwrap_lists
WHERE parsed_action is not null

Postgresql - Basic Arrays and array_agg

As a Test I created this schema:
CREATE TABLE simple_table (client_id int4, order_id int4);
INSERT INTO simple_table (client_id, order_id)
VALUES
(1,2),(1,3),(1,4),(1,6),(1,8),(1,12),(1,16),(1,18),(1,25),(1,32),(1,33),(1,37),(1,43),
(1,56),(1,57),(1,66),(2,2),(2,3),(2,5),(2,7),(2,9),(2,12),(2,17),(2,19),(2,22),(2,30),
(2,33),(2,38),(2,44),(2,56),(2,58),(2,66)
;
Then used array_agg:
SELECT client_id, array_agg(order_id) FROM simple_table GROUP BY client_id;
to create the arrays for client 1 and client 2:
| CLIENT_ID | ARRAY_AGG |
----------------------------------------------------------
| 1 | 2,3,4,6,8,12,16,18,25,32,33,37,43,56,57,66 |
| 2 | 2,3,5,7,9,12,17,19,22,30,33,38,44,56,58,66 |
Now I would like to compare the 2 rows and identify the like values. Tried && overlap (have elements in common) ARRAY[1,4,3] && ARRAY[2,1] from the Postgresql documentation but I am having problems.
Perhaps I am looking at this wrong. Any help or guidance would be appreciated!
The && operator is a predicate that yields a true or false result, not a list of values.
If you're looking for the list of order_id that exist for both client_id=1 and client_id=2, the query would be:
select order_id from simple_table where client_id in (1,2)
group by order_id having count(*)=2;
That's equivalent to the intersections of the two arrays if you consider that these arrays are sets (no duplicates and the positions of the values are irrelevant), except that you don't need to use arrays at all, simple standard SQL is good enough.
Take a look at the "array_intersect" functions here:
Array Intersect
To see elements that are not common to both arrays:
create or replace function arrxor(anyarray,anyarray) returns anyarray as $$
select ARRAY(
(
select r.elements
from (
(select 1,unnest($1))
union all
(select 2,unnest($2))
) as r (arr, elements)
group by 1
having min(arr) = max(arr)
)
)
$$ language sql strict immutable;

Converting Access Pivot Table to SQL Server

I'm having trouble converting a MS Access pivot table over to SQL Server. Was hoping someone might help..
TRANSFORM First(contacts.value) AS FirstOfvalue
SELECT contacts.contactid
FROM contacts RIGHT JOIN contactrecord ON contacts.[detailid] = contactrecord.[detailid]
GROUP BY contacts.contactid
PIVOT contactrecord.wellknownname
;
Edit: Responding to some of the comments
Contacts table has three fields
contactid | detailid | value |
1 1 Scott
contactrecord has something like
detailid | wellknownname
1 | FirstName
2 | Address1
3 | foobar
contractrecord is dyanamic in that the user at anytime can create a field to be added to contacts
the access query pulls out
contactid | FirstName | Address1 | foobar
1 | Scott | null | null
which is the pivot on the wellknownname. The key here is that the number of columns is dynamic since the user can, at anytime, create another field for the contact. Being new to pivot tables altogether, I'm wondering how I can recreate this access query in sql server.
As for transform... that's a built in access function. More information is found about it here. First() will just take the first result on that matching row.
I hope this helps and appreciate all the help.
I quick search for dynamic pivot tables comes up with this article.
After renaming things in his last query on the page I came up with this:
DECLARE #PivotColumnHeaders VARCHAR(max);
SELECT #PivotColumnHeaders = COALESCE(#PivotColumnHeaders + ',['+ CAST(wellknownname as varchar) + ']','['+ CAST(wellknownname as varchar) + ']')
FROM contactrecord;
DECLARE #PivotTableSQL NVARCHAR(max);
SET #PivotTableSQL = N'
SELECT *
FROM (
SELECT
c.contactid,
cr.wellknownname,
c.value
FROM contacts c
RIGHT JOIN contactrecord cr
on c.detailid = cr.detailid
) as pivotData
pivot(
min(value)
for wellknownname in (' + #PivotColumnHeaders +')
) as pivotTable
'
;
execute(#PivotTableSQL);
which despite its ugliness, it does the job