Why does atttypmod differ from character_maximum_length? - postgresql

I'm converting some information_schema queries to system catalog queries and I'm getting different results for character maximum length.
SELECT column_name,
data_type ,
character_maximum_length AS "maxlen"
FROM information_schema.columns
WHERE table_name = 'x'
returns the results I expect, e.g.:
city character varying 255
company character varying 1000
The equivalent catalog query
SELECT attname,
atttypid::regtype AS datatype,
NULLIF(atttypmod, -1) AS maxlen
FROM pg_attribute
WHERE CAST(attrelid::regclass AS varchar) = 'x'
AND attnum > 0
AND NOT attisdropped
Seems to return every length + 4:
city character varying 259
company character varying 1004
Why the difference? Is it safe to always simply subtract 4 from the result?

You could say it's safe to substract 4 from the result for types char and varchar. What information_schema.columns view does under the hood is it calls a function informatoin_schema._pg_char_max_length (this is your difference, since you don't), which body is:
CREATE OR REPLACE FUNCTION information_schema._pg_char_max_length(typid oid, typmod integer)
RETURNS integer
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT
AS $function$SELECT
CASE WHEN $2 = -1 /* default typmod */
THEN null
WHEN $1 IN (1042, 1043) /* char, varchar */
THEN $2 - 4
WHEN $1 IN (1560, 1562) /* bit, varbit */
THEN $2
ELSE null
END$function$
That said, for chars and varchars it always substracts 4.
This makes your query not equivalent to the extent that it would actually need a join to pg_type in order to establish the typid of the column and wrap the value in a function to have it return proper values. This is due to the fact, that there are more things coming into play than just that. If you wish to simplify, you can do it without a join (it won't be bulletproof though):
SELECT attname,
atttypid::regtype AS datatype,
NULLIF(information_schema._pg_char_max_length(atttypid, atttypmod), -1) AS maxlen
FROM pg_attribute
WHERE CAST(attrelid::regclass AS varchar) = 'x'
AND attnum > 0
AND NOT attisdropped
This should do it for you. Should you wish to investigate the matter further, refer to view definition of information_schema.columns.

Related

Concatenate string instead of just replacing it

I have a table with standard columns where I want to perform regular INSERTs.
But one of the columns is of type varchar with special semantics. It's a string that's supposed to behave as a set of strings, where the elements of the set are separated by commas.
Eg. if one row has in that varchar column the value fish,sheep,dove, and I insert the string ,fish,eagle, I want the result to be fish,sheep,dove,eagle (ie. eagle gets added to the set, but fish doesn't because it's already in the set).
I have here this Postgres code that does the "set concatenation" that I want:
SELECT string_agg(unnest, ',') AS x FROM (SELECT DISTINCT unnest(string_to_array('fish,sheep,dove' || ',fish,eagle', ','))) AS x;
But I can't figure out how to apply this logic to insertions.
What I want is something like:
CREATE TABLE IF NOT EXISTS t00(
userid int8 PRIMARY KEY,
a int8,
b varchar);
INSERT INTO t00 (userid,a,b) VALUES (0,1,'fish,sheep,dove');
INSERT INTO t00 (userid,a,b) VALUES (0,1,',fish,eagle')
ON CONFLICT (userid)
DO UPDATE SET
a = EXCLUDED.a,
b = SELECT string_agg(unnest, ',') AS x FROM (SELECT DISTINCT unnest(string_to_array(t00.b || EXCLUDED.b, ','))) AS x;
How can I achieve something like that?
Storing comma separated values is a huge mistake to begin with. But if you really want to make your life harder than it needs to be, you might want to create a function that merges two comma separated lists:
create function merge_lists(p_one text, p_two text)
returns text
as
$$
select string_agg(item, ',')
from (
select e.item
from unnest(string_to_array(p_one, ',')) as e(item)
where e.item <> '' --< necessary because of the leading , in your data
union
select t.item
from unnest(string_to_array(p_two, ',')) t(item)
where t.item <> ''
) t;
$$
language sql;
If you are using Postgres 14 or later, unnest(string_to_array(..., ',')) can be replace with string_to_table(..., ',')
Then your INSERT statement gets a bit simpler:
INSERT INTO t00 (userid,a,b) VALUES (0,1,',fish,eagle')
ON CONFLICT (userid)
DO UPDATE SET
a = EXCLUDED.a,
b = merge_lists(excluded.b, t00.b);
I think I was only missing parentheses around the SELECT statement:
INSERT INTO t00 (userid,a,b) VALUES (0,1,',fish,eagle')
ON CONFLICT (userid)
DO UPDATE SET
a = EXCLUDED.a,
b = (SELECT string_agg(unnest, ',') AS x FROM (SELECT DISTINCT unnest(string_to_array(t00.b || EXCLUDED.b, ','))) AS x);

How to use queried table name in subquery

I'm trying to query field names as well as their maximum length in their corresponding table with a single query - is it at all possible? I've read about correlated subqueries, but I couldn't get the desired result.
Here is the query I have so far:
select T1.RDB$FIELD_NAME, T2.RDB$FIELD_NAME, T2.RDB$RELATION_NAME as tabName, T1.RDB$CHARACTER_SET_ID, T1.RDB$FIELD_LENGTH,
(select max(char_length(T2.RDB$FIELD_NAME))
FROM tabName as MaxLength)
from RDB$FIELDS T1, RDB$RELATION_FIELDS T2
The above doesn't work because, of course, here the subquery tries to find "tabName" table. My guess is that I should use some kind of joins, but my SQL skills are very limited in this matter.
The origin of the request is that I want to apply this script in order to transform all my non-utf8 fields to UTF8 but I run into "string truncation" issues, as I have a few `VARCHAR(8192)' fields that lead to string truncation errors with the script. Usually, none of the fields would actually use these 8192 chars, but I'd rather make sure before truncating.
What you're trying to do cannot be done this way. It looks like you want to obtain the actual maximum length of fields in tables, but you cannot dynamically reference table and column names like this; being able to do that would be a SQL injection heaven. In addition, your use of a SQL-89 cross join instead of an inner join (preferably in SQL-92 style) causes other problems, as you will combine fields incorrectly (as a Cartesian product).
Instead you need to write PSQL to dynamically build and execute the statement to obtain the lengths (using EXECUTE BLOCK (or a stored procedure) and EXECUTE STATEMENT).
For example, something like this:
execute block
returns (
table_name varchar(63) character set unicode_fss,
column_name varchar(63) character set unicode_fss,
type varchar(10),
length smallint,
charset_name varchar(63) character set unicode_fss,
collation_name varchar(63) character set unicode_fss,
max_length smallint)
as
begin
for select
trim(rrf.RDB$RELATION_NAME) as table_name,
trim(rrf.RDB$FIELD_NAME) as column_name,
case rf.RDB$FIELD_TYPE when 14 then 'CHAR' when 37 then 'VARCHAR' end as type,
coalesce(rf.RDB$CHARACTER_LENGTH, rf.RDB$FIELD_LENGTH / rcs.RDB$BYTES_PER_CHARACTER) as length,
trim(rcs.RDB$CHARACTER_SET_NAME) as charset_name,
trim(rc.RDB$COLLATION_NAME) as collation_name
from RDB$RELATIONS rr
inner join RDB$RELATION_FIELDS rrf
on rrf.RDB$RELATION_NAME = rr.RDB$RELATION_NAME
inner join RDB$FIELDS rf
on rf.RDB$FIELD_NAME = rrf.RDB$FIELD_SOURCE
inner join RDB$CHARACTER_SETS rcs
on rcs.RDB$CHARACTER_SET_ID = rf.RDB$CHARACTER_SET_ID
left join RDB$COLLATIONS rc
on rc.RDB$CHARACTER_SET_ID = rf.RDB$CHARACTER_SET_ID
and rc.RDB$COLLATION_ID = rf.RDB$COLLATION_ID
and rc.RDB$COLLATION_NAME <> rcs.RDB$DEFAULT_COLLATE_NAME
where coalesce(rr.RDB$RELATION_TYPE, 0) = 0 and coalesce(rr.RDB$SYSTEM_FLAG, 0) = 0
and rf.RDB$FIELD_TYPE in (14 /* char */, 37 /* varchar */)
into table_name, column_name, type, length, charset_name, collation_name
do
begin
execute statement 'select max(character_length("' || replace(column_name, '"', '""') || '")) from "' || replace(table_name, '"', '""') || '"'
into max_length;
suspend;
end
end
As an aside, the maximum length of a VARCHAR of character set UTF8 is 8191, not 8192.

Removing all the Alphabets from a string using a single SQL Query [duplicate]

I'm currently doing a data conversion project and need to strip all alphabetical characters from a string. Unfortunately I can't create or use a function as we don't own the source machine making the methods I've found from searching for previous posts unusable.
What would be the best way to do this in a select statement? Speed isn't too much of an issue as this will only be running over 30,000 records or so and is a once off statement.
You can do this in a single statement. You're not really creating a statement with 200+ REPLACEs are you?!
update tbl
set S = U.clean
from tbl
cross apply
(
select Substring(tbl.S,v.number,1)
-- this table will cater for strings up to length 2047
from master..spt_values v
where v.type='P' and v.number between 1 and len(tbl.S)
and Substring(tbl.S,v.number,1) like '[0-9]'
order by v.number
for xml path ('')
) U(clean)
Working SQL Fiddle showing this query with sample data
Replicated below for posterity:
create table tbl (ID int identity, S varchar(500))
insert tbl select 'asdlfj;390312hr9fasd9uhf012 3or h239ur ' + char(13) + 'asdfasf'
insert tbl select '123'
insert tbl select ''
insert tbl select null
insert tbl select '123 a 124'
Results
ID S
1 390312990123239
2 123
3 (null)
4 (null)
5 123124
CTE comes for HELP here.
;WITH CTE AS
(
SELECT
[ProductNumber] AS OrigProductNumber
,CAST([ProductNumber] AS VARCHAR(100)) AS [ProductNumber]
FROM [AdventureWorks].[Production].[Product]
UNION ALL
SELECT OrigProductNumber
,CAST(STUFF([ProductNumber], PATINDEX('%[^0-9]%', [ProductNumber]), 1, '') AS VARCHAR(100) ) AS [ProductNumber]
FROM CTE WHERE PATINDEX('%[^0-9]%', [ProductNumber]) > 0
)
SELECT * FROM CTE
WHERE PATINDEX('%[^0-9]%', [ProductNumber]) = 0
OPTION (MAXRECURSION 0)
output:
OrigProductNumber ProductNumber
WB-H098 098
VE-C304-S 304
VE-C304-M 304
VE-C304-L 304
TT-T092 092
RichardTheKiwi's script in a function for use in selects without cross apply,
also added dot because in my case I use it for double and money values within a varchar field
CREATE FUNCTION dbo.ReplaceNonNumericChars (#string VARCHAR(5000))
RETURNS VARCHAR(1000)
AS
BEGIN
SET #string = REPLACE(#string, ',', '.')
SET #string = (SELECT SUBSTRING(#string, v.number, 1)
FROM master..spt_values v
WHERE v.type = 'P'
AND v.number BETWEEN 1 AND LEN(#string)
AND (SUBSTRING(#string, v.number, 1) LIKE '[0-9]'
OR SUBSTRING(#string, v.number, 1) LIKE '[.]')
ORDER BY v.number
FOR
XML PATH('')
)
RETURN #string
END
GO
Thanks RichardTheKiwi +1
Well if you really can't use a function, I suppose you could do something like this:
SELECT REPLACE(REPLACE(REPLACE(LOWER(col),'a',''),'b',''),'c','')
FROM dbo.table...
Obviously it would be a lot uglier than that, since I only handled the first three letters, but it should give the idea.

Select that splits rows with range into several rows with smaller ranges?

I have two tables that contain categorized tsrange values. The ranges in each table are non-overlapping per category, but the ranges in b might overlap those in a.
create table a ( id serial primary key, category int, period tsrange );
create table b ( id serial primary key, category int, period tsrange );
What I would like to do is combine these two tables into a CTE for another query. The combined values needs to be the tsranges from table a subtracted by any overlapping tsranges in table b with the same category.
The complication is that in the case where an overlapping b.period is contained inside an a.period, the result of the subtraction is two rows. The Postgres Range - operator does not support this, so I create a function that will return 1 or 2 rows:
create function subtract_tsrange( a tsrange , b tsrange )
returns table (period tsrange)
language 'plpgsql' as $$
begin
if a #> b and not isempty(b) and lower(a) <> lower(b) and upper(b) <> upper(a)
then
period := tsrange(lower(a), lower(b), '[)');
return next;
period := tsrange(upper(b), upper(a), '[)');
return next;
else
period := a - b;
return next;
end if;
return;
end
$$;
There can also be several b.periods overlapping an a.period, so one row from a might be potentially be split into a lot of rows with shorter periods.
Now I want to create a select that takes each row in a and returns:
The original a.period if there is no overlapping b.period with the same category
or
1 or several rows representing the original a.period minus all overlapping b.periods with the same category.
After reading lots of other posts I figure I should use SELECT LATERAL in combination with my function somehow, but I'm still scratching my head as to how?? (We're talking Postgres 9.6 btw!)
Notes: your problem can easily be generalized to every range types, therefore I will use the anyrange pseudo type in my answer, but you don't have to. In fact because of this I had to create a generic constructor for range types, because PostgreSQL have not defined it (yet):
create or replace function to_range(t anyrange, l anyelement, u anyelement, s text default '[)', out to_range anyrange)
language plpgsql as $func$
begin
execute format('select %I($1, $2, $3)', pg_typeof(t)) into to_range using l, u, s;
end
$func$;
Of course, you can use the appropriate range constructor instead of to_range() calls.
Also, I will use the numrange type for testing purposes, as it can be created and checked more easily than the tsrange type, but my answer should work with that as well.
Answer:
I rewrote your function to handle any type of bounds (inclusive, exclusive and even unbounded ranges). Also, it will return an empty result set when a <# b.
create or replace function range_div(a anyrange, b anyrange)
returns setof anyrange
language sql as $func$
select * from unnest(case
when b is null or a <# b then '{}'
when a #> b then array[
to_range(a, case when lower_inf(a) then null else lower(a) end,
case when lower_inf(b) then null else lower(b) end,
case when lower_inc(a) then '[' else '(' end ||
case when lower_inc(b) then ')' else ']' end),
to_range(a, case when upper_inf(b) then null else upper(b) end,
case when upper_inf(a) then null else upper(a) end,
case when upper_inc(b) then '(' else '[' end ||
case when upper_inc(a) then ']' else ')' end)
]
else array[a - b]
end)
$func$;
With this in mind, what you need is somewhat an inverse of aggregation. F.ex. with sum() one can start with an empty value (0) and constantly add some value to that. But you have your initial value and you need to constantly remove some parts of it.
One solution to that is to use recursive CTEs:
with recursive r as (
select *
from a
union
select r.id, r.category, d
from r
left join b using (category)
cross join range_div(r.period, b.period) d -- this is in fact an implicit lateral join
where r.period && b.period
)
select r.*
from r
left join b on r.category = b.category and r.period && b.period
where not isempty(r.period) and b.period is null
My sample data:
create table a (id serial primary key, category int, period numrange);
create table b (id serial primary key, category int, period numrange);
insert into a (category, period) values (1, '[1,4]'), (1, '[2,5]'), (1, '[3,6]'), (2, '(1,6)');
insert into b (category, period) values (1, '[2,3)'), (1, '[1,2]'), (2, '[3,3]');
The query above produces:
id | category | period
3 | 1 | [3,6]
1 | 1 | [3,4]
2 | 1 | [3,5]
4 | 2 | (1,3)
4 | 2 | (3,6)

Get columns that differ between 2 rows

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;