I want to get max length (in bytes) of a variable-length column. One of my columns has the following definition:
shortname character varying(35) NOT NULL
userid integer NOT NULL
columncount smallint NOT NULL
I tried to retrieve some info from the pg_attribute table, but the attlen column has -1 value for all variable-length columns. I also tried to use pg_column_size function, but it doesn't accept the name of the column as an input parameter.
It can be easily done in SQL Server.
Are there any other ways to get the value I'm looking for?
You will need to use a CASE expression checks pg_attribute.attlen and then calculate the maximum size in bytes depending on that. To get the max size for a varchar column you can "steal" the expression used in information_schema.columns.character_octet_length for varchar or char columns
Something along the lines:
select a.attname,
t.typname,
case
when a.attlen <> -1 then attlen
when t.typname in ('bytea', 'text') then pg_size_bytes('1GB')
when t.typname in ('varchar', 'char') then information_schema._pg_char_octet_length(information_schema._pg_truetypid(a.*, t.*), information_schema._pg_truetypmod(a.*, t.*))
end as max_bytes
from pg_attribute a
join pg_type t on a.atttypid = t.oid
where a.attrelid = 'stuff.test'::regclass
and a.attnum > 0
and not a.attisdropped;
Note that this won't return a proper size for numeric as that is also a variable length type. The documentation says "he actual storage requirement is two bytes for each group of four decimal digits, plus three to eight bytes overhead".
As a side note: this seems an extremely strange thing to do. Especially with your mentioning of temp tables in stored procedures. More often than not, the use of temp tables is not needed in Postgres. Instead of blindly copying the old approach that might have worked well in SQL Server, you should understand how Postgres works and change the approach to match the best practices in Postgres.
I have seen many migrations fail or deliver mediocre performance because of the assumption that the best practices for "System A" can be applied without any change to "System B". You need to migrate your mindset as well.
If this checks the columns of a temp table, then why not simply check the actual size of the column values using pg_column_size()?
Related
I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.
I'm trying to get a better understanding of the tradeoffs involved in creating Postgres indexes. As part of that, I'd love to understand how much space indexes usually use. I've read through the docs, but can't find any information on this. I've been doing my own little experiments creating tables and indexes, but it would be amazing if someone could offer an explanation of why the size is what it is. Assume a common table like this with 1M rows, where each row has a unique id and a unique outstanding.
CREATE TABLE account (
id integer,
active boolean NOT NULL,
outstanding double precision NOT NULL,
);
and the indexes created by
CREATE INDEX id_idx ON account(id)
CREATE INDEX outstanding_idx ON account(outstanding)
CREATE INDEX id_outstanding_idx ON account(id, outstanding)
CREATE INDEX active_idx ON account(active)
CREATE INDEX partial_id_idx ON account(id) WHERE active
What would you estimate the index sizes to be in bytes and more importantly, why?
Since you did not specify the index type, I'll assume default B-tree indexes. Other types can be a lot different.
Here is a simplistic function to compute the estimated minimum size in bytes for an index on the given table with the given columns:
CREATE OR REPLACE FUNCTION f_index_minimum_size(_tbl regclass, _cols VARIADIC text[], OUT estimated_minimum_size bigint)
LANGUAGE plpgsql AS
$func$
DECLARE
_missing_column text;
BEGIN
-- assert
SELECT i.attname
FROM unnest(_cols) AS i(attname)
LEFT JOIN pg_catalog.pg_attribute a ON a.attname = i.attname
AND a.attrelid = _tbl
WHERE a.attname IS NULL
INTO _missing_column;
IF FOUND THEN
RAISE EXCEPTION 'Table % has no column named %', _tbl, quote_ident(_missing_column);
END IF;
SELECT INTO estimated_minimum_size
COALESCE(1 + ceil(reltuples/trunc((blocksize-page_overhead)/(4+tuple_size)))::int, 0) * blocksize -- AS estimated_minimum_size
FROM (
SELECT maxalign, blocksize, reltuples, fillfactor, page_overhead
, (maxalign -- up to 16 columns, else nullbitmap may force another maxalign step
+ CASE WHEN datawidth <= maxalign THEN maxalign
WHEN datawidth%maxalign = 0 THEN datawidth
ELSE (datawidth + maxalign) - datawidth%maxalign END -- add padding to the data to align on MAXALIGN
) AS tuple_size
FROM (
SELECT c.reltuples, count(*)
, 90 AS fillfactor
, current_setting('block_size')::bigint AS blocksize
, CASE WHEN version() ~ '64-bit|x86_64|ppc64|ia64|amd64|mingw32' -- MAXALIGN: 4 on 32bits, 8 on 64bits
THEN 8 ELSE 4 END AS maxalign
, 40 AS page_overhead -- 24 bytes page header + 16 bytes "special space"
-- avg data width without null values
, sum(ceil((1-COALESCE(s.null_frac, 0)) * COALESCE(s.avg_width, 1024))::int) AS datawidth -- ceil() because avg width has a low bias
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_attribute a ON a.attrelid = c.oid
JOIN pg_catalog.pg_stats s ON s.schemaname = c.relnamespace::regnamespace::text
AND s.tablename = c.relname
AND s.attname = a.attname
WHERE c.oid = _tbl
AND a.attname = ANY(_cols) -- all exist, verified above
GROUP BY 1
) sub1
) sub2;
END
$func$;
Call examples:
SELECT f_index_minimum_size('my_table', 'col1', 'col2', 'col3');
SELECT f_index_minimum_size('public.my_table', VARIADIC '{col1, col2, col3}');
db<>fiddle here
About VARIADIC parameters:
Return rows matching elements of input array in plpgsql function
Basically, all indexes use data pages of typically 8 kb block size (rarely 4 kb). There is one data page overhead for B-tree indexes to start with. Each additional data page has a fixed overhead of 40 bytes (currently). Each page stores tuples like depicted in the manual here. Each tuple has a tuple header (typically 8 bytes incl. alignment padding), possibly a null bitmap, data (possibly incl. alignment padding between columns for multicolumn indices), and possibly alignment padding to the next multiple of MAXALIGN (typically 8 bytes). Plus, there is an ItemId of 4 bytes per tuple. Some space may be reserved initially for later additions with a fillfactor - 90 % by default for B-tree indexes.
Important notes & disclaimers
The reported size is the estimated minimum size. An actual index will typically be bigger by around 25 % due to natural bloat from page splits. Plus, the calculation does not take possible alignment padding between multiple columns into account. Can add another couple percent (or more in extreme cases). See:
Calculating and saving space in PostgreSQL
Estimations are based on column statistics in the view pg_stats which is based on the system table pg_statistics. (Using the latter directly would be faster, but only allowed for superusers.) In particular, the calculation is based on null_frac, the "fraction of column entries that are null" and avg_width, the "average width in bytes of column's entries" to compute an average data width - ignoring possible additional alignment padding for multicolumn indexes.
The default 90 % fillfactor is taken into account. (One might specify a different one.)
Up to 50 % bloat is typically natural for B-tree indexes and nothing to worry about.
Does not work for expression indexes.
No provision for partial indexes.
Function raises an exception if anything but existing plain column names is passed. Case-sensitive!
If the table is new (or in any case if statistics may be out of date), be sure to run ANALYZE on the table before calling the function to update (or even initiate!) statistics.
Due to major optimizations, B-tree indexes in Postgres 12 waste less space and are typically closer to the reported minimum size.
Does not account for deduplication that's introduced with Postgres 13, which can compact indexes with duplicate values.
Parts of the code are taken from ioguix' bloat estimation queries here:
https://github.com/ioguix/pgsql-bloat-estimation
More gory details i the Postgres source code here:
https://doxygen.postgresql.org/bufpage_8h_source.html
You can calculate it yourself. Each index entry has an overhead of 8 bytes. Add the average size of your indexed data (in the internal binary format).
There is some more overhead, like page header and footer and internal index pages, but that doesn't account for much, unless your index rows are very wide.
I have been trying to do an outer join across two different tables in two different schemas. I am trying to filter out before from the table variants the character that are smaller than 4 and bigger than 5 digits. The join was not working with a simply where clause in the end, hence this decision.
The problem is if I do not put the quotes, Snowflake will say that I put invalid identifiers. However, when I run this with the quotes, it works but I get as values in the fields of the column raw.stitch_heroku.spree_variants.SKU only named as the column name, all across the table!
SELECT
analytics.dbt_lcasucci.product_category.product_description,
'raw.stitch_heroku.spree_variants.SKU'
FROM analytics.dbt_lcasucci.product_category
LEFT JOIN (
SELECT * FROM raw.stitch_heroku.spree_variants
WHERE LENGTH('raw.stitch_heroku.spree_variants.SKU')<=5
and LENGTH('raw.stitch_heroku.spree_variants.SKU')>=4
) ON 'analytics.dbt_lcasucci.product_category.product_id'
= 'raw.stitch_heroku.spree_variants.SKU'
Is there a way to work this around? I am confused and have not found this issue on forums yet!
thx in advance
firstly single quote define a string literal 'this is text' where as double quotes are table/column names "this_is_a_table_name"
add alias's to the tables makes the SQL more readable, and the duplicate length command can be reduced with a between, thus this should work better:
SELECT pc.product_description,
sp.SKU
FROM analytics.dbt_lcasucci.product_category AS PC
LEFT JOIN (
SELECT SKU
FROM raw.stitch_heroku.spree_variants
WHERE LENGTH(SKU) BETWEEN 4 AND 5
) AS sp
ON pc.product_id = sp.SKU;
So I reduced the sub-selects results as you only used sku from sp but given you are comparing product_id to sku as your example exists you don't need to join to sp.
the invalid indentifiers implies to me something is named incorrectly, the first step there is to check the tables exist and the columns are named as you expect and the type of the columns are the same for the JOIN x ON y clause via:
describe table analytics.dbt_lcasucci.product_category;
describe table raw.stitch_heroku.spree_variants;
I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.
I am converting the following T-SQL statement to Redshift. The purpose of the query is to convert a column in the table with a value containing a comma delimited string with up to 60 values into multiple rows with 1 value per row.
SELECT
id_1
, id_2
, value
into dbo.myResultsTable
FROM myTable
CROSS APPLY STRING_SPLIT([comma_delimited_string], ',')
WHERE [comma_delimited_string] is not null;
In SQL this processes 10 million records in just under 1 hour which is fine for my purposes. Obviously a direct conversation to Redshift isn't possible due to Redshift not having a Cross Apply or String Split functionality so I built a solution using the process detailed here (Redshift. Convert comma delimited values into rows) which utilizes split_part() to split the comma delimited string into multiple columns. Then another query that unions everything to get the final output into a single column. But the typical run takes over 6 hours to process the same amount of data.
I wasn't expecting to run into this issue just knowing the power difference between the machines. The SQL Server I was using for the comparison test was a simple server with 12 processors and 32 GB of RAM while the Redshift server is based on the dc1.8xlarge nodes (I don't know the total count). The instance is shared with other teams but when I look at the performance information there are plenty of available resources.
I'm relatively new to Redshift so I'm still assuming I'm not understanding something. But I have no idea what am I missing. Are there things I need to check to make sure the data is loaded in an optimal way (I'm not an adim so my ability to check this is limited)? Are there other Redshift query options that are better than the example I found? I've searched for other methods and optimizations but unless I start looking into Cross Joins, something I'd like to avoid (Plus when I tried to talk to the DBA's running the Redshift cluster about this option their response was a flat "No, can't do that.") I'm not even sure where to go at this point so any help would be much appreciated!
Thanks!
I've found a solution that works for me.
You need to do a JOIN on a number table, for which you can take any table as long as it has more rows that the numbers you need. You need to make sure the numbers are int by forcing the type. Using the funcion regexp_count on the column to be split for the ON statement to count the number of fields (delimiter +1), will generate a table with a row per repetition.
Then you use the split_part function on the column, and use the number.num column to extract for each of the rows a different part of the text.
SELECT comma_delimited_string, numbers.num, REGEXP_COUNT(comma_delimited_string , ',')+1 AS nfields, SPLIT_PART(comma_delimited_string, ',', numbers.num) AS field
FROM mytable
JOIN
(
select
(row_number() over (order by 1))::int as num
from
mytable
limit 15 --max num of fields
) as numbers
ON numbers.num <= regexp_count(comma_delimited_string , ',') + 1