Does varchar's length have any effect on performance - db2

We have discussions with our development staff over the use of VARCHAR columns as they define every varchar fileds as varchar(255),varchar(500),... and much bigger than the maximum length of the field,
does varchar's length have any effect on performance in db2? We have find that it is recommended to use char instead of varchar for column of 30 bytes or less and our concern is about varchar fileds that are greater than 30 bytes.

Allowing excessive column length is not a good idea. If you allow, let’s say, a FirstName column to have maximum length 500, you may find quite a long irrelevant story there eventually, because why not if it’s allowed :)
As for performance implications.
The only problem may be here, if Extended row size is turned on for the database (you simply can’t create too “wide” table otherwise), and the total length of the row exceeds the tablespace page size. Some varchar column value gets out from the data page, and more IO will be needed to access such a row in future. You should keep in mind such a behavior. And the probability of such events is higher in case of uncontrolled varchar columns length.

This can have an performance hit with ORGANIZE BY COLUMN tables. There is a limit in the total declared width that can be processed within the Columnar Data Engine, if this limit is breached in a query plan, the remainder of the query will be processed in the Row Data Engine.

Related

Scalable approach to calculating balances over thousands of records in PostgreSQL

I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.
The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.
The different approaches I took:
Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow.
I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.
Main question is - what would be the most efficient way to achieve this while being 'database friendly'?
Additionally:
Whether you think these approaches are sane?
Am I missing another obvious solution?
Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?
I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.
EDIT:
To be a but more specific, I'm dealing with the following data:
CREATE TABLE entries (
profileid bigint NOT NULL,
programid bigint NOT NULL,
ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
startdate timestamptz,
enddate timestamptz,
amount numeric NOT NULL
)
What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).
The desired form is:
RETURNS TABLE (
programid bigint,
ledgerid text,
programid bigint,
currentbalance numeric,
pendingbalance numeric,
expiredbalance numeric,
spentbalance numeric
)
The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...
While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...

Does it waste space defining a column in IBM Db2 on Cloud a longer VARCHAR() than required?

We often have columns that can contain values of varying sizes. For these, I like to set the data type to VARCHAR with a size way beyond the current maximum length. For example, if I have a column where the current minimum length for a value is 10 and the maximum length is 35, I might set the data type to VARCHAR(64). My rationale is that Db2 stores the 2 byte length followed by the exact value, therefore, there is no difference, from a storage perspective, defining the data type as VARCHAR(64) instead of VARCHAR(35). And I don't get an error if I a value with a length of 36 comes along.
Is there a nuance that I'm missing and should I not be so glib about my VARCHAR assignments?
The exact formula to calculate row length is described in the docs for CREATE TABLE. VARCHAR(64) or VARCHAR(35) should not make a difference.
Be aware that rows a stored in data pages in tablespaces. Database systems usually pre-allocate pages for performance reasons. Moreover, pages might not be fully filled or there is compression. And you might have defined indexes which require their own pages with structures. Plus there is metadata in the system catalog.

What's the impact of TOAST on performance? (adding a hundred varchar columns)

Consider a table with the following data:
id bigint Auto Increment
name character varying(255) NULL
category character varying(255) NULL
english character varying(255) NULL
french character varying(255) NULL
pivot character varying(255) NULL
credits character varying(255) NULL
hash character varying(20) NULL
The english column contains data of the following size (in bytes): max 116, min 5, average 42, median: 40.
The number of rows in the table is around 30,000 and will hardly change.
The new 107 columns will be translations of the English.
Will adding 107 columns hurt performance?
The Postgres site says the maximum number of columns on a Postgres table is
250-1600 depending on column types
and
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page
Will the data fall under that limit?
Size of the largest row
What is the actual storage size of the table's rows? pg_column_size is the
Number of bytes used to store a particular value (possibly compressed)
SELECT id, pg_column_size(t.*) FROM table as t ORDER BY pg_column_size DESC
-- Some stats derived from the query:
-- Min 87 bytes
-- Max 514 bytes
-- Average 216 bytes
-- Median: 209 bytes
But no compression is actually happening here, because:
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Compression would start to kick in once the table gets bigger and those new columns are added.
It's unclear to me what the compression ratio would be for such data?
I wonder how effective it'll be on lots of short multilingual sentences. Also, tried to find the exact name of the compression algorithm used by Postgres: the docs say "the LZ family of compression techniques", but which one – LZ77? LZ78? A twist on one of them?
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
TOAST'ed?
If the size of the table goes beyond the page size limit, then Posgres will rely on TOAST not just to compress but also to split the data for "out-of-line" rows.
I understand this will increase fetch times for those rows that don't fit… But what's the impact of TOAST on performance? Is it negligible for such a use case?
Bottom-line
At the end of the day…
Is adding those 107 columns a good idea, or should I use a different approach?
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
Or am I approaching this the wrong way, i.e. is it a case of premature optimization where I'd have been better off just adding the columns and only investigate later if faced with problems?
Using Postgres 9.6. Upgrading is an option if needed.
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
I'd just copy the English version into each of the 107 columns. That should be good enough to get some useful findings. You might worry that the repetition would cause the compression to be idiosyncratic; but each value is compressed in isolation so won't "know" it is identical to some other value.
It's unclear to me what the compression ratio would be for such data?
Not very much. For example, the paragraph of yours I quoted first doesn't get any benefit from compression (when I copied it into 107 other columns). Short segments of ordinary text do not have enough repetition in them to be very compressible. Translating them to other languages is unlikely to change this.
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
This question has a very clear answer. You should absolutely select only what you need. Assembling a row from 100+ toasted columns, just to throw most of them away, will slow you down.
I don't know if this falls under "premature optimization" so much as falling under poor design. In one way or another you will need some method of know which of the 108 versions you need. But what happens when you need to add the 108th translation, or you delete say the 93rd. So use this information to form a key to a translation table. Something like Translation_Test (for_ref_in bigint, language text, translation text). Then access the necessary text (including perhaps the English version) from that table.

Get size of all columns on a DB2 table

I have been asked to determine how much data our application uses and how fast it is growing. The problem is many applications share the same database and tables with a column being used to determine which application the data belongs to. It is a DB2 database.
Is there any way to find the size in bytes of all the columns a table uses for a given row? It is important that I select only those rows that belong to my application.
If a column is not nullable I do not include it in the SQL I just multiply its size by the row count. I am primarily trying to determine the average size of nullable and variable size columns (we use VARCHAR and BLOB).
At the moment what I am doing looks something like this:
SELECT VALUE(LENGTH(COLUMN_1), 0) AS LEN_COL_1, repeat for each variable size column
FROM TABLE T
WHERE T.APP_ID = my app
The most accurate way to determine size would be to look at the sizes of the files that make up the DB2 tables.
Divide the file sizes by the percentage of rows that belong to your application.
This way, you count most of DB2's overhead size, including indexes.

How to predict PostgreSQL index size

Is it possible to predict the amount of disk space/memory that will be used by a basic index in PostgreSQL 9.0?
E.g. If I have a default B-tree index on an integer column in a table of 1 million rows, how much space would be taken up by the index? Is the entire index held in memory at all times?
Not really a definitive answer, but I looked at a table in a 9.0 test system I have with a couple of int indexes on a table of 280k rows. The indexs all report a size of 6232kb. So roughly 22 bytes per row.
There is no way to say that. It depends on the type of operations you will make, as PostgreSQL stores many different versions of the same row, including row versions stored in index files.
Just make the table you are interested in and check it.