How to calculate Row size of table in db2 - db2

We have to do some analysis on some tables for that we have to find out the maximum possible row size of each table in db2 db.
Please let us know..

Check out the Db2 documentation for CREATE TABLE. It contains the lengthy formula to compute the row size for a table. It depends on many attributes like
the type of table,
the column data types,
if they allow NULL,
if value compression is enabled,
...
The maximum possible row size depends on the page size, but there is also a column count limit.
If you don't need it precisely, you can sum up the byte count for each column data type in your table, add some extrac bytes. Then, make sure it is below 1/4 of the page size.

Related

What is space partitioning and dimensions in TimesclaleDB

I am new to the timescale database. I was learning about chunks and how to create chunks based on time.
But there is another time/space chunking which is confusing me a lot. Please help me with below queries.
What is a dimension in a timescale DB?
What is space chunking and how it works?
Thanks in advance.
A dimension in TimescaleDB is associated with a column. Each hypertable requires to define at least a time dimension, which is a time column for the time series. Then a hypertable is divided into chunks, where each chunk contains data for a time interval of the time dimension. As result all new data usually arrives into the latets chunk, while other chunks contain older data.
Then, it is possible to define space dimensions on other columns, for example device column or/and location column. No interval is defined for space dimensions, instead a number of partitions is defined. So for the same time interval, several chunks will be created, which is equivalent to the number of partitions. Data are distributed by a hashing function on the values of the space dimension. For example, if 3 partitions are defined for a space dimension on device column and 12 different device values were present in the data, each space chunk will contain 4 different values with a hash function uniformly distributing the values.
Space dimensions are specifically useful for parallel I/O, when data are stored on several disks. Another scenario is multinode, i.e., distributed version of hypertable (beta feature, which coming to release in 2.0).
There are some complex usage cases when space partitioning will be also helpful.
You can read more in add_dimension docs, cloud KB about space partitioning
A note in the doc:
Supporting more than one additional dimension is currently experimental.

What is a wide column in Postgres?

I've seen the following terms used:
wide field – e.g. "the TOAST mechanism first attempts to compress any wide field values"
wide column – e.g. "If you have many wide columns" or "even complete toasting doesn't allow a row with more than about 450 wide columns".
But I haven't seen them defined.
I understand Posgres has a page/block size limit (typically of 8192 bytes) and that it will rely on TOAST when the data doesn't fit within the limit. But, as I understand it, this is based on the size of the row, not on the size of any one column. So I understand how one could say a row to be wide… but a particular column? (But maybe I'm taking this too literally.)
In this context, what's the threshold for considering a column to be wide?

Create view or table

I have a table that will have about 3 * 10 ^ 12 lines (3 trillion), but with only 3 attributes.
In this table you will have the IDs of 2 individuals and the similarity between them (it is a number between 0 and 1 that I multiplied by 100 and put as a smallint to decrease the space).
It turns out that I need to perform, for a certain individual that I want to do the research, the summarization of these columns and returning how many individuals have up to 10% similarity, 20%, 30%. These values ​​are fixed (every 10) until identical individuals (100%).
However, as you may know, the query will be very slow, so I thought about:
Create a new table to save summarized values
Create a VIEW to save these values.
As individuals are about 1.7 million, the search would not be so time consuming (if indexed, returns quite fast). So, what can I do?
I would like to point out that my population will be almost fixed (after the DB is fully populated, it is expected that almost no increase will be made).
A view won't help, but a materialized view sounds like it would fit the bill, if you can afford a sequential scan of the large table whenever the materialized view gets updated.
It should probably contain a row per user with a count for each percentile range.
Alternatively, you could store the aggregated data in an independent table that is updated by a trigger on the large table whenever something changes there.

Best way to store data : Many columns vs many rows for a case of 10,000 new rows a day

after checking a lot of similar questions on stackoverflow, it seems that context will tell which way is the best to hold the data...
Short story, I add over 10,000 new rows of data in a very simple table containing only 3 columns. I will NEVER update the rows, only doing selects, grouping and making averages. I'm looking for the best way of storing this data to make the average calculations as fast as possible.
To put you in context, I'm analyzing a recorded audio file (Pink Noise playback in a sound mixing studio) using FFTs. The results for a single audio file is always in the same format: The frequency bin's ID (integer) and its value in decibels (float value). I'm want to store these values in a PostgreSQL DB.
Each bin (band) of frequencies (width = 8Hz) gets an amplitude in decibels. The first bin is ignored, so it goes like this (not actual dB values):
bin 1: 8Hz-16Hz, -85.0dB
bin 2: 16Hz-32Hz, -73.0dB
bin 3: 32Hz-40Hz, -65.0dB
...
bin 2499: 20,000Hz-20,008Hz, -49.0dB
The goal is to store an amplitude of each bin from 8Hz through 20,008Hz (1 bin covers 8Hz).
Many rows approach
For each analyzed audio file, there would be 2,499 rows of 3 columns: "Analysis UID", "Bin ID" and "dB".
For each studio (4), there is one recording daily that is to be appended in the database (that's 4 times 2,499 = 9,996 new rows per day).
After a recording in one studio, the new 2,499 rows are used to show a plot of the frequency response.
My concern is that we also need to make a plot of the averaged dB values of every bin in a single studio for 5-30 days, to see if the frequency response tends to change significantly over time (thus telling us that a calibration is needed in a studio).
I came up with the following data structure for the many rows approach:
"analysis" table:
analysisUID (serial)
studioUID (Foreign key)
analysisTimestamp
"analysis_results" table:
analysisUID (Foreign key)
freq_bin_id (integer)
amplitude_dB (float)
Is this the optimal way of storing data? A single table holding close to 10,000 new rows a day and making averages of 5 or more analysis, grouping by analysisUIDs and freq_bin_ids? That would give me 2,499 rows (each corresponding to a bin and giving me the averaged dB value).
Many columns approach:
I thought I could do it the other way around, breaking the frequency bins in 4 tables (Low, Med Low, Med High, High). Since Postgres documentation says the column limit is "250 - 1600 depending on column types", it would be realistic to make 4 tables containing around 625 columns (2,499 / 4) each representing a bin and containing the "dB" value, like so:
"low" table:
analysisUID (Foreign key)
freq_bin_id_1_amplitude_dB (float)
freq_bin_id_2_amplitude_dB (float)
...
freq_bin_id_625_amplitude_dB (float)
"med_low" table:
analysisUID (Foreign key)
freq_bin_id_626_amplitude_dB (float)
freq_bin_id_627_amplitude_dB (float)
...
freq_bin_id_1250_amplitude_dB (float)
etc...
Would the averages be computed faster if the server only has to Group by analysisUIDs and make averages of each column?
Rows are not going to be an issue, however, the way in which you insert said rows could be. If insert time is one of the primary concerns, then make sure you can bulk insert them OR go for a format with fewer rows.
You can potentially store all the data in a jsonb format, especially since you will not be doing any updates to the data-- it may be convenient to store it all in one table at a time, however the performance may be less.
In any case, since you're not updating the data, the (usually default) fillfactor of 100 is appropriate.
I would NOT use the "many column" approach, as the
amount of data you're talking about really isn't that much. Using your first example of 2 tables and few columns is very likely the optimal way to do your results.
It may be useful to index the following columns:
analysis_results.freq_bin_id
analysis.analysisTimestamp
As to breaking the data into different sections, it'll depend on what types of queries you're running. If you're looking at ALL freq bins, using multiple tables will just be a hassle and net you nothing.
If only querying at some freq_bin's at a time, it could theoretically help, however, you're basically doing table partitions and once you've moved into that land, you might as well make a partition for each frequency band.
If I were you, I'd create your first table structure, fill it with 30 days worth of data and query away. You may (as we often do) be overanalyzing the situation. Postgres can be very, very fast.
Remember, the raw data you're analyzing is something on the order of a few (5 or less) meg per day at an absolute maximum. Analyzing 150 mb of data is no sweat for a DB running with modern hardware if it's indexed and stored properly.
The optimizer is going to find the correct rows in the "smaller" table really, really fast and likely cache all of those, then go looking for the child rows, and it'll know exactly what ID's and ranges to search for. If your data is all inserted in chronological order, there's a good chance it'll read it all in very few reads with very few seeks.
My main concern is with the insert speed, as a doing 10,000 inserts can take a while if you're not doing bulk inserts.
Since the measurements seem well behaved, you could use an array, using the freq_bin as an index (Note: indices are 1-based in sql)
This has the additional advantage of the aray being stored in toasted storage, keeping the fysical table small.
CREATE TABLE herrie
( analysisUID serial NOT NULL PRIMARY KEY
, studioUID INTEGER NOT NULL REFERENCES studio(studioUID)
, analysisTimestamp TIMESTAMP NOT NULL
, decibels float[] -- array with 625 measurements
, UNIQUE (studioUID,analysisTimestamp)
);

Maximum data size for table per row

According to documentation
Maximum Columns per Table 250 - 1600 depending on column types
Ok, but if have less than 250 columns, but several they contains really big data, many text type columns and many array type columns (with many elements), then there is any limit?
Question is: there is any size limit for per row? (sum of all columns content).