TSQL Optimal / Acceptable table column count - tsql

What is the optimal / acceptable column count when you are designing a table that serves as a datastore for huge forms to be filled?

I don't think there is any optimal number of columns, as such.
Obviously you need to consider the maximum width of a row, which is just over 8000 bytes.
Consider the way you're going to be handling the data. Will most of the columns be used all the time, or are there groups of columns that are sometimes unused (e.g., on the form, questions 8-12 are only used when the answer to question 7 is "Yes"). In those cases, it might make sense to put the optional questions into a separate table.
Making the rows wider means that your data pages, and hence your clustered key (if any) will take up more space; hence you can fit fewer records onto a page, and that means you'll have to do more IO operations when data is retrieved from the table. Thus, if there are logical ways to partition the data so that the rows are narrower, you may improve your read access speeds.
You might even consider going one step farther in normalization, so that a single form entry has a header row in a parent table, and then each question in the form gets its own row in a child table.

Related

How is the performance of Postgres / TimescaleDB unique composite keys?

To make it short:
I have a TimescaleDB table with multiple rows:
from
to
code
point
...
There are a few more rows.
I need to keep the rows unique so that the combination of the above rows (from, to, code, point) is unique. The rows can have multiple identical entries, but the combination must be unique.
How would that affect the performance? There would be relative many insertions (a few hundred insertions per minute).
EDIT:
The project is still relatively new so we can make changes to the database. There are alternatives that I can consider. I also do not know if there will only be a few hundred insertions per minute or if the volume will scale with time. I know the question leaves space for debates and I am grateful for every answer, but I asked how composite keys affect performance so I can make an informed decision.

PostgreSQL varchar length performance impact

I have a table in PostgreSQL which is under heavy load (reads). It is practically core table of an application. One column is used as a discriminator - column used by application, that determines type of entity (class) that represents given row. It has to be exactly one varchar column. Currently I do store full class name in it, like: "bank_statement_transaction".
When application selects all bank statement transactions, query is built like ... WHERE Discriminator = 'bank_statement_transaction' . This brings more readability and clarity to data, structure and code.
Table contains currently 3M rows and counting, approximately 100k new rows monthly. Discriminator was indexed during some performance tunings. I don't have any performance issues right now.
I am working on a new feature that requires some little refactoring and yeah I had an idea to change full class name (bank_statement_transaction) to short unique codes (BST)
I replicated dbo and changed full class name to code. With 3M rows, performance gain is barely measurable, same or 1-2 milliseconds faster.
Can anyone share experience with VARCHAR length impact on INDEX size and performance? On bigger data set? Is this change worth of it?
If you index strings, the index will become larger if the strings are long. The fan-out will be less, so the index will become deeper.
With an index scan that searches for a few rows, this won't be noticable: reading a few blocks more and running comparisons on longer strings may be lost in the noise for any but the simplest queries. Still, you'll be faster with smaller strings.
Maybe the most noticeable effect will be that a smaller index needs less RAM for caching, so the number of disk reads should go down.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

Redshift and ultra wide tables

In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.

Don't more than a few dozen partitions make sense?

I store time-series simulation results in PostgreSQL.
The db schema is like this.
table SimulationInfo (
simulation_id integer primary key,
simulation_property1,
simulation_property2,
....
)
table SimulationResult ( // The size of one row would be around 100 bytes
simulation_id integer,
res_date Date,
res_value1,
res_value2,
...
res_value9,
primary key (simulation_id, res_date)
)
I usually query data based on simulation_id and res_date.
I partitioned the SimulationResult table into 200 sub-tables based on the range value of simulation_id. A fully filled sub table has 10 ~ 15 millions rows. Currently about 70 sub-tables are fully filled, and the database size is more than 100 gb. The total 200 sub tables would be filled soon, and when it happens, I need to add more sub tables.
But I read this answers, which says more than a few dozen partitions does not make sense. So my questions are like below.
more than a few dozen partitions not make sense? why?
I checked the execution plan on my 200 sub-tables, and it scan only the relevant sub-table. So i guessed more partitions with smaller each sub-table must be better.
if number of partitions should be limited, like 50, then is it no problem to have billions rows in one table? How big one table can be without big problem given the schema like mine?
It's probably unwise to have that many partitions, yes. The main reason to have partitions at all is not to make indexed queries faster (which they are not, for the most part), but to improve performance for queries that have to sequentially scan the table based on constraints that can be proved to not hold for some of the partitions; and to improve maintenance operations (like vacuum, or deleting large batches of old data which can be achieved by truncating a partition in certain setups, and such).
Maybe instead of using ranges of simulation_id (which means you need more and more partitions all the time), you could partition using a hash of it. That way all partitions grow at a similar rate, and there's a fixed number of partitions.
The problem with too many partitions is that the system is not prepared to deal with locking too many objects, for example. Maybe 200 work fine, but it won't scale well when you reach a thousand and beyond (which doesn't sound that unlikely given your description).
There's no problem with having billions of rows per partition.
All that said, there are obviously particular concerns that apply to each scenario. It all depends on the queries you're going to run, and what you plan to do with the data long-term (i.e. are you going to keep it all, archive it, delete the oldest, ...?)