There is a rule I once heard that when assigning storage size to char and varchar that instead of doing the regular 4, 8, 16, 32 rule you want to actually use 3,7,15, 31. Apparently it has something to do with optimizing the space in which it is stored.
Does anyone know if there is validity to this statement or is there a better way of assigning size to char and varchar in postgreSQL? Also is this rule for just postgreSQL or something to keep in mind in all of the SQL languages?
You're mis-remembering something that applies at a much lower level.
Strings in the "C" language are terminated by a zero-byte. So: "hello" would traditionally take six bytes. Of course, that was back when everyone assumed a single character would fit neatly into a single byte. Not the case any more.
The other (main) way to store strings is to have a length stored at the front, and then have the characters following. As it happens that is what PostgreSQL does, and I believe it even has an optimisation so the length doesn't take up so much space with short strings.
There are also separate issues where memory access is cheaper/easier at 2/4/8 byte boundaries (depending on the age of the machine) and memory allocation can be more efficient in powers of 2 (1024, 2048, 4096 bytes).
For PostgreSQL (or any of the major scripting languages / Java) just worry about representing your data accurately. About 99% of the time fiddly low-level optimisation is irrelevant. Actually, even if you are writing in "C", don't worry about it there until you need to.
Related
This may be a silly question, but I was wondering if should I use VARCHAR(n) or TEXT for a column that holds English phrases. The reason I'm not sure is because I don't know the maximum length, some phrases can contain up to 15 words or more. I suppose VARCHAR(500) would work well but I was also thinking on the worst case scenario. I read that there is no performance difference between TEXT and VARCHAR(n) in PostgreSQL. Should I go for TEXT in this case?
Text has no limit so it could be a correct choice, and you're right about performance: this is the postgres documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs and slower sorting. In most situations text or character varying should be used instead.
but you have to know that it isn't standard SQL.
What is a good Amazon Redshift column encoding for a VARCHAR column where each row contains a short (usually 50-100 characters) value that contains little repetition, but for which there is a high degree of similarity across the rows? (Identical prefixes, in particular.)
The maddeningly terse LZO description makes it sound like LZO is applied individually to each value. In that case, there will be no shared dictionary across the rows and little commonality to exploit. OTOH, if the LZO is applied to an entire 1 MB block of values written to disk, it would perform well.
Byte Dictionary sounds like it only yields savings when the values are identical rather than similar, so not a good option.
Compression is applied per block, which means that LZO is almost always the right choice for VARCHAR. Most of the other alternatives require the values to be either completely identical to other values (e.g. BYTEDICT, RUNLENGTH), or be numeric (e.g. DELTA, MOSTLY8).
The only other alternative for VARCHARS is TEXT255/TEXT32K, which might work for your use case. They build dictionaries of the first N words (245 for TEXT255 and variable for TEXT32K) and replaces occurrences of these words with a one byte index. If your values share a lot of words then TEXT255 might work better than LZO.
Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.
I'm designing a database since I'm not yet comfortable with the data types I was wondering about the following :
If I don't care about the length of a vector of characters, is it appropriate to use VARCHAR? (even if the strings are expected to be quite short ?)
Until now I've been using TEXT (VARCHAR) when I needed to store really long strings but reading the documentation and asking questions around I heard PostgreSQL does not store the extra bytes if the maximum length is supplied and the stored string's length is lesser than it.
I guess the maximum length is used in case the distant device can't afford large memory space.
Could one agree or explain that?
From the docs:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
Is VARCHAR(100) any better than VARCHAR(500) from a performance point of view? What about disk usage?
Talking about PostgreSQL today, not some database some time in history.
They are identical.
From the PostgreSQL documentation:
http://www.postgresql.org/docs/8.3/static/datatype-character.html
Tip: There are no performance
differences between these three types,
apart from increased storage size when
using the blank-padded type, and a few
extra cycles to check the length when
storing into a length-constrained
column. While character(n) has
performance advantages in some other
database systems, it has no such
advantages in PostgreSQL. In most
situations text or character varying
should be used instead.
Here they are talking about the differences between char(n), varchar(n) and text (= varchar(1G)). The official story is that there is no difference between varchar(100) and text (very large varchar).
There is no difference between varchar(m) and varchar(n)..
http://archives.postgresql.org/pgsql-admin/2008-07/msg00073.php
There is a difference between varchar(n) and text though, varchar(n) has a built in constraint which must be checked and is actually a little slower.
http://archives.postgresql.org/pgsql-general/2009-04/msg00945.php
TEXT /is/ the same as VARCHAR without an explicit length, the text
"The storage requirement for a short
string (up to 126 bytes) is 1 byte
plus the actual string, which includes
the space padding in the case of
character. Longer strings have 4 bytes
overhead instead of 1. Long strings
are compressed by the system
automatically, so the physical
requirement on disk might be less.
Very long values are also stored in
background tables so that they do not
interfere with rapid access to shorter
column values. In any case, the
longest possible character string that
can be stored is about 1 GB."
refers to both VARCHAR and TEXT (since VARCHAR(n) is just a limited version of TEXT). Limiting your VARCHARS artificially has no real storage or performance benefits (the overhead is based on the actual length of the string, not the length of the underlying varchar), except possibly for comparisons against wildcards and regexes (but at the level where that starts to matter, you should probably be looking at something like PostgreSQL's full-text indexing support).