using varchar over varchar(n) with Postgresql - postgresql

I'm designing a database since I'm not yet comfortable with the data types I was wondering about the following :
If I don't care about the length of a vector of characters, is it appropriate to use VARCHAR? (even if the strings are expected to be quite short ?)
Until now I've been using TEXT (VARCHAR) when I needed to store really long strings but reading the documentation and asking questions around I heard PostgreSQL does not store the extra bytes if the maximum length is supplied and the stored string's length is lesser than it.
I guess the maximum length is used in case the distant device can't afford large memory space.
Could one agree or explain that?

From the docs:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

Related

Which is the data type to use for storing a firebase uid in postgresql

I want to store the uids generated by firebase auth in a postgres database. As it is not a valid uuid I am not sure which datatype to choose. Mainly I am not sure if I should use a char or a varchar.
I would say use varchar to allow for the uid changing over time. From the Postgres end there is really no difference, see here:
Tip
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
Firebase Authentication UIDs are just strings. The strings don't contain any data - they are just random. A varchar seems appropriate.

What would be the ideal PostgreSQL data type for storing English phrases?

This may be a silly question, but I was wondering if should I use VARCHAR(n) or TEXT for a column that holds English phrases. The reason I'm not sure is because I don't know the maximum length, some phrases can contain up to 15 words or more. I suppose VARCHAR(500) would work well but I was also thinking on the worst case scenario. I read that there is no performance difference between TEXT and VARCHAR(n) in PostgreSQL. Should I go for TEXT in this case?
Text has no limit so it could be a correct choice, and you're right about performance: this is the postgres documentation:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs and slower sorting. In most situations text or character varying should be used instead.
but you have to know that it isn't standard SQL.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

unix db2 BIGINT Vs Decimal as primary key

Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.

Does a varchar field's declared size have any impact in PostgreSQL?

Is VARCHAR(100) any better than VARCHAR(500) from a performance point of view? What about disk usage?
Talking about PostgreSQL today, not some database some time in history.
They are identical.
From the PostgreSQL documentation:
http://www.postgresql.org/docs/8.3/static/datatype-character.html
Tip: There are no performance
differences between these three types,
apart from increased storage size when
using the blank-padded type, and a few
extra cycles to check the length when
storing into a length-constrained
column. While character(n) has
performance advantages in some other
database systems, it has no such
advantages in PostgreSQL. In most
situations text or character varying
should be used instead.
Here they are talking about the differences between char(n), varchar(n) and text (= varchar(1G)). The official story is that there is no difference between varchar(100) and text (very large varchar).
There is no difference between varchar(m) and varchar(n)..
http://archives.postgresql.org/pgsql-admin/2008-07/msg00073.php
There is a difference between varchar(n) and text though, varchar(n) has a built in constraint which must be checked and is actually a little slower.
http://archives.postgresql.org/pgsql-general/2009-04/msg00945.php
TEXT /is/ the same as VARCHAR without an explicit length, the text
"The storage requirement for a short
string (up to 126 bytes) is 1 byte
plus the actual string, which includes
the space padding in the case of
character. Longer strings have 4 bytes
overhead instead of 1. Long strings
are compressed by the system
automatically, so the physical
requirement on disk might be less.
Very long values are also stored in
background tables so that they do not
interfere with rapid access to shorter
column values. In any case, the
longest possible character string that
can be stored is about 1 GB."
refers to both VARCHAR and TEXT (since VARCHAR(n) is just a limited version of TEXT). Limiting your VARCHARS artificially has no real storage or performance benefits (the overhead is based on the actual length of the string, not the length of the underlying varchar), except possibly for comparisons against wildcards and regexes (but at the level where that starts to matter, you should probably be looking at something like PostgreSQL's full-text indexing support).