Storing small negative values - tsql

Note: I realize the question here won't grossly impact performance or storage size in my database, but I'm interested hearing expert opinions (if there are any).
I'm writing an application that needs to perform an operation on a datetime value and store the resulting offset in a database. I expect the offsets to stay the same (around -3), but because the source date (the beginning date of each term) is set by a college-level committee, and because my application is dealing with contacting students at regular, predictable intervals (which would also be set by a committee) it's possible that it could be between -30 and 30.
I originally intended to store the resulting offset as a tinyint, but since it's unsigned in SQL Server 2008, I'm not sure what would work best here. Should I use a smallint, or something like a numeric(2,0)?

Just use a smallint. Don't overthink it.
(Insert standard text about premature optimization here.)

numeric(2,0) has a storage size of 5 bytes, vs. 2 bytes for smallint, so just use smallint.

Numeric(2,0) will actually take 5 bytes of storage, so I would suggest smallint, which only takes 2 bytes.

Depending on your size constraints, I would lean towards a less error prone implementation, and give yourself enough leeway on min/max values. Use a datatype that will meet your needs today, and in the future, for all possible cases. If you feel like you may need the storage, give yourself the larger datatype.
Either case, your data access layer must accommodate the database column size.

Related

Store arbitrary precision integer in PostgreSQL

I have an application that needs to store cryptocurrency values to PostgreSQL database. The application uses arbitrary precision integers, and those I have to store to the database. What's the most efficient way to do that?
Why arbitrary precision? For two reasons:
For security. There shall never be an overflow.
For necessity. For example, Ethereum uses uint256 by default internally, and 1 Ether = 10^18 wei. So transactions will have a gigantic number of digits that has to be stored if accuracy is to be sought (which it's).
The best solution I came up with is to convert the number to a blob and store the number as bits in raw format. But I'm hoping there's a better way that's more suitable for a database.
EDIT:
The reason why I need this method for storing to be better is performance. I don't want to get into benchmarks and all this detail. That's why I'm keeping the question simple, or otherwise it'll get complicated. So the question is whether there's a proper way to do this.
Have a look at the documentation.
If you need efficiency but also depend on accurate values (which I would agree with), then you really should pre-define columns or different tables with specific presets using decimal(precision, scale).
If your tests reveal that the standard data types are not performing good enough you might want to have a look at bignum and maybe others.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

unix db2 BIGINT Vs Decimal as primary key

Need suggestion on which Datatype would give better performance if we set one of these as primary key in DB2 - BIGINT or Decimal(13,0) type?
I suspect Decimal(13,0) will have issues once the key grows to a very big size but I wanted a better answer/understanding for this.
Thanks.
Decimal does not have issues. The only thing, is that DB2 has to do more operations to retrieve the data, once is read. I mean, DB2 read the data and then it should find the decimal part (the precision) even if is 0.
On the other hand, DB2 will read the BigInt and it does not need any further process. The number is on the bufferpool.
If you are going to use integers of 13 positions (most of them), probably Decimal will be better because you are not going to use extra bytes, however decimals have extra bytes for the precision. By using decimal in this way, you are going to optimize the storage, and this will be translated in better IO, better performance. However, it depends on the other columns of your table. You have to test which of them gives you better performance.
When using compression, there are more CPU cycles to recover the information. You have to test if the performance is affected.
Use BIGINT:
Can store ~19 digits (versus 13)
Will take 8 bytes (versus maybe either 7 or 13 - see next)
Depending on platform, DECIMAL will be stored as a form of Binary Coded Decimal - for example, on the iSeries (and I can't remember if it's Packed or Zoned). Can't speak to other deployments, unfortunately.
You aren't doing math on these values (things like "next entry" don't count) - save DECIMAL/NUMERIC for measurements/values.
Note that, really, ids are just a sequence of bits - the fact that it happens to be an integer (usually) is irrelevant. It's best to consider them random data; sequential assignment is an optimization detail, there's often gaps (rollbacks, system crashes, whatever), and they're meaningless for anything other than joining.

using varchar over varchar(n) with Postgresql

I'm designing a database since I'm not yet comfortable with the data types I was wondering about the following :
If I don't care about the length of a vector of characters, is it appropriate to use VARCHAR? (even if the strings are expected to be quite short ?)
Until now I've been using TEXT (VARCHAR) when I needed to store really long strings but reading the documentation and asking questions around I heard PostgreSQL does not store the extra bytes if the maximum length is supplied and the stored string's length is lesser than it.
I guess the maximum length is used in case the distant device can't afford large memory space.
Could one agree or explain that?
From the docs:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

Does a varchar field's declared size have any impact in PostgreSQL?

Is VARCHAR(100) any better than VARCHAR(500) from a performance point of view? What about disk usage?
Talking about PostgreSQL today, not some database some time in history.
They are identical.
From the PostgreSQL documentation:
http://www.postgresql.org/docs/8.3/static/datatype-character.html
Tip: There are no performance
differences between these three types,
apart from increased storage size when
using the blank-padded type, and a few
extra cycles to check the length when
storing into a length-constrained
column. While character(n) has
performance advantages in some other
database systems, it has no such
advantages in PostgreSQL. In most
situations text or character varying
should be used instead.
Here they are talking about the differences between char(n), varchar(n) and text (= varchar(1G)). The official story is that there is no difference between varchar(100) and text (very large varchar).
There is no difference between varchar(m) and varchar(n)..
http://archives.postgresql.org/pgsql-admin/2008-07/msg00073.php
There is a difference between varchar(n) and text though, varchar(n) has a built in constraint which must be checked and is actually a little slower.
http://archives.postgresql.org/pgsql-general/2009-04/msg00945.php
TEXT /is/ the same as VARCHAR without an explicit length, the text
"The storage requirement for a short
string (up to 126 bytes) is 1 byte
plus the actual string, which includes
the space padding in the case of
character. Longer strings have 4 bytes
overhead instead of 1. Long strings
are compressed by the system
automatically, so the physical
requirement on disk might be less.
Very long values are also stored in
background tables so that they do not
interfere with rapid access to shorter
column values. In any case, the
longest possible character string that
can be stored is about 1 GB."
refers to both VARCHAR and TEXT (since VARCHAR(n) is just a limited version of TEXT). Limiting your VARCHARS artificially has no real storage or performance benefits (the overhead is based on the actual length of the string, not the length of the underlying varchar), except possibly for comparisons against wildcards and regexes (but at the level where that starts to matter, you should probably be looking at something like PostgreSQL's full-text indexing support).