Store arbitrary precision integer in PostgreSQL - postgresql

I have an application that needs to store cryptocurrency values to PostgreSQL database. The application uses arbitrary precision integers, and those I have to store to the database. What's the most efficient way to do that?
Why arbitrary precision? For two reasons:
For security. There shall never be an overflow.
For necessity. For example, Ethereum uses uint256 by default internally, and 1 Ether = 10^18 wei. So transactions will have a gigantic number of digits that has to be stored if accuracy is to be sought (which it's).
The best solution I came up with is to convert the number to a blob and store the number as bits in raw format. But I'm hoping there's a better way that's more suitable for a database.
EDIT:
The reason why I need this method for storing to be better is performance. I don't want to get into benchmarks and all this detail. That's why I'm keeping the question simple, or otherwise it'll get complicated. So the question is whether there's a proper way to do this.

Have a look at the documentation.
If you need efficiency but also depend on accurate values (which I would agree with), then you really should pre-define columns or different tables with specific presets using decimal(precision, scale).
If your tests reveal that the standard data types are not performing good enough you might want to have a look at bignum and maybe others.

Related

Which access method shall be used for a Berkeley DB that it is going to store 15.000.000 of integer keys?

I am planning to evaluate BerkeleyDB for a project where I have to store 15.000.000 of key/value pairs.
Keys are integers of 10 digits.
Values are variable lenght binary data.
In the BerkeleyDB documentation (https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/am_conf/intro.html) it is said that there are four access methods that can be configured:
Btree
Hash
Queue
Recno
While the documentation describes each access method, I can not fully understand which access method would fit better for this specific data set I need to store.
Which access method shall be used for this kind of data?
When unsure, choose btree. It's the most flexible access method. Sure, if you're positive that your application fits in one of the other ones, go for it.
A note of caution: writing an application using BDB that really works, that's transactional, recoverable, and offers consistency guarantees is going to be time consuming and prone to error at every step. And, if you're using this for commercial purposes, the licensing could be a total dealbreaker. For some things, it's really the best option. Just make sure you weigh all the other key value store options before embarking on your BDB quest: https://en.wikipedia.org/wiki/Key-value_database

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

Store enum MongoDB

I am storing enums for things such as ranks (administrator, moderator, user...) and achievements for each user in my Mongo database. As far as I know Mongo does not have an enum data type which means I have to store it using another type.
I have thought of storing it using integers which I would assume uses less space than storing strings for everything that could easily be expressed as an integer. Another upside I see of using integers is that if I wanted to rename an achievement or rank I could easily change it without even having to touch the database. A benefit I see for using strings is that the data requires less processing before it is used and is more human readable which could help in tracking down bugs.
Are there any better ways of storing enums in Mongo? Is there an strong reason to use either integers or strings? (trying to stay away from a which is better question)
TL;DR: Strings are probably the safer choice, and the performance difference should be negligible. Integers make sense for huge collections where the enum must be indexed. YMMV.
I have thought of storing it using integers which I would assume uses less space than storing strings for everything that could easily be expressed as an integer
True.
other upside I see of using integers is that if I wanted to rename an achievement or rank I could easily change it without even having to touch the database.
This is a key benefit of integers in my opinion. However, it also requires you to make sure the associated values of the enum don't change. If you screw that up, you'll almost certainly wreak havoc, which is a huge disadvantage.
A benefit I see for using strings is that the data requires less processing before it is used
If you're actually using an enum data type, it's probably some kind of integer internally, so the integer should require less processing. Either way, that overhead should be negligible.
Is there an strong reason to use either integers or strings?
I'm repeating a lot of what's been said, but maybe that helps other readers. Summing up:
Mixing up the enum value map wreaks havoc. Imagine your Declined states are suddenly interpreted as Accepted, because Declined had the value '2' and now it's Accepted because you reordered the enum and forgot to assign values manually... (shudders)
Strings are more expressive
Integers take less space. Disk space doesn't matter, usually, but index space will eat RAM which is expensive.
Integer updates don't resize the object. Strings, if their lengths vary greatly, might require a reallocation. String padding and padding factor should alleviate this, though.
Integers can be flags (not yet queryable (yet), unfortunately, see SERVER-3518)
Integers can be queried by $gt / $lt so you can efficiently implement complex $or queries, though that is a rather arcane requirement and there's nothing wrong with $or queries...

Storing small negative values

Note: I realize the question here won't grossly impact performance or storage size in my database, but I'm interested hearing expert opinions (if there are any).
I'm writing an application that needs to perform an operation on a datetime value and store the resulting offset in a database. I expect the offsets to stay the same (around -3), but because the source date (the beginning date of each term) is set by a college-level committee, and because my application is dealing with contacting students at regular, predictable intervals (which would also be set by a committee) it's possible that it could be between -30 and 30.
I originally intended to store the resulting offset as a tinyint, but since it's unsigned in SQL Server 2008, I'm not sure what would work best here. Should I use a smallint, or something like a numeric(2,0)?
Just use a smallint. Don't overthink it.
(Insert standard text about premature optimization here.)
numeric(2,0) has a storage size of 5 bytes, vs. 2 bytes for smallint, so just use smallint.
Numeric(2,0) will actually take 5 bytes of storage, so I would suggest smallint, which only takes 2 bytes.
Depending on your size constraints, I would lean towards a less error prone implementation, and give yourself enough leeway on min/max values. Use a datatype that will meet your needs today, and in the future, for all possible cases. If you feel like you may need the storage, give yourself the larger datatype.
Either case, your data access layer must accommodate the database column size.

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.
What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.