Store enum MongoDB - mongodb

I am storing enums for things such as ranks (administrator, moderator, user...) and achievements for each user in my Mongo database. As far as I know Mongo does not have an enum data type which means I have to store it using another type.
I have thought of storing it using integers which I would assume uses less space than storing strings for everything that could easily be expressed as an integer. Another upside I see of using integers is that if I wanted to rename an achievement or rank I could easily change it without even having to touch the database. A benefit I see for using strings is that the data requires less processing before it is used and is more human readable which could help in tracking down bugs.
Are there any better ways of storing enums in Mongo? Is there an strong reason to use either integers or strings? (trying to stay away from a which is better question)

TL;DR: Strings are probably the safer choice, and the performance difference should be negligible. Integers make sense for huge collections where the enum must be indexed. YMMV.
I have thought of storing it using integers which I would assume uses less space than storing strings for everything that could easily be expressed as an integer
True.
other upside I see of using integers is that if I wanted to rename an achievement or rank I could easily change it without even having to touch the database.
This is a key benefit of integers in my opinion. However, it also requires you to make sure the associated values of the enum don't change. If you screw that up, you'll almost certainly wreak havoc, which is a huge disadvantage.
A benefit I see for using strings is that the data requires less processing before it is used
If you're actually using an enum data type, it's probably some kind of integer internally, so the integer should require less processing. Either way, that overhead should be negligible.
Is there an strong reason to use either integers or strings?
I'm repeating a lot of what's been said, but maybe that helps other readers. Summing up:
Mixing up the enum value map wreaks havoc. Imagine your Declined states are suddenly interpreted as Accepted, because Declined had the value '2' and now it's Accepted because you reordered the enum and forgot to assign values manually... (shudders)
Strings are more expressive
Integers take less space. Disk space doesn't matter, usually, but index space will eat RAM which is expensive.
Integer updates don't resize the object. Strings, if their lengths vary greatly, might require a reallocation. String padding and padding factor should alleviate this, though.
Integers can be flags (not yet queryable (yet), unfortunately, see SERVER-3518)
Integers can be queried by $gt / $lt so you can efficiently implement complex $or queries, though that is a rather arcane requirement and there's nothing wrong with $or queries...

Related

Store arbitrary precision integer in PostgreSQL

I have an application that needs to store cryptocurrency values to PostgreSQL database. The application uses arbitrary precision integers, and those I have to store to the database. What's the most efficient way to do that?
Why arbitrary precision? For two reasons:
For security. There shall never be an overflow.
For necessity. For example, Ethereum uses uint256 by default internally, and 1 Ether = 10^18 wei. So transactions will have a gigantic number of digits that has to be stored if accuracy is to be sought (which it's).
The best solution I came up with is to convert the number to a blob and store the number as bits in raw format. But I'm hoping there's a better way that's more suitable for a database.
EDIT:
The reason why I need this method for storing to be better is performance. I don't want to get into benchmarks and all this detail. That's why I'm keeping the question simple, or otherwise it'll get complicated. So the question is whether there's a proper way to do this.
Have a look at the documentation.
If you need efficiency but also depend on accurate values (which I would agree with), then you really should pre-define columns or different tables with specific presets using decimal(precision, scale).
If your tests reveal that the standard data types are not performing good enough you might want to have a look at bignum and maybe others.

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.

What's the fastest way to create a C-compatible unbounded string in Ada?

I'm creating an Ada program for Windows that needs to be able to pass strings to some functions written in C. Until now I have been manipulating the strings in Ada using the Unbounded_String type, and then converting the data to an Interfaces.C.char_array before passing it to the C functions.
This works fine, only performance is a bit of an issue on slower, older computers. The C function is sometimes called repeatedly on a slightly modified version of a string, and requires the Unbounded_String to be converted to a similar char_array every time. The strings aren't modified by the C functions, so the only ever have to be converted to char_array.
I have thought of storing the strings in char_array, and converting from an Ada type each time the string is manipulated. The data is passed to C more often than it is changed, so it would improve performance. The problem with this approach is that often the length of the string will change, sometimes by a lot, and there is no way of knowing the maximum length beforehand.
The ideal solution would be to have something similar to an Unbounded_String only storing the string as a char_array. By this I mean something that is dynamically sized, allocating a new array when the old one isn't big enough and it should allow Ada Characters/Strings to be inserted (and also removed) into the array, converting only those characters to C chars.
Is there any (relatively) easy, fast way of doing this without having to implement it myself? Or is there any other quick way of manipulating C-compatible strings in Ada? Thanks in advance for any suggestions.
You don't mention how many objects you expect to have of your type, but I will assume that we are not talking about so many that you will be anywhere near exhausting your available address space.
Just encapsulate a sufficiently large char_array (say 10 times the largest expected size) in a private record, and create the needed operations to manipulate it.
If you're very unlucky, you may need to tell your compiler/run-time environment that you need an unusually large stack, but save that worry for when you actually experience it.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.
What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.