standard mangling for reserved column or table names - postgresql

I am wary of using quoted identifiers in PostgreSQL when my "business" table or column names happen to be reserved words. Are there any standard or best-practice mangling approaches? I would like to have a consistent mangling mechanism as there's also some code-generation in place and need to be able to automatically translate back and forth between mangled names used in PostgreSQL and non-mangled names used in code.

The simplest, and probably best, way to "mangle" terms to guarantee all names are not reserved words is to prefix them with an underscore character.
The best way is to not use reserved words - let the DBA chose the names for any clashes between business terms and DB terms.
Also consider terms that might clash with reserved words in your application language.

I like to reserve the underscore prefix (_name) for parameters and variables in server-side functions to rule out conflicts. I have seen that a lot lately among people working with functions.
For identifiers of database objects I pick the names by hand. Working with non-english terms (reduced to ASCII letters) avoids most of the conflicts.
Another alternative is to use the plural form for table names, since most reserved words are in singular form (a few exceptions!).
Most 1-letter prefixes would do the job without exception. Like: n_name. But since I hand-pick identifiers, there is no need for automation.
Anything is good that avoids the error-prone need for double-quotes. I never use reserved words even if Postgres would allow them.

Related

Using ampersand in SSAS Tabular models

Some people in my company have gone to great lengths to remove & characters from data and measure names in our Tabular models. I wasn't around when they made this decision, but it destroys readability in our financial reporting. Instead of R&D and SG&A in our statements, we have RD and SGA.
The offenders are no longer around to answer for their crimes. I am trying to convince my co-workers to re-add the &, but they won't budge without some idea why this was done in the first place. My best guess is that a consultant told them not to use & in models. I think they meant in object names only, but our team got carried away. I've been able to find this page that says & was a reserved character in Compatibility Level 1100, but that goes back to SQL Server 2012! I think our lowest environment is SSAS 2017.
Am I missing anything or can we re-add & to our data and measure names? Is there any reason you would avoid the use of & anywhere in an SSAS tabular model? Links to documentation appreciated!
https://learn.microsoft.com/en-us/analysis-services/multidimensional-models/olap-physical/object-naming-rules-analysis-services?view=asallproducts-allversions
Exceptions: When Reserved Characters are Allowed
As noted, databases of a specific modality and compatibility level can have object names that include reserved characters. Dimension attribute, hierarchy, level, measure and KPI object names can include reserved characters, for tabular databases (1103 or higher) that allow the use of extended characters:
Server mode and database compatibility level Reserved characters allowed?
Databases can have a ModelType of default. Default is equivalent to multidimensional, and thus does not support the use of reserved characters in column names.

REST API - string or numerical identifier in URL

We're developing a REST API for our platform. Let's say we have organisations and projects, and projects belong to organisations.
After reading this answer, I would be inclined to use numerical ID's in the URL, so that some of the URLs would become (say with a prefix of /api/v1):
/organisations/1234
/organisations/1234/projects/5678
However, we want to use the same URL structure for our front end UI, so that if you type these URLs in the browser, you will get the relevant webpage in the response instead of a JSON file. Much in the same way you see relevant names of persons and organisations in sites like Facebook or Github.
Using this, we could get something like:
/organisations/dutchpainters
/organisations/dutchpainters/projects/nightwatch
It looks like Github actually exposes their API in the same way.
The advantages and disadvantages I can come up with for using names instead of IDs for URL definitions, are the following:
Advantages:
More intuitive URLs for end users
1 to 1 mapping of front end UI and JSON API
Disadvantages:
Have to use unique names
Have to take care of conflict with reserved names, such as count, so later on, you can still develop an API endpoint like /organisations/count and actually get the number of organisations instead of the organisation called count.
Especially the latter one seems to become a potential pain in the rear. Still, after reading this answer, I'm almost convinced to use the string identifier, since it doesn't seem to make a difference from a convention point of view.
My questions are:
Did I miss important advantages / disadvantages of using strings instead of numerical IDs?
Did Github develop their string-based approach after their platform matured, or did they know from the start that it would imply some limitations (like the one I mentioned earlier, it seems that they did not implement such functionality)?
It's common to use a combination of both:
/organisations/1234/projects/5678/nightwatch
where the last part is simply ignored but used to make the url more readable.
In your case, with multiple levels of collections you could experiment with this format:
/organisations/1234/dutchpainters/projects/5678/nightwatch
If somebody writes
/organisations/1234/germanpainters/projects/5678/wanderer
it would still map to the rembrandt, but that should be ok. That will leave room for editing the names without messing up url:s allready out there. Also, names doesn't have to be unique if you don't really need that.
Reserved HTTP characters: such as “:”, “/”, “?”, “#”, “[“, “]” and “#” – These characters and others are “reserved” in the HTTP protocol to have “special” meaning in the implementation syntax so that they are distinguishable to other data in the URL. If a variable value within the path contains one or more of these reserved characters then it will break the path and generate a malformed request. You can workaround reserved characters in query string parameters by URL encoding them or sometimes by double escaping them, but you cannot in path parameters.
https://www.serviceobjects.com/blog/path-and-query-string-parameter-calls-to-a-restful-web-service
Numerical consecutive IDs are not recommended anymore because it is very easy to guess records in your database and some might use that to obtain info they do not have access to.
Numerical IDs are used because the in the database it is a fixed length storage which makes indexing easy for the database. For example INT has 4 bytes in MySQL and BIGINT is 8 bytes so the number have the same length in memory (100 in INT has the same length as 200) so it is very easy to index and search for records.
If you have a lot of entries in the database then using a VARCHAR field to index is a bad idea. You should use a fixed width field like CHAR(32) and fill the difference with spaces but you have to add logic in your program to treat the differences when searching the database.
Another idea would be to use slugs but here you should take into consideration the fact that some records might have the same slug, depends on what are you using to form that slug. https://en.wikipedia.org/wiki/Semantic_URL#Slug
I would recommend using UUIDs since they have the same length and resolve this issue easily.

Lex reserved word rules versus lookup table

This web page suggests that if your lex program "has a large number of reserved words, it is more efficient to let lex simply match a string and determine in your own code whether it is a variable or reserved word."
My question is: More efficient where, and why? If it means compiling the lexer is faster, I don't really care about that because it is one step removed from the program which uses the lexer to parse input.
It seems to be that lex just uses your description to build a state machine that processes one character at a time. It does not seem logical that increasing the size of the state machine would necessarily cause it to become any slower than using one rule for identifiers and then doing several string comparisons.
Additionally, if it turns out that there is some logical reason for this to make sense as an optimization, what would be considered a large number of reserved words? I have approximately 20, as compared to about 30 other rules for various things. Would that be considered a large number of reserved words? Should I attempt to use the same strategy for some of the other symbols?
I have attempted to google for a result, but the only relevant articles I found stated this strategy as though it were well-known without giving any reason.
In case it is relevant, I am using flex 2.5.35.
Edit: Here is another reference which claims that lex produces an inefficient scanner when asked to match several long literal strings. It also does not give a reason.
According to the flex manual, "[t]he speed of the scanner is independent of the number of rules or ... how complicated the rules are with regard to operators such as '*' and '|'."
The main performance losses are due to backtracking. This can be avoided by (among other things) using catch-all rules which will match tokens which "start with" the offending token. For example, if you have a list of reserved words consisting of [a-zA-Z_], and then a rule for matching identifiers of the form [a-zA-Z_][a-zA-Z_0-9]*, the rule for matching identifiers will catch any identifiers which start with the name of a reserved word without having to back up and try to match again.
According to the faq, flex generates a deterministic finite automaton which "does all the matching simultaneously, in parallel." The result of this is, as was said above, that the speed of the scanner is independent of the number of rules. On the other hand, string comparison is linear in the number of rules.
As a result, the reserved word rules should, in fact, be considerably faster than a lookup table.

PostgreSQL naming conventions

Where can I find a detailed manual about PostgreSQL naming conventions? (table names vs. camel case, sequences, primary keys, constraints, indexes, etc...)
Regarding tables names, case, etc, the prevalent convention is:
SQL keywords: UPPER CASE
identifiers (names of databases, tables, columns, etc): lower_case_with_underscores
For example:
UPDATE my_table SET name = 5;
This is not written in stone, but the bit about identifiers in lower case is highly recommended, IMO. Postgresql treats identifiers case insensitively when not quoted (it actually folds them to lowercase internally), and case sensitively when quoted; many people are not aware of this idiosyncrasy. Using always lowercase you are safe. Anyway, it's acceptable to use camelCase or PascalCase (or UPPER_CASE), as long as you are consistent: either quote identifiers always or never (and this includes the schema creation!).
I am not aware of many more conventions or style guides. Surrogate keys are normally made from a sequence (usually with the serial macro), it would be convenient to stick to that naming for those sequences if you create them by hand (tablename_colname_seq).
See also some discussion here, here and (for general SQL) here, all with several related links.
Note: Postgresql 10 introduced identity columns as an SQL-compliant replacement for serial.
There isn't really a formal manual, because there's no single style or standard.
So long as you understand the rules of identifier naming you can use whatever you like.
In practice, I find it easier to use lower_case_underscore_separated_identifiers because it isn't necessary to "Double Quote" them everywhere to preserve case, spaces, etc.
If you wanted to name your tables and functions "#MyAṕṕ! ""betty"" Shard$42" you'd be free to do that, though it'd be pain to type everywhere.
The main things to understand are:
Unless double-quoted, identifiers are case-folded to lower-case, so MyTable, MYTABLE and mytable are all the same thing, but "MYTABLE" and "MyTable" are different;
Unless double-quoted:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).
You must double-quote keywords if you wish to use them as identifiers.
In practice I strongly recommend that you do not use keywords as identifiers. At least avoid reserved words. Just because you can name a table "with" doesn't mean you should.
The only two answers here are 6 years old idk if snake_case being the best case is true anymore. Here's my take on modern times. Also, forgoing any extra complication of needing to double-quote. I think flow is more important than trying to avoid a minor inconvenience.
Provided by the fact that there are no strict guidelines/style guides, I'd say it is best to use the same case as project code. So for example, using OOP approach in languages like JavaScript, table names would be in PascalCase where as attributes would be in camelCase. Where as if you're taking the functional approach, they'd both be camelCase. Also, by convention JS classes are PascalCase and attributes are camelCase so it makes sense anyways.
On the other hand, if you are coding in Python using SqlAlchemy then it only makes sense to use snake_case names for function-derived models and PascalCase names for class-derived models. In both cases, attributes/columns should be snake_case.

Using constants for message keys and database table names and column names

Recently there was a big debate during a code reveiw session on the use of constants.
The developers had used constants for the following purposes:
Each and every message key used in the i18N application was declared as a constant. The application contained around 3000 message keys and hence the same number of constants.
Each and every database column name was declared as a constant. There were around 5000 column names and still counting..
Does it make sense to have such a huge number of constants in any application?
IMHO, common sense should prevail. Message keys just don't need to be declared as constants. We already have one level of indirection - why add one more?
Reg. database column names, I have mixed opinions. If a column is being used in multiple classes, does it make sense to declare it as a global constant?
Please pour in with your thoughts...
If I18N message keys aren't defined as constants, how do you enforce consistency? How do you automatically differentiate between a typo and a missing value? How do you audit to make sure that all I18N keys are fulfilled in each new language file?
As to database columns, you could definitely use some indirection - if your application knows about column names, you've got a binding problem. So there, you might consider a config file with the actual column names - but of course, you would want to refer to the column names by symbolic keys, which should be defined as auditable constants, just like the I18N keys.
I think is a good practice to put message keys used for i18N as constants.
I don't see much benefits in doing the same for the DB columns, if you have a well designed persistence layer.
This depends on the programming language, I think.
In PHP it's not uncommon to ude defines aka contants for such things, while I'd not use this in Java or C#.
In most projects we tried to extract the SQL to templates, so not only the table and column names were configurable but the whole sql statement. We used velocity for basic templating mechanics like variables, small loops,...
Regarding the language constants:
Another layer doesn't make much sense to me, but you hav eto choose your identifiers for the language translation carefully. Using the whole english sentence as key may end up in a lot of work for the translators if you fix the wording for example in the english sentence without changing the meaning. So all translators would have to update their files.
If the constant is used in multiple places and the compiler really catches the problem, yes.