Where can I find a detailed manual about PostgreSQL naming conventions? (table names vs. camel case, sequences, primary keys, constraints, indexes, etc...)
Regarding tables names, case, etc, the prevalent convention is:
SQL keywords: UPPER CASE
identifiers (names of databases, tables, columns, etc): lower_case_with_underscores
For example:
UPDATE my_table SET name = 5;
This is not written in stone, but the bit about identifiers in lower case is highly recommended, IMO. Postgresql treats identifiers case insensitively when not quoted (it actually folds them to lowercase internally), and case sensitively when quoted; many people are not aware of this idiosyncrasy. Using always lowercase you are safe. Anyway, it's acceptable to use camelCase or PascalCase (or UPPER_CASE), as long as you are consistent: either quote identifiers always or never (and this includes the schema creation!).
I am not aware of many more conventions or style guides. Surrogate keys are normally made from a sequence (usually with the serial macro), it would be convenient to stick to that naming for those sequences if you create them by hand (tablename_colname_seq).
See also some discussion here, here and (for general SQL) here, all with several related links.
Note: Postgresql 10 introduced identity columns as an SQL-compliant replacement for serial.
There isn't really a formal manual, because there's no single style or standard.
So long as you understand the rules of identifier naming you can use whatever you like.
In practice, I find it easier to use lower_case_underscore_separated_identifiers because it isn't necessary to "Double Quote" them everywhere to preserve case, spaces, etc.
If you wanted to name your tables and functions "#MyAṕṕ! ""betty"" Shard$42" you'd be free to do that, though it'd be pain to type everywhere.
The main things to understand are:
Unless double-quoted, identifiers are case-folded to lower-case, so MyTable, MYTABLE and mytable are all the same thing, but "MYTABLE" and "MyTable" are different;
Unless double-quoted:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($).
You must double-quote keywords if you wish to use them as identifiers.
In practice I strongly recommend that you do not use keywords as identifiers. At least avoid reserved words. Just because you can name a table "with" doesn't mean you should.
The only two answers here are 6 years old idk if snake_case being the best case is true anymore. Here's my take on modern times. Also, forgoing any extra complication of needing to double-quote. I think flow is more important than trying to avoid a minor inconvenience.
Provided by the fact that there are no strict guidelines/style guides, I'd say it is best to use the same case as project code. So for example, using OOP approach in languages like JavaScript, table names would be in PascalCase where as attributes would be in camelCase. Where as if you're taking the functional approach, they'd both be camelCase. Also, by convention JS classes are PascalCase and attributes are camelCase so it makes sense anyways.
On the other hand, if you are coding in Python using SqlAlchemy then it only makes sense to use snake_case names for function-derived models and PascalCase names for class-derived models. In both cases, attributes/columns should be snake_case.
Related
SQL language, in particular PostgreSQL 9+, have many ways to do the same thing... But in many circumstances (see Notes sec. for a rationale) we need to "cut diversity" and opt to a standard way.
There are a tendency to adopt text data type instead varchar. Will be "the standard way to express strings" in PostgreSQL (!), avoiding lost time in project discussions and casting similar formats...
But, how to use text preserving the size limit constraint?
I use CHECK(char_length(field)<N) and have no problem to change the limit in live environment, so it is perhaps the best way... Is it?
Some variations: in general what is the best choice?
in CREATE TABLE:
1.1. CHECK after the data type, just like default value definitions. This is the best practice?
1.2. CHECK after all column definitions. Usual to multi-column declaration like CHECK(char_length(col1)<N1 AND char_length(col2)<N2).
1.2.1. Some people like also to express all individual CHECKs after, to not "pollute" column declarations.
Use in trigger: there are some advantage?
Other ways... Other relevant one?
1.1, 1.2, 2 or 3, what is the best practice ?
CONTEXT NOTES
In projects and teams with some KISS or Convention over configuration demands, we need "good practices" recommendations... I was looking for it, in the context of CREATE TABLE ... text/varchar and project maintenance. No unbiased "good practices" recommendation in the horizon: Stackoverflow votings are the only reasonable record of this kind of recommendation.
Convention scope
(edit) For individual use, of course, as #ConsiderMe commented, "no matter what you choose, as long as you stick with it throughout the entire time there will be no problem with it".
This question, by other hand, is about "SQL community" or "PostgreSQL community" best practices conventions.
I like to keep the code as short as it's possible so it'd go with length(string) in the CHECK constraint. I do not see a particular use for char_length in this case - it takes up more "code space".
Internally, they are both textlen anyways.
You should be careful about signs that take more than 1 byte. In this case I would use octet_length. As an example consider character ą which returns 1 when asked for length, and 2 when asked for octet_length. It's been a pain doing migrations between database systems with different length enforcement.
I believe that a good source for "best practices" would be to follow documentation.
It says that using CHECK constraint inline with a column implies a column constraint which is bound to a particular column.
It mentions table constraint which is written separately from any column definition and enforces data corectness between several columns.
Basically in projects I'm involved I follow this rule for readability and maintenance purposes.
I wouldn't even consider creating trigger for such things. To me they are designed for much more complex tasks. I don't see a reason to enforce simple data correctness rules in triggers.
I can't think of any other solution which would be as basic as the standard ones and still doing it's simple job as those mentioned above.
The Depesz article on which this reasoning was based is outdated. The only argument against varchar(N) was that changing N required a table rewrite, and as of Postgres 9.2, this is no longer the case.
This gives varchar(N) a clear advantage, as increasing N is basically instantaneous, while changing the CHECK constraint on a text field will involve re-checking the entire table.
I am wary of using quoted identifiers in PostgreSQL when my "business" table or column names happen to be reserved words. Are there any standard or best-practice mangling approaches? I would like to have a consistent mangling mechanism as there's also some code-generation in place and need to be able to automatically translate back and forth between mangled names used in PostgreSQL and non-mangled names used in code.
The simplest, and probably best, way to "mangle" terms to guarantee all names are not reserved words is to prefix them with an underscore character.
The best way is to not use reserved words - let the DBA chose the names for any clashes between business terms and DB terms.
Also consider terms that might clash with reserved words in your application language.
I like to reserve the underscore prefix (_name) for parameters and variables in server-side functions to rule out conflicts. I have seen that a lot lately among people working with functions.
For identifiers of database objects I pick the names by hand. Working with non-english terms (reduced to ASCII letters) avoids most of the conflicts.
Another alternative is to use the plural form for table names, since most reserved words are in singular form (a few exceptions!).
Most 1-letter prefixes would do the job without exception. Like: n_name. But since I hand-pick identifiers, there is no need for automation.
Anything is good that avoids the error-prone need for double-quotes. I never use reserved words even if Postgres would allow them.
I am designing a rest api where users can pass in queries using a search query language I will define.
The language will allow a number of operators eq, ne, gt, lt (equals, not equals, greater than, less than) etc etc.
The language will allow grouping and logical operators AND and OR.
So for example a query about companies may look like the following
/api/companies?q=(CompanyName eq Microsoft Or CompanyName eq Apple) And State eq California
So this should give me all companies where company name equals 'Microsoft' or 'Apple' and the state is California.
So this all works fine except for the fact that the system that I am writing the api against is extremely flexible and allows almost any character to be inserted into fields values. Additionally, I also must support custom fields and those are able to have special characters in the field name.
Initially my main concern was fields that contained parentheses. I will be converting this query into a SQL server query and I need a way to ensure that I do not confuse a parentheses in a field value with one that is intended for grouping. My second thought was to force field values to be quoted, but I think this will also cause similar problems.
I was also considering that there may be a simple approach involving html encoding, but I am unable to see exactly how that would work.
What I am looking for is any advice or examples of reasonable approaches to handle a rest search query with such flexible data.
You should use percent encoding to escape characters in your query string, see RFC 3986. This previous StackOverflow post contains some useful background information about URI encoding.
Initially my main concern was fields that contained parentheses. I will be converting this
query into a SQL server query and I need a way to ensure that I do not confuse a parentheses
in a field value with one that is intended for grouping
If this might be a problem then it sounds like your application will be susceptible to SQL injection. You should be escaping any external data before constructing an SQL query.
/api/companies?q=(CompanyName eq Microsoft Or CompanyName eq Apple) And State eq California
Based on this example you could take advantage of the URI query string to better represent your query:
/api/companies?CompanyName=Microsoft%20OR%20Apple&State=California
Here is an example.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/
This web page suggests that if your lex program "has a large number of reserved words, it is more efficient to let lex simply match a string and determine in your own code whether it is a variable or reserved word."
My question is: More efficient where, and why? If it means compiling the lexer is faster, I don't really care about that because it is one step removed from the program which uses the lexer to parse input.
It seems to be that lex just uses your description to build a state machine that processes one character at a time. It does not seem logical that increasing the size of the state machine would necessarily cause it to become any slower than using one rule for identifiers and then doing several string comparisons.
Additionally, if it turns out that there is some logical reason for this to make sense as an optimization, what would be considered a large number of reserved words? I have approximately 20, as compared to about 30 other rules for various things. Would that be considered a large number of reserved words? Should I attempt to use the same strategy for some of the other symbols?
I have attempted to google for a result, but the only relevant articles I found stated this strategy as though it were well-known without giving any reason.
In case it is relevant, I am using flex 2.5.35.
Edit: Here is another reference which claims that lex produces an inefficient scanner when asked to match several long literal strings. It also does not give a reason.
According to the flex manual, "[t]he speed of the scanner is independent of the number of rules or ... how complicated the rules are with regard to operators such as '*' and '|'."
The main performance losses are due to backtracking. This can be avoided by (among other things) using catch-all rules which will match tokens which "start with" the offending token. For example, if you have a list of reserved words consisting of [a-zA-Z_], and then a rule for matching identifiers of the form [a-zA-Z_][a-zA-Z_0-9]*, the rule for matching identifiers will catch any identifiers which start with the name of a reserved word without having to back up and try to match again.
According to the faq, flex generates a deterministic finite automaton which "does all the matching simultaneously, in parallel." The result of this is, as was said above, that the speed of the scanner is independent of the number of rules. On the other hand, string comparison is linear in the number of rules.
As a result, the reserved word rules should, in fact, be considerably faster than a lookup table.
Recently there was a big debate during a code reveiw session on the use of constants.
The developers had used constants for the following purposes:
Each and every message key used in the i18N application was declared as a constant. The application contained around 3000 message keys and hence the same number of constants.
Each and every database column name was declared as a constant. There were around 5000 column names and still counting..
Does it make sense to have such a huge number of constants in any application?
IMHO, common sense should prevail. Message keys just don't need to be declared as constants. We already have one level of indirection - why add one more?
Reg. database column names, I have mixed opinions. If a column is being used in multiple classes, does it make sense to declare it as a global constant?
Please pour in with your thoughts...
If I18N message keys aren't defined as constants, how do you enforce consistency? How do you automatically differentiate between a typo and a missing value? How do you audit to make sure that all I18N keys are fulfilled in each new language file?
As to database columns, you could definitely use some indirection - if your application knows about column names, you've got a binding problem. So there, you might consider a config file with the actual column names - but of course, you would want to refer to the column names by symbolic keys, which should be defined as auditable constants, just like the I18N keys.
I think is a good practice to put message keys used for i18N as constants.
I don't see much benefits in doing the same for the DB columns, if you have a well designed persistence layer.
This depends on the programming language, I think.
In PHP it's not uncommon to ude defines aka contants for such things, while I'd not use this in Java or C#.
In most projects we tried to extract the SQL to templates, so not only the table and column names were configurable but the whole sql statement. We used velocity for basic templating mechanics like variables, small loops,...
Regarding the language constants:
Another layer doesn't make much sense to me, but you hav eto choose your identifiers for the language translation carefully. Using the whole english sentence as key may end up in a lot of work for the translators if you fix the wording for example in the english sentence without changing the meaning. So all translators would have to update their files.
If the constant is used in multiple places and the compiler really catches the problem, yes.