Enforce printable characters only in text fields - postgresql

The text data type in PostgreSQL database (encoding utf-8) can contain any UTF-8 character. These include a number of control characters (https://en.wikipedia.org/wiki/Unicode_control_characters)
While I agree there are cases when the control characters are needed, there is little (to none) of use of these characters in normal attributes like persons name, document number etc. In fact allowing such characters to be stored in DB can lead to nasty problems as the characters are not visible and the value of the attribute is not what it seems to be to the end user.
As the problem seems to be very general, is there a way to prevent control chars in text fields? Maybe there is a special text type (like citext for case-incencitive text)? Or should this behaviour be realized as a domain? Are there any other options? All I could find people talk is finding these characters using regex.
I could not find any general recommendations to solving the problem so maybe I'm missing something obvious here.

The exact answer will depend on what you consider printable.
However, a domain is the way to go. If you want to go with what your database collation considers a printable character, use a domain like this:
CREATE DOMAIN printable_text AS text CHECK (VALUE !~ '[^[:print:]]');
SELECT 'a'::printable_text;
printable_text
════════════════
a
(1 row)
SELECT E'\u0007'::printable_text; -- bell character (ASCII 7)
ERROR: value for domain printable_text violates check constraint "printable_text_check"

Related

Are periods in object names bad practice?

For example, a constraint for a default value of 0 could be named DF__tablename.columnname.
Although my search for this being bad practice doesn't yield results, in the numerous constraints examples I've seen on SO and many other sites, I never spotted a period.
Using period in an object name is bad practice.
Don't use dot character in an identifier. Yes it can be done but the drawbacks outweigh any benefits.
tl;dr
Special characters, such as a dot, are not allowed in regular identifiers. If an identifier does not follow the rules for regular identifier, then references to the identifier must be enclosed in square brackets (or ANSI double quotes).
https://learn.microsoft.com/en-us/sql/relational-databases/databases/database-identifiers?view=sql-server-2017
In terms of the period (dot character), using that in an identifier is not allowed in a regular identifier; but it could be used within square brackets.
The dot character is even more of a special-ish character in SQL; it's used to separate an identifier from a preceding qualifier.
SELECT mytable.mycolumn FROM mytable
We could also write that as
SELECT [mytable].[mycolumn] FROM mytable
We could also write
SELECT [mytable.mycolumn] FROM mytable
but that means something very different. With that, we aren't referencing a column named mycolumn, we are now referencing an identifier that contains a dot character.
SQL Server will deal with this just fine.
But if we do this, and start using the dot character in our identifiers, we will be causing confusion and frustration to future readers. Any benefit we would gain by using dot characters in identifiers is going to be far outweighed by the downside for others.
Similarly, why we don't create tables named WHERE (1=1) OR, or create columns named SUBSTR(foo.bar,1,10) to avoid monstrosities like
SELECT [SUBSTR(foo.bar,1,10)] FROM [WHERE (1=1)] OR]
Which may be valid SQL, but it will cause future readers to become very upset, and cause them to curse us, our descendants and loved ones. Don't make them do that. For the love of all that is good and beautiful in this world, don't use dot characters in identifiers.
It is perfectly valid to have periods in the object names. However, this requires you to use square brackets around the object name when referring to it. In case you forget these square brackets you will get some error messages that can be less intuitive to the inexperienced developer. For this reason I recommend not to use periods in the object names. I would also guess this is the main reason you don't often see examples of periods in object names on the internet.
In your example, you could use another underscore instead of the period, like this: DF__tablename_columnname

Is there a special character that cannot be typed or copied by user, but can be inserted/read by code into/from text?

I need to have a temporary delimiter, inserted server-side, that cannot possibly exist in content created by user.
The purpose for this is to have prepared content for CSV export, with configurable value delimiter, that will replace this untypeable character client-side, right before the export.
Does such character even exist?
There is no character that cannot possibly exist; however there are many characters (in particular control codes - those lower than decimal 32, excluding cr/lf/tab) that are extremely unlikely to exist in any reasonable text content. This is why escaping is often required in text-based protocols. There is no reserved space of characters that will be escaped in CSV, other than those already used in CSV itself.
Zero-width joiner is a unicode invisible kind of character which exist but do not exist. You can use that! :)

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)
Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string
Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

What Unicode characters are dangerous?

What Unicode characters (more precisely codepoints) are dangerous and should be blacklisted and prohibited for the users to use?
I know that BIDI override characters and the "zero width space" are very prone to make problems, but what others are there?
Thanks
Characters aren’t dangerous: only inappropriate uses of them are.
You might consider reading things like:
Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
RFC 3454: Preparation of Internationalized Strings (“stringprep”)
It is impossible to guess what you mean by dangerous.
A Golden Rule in security is to whitelist instead of blacklist, instead of trying to cover all bad characters, it is a much better idea to validate based on ensuring the user only use known good characters.
There are solutions that help you build the large whitelist that is required for international whitelisting. For example, in .NET there is UnicodeCategory.
The idea is that instead of whitelisting thousands of individual characters, the library assigns them into categories like alphanumeric characters, punctuations, control characters, and such.
Tutorial on whitelisting international characters in .NET
Unicode Regex: Categories
'HANGUL FILLER' (U+3164)
Since Unicode 1.1 in 1993, there is an empty wide, zero space character.
We can't see it, neither copy/paste it alone because we can't select it!
It need to be generated, by the unix keyboard shortcut: CTRL + SHIFT + u + 3164
It can pretty much 💩 up anything: variables, function name, url, file names, mimic DNS, invalidate hash strings, database entries, blog posts, logins, allow to fake identical accounts, etc.
DEMO 1: Altering variables
The variable hijacked contains a Hangul Filler char, the console log call the variable without the char:
const normal = "Hello w488ld"
const hijaㅤcked = "Hello w488ld"
console.log(normal)
console.log(hijacked)
DEMO 2: Hijack URL's
Those 3 url will lead to xn--stackoverflow-fr16ea.com:
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
See Unicode Security Considerations Report.
It covers various aspects, from spoofing of rendered strings to dangers of processing UTF encodings in unsafe languages.
U+2800 BRAILLE PATTERN BLANK - a Braille character without any "dots". It looks like a regular "space" but is not classified as one.