Is there a special character that cannot be typed or copied by user, but can be inserted/read by code into/from text? - special-characters

I need to have a temporary delimiter, inserted server-side, that cannot possibly exist in content created by user.
The purpose for this is to have prepared content for CSV export, with configurable value delimiter, that will replace this untypeable character client-side, right before the export.
Does such character even exist?

There is no character that cannot possibly exist; however there are many characters (in particular control codes - those lower than decimal 32, excluding cr/lf/tab) that are extremely unlikely to exist in any reasonable text content. This is why escaping is often required in text-based protocols. There is no reserved space of characters that will be escaped in CSV, other than those already used in CSV itself.

Zero-width joiner is a unicode invisible kind of character which exist but do not exist. You can use that! :)

Related

Enforce printable characters only in text fields

The text data type in PostgreSQL database (encoding utf-8) can contain any UTF-8 character. These include a number of control characters (https://en.wikipedia.org/wiki/Unicode_control_characters)
While I agree there are cases when the control characters are needed, there is little (to none) of use of these characters in normal attributes like persons name, document number etc. In fact allowing such characters to be stored in DB can lead to nasty problems as the characters are not visible and the value of the attribute is not what it seems to be to the end user.
As the problem seems to be very general, is there a way to prevent control chars in text fields? Maybe there is a special text type (like citext for case-incencitive text)? Or should this behaviour be realized as a domain? Are there any other options? All I could find people talk is finding these characters using regex.
I could not find any general recommendations to solving the problem so maybe I'm missing something obvious here.
The exact answer will depend on what you consider printable.
However, a domain is the way to go. If you want to go with what your database collation considers a printable character, use a domain like this:
CREATE DOMAIN printable_text AS text CHECK (VALUE !~ '[^[:print:]]');
SELECT 'a'::printable_text;
printable_text
════════════════
a
(1 row)
SELECT E'\u0007'::printable_text; -- bell character (ASCII 7)
ERROR: value for domain printable_text violates check constraint "printable_text_check"

What is this character: 🔖 ? Where can I see the similar characters?

🔖
I am not sure whether everyone can see the above character, but I can see it. I got it when I input "booknote" in Chinese on my iPhone. To my surprise, this character seems "platform-insensative", it can be seen on my phones, chrome on laptop, and even in MacOS terminal.
Is it an ASCII character? I've never seen colorful characters like this before. Since when these have been around? And where I can get a list of similar characters?
Here: http://www.unicode.org/charts/nameslist/index.html
You put the character on an HTML page. All characters on an HTML page are from the Unicode character set. Characters that are not in the Unicode character set either soon will be or are too specialized to be of general use.
The Unicode Consortium occasionally publishes a new version of the character set. Since you ask about the kind of character, the common partitions of the character set are blocks, categories, and—stretching a bit—which version the character was added in. Some characters are in a script (for a language writing system), some are not. You see the block and category of 🔖 at http://www.fileformat.info/info/unicode/char/1f516/index.htm.
The Unicode character set is published in text files called the Unicode Character Database (UCD), as well as many supplementary documents and webpages. The data includes important information about usage and relationships. For example, for applicable characters, which character is considered the uppercase form of another in a particular language.
To see any character, you have to use a font that presents it. This can be a problem for some characters. There is probably no one font that presents every Unicode character as it was meant to be.
You mentioned ASCII. Although it used every day in HTTP headers and other specialized and historical applications, ASCII is such a limited character set that it hasn't generally been used in decades.

filter specific unicode character in SAS

I need to replace a specific unicode character in SAS, exactly the U+0191 with a whitespace or blank. How can I do it by COMPRESS ? Thanks in advance.
You should use the KCOMPRESS function rather than COMPRESS for compressing unicode characters, as it is considered safer for Unicode and DBCS environments.
However, it sounds like you actually want to TRANSLATE, or more accurately KTRANSLATE, which actually replaces characters with whitespace or other characters (as opposed to removing them, as COMPRESS does).
Here's an example:
data have;
charvar = "Ƒellow Americans";
fixed_charvar = translate(charvar,'F','Ƒ');
kfixed_charvar= ktranslate(charvar,'F','Ƒ');
put _all_;
run;
Here I convert U+0191 to a normal F; of course you can convert to space as you wish (Replace the 'F' with whatever you want it converted to).
This will work in an instance of SAS set up in Unicode mode; if you're running in WLATIN1 or similar, you may have more difficulty, particularly with actually passing SAS the U+0191 character.

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

Allowed characters in CSS 'content' property?

I've read that we must use Unicode values inside the content CSS property i.e. \ followed by the special character's hexadecimal number.
But what characters, other than alphanumerics, are actually allowed to be placed as is in the value of content property? (Google has no clue, hence the question.)
The rules for “escaping” characters are in the CSS 2.1 specification, clause 4.1.3 Characters and case. The special rules for quoted strings, as in content property value, are in clause 4.3.7 Strings. Within a quoted string, any character may appear as such, except for the character used to quote the string (" or '), a newline character, or a backslash character \.
The information that you must use \ escapes is thus wrong. You may use them, and may even need to use them if the character encoding of the document containing the style sheet does not let you enter all characters directly. But if the encoding is UTF-8, and is properly declared, then you can write content: '☺ Я Ω ⁴ ®'.
As far as I know, you can insert any Unicode character. (Here's a useful list of Unicode characters and their codes.)
To utilize these codes, you must escape them, like so:
U+27BA Becomes \27BA
Or, alternatively, I think you may just be able to escape the character itself:
content: '\➺';
Source: http://mathiasbynens.be/notes/css-escapes