Each Unicode code point is assigned to a character category. I'm looking for a list of ranges which has the category "Letter". Best would be a CSV in the format "FROM_CODEPOINT;TO_CODEPOINT" with all ranges that define letters.
Looks like the Unicode consortium publishes a database. The UnicodeData.txt file contains the character category. I can obtain the ranges from there with a simple utility.
Related
I am trying to use the SOH character as a delimiter for a CSV file that my code generates. However, it looks like there are two unicode characters for SOH?
https://www.compart.com/en/unicode/U+2401
https://www.compart.com/en/unicode/U+0001
I am not sure what is the difference between the two is? or which one should I use?
U+0001 is the control character. U+2401 is a symbolic picture of the character.
Example: ␁ (May not display in all browsers, but is a single pictograph of SOH)
Given two Unicode strings encoding a first and last name (in Japanese or Chinese), what would be the best approach to tell if the first/last name belongs is Chinese or Japanese?
For example, is it possible to tell if the following are Chinese or Japanese names?
任天堂
金城武
唐泽西
白川轩
竹中宇
叶山明
林慧梦
No, it is impossible to tell the language of a string from just its raw character content alone.
If a DICOM file does not define a Specific Character Set (0008,0005), what character set does it use by default? Is ASCII the default encoding for DICOM files?
TL;DR
A DICOM file contains German ä in one of the tags, but the file does not specify any character set. I assume that in this case the file is allowed to contain only ASCII symbols (the default character set) and report this file as invalid. Before I submit my change, I want to make sure that I understood DICOM correctly.
As specified in the Dicom Data Structures and Encoding
6.1.2.5.4 Levels of Implementation and Initial Designation
a) Attribute Specific Character Set (0008,0005) not present:
7-bit code
Implementation level: ISO 2022 Level 1 - Elementary 7-bit code (code-level identifier 1)
Initial designation: ISO-IR 6 (ASCII) as G0.
Code Extension shall not be used
Reference:
http://dicom.nema.org/medical/dicom/current/output/chtml/part05/chapter_6.html#sect_6.1.2.5.4
To add to answer by JonnyQ, DICOM standard also defines mechanisms when confronted with character sets that are unknown to implementations or unsupported (see PS 3.5 section 6.1.2.3). Implementations can print or display such characters by replacing all unknown characters with the four characters "\nnn", where "nnn" is the three digit octal representation of each byte.
An example given in the standard for an ASCII based machine as follows:
Character String: Günther
Encoded representation: 04/07 15/12 06/14 07/04 06/08 06/05 07/02
ASCII based machine: G\374nther
Implementations may also encounter Control Characters which they have no means to print or display.
Application may print or display such Control Characters by replacing the Control Character with the
four characters “\nnn”, where “nnn” is the three digit octal representation of each byte.
What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski
I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?
Characters outside the BMP can be letters. (Michael Kaplan recently discussed a bug in the classification of the character U+1F48C.) But IsCharAlphaNumeric cannot see characters outside the BMP (for the reasons you noted), so you cannot obtain classification information for them that way.
If you have a surrogate pair, call GetStringType with cchSrc = 2 and check for C1_ALPHA and C1_DIGIT.
Edit: The second half of this answer is incorrect GetStringType does not support surrogate pairs.
You can determine yourself by looking at the Unicode plane assignment what you are missing by not being able to inspect non-BMP codepoints.
For example, you won't be able to identify imperial Aramaic characters as alphanumeric. Shame.
Does that mean that there are no alphanumeric characters that are surrogate pairs?
No, there are supplementary code-points that are in the letter group.
Comparing a char to a code-point?
For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.