I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic. However, I cannot find them in the Unicode Character Database no matter how I search for them. Can somebody provide a list of them or just a search facility for characters with specified Unicode properties?
The Unicode Character Database comprises all the text files in the distribution. It is not just a single file as it once was long ago.
The Alphabetic property is a derived property.
You really do not want to use code point ranges for this. You want to use the property properly. That’s because there are just too many of them. Using the unichars script, we learn that there are more than ten thousand just in the Basic Multilingual Plane alone not counting Han or Hangul:
$ unichars '\p{Alphabetic}' | wc -l
10052
If we include the other 16 astral planes, now we’re at fourteen thousand:
$ unichars -a '\p{Alphabetic}' | wc -l
14736
And if we include Han and Hangul, which in fact the Alphabetic property does, we just blew the roof off of a hundred thousands code points:
$ unichars -ua '\p{Alphabetic}' | wc -l
101539
I hope you can see that you do not want to specifically enumerate these using code point ranges. Down that road lies madness.
By the way, if you find the unichars script useful,
you might also like the uniprops script and perhaps the uninames script.
Derived Core Properties can be calculated from the other properties.
The Alphabetic property is defined as: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
So, if you take all the characters in Lu, Ll, Lt, Lm, Lo, Nl, and all the characters with the Other_Alphabetic property, you will have the Alphabetic characters.
Citation from your source: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
These Abbrevations seem to be explained here.
I found the UniView web application which provides a nice search interface. Searching for the Letter property (with Local unchecked) gives 14723 results...
Related
This is a follow-up of this question. I'm interested by different glyphs for the same character, also known as "Unicode Compatibility Characters".
Let's take the following two Arabic "reversed-character" words: كلمة ةملك
First word is:
كلمة
in hex code:
0643 0644 0645 0629
Second word is:
ةملك
in hex code:
0629 0645 0644 0643
If I paste those two words in Microsoft Word using Deja Vu Sans, I get this:
With the following pseudo-code using FreeType2, I get:
FT_Face face;
FT_New_Face(library, "DejaVuSans.ttf", 0, &face);
FT_GlyphSlot slot;
FT_Load_Char(face, each_character, FT_LOAD_RENDER);
slot = face->glyph;
//Use slot->bitmap.buffer
FT_Done_Face(face);
What am I missing? How can I have the right glyphs depending of the context?
My key issue is that I store each "character" (I should say glyph - but for me, character was equivalent to glyph) in a table so it's going to be complicated. I'm limited in speed, not in space. Can I have two different unicode characters for the same logical character?
libraqm is a solution to get the glyth for each character depending of its position in the sentence. But I'm still interested to get the character corresponding to the glyth (I know it's not a 1-to-1 relation). For instance, there are 4 characters for the 4 glyths of the letter Kaf as stated in the comment above.
I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f
You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.
You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html
Disclaimer:
I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have a list of servers (VMs) that have it's UUID embedded as part of the name. I need to get rid of that in order to obtain the "pure/clean" server name. Now, the problem is precisely that: I need to get rid of the UUID (which has a very specific and constant format, more details on this below) and ONLY that, nothing else.
The UUID - as you might already know or have noticed - has a specific and constant format which consists of the following parts:
It starts with a dash (-).
Which is followed by a subset of 8 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 4 alphanumeric characters (letters are always lowercase).
Which is followed by a dash (-).
Which is followed by a subset of 12 alphanumeric characters (letters are always lowercase).
Samples of results achieved using "my" """"code"""":
In this case the result is the expected one:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
In this case the result is the expected one too:
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f_OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022_OLD
Expected result: PRODSERVER0022-OLD
echo PRODSERVER0022-872151c8-1a75-43fb-9b63-e77652931d3f-OLD | sed 's/-[a-z0-9]*//g'
PRODSERVER0022
Expected result: PRODSERVER00-22
echo PRODSERVER00-22-872151c8-1a75-43fb-9b63-e77652931d3f-old | sed 's/-[a-z0-9]*//g'
PRODSERVER00
I know that, within the sed universe, a . means "any character", while a * means "any number of the preceding character". However, what I would need in this case, as I see it at least, is a way to tell sed to do the replacement only if this specific sequence is present (8 alphanumeric characters [any, but specifically 8, not more, not less]; followed by a dash, then followed by 4 alphanumeric characters [any, but specifically 4, not more, not less], etc..). So, the question would be: Is there a regex construction (or a combination [through piping I guess] of several of them, if it has to be the case) that can achieve the expected results in this case?
Note that: Even though servers may have additional dashes (-) as part of their names, the resulting sub-strings will never consist of 8 characters, neither of 4. They might, however, end up having 12 characters, which, even though would initially match up with the last sub-string in the UUID, it will not be at the end of the string, so we have that to discriminate between these two 12-chars substrings (and also it will not be a problem if there is indeed a regex combination that can get rid of the UUID as a whole).
Try this to match the UUID.
-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}
Embed it in the sed command line in the usual way. As Benjamin W. has said, we need to use extended regular expressiongs.
sed -E 's/-[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}//g'
I am trying to find and replace single characters in a number of text files in a directory. Apologies for the possible duplication but I have not been able to find an answer in other sed threads.
I used Homebrew to install gnu-sed, and I'm using the command:
find . -name "*.txt" -exec gsed -i -e 's/ñ/–/g' '{}' \;
I have a 'test' file containing the characters I need replacing in the directory, and these are all found and replaced correctly. But other characters in other text files are not. e.g. 'Weíre to Denmark ñ all' (ñ also isn't found/ replaced).
Why might this be? How can I fix it? Thank you!
Edit - Output of
$ od -c filethatworks.txt | head -2
0000000 – ** ** \n – ** ** \n “ ** ** \n “ ** ** \n
0000020 — ** ** \n — ** ** \n - \n “ ** ** \n “ **
$ od -c filethatdoesnot.txt | head -2
0000000 T h o s e b l e s s e d d a
0000020 y s o f s u m m e r a r e
For a file that works, the file command returns
test.txt: UTF-8 Unicode text
and for one that does not:
ca001_mci_17071971.txt: Non-ISO extended-ASCII text, with very long lines, with CRL
F line terminators
Characters are human concepts. When characters are to be represented in computer files, they need to be encoded. Encoding associates each character with an integer called a code point.
For example, take the character "ă" (that's lower case "a" with a breve on top, used in Romanian spelling for the vowel /ə/); in the old days of MS-DOS, we quite often used an encoding called "code page 852", where "ă" has the code point 199. Then Windows came, and on Windows we often used an encoding called "code page 1250", where "ă" has the code point 227. Then came Unicode, and in Unicode "ă" has the code point 259.
Since Unicode code points can have values larger than 255, there must be a way to represent them using bytes with values between 0 and 255. Those methods are called "Unicode Transformation Formats" (UTF), of which the most widely used are UTF-8 (very popular in Linux) and UTF-16 (of two kinds, little and big endian, and very popular on Windows). In UTF-8, "ă" is represented as two bytes, with the values 196 and 131 (by the rules of UTF-8, those two bytes together represent code point 259); in little endian UTF-16, "ă" is represented by two bytes, with the values 3 and 1 (by the rules of little endian UTF-16, those two bytes together represent code point 259).
The point is that in order to make sense of a text file you need to know (1) what encoding is used, and (2) in the case of Unicode, what transformation format is used. Now, on Linux and on the Web we are very close to a consensus that all text is represented in UTF-8; nevertheless, old files still exist, and occasionally new files come from Windows, so there is a very nice program called iconv (available both on Linux and on Windows) which is used to translate text files from one encoding into another.
For example, assuming that your problematic file is encoded in Windows-1252 (also called ANSI by the Windows documentation, although the American National Standards Institute had nothing to do with it), you could say
iconv -f windows-1250 -t utf-8 ca001_mci_17071971.txt | gsed -e 's/ñ/–/g' '{}'
Sadly, there is no way to use sed -i; you must write a temporary output file, then rename the temporary output file on top of the source file, of course after checking that everything went well.
Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)
Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...