remove one unicode range from a unicode textfile in Python3

remove one unicode range from a unicode textfile in Python3 - unicode

I have a Unicode textfile consisting of ranges q to r and s to t. I want to remove range s to t (which is in Chinese) leaving q to r (which is in English).
How can I do that in Python3?

Use the string translate method. To quote from the 3.1.3 Std Lib doc:
str.translate(map)
Return a copy of the s where all characters have been mapped through the map
which must be a dictionary of Unicode ordinals (integers) to Unicode ordinals,
strings or None. Unmapped characters are left untouched. Characters mapped to
None are deleted.
You can use str.maketrans() to create a translation map from
character-to-character mappings in different formats.

Related

Adding the combining overline unicode character

I'm writing a program that converts an integer to a Roman numeral.
Roman numerals over 3999 are overlined, so IV overlined is 4000, CM overlined is 900'000, etc. These lines can stack.
So as to not limit my program, stopping it at just 3999 isn't good enough.
The question is, how do I add the "combining overline" unicode character to my string to achieve this?
My program is written in Rust, but I suspect the solution is similar across most languages that support unicode strings.

Just add the combining mark after each character.
Here's a Python example. What you see depends on support for combining marks in your console/IDE/browser.
with open('test.txt','w',encoding='utf-8-sig') as f:
print('I\u0305V\u0305',file=f)
Output (image and text)
(image) I̅V̅ (text)
In testing, U+0305 COMBINING OVERLINE could stack up to two, but Chrome drew incorrectly for three. There is also U+033F COMBINING DOUBLE OVERLINE.

You can just use them in string constants, either with the Unicode escape sequence (here shown for Rust) or directly (as they can be easily represented in UTF-8 source code files):
println!("I\u{0305}V\u{0305} - I̅V̅");
Note however, that each letter with overline requires two Unicode codepoints. So they do not fit into a single char. You need to use a string.
The combining overline character itself fits into a single character:
let combining_overline = '\u{0305}';
To apply it, insert it after the base character that needs the overline.

unicode:characters_to_list seems doesn't work for utf8 list

I am trying to convert UTF-8 string to Unicode (code point) list with Erlang library "unicode. My input data is a string "АБВ" (Russian string, which correct Unicode representation is [1040,1041,1042]), encoded in UTF-8. When I am running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8).
[1040,1041,1042]
it returns correct value, but following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8).
[208,144,208,145,208,146]
does not. Why does it happens? As I read in specification, input data could be either binary or list of chars, so, as for me, I am doing everything right.

The signature of the function is unicode:characters_to_list(Data, InEncoding), it expects Data to be either binary containing string encoded in InEncoding encoding or possibly deep list of characters (code points) and binaries in InEncoding encoding. It returns list of unicode characters. Characters in erlang are integers.
When you call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) it correctly decodes unicode string (yes, second is noop as long as Data is list of integers). But when you call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks you pass list of 6 characters in utf8 encoding, since it's already unicode the output will be exactly the same.
There is no byte type in erlang, but you assume that unicode:characters_to_list/2 will accept list of bytes and will behave correctly.
To sum it up. There are two usual ways to represent string in erlang, they are bitstrings and lists of characters. unicode:characters_to_list(Data, InEncoding) takes string Data in one of these representations (or combination of them) in InEncoding encoding and converts it to list of unicode codepoints.
If you have list [208,144,208,145,208,146] like in your example you can convert it to binary using erlang:list_to_binary/1 and then pass it to unicode:characters_to_list/2, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8).
[1040,1041,1042]
unicode module supports only unicode and latin-1. Thus, (since the function expects codepoints of unicode or latin-1) characters_to_list does not need to do anything with list in a case of flat list of codepoints. However, list may be deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). That is a reason to support list datatype for Data argument.

<<208,144,208,145,208,146>> is an UTF-8 binary.
[208,144,208,145,208,146] is a list of bytes (not code points).
[1040,1041,1042] is a list of code points.
You are passing a list of bytes, but the function wants a list of chars or a binary.

Unicode alphanumeric character range

I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?

Characters outside the BMP can be letters. (Michael Kaplan recently discussed a bug in the classification of the character U+1F48C.) But IsCharAlphaNumeric cannot see characters outside the BMP (for the reasons you noted), so you cannot obtain classification information for them that way.
If you have a surrogate pair, call GetStringType with cchSrc = 2 and check for C1_ALPHA and C1_DIGIT.
Edit: The second half of this answer is incorrect GetStringType does not support surrogate pairs.

You can determine yourself by looking at the Unicode plane assignment what you are missing by not being able to inspect non-BMP codepoints.
For example, you won't be able to identify imperial Aramaic characters as alphanumeric. Shame.

Does that mean that there are no alphanumeric characters that are surrogate pairs?
No, there are supplementary code-points that are in the letter group.
Comparing a char to a code-point?
For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

what is the way to represent a unichar in lua

If I need to have the following python value, unicode char '0':
>>> unichr(0)
u'\x00'
How can I define it in Lua?

There isn't one.
Lua has no concept of a Unicode value. Lua has no concept of Unicode at all. All Lua strings are 8-bit sequences of "characters", and all Lua string functions will treat them as such. Lua does not treat strings as having any Unicode encoding; they're just a sequence of bytes.
You can insert an arbitrary number into a string. For example:
"\065\066"
Is equivalent to:
"AB"
The \ notation is followed by 3 digits (or one of the escape characters), which must be less than or equal to 255. Lua is perfectly capable of handling strings with embedded \000 characters.
But you cannot directly insert Unicode codepoints into Lua strings. You can decompose the codepoint into UTF-8 and use the above mechanism to insert the codepoint into a string. For example:
"x\226\131\151"
This is the x character followed by the Unicode combining above arrow character.
But since no Lua functions actually understand UTF-8, you will have to expose some function that expects a UTF-8 string in order for it to be useful in any way.

How about
function unichr(ord)
if ord == nil then return nil end
if ord < 32 then return string.format('\\x%02x', ord) end
if ord < 126 then return string.char(ord) end
if ord < 65539 then return string.format("\\u%04x", ord) end
if ord < 1114111 then return string.format("\\u%08x", ord) end
end

While native Lua does not directly support or handle Unicode, its strings are really buffers of arbitrary bytes that by convention hold ASCII characters. Since strings may contain any byte values, it is relatively straightforward to build support for Unicode on top of native strings. Should byte buffers prove to be insufficiently robust for the purpose, one can also use a userdata object to hold anything, and with the addition of a suitable metatable, endow it with methods for creation, translation to a desired encoding, concatenation, iteration, and anything else that is needed.
There is a page at the Lua User's Wiki that discusses various ways to handle Unicode in Lua programs.

For a more modern answer, Lua 5.3 now has the utf8.char:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.

How to enumerate all Unicode canonically equivalent sequences in Perl?

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)

Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

remove one unicode range from a unicode textfile in Python3 - unicode

I have a Unicode textfile consisting of ranges q to r and s to t. I want to remove range s to t (which is in Chinese) leaving q to r (which is in English). How can I do that in Python3?

Related

Adding the combining overline unicode character

unicode:characters_to_list seems doesn't work for utf8 list

Unicode alphanumeric character range

what is the way to represent a unichar in lua

How to enumerate all Unicode canonically equivalent sequences in Perl?

Categories

Resources