I think answer to this question might not be difficult but tricky obviously . I do not know why I cannot show the Nepali numbers in my application form. Though alphabetical Nepalese characters (क, ख, ग, etc) are displayed correctly, whenever I type or copy numbers typed in Nepalese unicode fonts it automatically gets converted to English.
For eg. १२३४५६७८९० are Nepalese unicode characters representing 1234567890. Could you please help what is being missed. For your reference I have pasted the link here.
I am converting this from English Numbers to Nepali numbers using the following one-liner code. The output is in Nepali numbers. Other characters will not be affected.
If you use the "onkeyup" event for the field you could do the translation automatically.
const toNe = n => n.replace(/\d/g, d => "०१२३४५६७८९"[d])
// Tests
console.log(toNe("The value is 12000"));
console.log(toNe("0123456789"));
console.log(toNe("यो संख्या 25000 हो"));
console.log(toNe("जन्म मिति 25 अक्टोबर 2010 हो"));
Related
In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.
I'm writing an AppleScript to count characters in Unicode strings. The script works well except that it does not count Arabic diacritics, for example:
considering diacriticals, hyphens and punctuation
set count_a to count characters of ("فما")
set count_b to count characters of ("فَمّا")
end considering
This gives count_a = 3, which is correct. But, it also gives count_b = 3, which is wrong! count_b should be 5 because of the two extra diacritics added to the word.
Any idea how can I make AppleScript to count for diacritics?
AppleScript is working as designed. Like Swift and other languages that have a decent understand of Unicode, AppleScript counts glyphs, not codepoints.
If for some reason you really need to count raw codepoints, use the AppleScript-ObjC bridge to convert it to NSString (which being old and dumb has no concept of glyphs) and count that. Bear in mind that the raw codepoint count can also vary dependending on the normalization form used by a given piece of text. It really isn't a useful measure of anything other than the number of bytes used to store it.
Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF
The question is quiet simple: I've got an arabic text with an US formated Date in it. What is the correct display order of this date? Is it(for instance) 01/10/2009 or 2009/10/1?
The bidi algorithm recognizes the numbers an slashes as neutral and orders them in the same direction like the surrounding text. So the date should be backwards but that's not what any browser does. On the other hand, i can't find any rule in the unicode bidi algorithm which excludes date patterns. So, what is correct here and (especially) why?
without going deep in the technical details
I can tell that 01/10/2009 is the correct one and some times it's 10/01/2009
but it's never 2009/10/1
I am writing a function that transliterates UNICODE digits into ASCII digits, and I am a bit stumped on what to do if the string contains digits from different sets of UNICODE digits. So for example, if I have the string "\x{2463}\x{24F6}" ("④⓶"). Should my function
return 42?
croak that the string contains mixed sets?
carp that the string contains mixed sets and return 42?
give the user an additional argument to specify one of the three above behaviours?
do something else?
Your current function appears to do #1.
I suggest that you should also write another function to do #4, but only when the requirement appears, and not before .
I'm sure Joel wrote about "premature implementation" in a blog article sometime recently, but I can't find it.
I'm not sure I see a problem.
You support numeric conversion from a range of scripts, which is to say, you are aware of the Unicode codepoints for their numeric characters.
If you find an unknown codepoint in your input data, it is an error.
It is up to you what you do in the event of an error; you may insert a space or underscore, or you may abort conversion. What you would do will depend on the environment in which your function executes; it is not something we can tell you.
My initial thought was #4; strictly based on the fact that I like options. However, I changed my mind, when I viewed your function.
The purpose of the function seems to be, simply, to get the resulting digits 0..9. Users may find it useful to send in mixed sets (a feature :) . I'll use it.
If you ever have to handle input in bases greater than 10, you may end up having to treat many variants on the first 6 letters of the Latin alphabet ('ABCDEF') as digits in all their forms.