How will Perl 6 handle the new combining emoji length? - unicode

Some emoji now combine. For instance, U+1f441 (👁) U+200d (ZWJ) U+1f5e8 (🗨) combine to make 👁‍🗨 (I am a witness). Rakudo 2016.07.1 on MoarVM 2016.07 says there are two graphemes:
> "\x[1f441]\x[200d]\x[1f5e8]".chars
2
I think that should be 1. It seems to have a similar problem with
> "\x[1f441]\x[fe0f]\x[200d]\x[1f5e8]\x[fe0f]".chars
2
But at least it handles U+fe0f (VS-16, emoji representation) correctly.
Are there plans to fix this in a later version of Perl 6 or am I misunderstanding the intent of the chars method?

The ZWJ sequence you mentioned is only part of Unicode Emoji 4.0 which is still in draft status and planned for release in November 2016. Under this new version, U+1F5E8 has the Grapheme_Cluster_Break property E_Base_GAZ (EBG), so the sequence should indeed form a single grapheme cluster.
I'm sure that Perl 6 will catch up at some point.

Related

Tk text widget index expressions and Unicode

(this question is based on that)
Let us consider the following code:
package require Tk 8.6
pack [text .t]
.t insert end "abcdefgh\nабвгґдеє\n一伊依医咿噫欹泆"
puts "[.t index 1.4+1l] [.t index 1.4+2l]"
puts "[.t index 3.4-1l] [.t index 3.4-2l]"
exit 0
Output:
2.2 3.2
2.6 1.8
I would rather expect +1l and -1l to preserve the column if the line is long enough, that is, to print 2.4 3.4 and 2.4 1.4. It looks like the result depends on the number of bytes needed to encode each character.
Should it be this way? Is it documented somewhere?
What font are you using? What exact patch-version of Tk are you using? (It should be reported by doing puts [package require Tk].)
I think the text widget currently uses character widths when working out the actual motions when doing index movement by lines. This has changed between past versions. The problem is that different bits of code want different things: sometimes you want visible motions (e.g., when handling users' cursor motion, especially with tabs set) and sometimes you want character-space motions (which is what you appear to be expecting).
Tk shouldn't ever be doing anything (you can see) with the byte widths of unicode characters. It's really supposed to handle that transparently (at least for any character in the Basic Multilingual Plane; you might find bugs outside that).

how to convert the old emoji encoding to the latest encoding in iOS5?

sadly, after iOS5 finally released, I got report from my users that they can not login.
Because there is emoji symbol in there names, and apple changed the encoding of emoji.
so there username contains a old version of emoji, how could I convert them to the new encoding?
thanks!
be specific: one emoji symbol "tiger", it is "\U0001f42f" in iOS5, but "\ue050" in earlier iOS version.
iOS 5 and OS X 10.7 (Lion) use the Unicode 6.0 standard ‘unified’ code points for emoji.
iOS 4 on SoftBank iPhones used a set of unofficial code points in the Unicode Private Use Area, and so aren't compatible with any other systems. To convert from this format to proper Unicode 6.0 characters, you'll need to run a big lookup table from Softbank code to Unified over all your current data and all new form data as it gets submitted. You might also want to do Unicode normalisation at this point, so that eg. fullwidth letters match normal ASCII letters.
See for example this table from a library that does emoji conversion tasks for PHP.
Emoji in usernames though?
I had the same problem, after digging for hours and finally found this answer that works for me
If you are using rails as your server, this is all you need to do. No need to do anything in ios/xcode, just pass the NSString without doing any UTF8/16 encoding stuff to the server.
Postegre stores the code correctly, it's just when you send the json response back to your ios client, assuming you do render json:#message, the json encoding has problem.
you could test whether you are having json encoding problem in your rails console by doing as simple test in your console
test = {"smiley"=>"u{1f604}"}
test.to_json
if it prints out "{\"smiley\":\"\uf604\"}" (notice the 1 is lost), then you have this problem. and the patch from the link will fix it.

Getting first symbol from a glyph

Related (in fact, perhaps a duplicate of): how to extract characters from a Korean string in VBA
The linked question doesn't give me satisfactory answers and it's 2 years old so I'm making a new question.
I want to find the first symbol in a Korean glyph, ie. "한" -> "ㅎ" or "가" -> "ㄱ". I also want to recognize inputs that are already single symbols, such as "ㄱ".
I'm working with NSString, which I believe uses UTF-8. Do I have to convert the string to EUC-KR, then start reading bytes, or what?
As a disclaimer, I have no experience in working with iphone or NSString, except for what I've read in the documentation in order to answer this question. I'm addressing the question mainly as a unicode problem.
In order to find the first symbol (jamo) from a Korean glyph, you have to perform a decomposition as described in my answer to how to extract characters from a Korean string in VBA (it's a new answer so you didn't see it when you posted your question). To apply my answer (which is derived directly from the Unicode standard), you have to work with the Unicode code points (numerical values) of the Korean syllables. It looks like calling the method dataUsingEncoding passing NSUnicodeStringEncoding as a parameter should do the trick.
In order to identify single symbols, you have to check whether the Unicode code point of the character you are checking is in any of the following ranges:
1100-11FF (Hangul Jamo). I think this should cover most of the real life cases.
A960-A97F (Hangul Jamo Extended-A)
D7B0-D7FF (Hangul Jamo Extended-B)
3130-318F (Hangul Compatibility Jamo)
FFA0-FFDC (Halfwidth Jamo)
Check the Unicode Code Charts for a complete reference.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Which programming languages were designed with Unicode support from the beginning?

Which widely used programming languages were designed ground-up with Unicode support?
A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?
Java was probably the first popular language to have ground-up Unicode support.
Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.
There were many breaking changes in Python 3, among them the switch to Unicode for all text.
So Python wasn't designed ground-up for Unicode, but Python 3 was.
I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:
using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט {
public תוצאה בית() {
int אלף = 0;
for (int λ = 0; λ < 20; λ++) אלף+=λ;
return אלף;
}
}
Google's Go programming language supports Unicode and works with UTF-8.
It really is difficult to design Unicode support for the future, in a programming language right from the beginning.
Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.
Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.
Therefore, don't be surprised if different programming languages implement Unicode support differently.
On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.
PS: Support for v5.1 of Unicode is to be made in Java 7.
Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.
The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)
Other languages that support non-ASCII identifiers include current PHP.
Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)
General overview
Unicode operators
Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).
A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.
Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html
Sometimes, a feature that was included in a language when it was first designed is not always the best.
Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.
So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.
With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.
Java uses characters from the Unicode character set.
java and .net languages