Bluemix language identify does not identify English - ibm-cloud

So I am beta-testing the Language Translate - Identify API and it is very strange - it identifies Spanish very well but not English. Huh? Check this out...
Go to https://watson-api-explorer.mybluemix.net/apis/language-translation-v2#!/identify/identifyLanguageGet
Paste this text in the box:
Michael+is+a+hard+working+student+who+shows+responsibility+in+his+daily+tasks.+He+is+enthusiastic%2C+engaged+and+active+in+his+learning%2C+all+qualities+shown+through+his+level+of+participation+and+ability+to+self-assess+his+work.+He+has+become+very+good+at+identifying+his+strengths+and+uses+feedback+to+improve+and+revise+his+assignments.+Congratulations+on+a+great+first+semester.
I get back "et" which is ESTONIAN. Huh?
And if I enter this text (same as before but with one more character at the end):
Michael+is+a+hard+working+student+who+shows+responsibility+in+his+daily+tasks.+He+is+enthusiastic%2C+engaged+and+active+in+his+learning%2C+all+qualities+shown+through+his+level+of+participation+and+ability+to+self-assess+his+work.+He+has+become+very+good+at+identifying+his+strengths+and+uses+feedback+to+improve+and+revise+his+assignments.+Congratulations+on+a+great+first+semester.%21
I get "ja" which is JAPANESE. I also get "ht" which is HAITIAN CREOLE from time to time....Hello Watson?? English??

Definitely an encoding issue. Everything must be strictly in UTF-8 as docs say.
FYI in a POST request, even though you do need to encode the text, the text must also be strictly in UTF-8. This was one of my challenges as obviously you're dealing with lots of accented characters with Language Translate.

Related

Japanse characters unreadable

I am working on my thesis and got acces to a database that was used by Japanese scientists. They included some readme files, but the text that was supposed to be in Japanese, is displayed in characters like these:
ÉRÅ[ÉqÅ[Ç…É~ÉãÉNÇì¸ÇÍÇ‹Ç∑Ç©ÅB
I've tried everything to convert them to Japanese characters, but I can't get it right. De database is from 1999, maybe that makes it harder to convert it?
Does anybody know how to fix this?
So you have a text file, but with these strange characters ? Does your text editor allow you to change the page encoding ?
For exemple, in Atom, once your text file is open, you can switch the page encoding using the status bar: Atom knows (but perhaps it is inherited from the host system) Shift JIS, CP 932 and EUC-JP, which seem to be all related to japanese character encoding.
Maybe you can find helpful details from this page ?
But even once done, I guess you have to find out a native speaker in order to tell you if the results make sense...

How to interpret emoji in Web API

I am trying to intercept and replace emoji with a corresponding text. I left the default encoding on the Web API (UTF-8 / UTF-16 respectively).
How can I convert an emoji like 😉 to something like U+1F609?
Here is something that helped me out, although it is in Perl. But you can encode and decode. This should be what you're looking for: https://metacpan.org/pod/Encode::JP::Emoji
This is quite an old post and even though I'm not on the project anymore I want to still answer with my findings for future reference if someone else has the same problem.
What I ended up doing is to create a dictionary with key the UTF combination of the emoji and as a value the text. One thing as an advice: I made sure the longest UTF combination, some emoji have 4 or even 5, as the first ones as otherwise some emoji never got reached. Totally not a perfect and future proof solution that I was hoping for but it worked for us and it shipped into production where it has been working since 2017.

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

What text encoding to use?

I need to setup my PostgreSQL DB's text encoding to handle non-American English characters that you'd find showing up in languages such as German, Spanish, and French. What character encoding should I use?
Start with UTF-8. It covers every character used at the world. Prepare your DB for world domination.
Unless you have a very good reason not to, use UTF-8. For the list of languages you cite, latin-1 would be acceptable (but not quite: it misses one, admittedly are, character for French: œ). Unicode is very mature now, there is little reason to reject it on principle. On the contrary, if you ever need to extend the list of languages you work with, you will be glad to have chosen an encoding that's able to deal with them.

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)