How can I spell check for multiple languages in emacs? - emacs

I write mostly my documentation in HTML using emacs as my main editor. Emacs let you interactively spell-check the current buffer with the command ispell-buffer.
Since I switch between a number of languages, I have an HTML comment at the end of the file specifying the main dictionary and personal dictionary for that file, E.g. for Norwegian (norsk) I use the following pair of dictionaries:
<!-- Local IspellDict: norsk -->
<!-- Local IspellPersDict: ~/.aspell/personal.dict -->
This works great.
However, sometimes I have a paragraph in another language (e.g. English) embedded in an otherwise Norwegian document. Example:
<p xml:lang="en">This paragraph is in English.</p>
The spell-checker naturally flag all the words in such a paragraph as misspellings (since the dictionary only contain Norwegian words).
To avoid this, I've tried to add a "british" dictionary to the document, like this:
<!-- Local IspellDict: british -->
<!-- Local IspellDict: norsk -->
<!-- Local IspellPersDict: ~/.aspell/personal.dict -->
Unfortunately, this does not work. The "british" dictionary is simply ignored.
My prefered solution would to load an additional dictionary and use this, toghether with the primary dictionary, for spell-checking. Is this possible?
However, I am also interested in a solution that let me mark paragraphs for not being spell checked. It is not ideal, but it would stop valid English words from being flagged as misspellings.
PS: I have also looked at the answer to this question: Multilingual spell checking with language detection, but it is much broader and does not address the specific use emacs ispell for doing the spell-check.

Try ispell-multi and flyspell-xml-lang http://www.dur.ac.uk/p.j.heslin/Software/Emacs/
You can spawn multiple instances of ispell, and use the xml:lang tag to decide which language to check for.

Related

change '#' key in freemarker templates

In order to use if statements in Freemarker templates, the following syntax is used;
[#if ${numberCoupons} <= 1]
[#assign couponsText = 'coupon']
[/#if]
Is there a way to replace the '#' character with something else, because I am trying to integrate it with drools (a java based rule engine) and the '#' character is used to mark start of comments so the formatting breaks?
There isn't anything for that out of the box (it uses a JavaCC generated parser, which is static). But you can write a TemplateLoader that just delegates to another TemplateLoader, but replaces the Reader with a FilterReader that replaces [% and [/% and [%-- and --%] with [#, etc. Then then you can use % instead of # in the FreeMarker tags. (It's somewhat confusing though, as error messages will still use #, etc.)
As #ddekany wrote, you can write code that tranform the template without the pound sign, But notice it can clash with HTML or XML (and similar) tags, at least from an editor prespective.

Eclipse Juno spell check strange behavior

I am trying to setup eclipse Juno to do spell checking with a custom dictionary (to do Spanish spell checking). When I enable spell checking I get errors on every Spanish word, because the English dictionary detects them as wrong.
I have seen several guides on this topic and they all point to the same solution; add a user defined dictionary with the Spanish words, and disable the English dictionary. So I proceed to set it up like the following picture:
But it didn't work (no words were detected as incorrect), so on the 'Platform dictionary' dropdown, I selected 'US english' to re-enable the english dictionary. Like this:
Now the Spanish words are detected as wrong again. BUT, the interesting point is that I have a user dictionary file defined, so if I add a Spanish word to the dictionary, I can see the dictionary file to check if that word has been added. So there we go.
The end of the original dictionary file:
The code with the incorrectly spelled word:
The end of the dictionary file after adding the word 'Pedro':
Anybody have a clue of what's going on here?. I would expect the word 'Pedro' to be added at the end of the file, but I am getting these strange characters instead. Although, that explains why none of the words already present in the dictionary are being recognized.
I have tried different file encoding settings, but that made no difference.
Help please!
Most probably that your dictionary file is encoded in UTF-8. Try to change encoding setting in Eclipse to UTF-8.

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

Is Localizable.strings required for the root language of an app?

As we're enabling our (English) application to be localized, we're replaced all in-line strings with NSLocalizedString() calls. Since all of our English strings are all there in-line with the code, e.g. NSLocalizedString(#"OK, #"OK button in a message box"), is there any reason we need the English version of Localizable.strings? When we try removing the strings from the English Localizable.strings, the program seems to work fine. Just wanted to double check if there was some side-effect to not having that around. Thanks, alex
One of the main points of using the NSLocalizedString() macro is so that your programming code can be parsed with the genstrings command-line tool to generate the corresponding Localizable.strings file(s) (see Resource Programming Guide: About the String-Loading Macros and Using the Genstrings Tool to Create Strings Files).
That Localizable.strings file then serves as a starting point for your translators, to use to translate to another language. Without that file to work with, your translators would basically need access to your source code in order to see all the strings you want to use (which kind of defeats the purpose).
Yes, your English version works fine right now, since if a localized version of the string you try to get in code –– for example, NSLocalizedString(#"OK", #"") –– cannot be found in a .strings file, it simply uses the #"OK" string that you passed in.
Another reason why you should likely be keeping the English Localizable.strings is that you should generally try to avoid using high-ASCII characters in your code, but should use the full range of available characters in your actual user interface. For example, you may not want to put the following characters in your code, but would want to use them in your user interface:
… (horizontal ellipsis) (U+2026)
“ (left double quotation mark) (U+201C)
” (right double quotation mark) (U+201D)
‘ (left single quotation mark) (U+2018)
’ (right single quotation mark) (U+2019)
So in code, you'd do something like this:
NSLocalizedString(#"Add Bookmark...", #"")
and then in your .strings file (which is UTF16, so this is fine):
"Add Bookmark..." = "Add Bookmark…";
It's not recommended to use the words as keys, if you want to change that Ok for something else and you have any other Localizable.strings, you will have to edit every single of them to update the Okkey.
Surely your translators will want your Localizable.strings file to work from. And want it to be ALL the strings.
(It is true that the system will fall back to the key, but relying on that seems like bad practice. And you'll find it more reliable to use proper punctuation in a Unicode file, which source code seldom is.)

Website localization for multibyte languages

I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards
$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements
I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...
You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.