Google text-to-speech (Wavenet) Is there a list of date formats for each supported language? - google-text-to-speech

Wavenet correctly converts dates like 01/02/2019 into a spoken date. Different languages use different date formats and separators e.g. dd/mm/yyyy and mm/dd/yyyy. Is there a list of date formats and separators for each supported Wavenet language?
Does Wavenet follow the country standard formats given in https://en.wikipedia.org/wiki/Date_format_by_country?

The great thing about Google's text-to-speech (as well as others, like Amazon's Polly) service is, that besides plain text that you seem to be using, it accepts SSML which stands for Speech Synthesis Markup Language. This allows you to provide XML tags to indicate how to pronounce certain parts of speech. Among them, dates:
<speak>
<say-as interpret-as="date" format="yyyymmdd" detail="1">
1960-09-10
</say-as>
</speak>
(Example taken from https://cloud.google.com/text-to-speech/docs/ssml#sayas)
As you surely know you can test this out directly in the browser here: https://cloud.google.com/text-to-speech/.

Related

Unicode for Contextual forms of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in Pashto language

I am developing a program that give the correct format of text for example if I write سلام so it give FEB3, FEE0, FE8E and FEE2 witch are Unicode of سـ, ـلـ,ﺎ,ـم, then if I write ټول there is Unicode for character ټ which is 067C, but there is not Unicode for character ټـ which is Initial Contextual form.
So I found Unicode for isolated of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in the Wikipedia, but I can't find Unicode of Contextual forms.
For example Unicode of ټـ ,ـټـ,ـټ.
I am waiting for response if any one knows the solution of this problem.
thanks...
A Unicode character is intended to be abstract in the sense that it doesn't have a particular presentation form. The preferred way to display cursive scripts like Arabic is to store the standard, non-contextual forms, and convert them to their cursive forms at display time - that is, as one of the final stages of a text display system in an operating system or word processor.
The cursive forms are usually provided as glyphs in the font, and are chosen using information in tables in the font file embodying the contextual rules.
Unicode stores quite a large number of Arabic contextual forms, but only for compatibility with older encodings, and with traditional metal type, for which only a finite number of physical glyphs can be supplied. Unfortunately for your purposes, these contextual forms don't cover all the extended characters used in languages other than Arabic, such as the example you give, which is U+067C ARABIC LETTER TEH WITH RING, used in Pashto.
It's very unlikely that further contextual Arabic forms will be added, in my opinion. Therefore your proposed program cannot be made to work, at least according to its current design.
Earlier Unicode versions included separate codes for the different forms of Arabic letters for all letters except some. Arabic letters are used to write Pashto, Farsi, Urdu, and few other languages. The letters that were used in Arabic, Farsi, and may be a couple more languages were assigned different codes for each form of the their letters. However, the letters used only by less taught languages like Pashto, which you are asking about, were assigned codes for only the isolated forms. In the later versions of the Unicode, it was decided to only assign a single code to each letter, leaving Pashto only letters to have codes for only the isolated forms.
Actually there was no need to have a separate code for each form which was a bad decision made by the earlier Unicode versions. A rendering engine (editors, and other programs that deal with plain text) should account for the different forms of each letter and display the correct form according to its position.

Unicode range mapping between languages

there is 7707 languages listed in this link http://www.sil.org/iso639-3/download.asp and http://en.wikipedia.org/wiki/ISO_639:a.
And also Unicode support the writing system of the languages, but i want to know mapping beetween the languages and unicode range.
Unicode range is listed in this link http://www.unicode.org/roadmaps/bmp/
Example one of unicode range : "start"=> "0x0900", "end"=> "0x097F", "block_name"=> "Devanagari" (what language use this range of unicode ?)
there is any documentation ? I need full languages mapping that are supported in unicode range.
You can take a look at ICU4C locale (http://icu-project.org/apiref/icu4c/uloc_8h.html)
You can get all the locales (with uloc_getAvailable), then for each locale call uloc_addLikelySubtags, and then uloc_getScript on the result.
This is going to give you the most likely script used by a language. But there are languages that use more than one script. Some of them are captured by ICU, but some are not.

Website localization for multibyte languages

I have started to code a multi-language feature for a medium-sized website with a lot of hardcoded text. As the website is supposed to be translated into Japanese and Korean (multibyte character set) I am considering the following:
If I use string externalization, do the strings for Japanese or Korean need to be in unicode form within the locale file (i.e. 台北 instead of 台北 as string value)?
Would it make more sense to store the localization in a DB (i.e. MySQL) and retrieve the respective values via a localization function in PHP?
Your thought input is much appreciated.
Best regards
$0.02 from someone who has some experience with i18n...
Keep your translations in human-readable form, as it will likely be translators and not coders managing these resources.
If this text (hard-coded, you say) is not subject to frequent change, then you may wish to store these resources as files that you read in at runtime.
If this text is subject to frequent change, then you may wish to explore other alternatives for storing resources, such as databases or in-memory key-value stores.
Depending upon your requirements, you may want to consider a mixture of the above.
But I strongly suggest that you avoid mixing code (the HTML character entities) with your translation resources. Most translators will not understand what they mean and may break them when they are translating. And on the flip-side, a programmer may not understand how to insert code or formatting into the translation resources properly, unless they actually understand that language.
tl;dr
- use UTF-8
- don't mix any code/formatting into the translations themselves
- how you store the translations depends upon your requirements
I doubt that string externalization would be your biggest problem. But let me give you some advise.
String externalization
Of course you would need to separate translatable strings from the code. I would recommend storing translation in plain text, UTF-8 encoded file containing key-value pairs:
some.key=some translation
Of course you would need to write a helper script to resolve this at runtime. The script would need to detect end-user's language.
Language detection
Web browsers are so nice to send AcceptLanguage header each time they send a request. What you need to do, is to read the content of this header and check if you support any of the language user has listed. If so, read the resource file (as defined above) and return strings for given language, return your default language otherwise. The code example below will give you the most desired language (which is not necessary the one you support):
<?php
$locale = Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
echo $locale;
?>
This is still, not the biggest of your challenges.
Styles and style sheets
The real problem with multilingual web sites or web applications are styles. People tend to put style definitions in-line, which is problematic to say the least. Also, designers tend to think that Arial is the best font for entire Universe, as well as emphasis always have to come with bolded font. The only problem is, the font might be unreadable under some circumstances.
I must admit, I don't know why it happens, but most of the times web browsers tend to ignore bold attribute for Asian scripts (which is good), but sometimes they do not and it could became a major challenge for end users if your font definition is say font-family:Arial; font-size:10px;.
The other problem could be colors. Depending on your web site design, some colors used might be inappropriate for target customers. That is because we all tend to assign meaning to colors based on our cultural background.
Images containing localizable text could also give you a headache, you would need to either externalize such texts (and write them down just like any other HTML element), or prepare multilingual resources structure (i.e. put all images to directories named after language code ("en", "ja", "ko")).
The real challenge however, are hard-coded formatting tags like <b>, <i>, <u>, <strong>, etc. Nobody should use them nowadays, style classes should be used instead but the common practice is different. You would probably need to replace them with style classes; each element could have more than one style class, which to my surprise is not common knowledge (for example <p class="main boldText">).
OK, once you have your styles externalized, you would probably be forced to implement some sort of CSS Localization Mechanism. This is needed in the lights of what I wrote above. The easiest way to do that is to create directory structure similar to the one I mentioned before - "en" for English base CSS files, "ja" for Japanese and "ko" for Korean, so each language would have their own, separate set of CSS files. This is similar to UI skins, only in that case user won't be able to choose the skin, you will decide on which CSS to present them - you would detect language anyway.
As for in-line style definitions (<p style="whatever">), after you define CSS L10n Mechanism, you could override any style by forcing it with !important keyword. That is, unless somebody in his very wrong mind put this keyword to in-line style definition.
Concatenations
Well, this is your biggest challenge. Even people who understand the need of string externalization tend to concatenate the strings like this:
$result = $label + ": " + $product;
$message = "$your_basket_is + $basket_status + ".";
This poses serious problem for Internationalization (and if it is not resolved for Localization as well). That is because, the order of the sentence tend to be different after translating text into different language (this especially regards to Korean). Also, I showed you hard-coded punctuations, which are not necessary correct for Asian languages. That is what I have to go through on a daily basis :/
What you would probably need to do, is to remove such concatenations, or use some means of message formatting. The PHP example (taken directly from web page I am referencing) would be:
<?php
$fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree");
echo $fmt->format(array(4560, 123, 4560/123));
$fmt = new MessageFormatter("de", "{0,number,integer} Affen auf {1,number,integer} Bäumen sind {2,number} Affen pro Baum");
echo $fmt->format(array(4560, 123, 4560/123));
?>
As you can see in this example, numbers are also formatted to much locale style. This leads us to:
Locale aware formatting
Dates, times, numbers and currencies or other similar information need to be formatted according to user-detected Locale. There is a slight difference here: you should attempt to do that, even if you do not support related language resources (do not have translations). Of course for currency symbol, you would use whatever is your real currency, not the user's default, but the format should respect end user's cultural background.
Summary
I have just presented you with a short introduction to multilingual web site design with focus on Japanese and Korean target markets. If at some point you would need to support Chinese Simplified as well, support for GB18030 encoding would be probably needed as well. This would be very challenging...
You do not want to store all your text as HTML entities. It'll drive you mad. The only reason to do this is if you need to serve your document in an ASCII encoding and cannot embed the characters directly. But in this day and age there's no reason for that; serve your document as UTF-8 and write and store your contents in UTF-8 and be done with it.
Whether or not to store translations in the database depends on many factors, including performance, caching, whether you need to be able to search for the text, whether the text should be editable by non-programmers etc. Usually .mo/.po translation files with gettext are a good way to go unless proven otherwise.

Latin<->Han Conversion in ICU?

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.
According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)
Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.
While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:
{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },
I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.
I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?
Thank you for your time
Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

Which Perl module can handle a variety of date formats containing unicode characters?

My requirement is parsing xml files which contains wide varieties of timestamps based on the locales at which they are written. They may contain Unicode characters in case of Chinese or Korean locales. I have to parse these timestamps and put then in a standard format something like 2009-11-26 12:40:54 to put them in a oracle database. Sometimes I may not even know the locale and yet I have to parse the timestamps.
I am looking for a module that automatically detects the timestamp format (including unicode characters for am and pm in their local language) and converts in to epoch time so that I can convert it back to what ever way I like to.
I have gone through similar questions in this forum. Few suggested DateFormat module, and Date::Parse module. The perl distribution I am using is 5.10 so Date::Manip doesn't come as a core module. As I am supposed to use just the basic core modules and few CPAN modules(on request I cannot ask for all),
I request you to kindly suggest me a good module that suffices all my requirements.
Thanks in advance
DateTime::Locale should do what you want.
You might have a look at Date::Manip. Don't know if it supports the languages you need but there is some UTF8 support in it. In any case once you get the dates extracted it has a UnixDate method to easily output in whatever format you want. Also resolves time zones.