Where can I find a list of language + region codes? - unicode

I have googled (well, DuckDuckGo'ed, actually) till I'm blue in the face, but cannot find a list of language codes of the type en-GB or fr-CA anywhere.
There are excellent resources about the components, in particular the W3C I18n page, but I was hoping for a simple alphabetical listing, fairly canonical if possible (something like this one). Cannot find.
Can anyone point me in the right direction? Many thanks!

There are several language code systems and several region code systems, as well as their combinations. As you refer to a W3C page, I presume that you are referring to the system defined in BCP 47. That system is orthogonal in the sense that codes like en-GB and fr-CA simply combine a language code and a region code. This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).
So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.
The specifications that you have found define the general principles and also the authoritative sources on different “subtags” (like primary language code and region code). For the most important parts, the official registration authority maintains the three- and two-letter ISO 639 codes for languages, and the ISO site contains the two-letter ISO 3166 codes for regions. The lists are quite readable, and I see no reason to consider using other than these primary resources, especially regarding possible changes.

There are 2 components in play here :
The language tag which is generally defined by ISO 639-1 alpha-2
The region tag which is generally defined by ISO 3166-1 alpha-2
You can mix and match languages and regions in whichever combination makes sense to you so there is no list of all possibilities.
BTW, you're effectively using a BCP47 tag, which defines the standards for each locale segment.

Unicode maintains such a list :
http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html
Even better, you can have it in an XML format (ideal to parse the list) and with also the usual writing systems used by each language :
http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
(look in /LanguageData)

One solution would be to parse this list, it would give you all of the keys needed to create the list you are looking for.
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

I think you can take it from here http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html

This can be found at Unicode's Common Locale Data Repository. Specifically, a JSON file of this information is available in their cldr-json repo

We have a working list that we work off of for language code/language name referencing for Localizejs. Hope that helps
List of Language Codes in YAML or JSON?

List of primary language subtags, with common region subtags for each language (based on population of language speakers in each region):
https://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
For example, for English:
en-US (320,000,000)
en-IN (250,000,000)
en-NG (110,000,000)
en-PK (100,000,000)
en-PH (68,000,000)
en-GB (64,000,000)
(Jukka K. Korpela and tigrish give good explanations for why any combination of language + region code is valid, but it might be helpful to have a list of codes most likely to be in actual use. s-f's link has such useful information sorted by region, so it might also be helpful to have this information sorted by language.)

Related

The reason how does this German word work in custom dictionary

The official help page has provided an example of German language.
https://help.libreoffice.org/6.1/he/text/shared/optionen/01010400.html
In language-dependent custom dictionaries, the field contains a known
root word, as a model of affixation of the new word or its usage in
compound words. For example, in a German custom dictionary, the new
word “Litschi” (lychee) with the model word “Gummi” (gum) will result
recognition of “Litschis” (lychees), “Litschibaum” (lychee tree),
“Litschifrucht” (lychee fruit) etc.
Can someone provide an english example just to make it clear how it works?
Indeed (I am german). But you certainly know that this is to some extend a "german problem" because we have tons and tons of compound words. Everything is a compound word. Just some examples "door handle" is "Türklinke", "coffee machine" is "Kaffemaschine", "birthday cake" is "Geburtstagskuchen" and the infamous "Association for Subordinate Officials of the Main Maintenance Building of the Danube Steam Shipping Electrical Services" is "Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft". Really.
Now to your question.
I must say, after searching a bit, I think for English this is basically a useless feature. There are just so few and not very flexible real compound words. But here is an actual example. Take "sun" as new word, and specify "moon" as model word. Then the dictionary can deduce from "moonlight" that there also is a "sunlight". But I think in English this example almost exhausts the entire capabilities, while in a language like German you can generate a very high number of words, including all of their declination (which is really useful).
For apple (instead of lychee) some examples would be:
English: Apple tree, apple pie, apple juice, apples
German: Apfelbaum, Apfelkuchen, Apfelsaft, Äpfel
What exactly don't you understand? btw. I am German

Canonical list of scala 'operators'

It seems easy to find general info on a specific 'operator' (method, syntactic sugar), but I can't seem to find anything that has a list of all, or even just most, of these goodies. As such, it makes it fairly difficult, or at least overly time consuming, to work through learning the language.
I have already looked over this question. While it has great information, and definitely shows you how to find any information you need, I was hoping for something like a 'pocket ref' that just had all the relevant info and was only dedicated to that.
So, my question is this:
Is there a such a list?
Am I getting ahead of myself by looking for such a reference early on in learning the language?
Thanks in advance.
Well, a list of all operators makes as much sense as a list of all methods in the library, regardless of the type. It isn't going to be particularly useful except for finding information about a specific operator.
However, if you do want one, at any ScalaDoc site (http://www.scala-lang.org/api/current/ for the standard library) there is an alphabetic index just under the search bar. The first link (#) lists all the non-alphabetic methods (i.e. "operators").
Many of these are rarely used, or only in specific circumstances.
Obviously, any other library can introduce its own operators, and you'll need to check its own documentation.

Get language of string

Let's say I have a title string, written in different languages.
Is there way to check which language is each string?
I have not played with it but you should look at NSLinguisticTagger and its - (NSOrthography *)orthographyAtIndex:(NSUInteger)charIndex effectiveRange:(NSRangePointer)effectiveRange method. From the NSOrthography docs:
The NSOrthography class describes the linguistic content of a piece of
text, typically used for the purposes of spelling and grammar
checking.
An NSOrthography instance describes:
Which scripts the text contains. A dominant language and possibly
other languages for each of these scripts. A dominant script and
language for the text as a whole. Scripts are uniformly described by
standard four-letter tags (Latn, Grek, Cyrl, etc.) with the supertags
Jpan and Kore typically used for Japanese and Korean text, Hans and
Hant for Chinese text; the tag Zyyy is used if a specific script
cannot be identified. See Internationalization Programming Topics for
more information on internationalization.
Languages are uniformly described by BCP-47 tags , preferably in
canonical form; the tag und is used if a specific language cannot be
determined.
You can simply use the Google Transalate REST API to find the language.
And you can use something like RestKit to make the REST requests to the google servers.
You could use N-gram sampling frequencies techniques. I am not an expert, but they are rumored to work well in practice.
See netspeak and papers like this etc etc.
There's Google translation API available that allows language conversation. I am sure there must be something you will find that returns matched language for your string. See Google Translate APIs for details.

Universal meta-language for "simple" programs

I realize that it is impossible to have one language that is best for everything.
But there is a class of simple programs, whose source code looks virtually identical in any language.
I am thinking not just "hello world", but also arithmetics, maybe string manipulation, basic stuff that you would typically see in utility classes.
I would like to keep my utilities in this meta-language and have it automatically translated to a bunch of popular languages. I do this by hand right now.
Again, I do not ask for translation of every single possible program. I am thinking a very limited, simple language, but superportable.
Do you know of anything like that? Is there a reason why it should not exist?
Check Haxe, and its Wikipedia page. It's open source and its main purpose is what you describe: generating code in many languages from only one source.
Just about any language that you choose is going to have some feature that doesn't map to another in a natural way. The closest thing I can think of is probably a useful subset of JavaScript. Of course, if you are the language author you can limit it as much as you want, providing only constructs that are common to just about any language (loops, conditionals, etc.)
For purposes of mutability, an XML representation would be best, but you wouldn't want to code in it.
If you find that there is no universal language, you can try a pragmatic model-driven development approach, using a template-based code generator.
On the template you keep the underlying concepts of an algorithm. Then, you would add code for this algorithm in one or more specific languages (C++,Java,JS,Python) when necessary. You would have to do it anyway, whatever the language or approach you choose. A configuration switch would pick the correct language for any template you apply.
AtomWeaver is a code generator that works with templates and employs ABSE as the modeling approach.
I did some looking and found this.
https://www.indiegogo.com/projects/universal-programming-language
looks interesting
A classic Pascal is very simple. Oberon is another similar option. Or you could invent your own derivative language similar to the pseudocode from the computer science textbooks. It's trivial to implement a translator from one of that languages into any decent modern imperative language.

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing...‎ Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy...‎ Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.
In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.
If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.