Let's say I have a title string, written in different languages.
Is there way to check which language is each string?
I have not played with it but you should look at NSLinguisticTagger and its - (NSOrthography *)orthographyAtIndex:(NSUInteger)charIndex effectiveRange:(NSRangePointer)effectiveRange method. From the NSOrthography docs:
The NSOrthography class describes the linguistic content of a piece of
text, typically used for the purposes of spelling and grammar
checking.
An NSOrthography instance describes:
Which scripts the text contains. A dominant language and possibly
other languages for each of these scripts. A dominant script and
language for the text as a whole. Scripts are uniformly described by
standard four-letter tags (Latn, Grek, Cyrl, etc.) with the supertags
Jpan and Kore typically used for Japanese and Korean text, Hans and
Hant for Chinese text; the tag Zyyy is used if a specific script
cannot be identified. See Internationalization Programming Topics for
more information on internationalization.
Languages are uniformly described by BCP-47 tags , preferably in
canonical form; the tag und is used if a specific language cannot be
determined.
You can simply use the Google Transalate REST API to find the language.
And you can use something like RestKit to make the REST requests to the google servers.
You could use N-gram sampling frequencies techniques. I am not an expert, but they are rumored to work well in practice.
See netspeak and papers like this etc etc.
There's Google translation API available that allows language conversation. I am sure there must be something you will find that returns matched language for your string. See Google Translate APIs for details.
Related
I am trying to make chat bot. I searched for some solutions and programs to help me.
Can someone tell me if Program-o uses natural language processing?
I have searched on google but i didn't find the answer.
Program-O is basically the engine that uses recursive pattern-matching on AIML to find a suitable response.
The answer given here explains in a bit more detail NLP in AIML
The pertinent paragraph being:
If by "natural language processing" you mean what is commonly called a "learning bot," the ALICE (AIML) bot does not meet the definition. The ALICE program (whose "brain" is the AIML scripting language) is a pattern-matching program. It searches a fairly large database - usually about 40,000 entries - for a phrase or term that matches one in the input, then selects a reply from the set designated by the closest match. It neither writes to its own files or generates spontaneous output. It doesn't "learn" by itself. Any changes or new information must be hard-coded into the AIML files by the botmaster.
I would like to have my design stored as file for version control.
Are there any standards or commonly used formats?
For example, I can write one file for structure definition:
User {
uid,
name
}
And another file for API definition:
GET /users/:uid => User
GET /users?name=:name => [User]
However, these are in my own preferences. Are there any commonly used formats for representing these?
I expect it to be something like UML, regardless of language, just focusing on API itself.
The notation you mention is quite close to what developers would expect to get as a design or specification, so that might be enough.
However, if your project will get certain scale you can try to use some notation that might be then used by tools to automate either code generation, testing or documentation.
In particular, Swagger is a quite common tool to use for this. If you write your specification following these standards you'll get documentation and even some code generation if you use that tool.
https://swagger.io/specification/
I have googled (well, DuckDuckGo'ed, actually) till I'm blue in the face, but cannot find a list of language codes of the type en-GB or fr-CA anywhere.
There are excellent resources about the components, in particular the W3C I18n page, but I was hoping for a simple alphabetical listing, fairly canonical if possible (something like this one). Cannot find.
Can anyone point me in the right direction? Many thanks!
There are several language code systems and several region code systems, as well as their combinations. As you refer to a W3C page, I presume that you are referring to the system defined in BCP 47. That system is orthogonal in the sense that codes like en-GB and fr-CA simply combine a language code and a region code. This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).
So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.
The specifications that you have found define the general principles and also the authoritative sources on different “subtags” (like primary language code and region code). For the most important parts, the official registration authority maintains the three- and two-letter ISO 639 codes for languages, and the ISO site contains the two-letter ISO 3166 codes for regions. The lists are quite readable, and I see no reason to consider using other than these primary resources, especially regarding possible changes.
There are 2 components in play here :
The language tag which is generally defined by ISO 639-1 alpha-2
The region tag which is generally defined by ISO 3166-1 alpha-2
You can mix and match languages and regions in whichever combination makes sense to you so there is no list of all possibilities.
BTW, you're effectively using a BCP47 tag, which defines the standards for each locale segment.
Unicode maintains such a list :
http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html
Even better, you can have it in an XML format (ideal to parse the list) and with also the usual writing systems used by each language :
http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
(look in /LanguageData)
One solution would be to parse this list, it would give you all of the keys needed to create the list you are looking for.
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
I think you can take it from here http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html
This can be found at Unicode's Common Locale Data Repository. Specifically, a JSON file of this information is available in their cldr-json repo
We have a working list that we work off of for language code/language name referencing for Localizejs. Hope that helps
List of Language Codes in YAML or JSON?
List of primary language subtags, with common region subtags for each language (based on population of language speakers in each region):
https://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
For example, for English:
en-US (320,000,000)
en-IN (250,000,000)
en-NG (110,000,000)
en-PK (100,000,000)
en-PH (68,000,000)
en-GB (64,000,000)
(Jukka K. Korpela and tigrish give good explanations for why any combination of language + region code is valid, but it might be helpful to have a list of codes most likely to be in actual use. s-f's link has such useful information sorted by region, so it might also be helpful to have this information sorted by language.)
Is there a REST best practice for GETting resources in different languages. Currently, we have
www.mysite.com/books?locale=en
I know we can use the accept-language header but is it better for us to do
www.mysite.com/books/en or www.mysite.com/books.en
or does it not matter?
If you're trying to have your server return different translations or localized versions of the same books (in other words, the same resource from a RESTful perspective), then use Accept-Language because the resource is the same but the representation is different based on the client's needs.
However if you're trying to return completely different books based on the client's locale (say, returning books written in French if you know that the user is in France) then the URIs should be different since different resources would be returned. At this point, you're talking more about a query request more than anything else. For what it's worth, the /books/en approach sounds reasonable. Another approach would be to add the locale or language as a resource parameter to GET as /books?lang=en.
I think best way would be to implement this following way:
HTTP's Accept-Language header
Language prefix in URI such as /en/books/...
In other words you can accept language from both sources. Implementation will be following:
Check if Accept-Language header is provided and keep this in the variable;
If request URI starts with /en, /fr or other known language codes(that are supported by your system) Overwrite language variable with this new value, strip it from URI i.e. if URI is /en/books you will end up with /books.
If there is no language provided, keep default language in variable for example "en"
With this approach you can make sure that a) path routing will be language agnostic and your system will work uniformly with paths; b) language handling/negotiation will be completely separated from your scripts. You can use language information in your scripts without even knowing what was the source and how it was requested.
I am agree with comment of manuel-aldana in the answer to the question RESTful URL: where should I put locale? example.com/en/page vs example.com/page?locale=en
Check for parameter (e.g. locale=en ) first to allow client explicitely specify language with fallback to Accept-Language
I realize that it is impossible to have one language that is best for everything.
But there is a class of simple programs, whose source code looks virtually identical in any language.
I am thinking not just "hello world", but also arithmetics, maybe string manipulation, basic stuff that you would typically see in utility classes.
I would like to keep my utilities in this meta-language and have it automatically translated to a bunch of popular languages. I do this by hand right now.
Again, I do not ask for translation of every single possible program. I am thinking a very limited, simple language, but superportable.
Do you know of anything like that? Is there a reason why it should not exist?
Check Haxe, and its Wikipedia page. It's open source and its main purpose is what you describe: generating code in many languages from only one source.
Just about any language that you choose is going to have some feature that doesn't map to another in a natural way. The closest thing I can think of is probably a useful subset of JavaScript. Of course, if you are the language author you can limit it as much as you want, providing only constructs that are common to just about any language (loops, conditionals, etc.)
For purposes of mutability, an XML representation would be best, but you wouldn't want to code in it.
If you find that there is no universal language, you can try a pragmatic model-driven development approach, using a template-based code generator.
On the template you keep the underlying concepts of an algorithm. Then, you would add code for this algorithm in one or more specific languages (C++,Java,JS,Python) when necessary. You would have to do it anyway, whatever the language or approach you choose. A configuration switch would pick the correct language for any template you apply.
AtomWeaver is a code generator that works with templates and employs ABSE as the modeling approach.
I did some looking and found this.
https://www.indiegogo.com/projects/universal-programming-language
looks interesting
A classic Pascal is very simple. Oberon is another similar option. Or you could invent your own derivative language similar to the pseudocode from the computer science textbooks. It's trivial to implement a translator from one of that languages into any decent modern imperative language.