The reason how does this German word work in custom dictionary - libreoffice-writer

The official help page has provided an example of German language.
https://help.libreoffice.org/6.1/he/text/shared/optionen/01010400.html
In language-dependent custom dictionaries, the field contains a known
root word, as a model of affixation of the new word or its usage in
compound words. For example, in a German custom dictionary, the new
word “Litschi” (lychee) with the model word “Gummi” (gum) will result
recognition of “Litschis” (lychees), “Litschibaum” (lychee tree),
“Litschifrucht” (lychee fruit) etc.
Can someone provide an english example just to make it clear how it works?

Indeed (I am german). But you certainly know that this is to some extend a "german problem" because we have tons and tons of compound words. Everything is a compound word. Just some examples "door handle" is "Türklinke", "coffee machine" is "Kaffemaschine", "birthday cake" is "Geburtstagskuchen" and the infamous "Association for Subordinate Officials of the Main Maintenance Building of the Danube Steam Shipping Electrical Services" is "Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft". Really.
Now to your question.
I must say, after searching a bit, I think for English this is basically a useless feature. There are just so few and not very flexible real compound words. But here is an actual example. Take "sun" as new word, and specify "moon" as model word. Then the dictionary can deduce from "moonlight" that there also is a "sunlight". But I think in English this example almost exhausts the entire capabilities, while in a language like German you can generate a very high number of words, including all of their declination (which is really useful).

For apple (instead of lychee) some examples would be:
English: Apple tree, apple pie, apple juice, apples
German: Apfelbaum, Apfelkuchen, Apfelsaft, Äpfel
What exactly don't you understand? btw. I am German

Related

What type of NLP method to choose?

so I'm going to build a prototype of a Social Web application that:
- Incorporates Facebook data of users (working hours, house and work office)
to create a web app so that friends and friends of friends that have similar routes can drive/bike with each other.
However, in order for this app to be useful it should be able to extract keywords (e.g. working hours, or if someone has to work later (and he/she posts this on Facebook). Now I'm reading a lot of methods but I don't know which one to choose:
- Sentiment analysis
- Lexical analysis
- Syntactic parsing
Thanks in advance.
Ultimately what you want is a human-like intelligence that can read between the lines of all the posts to extract information. So, in general terms, you have the same Too Hard (currently) problem that everyone else in every branch of NLP faces. I'm just pointing that out because then you realize your question becomes which imperfect approximation should I use.
Personally, I'd start with a simple text matcher. Look for strings like "Starting work today at 9". Gather your core list of sentences.
Then you realize there are variations due to rephrasing. "Start work today at 9", "Starting today at 9", "9 is my start time today", etc. Bring in a sentence analyzer at this point, and instead of a string of ascii codes the sentence turns into a string of nouns, adjectives and verbs.
You also have synonyms: "Starting my job today at 9", "Starting down the office today at 9", "Starting work today an hour later than normal". WordNet (and semantic networks generally) can help a bit. The last example there, though, not only requires parsing a fairly complicated clause, but also knowing their usual start time is 8. (Oh, on all the above you needed to know if they meant 9am or 9pm...)
By this point you realize you are gathering lots of fuzzy data. That is when you bring in some machine learning, to try have it take care of discovering for you that one combination of the verb "start", the noun "work", the time-noun "today" and the number "9" is useful to you, and another isn't (e.g. "At work today learnt that new drama starts at 9. Forgot to set recorder. Aaarrggh!")
I think what you are looking for is a customized Name Entity Recognizer. NLTK could be a good starting point. However, the default NE chunker in NLTK is a maximum entropy chunker trained on the ACE corpus and has not been trained to recognize dates and times. So you need to train your own classifier if you want to do that.
The link below gives a neat and detailed explanation for the same.
http://mattshomepage.com/articles/2016/May/23/nltk_nec/
Also, there is a module called timex in nltk_contrib which might help you with your needs.
https://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py
Cheers!

What does lowercaseStringWithLocale do?

Does anyone know exactly what the new (iOS 6) lowercaseStringWithLocale method of NSString does? The documentation is very skimpy, and I didn't find a single reference to this method in Apple's developer forums.
While localizing my app, I'm interested in changing words from my strings file to lowercase when they appear in a sentence -- except in the German version, where some words should stay in uppercase at all times. Is that what this method is for? Or something completely different?
The discussion in lowercaseString might shed some light:
Note: This method performs the canonical (non-localized) mapping. It is suitable for programming operations that require stable results not depending on the user's locale preference. For localized case mapping for strings presented to users, use the corresponding lowercaseStringWithLocale: method.
So if you're computing the lowercase version of a string for a purpose such as case-insensitive database lookup, use lowercaseString. If you intend to show the user the result, then use lowercaseStringWithLocale.
Note that lowercaseStringWithLocale won't make a decision based on the actual words as to whether the word should be lowercased or not. It does what you ask it to do, and doesn't question your motives.
Lower/uppercasing is indeed locale-dependent. The only example I know about (and it's a killer one, a source of many globalization bugs) is the Turkish i issue. See here for an overview: http://www.codinghorror.com/blog/2008/03/whats-wrong-with-turkey.html
Basically, when you uppercase "Hi" you get "HI" except for Turkey where you get "Hİ"
Likewise, when you lowercase "HI" you get "hi" except for Turkey where you get "hı"

Where can I find a list of language + region codes?

I have googled (well, DuckDuckGo'ed, actually) till I'm blue in the face, but cannot find a list of language codes of the type en-GB or fr-CA anywhere.
There are excellent resources about the components, in particular the W3C I18n page, but I was hoping for a simple alphabetical listing, fairly canonical if possible (something like this one). Cannot find.
Can anyone point me in the right direction? Many thanks!
There are several language code systems and several region code systems, as well as their combinations. As you refer to a W3C page, I presume that you are referring to the system defined in BCP 47. That system is orthogonal in the sense that codes like en-GB and fr-CA simply combine a language code and a region code. This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).
So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.
The specifications that you have found define the general principles and also the authoritative sources on different “subtags” (like primary language code and region code). For the most important parts, the official registration authority maintains the three- and two-letter ISO 639 codes for languages, and the ISO site contains the two-letter ISO 3166 codes for regions. The lists are quite readable, and I see no reason to consider using other than these primary resources, especially regarding possible changes.
There are 2 components in play here :
The language tag which is generally defined by ISO 639-1 alpha-2
The region tag which is generally defined by ISO 3166-1 alpha-2
You can mix and match languages and regions in whichever combination makes sense to you so there is no list of all possibilities.
BTW, you're effectively using a BCP47 tag, which defines the standards for each locale segment.
Unicode maintains such a list :
http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/index.html
Even better, you can have it in an XML format (ideal to parse the list) and with also the usual writing systems used by each language :
http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
(look in /LanguageData)
One solution would be to parse this list, it would give you all of the keys needed to create the list you are looking for.
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
I think you can take it from here http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html
This can be found at Unicode's Common Locale Data Repository. Specifically, a JSON file of this information is available in their cldr-json repo
We have a working list that we work off of for language code/language name referencing for Localizejs. Hope that helps
List of Language Codes in YAML or JSON?
List of primary language subtags, with common region subtags for each language (based on population of language speakers in each region):
https://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
For example, for English:
en-US (320,000,000)
en-IN (250,000,000)
en-NG (110,000,000)
en-PK (100,000,000)
en-PH (68,000,000)
en-GB (64,000,000)
(Jukka K. Korpela and tigrish give good explanations for why any combination of language + region code is valid, but it might be helpful to have a list of codes most likely to be in actual use. s-f's link has such useful information sorted by region, so it might also be helpful to have this information sorted by language.)

How was the Google Books' Popular passages feature developed?

I'm curious if anyone understands, knows or can point me to comprehensive literature or source code on how Google created their popular passage blocks feature. However, if you know of any other application that can do the same please post your answer too.
If you do not know what I am writing about here is a link to an example of Popular Passages. When you look at the overview of the book Modelling the legal decision process for information technology applications ... By Georgios N. Yannopoulos you can see something like:
Popular passages
... direction, indeterminate. We have
not settled, because we have not
anticipated, the question which will
be raised by the unenvisaged case when
it occurs; whether some degree of
peace in the park is to be sacrificed
to, or defended against, those
children whose pleasure or interest it
is to use these things. When the
unenvisaged case does arise, we
confront the issues at stake and can
then settle the question by choosing
between the competing interests in the
way which best satisfies us. In
doing...‎ Page 86
Appears in 15 books from 1968-2003
This would be a world fit for
"mechanical" jurisprudence. Plainly
this world is not our world; human
legislators can have no such knowledge
of all the possible combinations of
circumstances which the future may
bring. This inability to anticipate
brings with it a relative
indeterminacy of aim. When we are bold
enough to frame some general rule of
conduct (eg, a rule that no vehicle
may be taken into the park), the
language used in this context fixes
necessary conditions which anything
must satisfy...‎ Page 86
Appears in 8 books from 1968-2000
more
It must be an intensive pattern matching process. I can only think of n-gram models, text corpus, automatic plagisrism detection. But, sometimes n-grams are probabilistic models for predicting the next item in a sequence and text corpus (to my knowledge) are manually created. And, in this particular case, popular passages, there can be a great deal of words.
I am really lost. If I wanted to create such a feature, how or where should I start? Also, include in your response what programming languages are best suited for this stuff: F# or any other functional lang, PERL, Python, Java... (I am becoming a F# fan myself)
PS: can someone include the tag automatic-plagiarism-detection, because i can't
Read this ACM paper by Kolak and Schilit, the Google researchers who developed Popular Passages. There are also a few relevant slides from this MapReduce course taught by Baldridge and Lease at The University of Texas at Austin.
In the small sample I looked over, it looks like all the passages picked were inline or block quotes. Just a guess, but perhaps Google Books looks for quote marks/differences in formatting and a citation, then uses a parsed version of the bibliography to associate the quote with the source. Hooray for style manuals.
This approach is obviously of no help to detect plagiarism, and is of little help if the corpus isn't in a format that preserves text formatting.
If you know which books are citing or referencing other books you don't need to look at all possible books only the books that are citing each other. If is is scientific reference often line and page numbers are included with the quote or can be found in the bibliography at the end of the book, so maybe google parses only this informations?
Google scholar certainly has the information about citing from paper to paper maybe from book to book too.

Lucene.Net features

Am new to Lucene.Net
Which is the best Analyzer to use in Lucene.Net?
Also,I want to know how to use Stop words and word stemming features ?
I'm also new to Lucene.Net, but I do know that the Simple Analyzer omits any stop words, and indexes all tokens/works.
Here's a link to some Lucene info, by the way, the .NET version is an almost perfect, byte-for-byte rewrite of the Java version, so the Java documentation should work fine in most cases: http://darksleep.com/lucene/. There's a section in there about the three analyzers, Simple, Stop, and Standard.
I'm not sure how Lucene.Net handles word stemming, but this link, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2, demonstrates how to create your own Analyzer in Java, and uses a PorterStemFilter to do word-stemming.
...[T]he Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English
I hope that is helpful.
The best analyzer which i found is the StandardAnalyzer in which you can specify the stopwords also.
For Example :-
string indexFileLocation = #"C:\Index";
string stopWordsLocation = #"C:\Stopwords.txt";
var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
Analyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));
It depends on your requirements. If your requirements are ultra simple - e.g. case insensitve, non-stemming searches - then StandardAnalyzer is a good choice. If you look into the Analyzer class and get familiar with Filters, particulary TokenFilter, you can exert an enormous amount of control over your index by rolling your own analyzer.
Stemmers are tricky, and it's important to have a deep understanding of what type of stemming you really need. I've used the Snowball stemmers. For example, the word "policy" and "police" have the same root in the English Snowball stemmer, and getting hits on documents with "policy" when the search term "police" isn't so hot. I've implemented strategies to support stemmed and non-stemmed search so that may be avoided, but it's important to understand the impact.
Beware of temptations like stop words. If you need to search for the phrase "to be or not to be" and the standard stop words are enabled, your search will fail to find documents with that phrase.