configure lucene.net to recognize homophones - lucene.net

we have a site where the user can enter the name of a city. Lucene.net 2.1.0.3 is the search engine to look for cities that have already been created. As configured Lucene does not recognise that Saint Jerome is the same as St. Jerome or that Lake Phillip is the same as Lac Phillip.
Any tips on widening the search strategy for Lucene.Net?

I've read a bit about this synonyming and "sounds like" (read "I currently have no experience with this"). To me it seems like two different problems: abbreviation "synonyms" and "sounds like".
Sounds Like
Soundex is an older algorithm which was designed for mispellings of "american" names. There is an improved algorithm called 'Double Metaphone' addressed some of the complaints of Soundex. This library looks promising:
http://sourceforge.net/projects/phonetixnet/
Abbreviation Synonyms
While it seems there could be a generic synonyming system, I would expect "Garden City" might get synonyms of "Plot Town" or "Patch burg". I am guessing you'll achieve better results with your own domain-specific synonyms.
It seems that words like 'Saint' ('St.') and 'Mount' ('Mt') would be best handled as synonyms. Here is an article that proposes a fairly simple solution to custom synonyming: http://www.codeproject.com/KB/cs/lucene_custom_analyzer.aspx .

Related

VSCode query format for searching for files, types, etc

I'm trying to understand what is the query format when I press (Cmd + P) or (Comd + T) and then type something.
Let's say I type ABC. it seems to me that VSCode searches using the regex A.*B.*C.*. Is it correct? It also appears that * is also allowed in the query, but I got confusing results, for example here
Can someone please point me out to the documentation about the query format?
It is called "fuzzy" matching or searching. I couldn't find any formal documentation other than something like implementing fuzzy matching. For your odd test case of vs*b it looks like they are trying to implement fuzzy matching with out-of-order symbols like some other editors have.
See also More fuzzy matching:VSCode documenation
The file picker is not using regular expressions, but a fuzzy search algorithm. I think this feature is somehow connected to IntelliSense, but I am not aware of any detailed technical documentation. However, it has been introduced in December 2015 (VSCode 0.10.6) and became a default setting in January 2016 (VSCode 0.10.9).
On GitHub you can find an issue collecting bug reports / feature requests regarding the fuzzy searching. If you want to dig much deeper into this topic, you might find a good starting point there.
As a side note, also the User Settings(File > Preferences > Settings) seem to use the same kind of fuzzy search:

Cite a paper using github markdown syntax

I am coding a readme for a repo in github, and I want to add a reference to a paper. What is the most adequate way to code in the citation? e.g. As a blockquote, as code, as simple text, etc?
Suggestions?
I agree with Horizon_Net that it depends on personal preference.
I like to have something which looks similar to LaTeX.
An example is provided below.
Note that it demonstrates a numeric citation style.
Alphabetic or reading style are possible too.
For numeric citation style, higher numbers should appear later in the text, and this can make satisfying numeric citation style cumbersome.
To avoid this problem, I typically use an alphabetic citation style.
"...the **go to** statement should be abolished..." [[1]](#1).
## References
<a id="1">[1]</a>
Dijkstra, E. W. (1968).
Go to statement considered harmful.
Communications of the ACM, 11(3), 147-148.
"...the go to statement should be abolished..." [1].
References
[1]
Dijkstra, E. W. (1968).
Go to statement considered harmful.
Communications of the ACM, 11(3), 147-148.
On GitHub flavored Markdown and most other Markdown flavors, you can actually click on [1] to jump to the reference.
Apologies for taking Dijkstra his sentence out of context.
The full sentence would make this example more difficult to read.
EDIT:
If the references all have a stable link, it is also possible to use those:
The field of natural language processing (NLP) has become mostly dominated by deep learning approaches
(Young et al., [2018](https://doi.org/10.1109/MCI.2018.2840738)).
Some are based on transformer neural networks
(e.g., Devlin et al, [2018](https://arxiv.org/abs/1810.04805)).
The field of natural language processing (NLP) has become mostly dominated by deep learning approaches
(Young et al., 2018).
Some are based on transformer neural networks
(e.g., Devlin et al, 2018).
From what I know, there is no built-in mechanism for this. This leads to more subjective opinions, depending on personal preference. I personally like to have a separate section called references. An example would look like the following
References
some reference
another reference
Update
Another way would be to use simple HTML embedded in your Markdown.
The alternative (to pure markdown) is to use the new GitHub integration from Aug. 2021:
Enhanced support for citations on GitHub
GitHub now has built-in support for CITATION.cff files.
This new feature enables academics and researchers to let people know how to correctly cite their work, especially in academic publications/materials.
Originally proposed by the research software engineering community, CITATION.cff files are plain text files with human- and machine-readable citation information.
When we detect a CITATION.cff file in a repository, we use this information to create convenient APA or BibTeX style citation links that can be referenced by others.
How this works
Under the hood, we’re using the ruby-cff RubyGem to parse the contents of the CITATION.cff file and build a citation string that is then shown in GitHub when someone browses a repository with one of those files1.
Now, that is helping others making your Github paper easily citable.
But, as documented, to "add a reference to a paper":
Citing something other than software
If you would prefer the GitHub citation information to link to another resource such as a research article, then you can use the preferred-citation override in CFF with the following types.
Resource
Type
Research article
article
Conference paper
conference-paper
Book
book
Extract
preferred-citation:
type: article
authors:
- family-names: "Lisa"
given-names: "Mona"
orcid: "https://orcid.org/0000-0000-0000-0000"

What's a Good package for Phonetic Representation for Various Human Languages?

I'm currently working on a project for which I think being able to come up with phonetic representations of words in various languages would be really helpful. I know Aspell does this pretty well, but I don't think there's a very easy way to get at their phonetic representations, so I ask: is there some other good package for getting the phonetic representation of a word given the word and the language/dialect/accent/whatever it's coming from?
This doesn't need to be in any particular language, but if it were Perl, that would be best.
I've already tried Soundex, Metaphone, DoubleMetaphone, and everything else in Text::Phonetic, and none of that stuff was very good – definitely nowhere near as good as the stuff in Aspell.
The first thing that springs to mind is Soundex. Of course, there is a Perl module Soundex, too. While this is designed to generate a soundex "key" from input it might be useful in mapping different variants to a common key.
There is a package Text::Aspell in CPAN. Might be useful.
I you are trying to make a google style suggestion/correction system, it's not based on just phonetics or AI, but on a massive amount of user input. When a user makes a search, and doesn't click in any link but corrects the input and searches again, it gives google a lot of data about "correct" writing than a phonetics test or dictionary matching.
The main problem is in human language itself, it's not that people speak or write in a deterministic way, let alone in multiple languages.
Of course , i might be wrong, but if you need a library that let's you do this:
getLanguage(string);
I want to see that working, really.

Syntax changes from the examples in 'The Little Schemer' to the real Scheme

I have recently started following the examples from The Little Schemer and when trying out the examples in DrScheme, I have realised that there are some minor syntax changes from the examples in the book to what I can write in DrScheme.
First of all, as a language in DrScheme, I chose Pretty Big (one of the Legacy Languages).
Is this the correct choice for trying the examples in the book?
As regards the syntax changes I have noticed that, for example, I need to prefix the identifiers with a ' in order for them to work.
For example:
(rember 'jelly '(peanut butter jelly))
Are there any more changes (syntactical or not) that I need to be aware of when trying the examples from the 'The Little Schemer' book ?
IIRC, the book uses a different font for quoted pieces of data, and in real Scheme code that requires using quote. As for your use of PLT Scheme -- the "Pretty Big" language is really there just as a legacy language. You should use the Module language, and have all files start with #lang scheme (which should be there by default).
(The "new" way of using different languages in DrScheme is to always be in the Module "language" and specify the actual language using a #lang line.)
See the "Guidelines for the reader" section in the Preface. (I'm looking at the 4th edition here.)

Can aptitude for learning Programming paradigms be influenced by culture or native language's grammar? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
It is well known that different people have different aptitudes regarding various programming paradigms (e.g. some people have trouble learning non-procedural, especially functional languages. Some people have trouble understanding pointers - see Joel Spolsky's blog for musings on that. Some people have trouble grasping recursion).
I was recently reading about a study that looked at how the grammar of someone's native language affected their speed of learning math. Can't find that article now but a quick googling found this reference.
That led me to wondering whether someone's native culture or first language might affect their aptitude towards various programming paradigms. I'm more curious about positive influences - e.g. some trait that make it easier/faster for someone to learn a particular paradigm, for example native language grammar being very recursion-oriented.
To be clear, I'm looking for how culture/language grammare may affect the difference between aptitude of the same person towards various paradigms as opposed to how it affects overall aptitude towards programming between different persons.
Important: the only answers I'm interested in are either references to scientific studies, or personal observations from someone intimately familiar with a particular culture/language, including from their own experience.
E.g. I'm not interested in your opinion of how Chinese being your first language affects anything unless you speak Chinese or worked with extremely large set of Chinese-native programmers extensively.
I'm OK with your guesstimates not based on scientific studies, but please be sure to supply your reasoning about plausible causes of your observation.
I'm not interested in culture-bashing (any such commends will be deleted or flagged for deletion).
I'm also not particularly interested in culture-building - we all know Linus is from Finland and Tetris was written in Russia and Larry Wall is an American. Any culture/nation can produce a brilliant mind in any discipline. I'm interested in averages.
Disclaimer: I was a Cultural Anthropologist before I got into programming, so you know I'm going to be on a high horse, here.
Obviously, a person's history will have an impact on their aptitude for any particular task, but I think this has less to do with the structure or grammar of a person's language than it does with the particular material conditions of the culture in which that language is spoken.
For example, a pair of Anthropologists in the 60's went to various African communities and tested people's susceptibility to various optical illusions. Here is a classic one:
In this illusion, the bottom line looks longer, because the angled lines connecting it make it appear to be off in the distance.
These Anthropologists found that in many African cultures, the illusion doesn't work at all - people consider the lines to be the same length. By refining their study, they found that the only people who were susceptible to the illusion were people who had grown up in an urban environment. They hypothesized that the illusion did not work on people from remote jungle environments, because these people had little or no experience with right angles and seeing things at very long distances.
My point with this is that even if you successfully found a correlation between programmers' native languages and their abilities with certain aspects of programming, you couldn't be sure that the correlation wasn't spurious. For example, you might think that Asians tend to be bad drivers, and you might even be able to demonstrate this statistically. If you then concluded, however, that "bad driving" is some sort of fundamental characteristic of Asian-ness, you would be ignoring the fact that Asians are more likely to be from Asia, and thus to have had much less experience driving cars (or even being in cars) while growing up than Westerners (and especially Americans) have had.
With programming, we might think that a particular language inhibits programming ability, and not take note of the fact that the society in which that language is spoken has much less access to computers, and thus people growing up with that language appear to have less programming aptitude or ability to understand certain programming concepts.
In short, I wouldn't give much credence to the idea that language inhibits anyone's ability to understand anything in particular. The human mind is much too flexible and adaptable for that to be true.
This seems analogous to the Sapir-Whorf Hypothesis - that the facilities of a language affect the ease which which one can cogitate about certain subjects, or in the words of the Wikipedia article:
"The linguistic relativity principle (also known as the Sapir-Whorf Hypothesis) is the idea that the varying cultural concepts and categories inherent in different languages affect the cognitive classification of the experienced world in such a way that speakers of different languages think and behave differently because of it."
( http://en.wikipedia.org/wiki/Linguistic_relativity )
While there appears to be little definitive information here, the discussions appear to be relevant to the question, and perhaps worthy of further exploration.
Just a few random thoughts. I think the influence is generally very weak and can most of the time be neglected but they do exist and sometimes they can make us feel them.
In Chinese grammar, for example, we don't quite distinguish between plural and singular forms, but I wouldn't think we Chinese have any noticeable difficulty understanding the concepts of scalar and array in Perl. The reason might be this: although we generally don't need particular suffixes or changes in form to indicate whether something is singular or plural, we do have the concepts of plural and singular and we mostly depend upon the context to tell them apart. Grammar-wise, the context in Chinese may possibly be way more important than that in those languages belonging to indo-european family. We omit a lot of things sometimes when they have already been mentioned and sometimes when we just presume that these things can be implicitly well understood by the listener. In either case, we don't need those indefinite and definite articles (a, an, the) or those relative pronouns like, that, which and who, to indicate whether they're being mentioned for the first time or yet another time again. Maybe that's partially why I feel very comfortable with Perl's default variable "$". print; chomp; split; all act upon $, which has never ever been mentioned. But this is quite subjective.
I think the Chinese language is more characterized by implicitness and fuzziness than Indo-european languages. For example, We never ever pay attention to subject verb agreement and we never ever do verbal conjugation to denote tenses. This could mean that the Chinese are inclined use a not quite so logical mode of thinking. One of my teachers onced used an example to try to generalize (or maybe over-generalize)the difference between Chinese non-logical mode of thinking and American logical mode of thinking.
If the American version of quarrelling should be this:
“I can lick you.”
“No, you can’t.”
“Yes, I can.”
“No, you can’t.”
“I can.”
“you can’t.”
“Can!”
“Can’t!”
The Chinese version (translated in English) would be something like this:
I can lick you.
How dare you!
What if I dare?
Then you try.
Try? Hm, you wait and see.
Wait and see? I’m not afraid.
Not afraid? OK. You don’t run away.
Who runs away? Come on and lick
Well, I agree that there may be some differences between Chinese way of thinking and that of other countries but the example looks like a stereotype because the Chinese may easily switch to the use of the American version. Back to the question, I think the language and culture may indeed influence a programmer's learning process in one way or another but this influence is defninitely not decidingly noticeable. Maybe because of the culture you're exposed to makes you feel a little bit uncomfortable to get used to some notions in some programming language, recursion or whatever, but time will solve it.
I was recently reading about a study that looked at how the grammar of someone's native language affected their speed of learning math. ... Important: the only answers I'm interested in are either references to scientific studies, or personal observations from someone intimately familiar with a particular culture/language, including from their own experience.
I learned a lot of maths before I started programming (enough to count as "intimately familiar"), and IMO programming is relatively easy: more tangible.
Sometimes I've wondered whether it's beneficial to know more than one human language: if you only know one language, then you might think of the words "cat" and "dog" as being values, i.e. synonymous with cat and dog objects; but if you're fluent in more than one language, then "cat" and "dog" become pointers: because for example the French words "chat" and "chien" are referring/pointing to the same objects as "cat" and "dog", and so clearly there's a distinction between the word and the object.
It's disappointing that you post the question without linking to the article which inspired it. I thought of "reverse polish notation" and wondered whether that was at all the kind of differences in "grammar" that were considered in the original study.
The reference you cite seems to rest on the assumption that making it easier helps with learning. In my understanding, there is a countereffect: without enough challange, you're not learning enough.
There are theories/studies (anyone with a link?) that development of language created crucial pressure on expanding the cerebral cortex and thus "made us human". (in very darwinistic terms: more grey matter ==> better language capabilities ==> better teamwork ==> better survival as a group). So language complexity can't be all bad for learning.
(My only qualification is being an eager follower of The Frontal Cortex blog, so take this with a grain of salt.)
In german we have a strange ordering of numbers: 10^0 and 10^1 positions are switched, but others are normal, (e.g. 25 is 'five and twenty', 125 is 'one hundred five and twenty'). It's been claimed that this makes learning numbers harder, and thus german should adopt a more intuitive ordering.
I guess that it helps a lot with doing additions in your head - at least if you stay below 100 or 200 - You can first add the 10^0 position and already say it / write it down while taking any carry into account for the 10^1 position.
(That doesn't continue for 10^2, I guess that would be done in writing by the majority anyway)
Also: abstractions. There are languages where numbers aren't abstracted from objects, "two coconuts" and "two sabretooth tigers" don't share a common "two" word / concept. Such a language would probably be very bad for developing math skills. Here the abstraction (separating number and object) in language is important.
Generally, I'd say the language has a strong effect on shaping a developing mind, and I see no reason why this should not extend to culture.
Of course it's still open what would be the "right kind of complexity" - for what, and how particular language features affect general improvement vs. establishment of an elite (i.e. "sharpening the skills of the gifted, while hampering the rest").
Interesting Question, no doubt - looking forward to other replies.