Suppose you have a database of articles that reference authors and you want to be able to find results even when an alias is used. For example:
Stephen King
-> Richard Bachman
-> Gus Pillsbury
If someone searches Richard Bachman, I also want it to return Stephen King results (and perhaps vice versa). However, I also want to continue to find results that explicitly use the Bachman name.
I looked into wordforms (ref: http://sphinxsearch.com/docs/current.html#conf-wordforms) and tried something like this:
richard bachman => stephen king
gus pillsbury => stephen king
However, this seems to completely replace either pen name with king and doesn't also continue to search for the original query.
Perhaps something like this doesn't exist, or perhaps I just don't know the right wording for it to find it, but so far, coming up empty.
Its a bit tricky but could do
richard bachman => richard bachman stephen king
gus pillsbury => gus pillsbury stephen king
This way if search for just "bachman" then the document index still has the the word "bachman" in it.
May also want
stephen king => stephen king richard bachman gus pillsbury
so that searching for "bachman" will also still find stephen king books too.
... it breaks down a bit on phrase seraches, eg seraching for
"some say stephen king is a great author"
that would still find documents that DO have that quote, as the document AND query will change to
"some say stephen king richard bachman gus pillsbury is a great author"
But a document that contained "some say gus pillsbury is a great author" would not match. BUt could perhaps just make sure the names are all in the same order in the destination tokens.
Related
I am trying to make a webpage using github. But I got stuck when trying to make a hanging indent
I want to make a reference list using apa format. (In Normal Text, Not in Code)
A., B., & C., D. (2020, August 1). Name of the paper. Name of the paper Name of the paper
Name of the paper Name of the paper. Journal of ABC. Retrieved from
https://doi.org/1.1.2.2.2./deew/37.4.433
I have been looking around, but I could not find a way to do this.
Finally, this is what I found
A., B., & C., D. (2020, August 1). Name of the paper. Name of the paper Name of the paper Name of the paper <br> Name of the paper. Journal of ABC. Retrieved from https://doi.org/1.1.2.2.2./deew/37.4.433
Which produce the following
A., B., & C., D. (2020, August 1). Name of the paper. Name of the paper Name of the paper Name of the paper Name of the paper. Journal of ABC. Retrieved from https://doi.org/1.1.2.2.2./deew/37.4.433
But I think this is not a very efficient way to do. I wonder if you have any better ideas? Thanks!
Please let me know if you need more information. I am still trying to learn github.
Sorry about the late response but this is for others with the same problem
The hanging indent that you want is somewhat the handling that you see after a list item and in the next line adding spaces before the text
1. List item itself
(space)(space)Hanging indent that is handled to make nice multiline list items
2. Next list item
Is is usually not possible in markdown itself. In HTML+CSS you can do this:
<p style="padding-left: 2em; text-indent: -2em;">(a) in relation to an exempt payment service provider mentioned in subsection (1)(a), means any of the following payment services:</p>
But is is not always that you can use the CSS in Markdown supported HTML.
Some of the solutions have been explored [here] [https://serial-comma.com/blog/posts/2020-09-13-hanging-paragraphs-in-markdown.html]
I'm writing a parser that converts messy author strings into neatly formatted strings in the following format: ^([A-Z]\. )+[[:surname:]]$. Some examples below:
Smith JS => J. S. Smith
John Smith => J. Smith
John S Smith => J. S. Smith
J S Smith => J. S. Smith
I've managed to get quite far using various regular expressions to cover most of these, but have hit a wall for instances where a full name is provided in an unknown order. For example:
Smith John
John Smith
Smith John Stone
Obviously regular expressions won't be able to discern what order the forename, surname and middle name(s) are in, so my thought is to perform a lexical analysis on the author string, returning a type and confidence score for each token. Has anyone coded such a solution before, preferably in Perl? If so, I imagine my code would look something like this:
use strict;
use warnings;
use UnknownModule::NamePredictor qw( predict_name );
my $messy_author = "Smith John Stone";
my #names = split(' ',$messy_author);
for my $name (#names){
my ($type,$confidence) = predict_name($name);
}
I've seen a post here explaining the problem I have, but no viable solution has been suggested. I'd be quite surprised if no one has coded such a solution before if I'm honest, as there are huge training sets available. I may go down this route myself if it hasn't been done already.
Other things to consider:
I don't need this to be perfect. I'm looking for precision >90% ideally.
I have >100,000 messy author strings to play with. My goal is to pass as many cleanly as possible, evaluate and improve the approach over time.
These are definitely author strings, but they're muddled together in lots of different formats, hence the challenge I've set myself.
For everyone trying to point out that names aren't necessarily possible to categorise. In short, yes of course there will be those instances, hence why I'm gunning for imperfect precision. However, the majority can be categorised pretty comfortably. I know this simply because my human brain, with all its clever pattern recognising abilities, allows me to do it pretty well.
UPDATE: In the absence of an existing solution I've been looking at creating a model from a Support Vector Machine, using LIBSVM. I should be able to build a large and accurate training and test datasets using forenames and surnames taken from PubMed, which has a nice library of >25M articles containing categorised names. Unfortunately these don't have middle names though, just initials.
I tried a lot to change the word "and" between two Authors using a citation programm (Papers2, Mac) and a specific citation style file (.csl), but my efforts doesnt work.
What I want to do is changing the "and" in the German "und", in both the citation inline and the reference list:
[Shaw and Riha, 2012]
Shaw, S. B., and S. J. Riha (2012), Title, J.
Hydrol., 434-435(C), 46–54, doi:10.1016/j.jhydrol.2012.02.034.
Does anybody know how I can configure this delimiter-word in the style file?
Thanks in advance!
Micha
Probably the best and easiest way to do this is to set the "default-locale" of this style to "de-DE" (for German), which should automatically result in the use of "und" instead of "and". See http://citationstyles.org/downloads/specification.html#the-root-element-cs-style .
Which style are you using?
It is likely that the CSL file was not properly loaded into Papers. As #RintzeZelle suggested, please make sure to change both the ID and the title in your new style. To override a built-in style in Papers (as coming from the official repo), you need to keep both the title and the ID, or else change both to create a separate style. In your case, it makes sense to have a separate style for the German version. I suggest to use the id http://www.zotero.org/styles/american-geophysical-union-german and the title American Geophysical Union (German).
I've just started learning a dialect of lisp (Racket) and I'd like to know if some one has a link or can point me to the theoretical foundations of the family of lisp languages, by resources I mean papers, articles or books anything that you could think of.
Preferably indicating which mathematical concepts it uses how it constructs it operators, how it resolves them,unifies identities etcetera. I'v read the SEXP in wikipedia but I find it a bit shallow.
I'm interested in the foundations because I like to be able to explain how things work to others .
Thanks in advance.
You could start at the beginning: http://www-formal.stanford.edu/jmc/recursive.html
http://library.readscheme.org
http://en.wikipedia.org/wiki/Lisp_in_Small_Pieces
I would also add Landin's "The Next 700 Programming Languages" to this list; where McCarthy reveals the notion of programs interpreting other programs, Landin shows how the same theoretical framework can be seen to underly nearly all programming languages.
In fact, I think it's not unreasonable to suggest that the theory of LISP-like languages is simply... the theory of programming languages.
Paul Graham has some nice mini-articles on the history of Lisp: http://www.paulgraham.com/lisp.html
Don't miss the original lambda papers by Guy Steele and Gerald Sussman.
"Lambda: The Ultimate Imperative"
"Lambda: The Ultimate Declarative"
"Lambda: The Ultimate GOTO"
Here are a couple links:
Lisp Primer by Allen and Dhagat
Lisp Tutorial
we have a site where the user can enter the name of a city. Lucene.net 2.1.0.3 is the search engine to look for cities that have already been created. As configured Lucene does not recognise that Saint Jerome is the same as St. Jerome or that Lake Phillip is the same as Lac Phillip.
Any tips on widening the search strategy for Lucene.Net?
I've read a bit about this synonyming and "sounds like" (read "I currently have no experience with this"). To me it seems like two different problems: abbreviation "synonyms" and "sounds like".
Sounds Like
Soundex is an older algorithm which was designed for mispellings of "american" names. There is an improved algorithm called 'Double Metaphone' addressed some of the complaints of Soundex. This library looks promising:
http://sourceforge.net/projects/phonetixnet/
Abbreviation Synonyms
While it seems there could be a generic synonyming system, I would expect "Garden City" might get synonyms of "Plot Town" or "Patch burg". I am guessing you'll achieve better results with your own domain-specific synonyms.
It seems that words like 'Saint' ('St.') and 'Mount' ('Mt') would be best handled as synonyms. Here is an article that proposes a fairly simple solution to custom synonyming: http://www.codeproject.com/KB/cs/lucene_custom_analyzer.aspx .