Lucene searching numerical - lucene.net

I have a strange issue with lucene.net 2.9:
If I searching for: high-quality it doesn't find any results. I found hyphenation char (-) is a problem for Lucene, so I search for high quality and it worked perfectly.
When I search for 30-40 it is showing results but for 30 40 is not showing any.
The second scenarios is in contradiction with first one.
I guess the second one is related as I have numerical text, but I didn't find something on web related.

I'm guessing you are using StandardAnalyzer when indexing your terms, and then are searching without analysis in some form, or with a different form of analysis.
The 2.9 StandardAnalyzer (ClassicAnalyzer, as of version 3.1) has some interesting behavior around hyphens. To quote the StandardTokenizer documentation:
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
So, two hyphenated words (or any collection of letters) will be split into separate tokens, when any number thrown into the mix will interpret the whole thing as a product number, and index as a ingle token, hyphens and all, so:
"high-qualtiy" --> "high" and "quality"
"ab-cd" ---------> "ab" and "cd"
"30-40" ---------> "30-40"
"ab-c4" ---------> "ab-c4"
"30 40" ---------> "30" and "40"
So, if you construct a TermQuery for "high-quality" on such an analyzed field, you will get no results (though you would if using the QueryParser with the same analyzer). When searching for "30-40", the TermQuery for "30-40" will be an exact match. but matches will be found for neither "30" nor "40".
So, I'm not how you are querying to run into the mismatch there (perhaps using StandardAnalyzer when indexing, and WhitespaceAnalyzer when querying?), but hopefully that points in the right direction.

You need to encrypt "-" sign to URL parameter. I think it will works fine.

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Sphinx search: multi-term wordforms not indexed correctly

I'm having an issue with specific entries in my wordforms file that are not being
interpreted as expected.
Here are a couple of examples:
1/48 > forty-eighth
1/96 > ninety-sixth
As you can see, these entries contain both slashes and hyphens, which may be related to
my issue.
For some reason, Sphinx doesn't correctly equate each fraction to the spelled out
version. Search results for "1/48" are not the same as for "forty-eighth", as they should
be. In other words, the mapping between these equivalent forms is not working.
In my Sphinx config, I have the forward slash (/) set as a blend character, so I assume
that the fraction is being recognized properly.
In support of that belief, the following wordforms entry does work correctly:
1/4 > fourth
Does anyone have any idea why my multi-term synonyms would not be working as expected?
I have tried replacing the hyphen with a space, but this doesn't change the result at
all. Would it help to change the order of the terms (i.e., on which side of the ">" they
should be placed)?
Thank you very much for any help.
When using characters in Sphinx it is always good to keep in mind the following:
By default, the Sphinx tokenizer handles unknown characters as whitespace
https://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
That has given me weird results too when using wordforms.
I would suggest you add the hyphen to charset_tables so ninety-sixth becomes one word. ignore_chars is also an option but then you will be looking for ninetysixth instead.
Much depends on the rest of your dataset and use cases ofcourse.

Typeahead Bloodhound - Filter

My index contains the word dog how can i also find this entry if i type dogs? I would find all parts of the word 'dogs','dog','do' to a min length of 2 or 3 chars
I'm not an expert on Bloodhound, but what you're talking about here is called stemming, and it seems like the kind of thing that you could do using the datumTokenizer and the queryTokenizer.
There are stemmers for most languages of varying quality, but I think the one most people are using for English these days is the Snowball Stemmer. There are a number of implementations in JavaScript floating around.
In general for things to work properly you'll want to stem both the uer's query and the results.

Iphone work out if string is a UK Postcode

In my app before I send a string off I need to work out if the text entered in the textbox is a UK Postcode. I don't have the regex ability to work that out for myself and after searching around I can't seem to work it out! Just wondered if anyone has done a similar thing in the past?
Or if anyone can point me in the right direction I would be most appreciative!
Tom
Wikipedia has a good section about this. Basically the answer depends on what sort of pathological cases you want to handle. For example:
An alternative short regular expression from BS7666 Schema is:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
The above expressions fail to exclude many non-existent area codes (such as A, AA, Z and ZY).
Basically, read that section of Wikipedia thoroughly and decide what you need.
for post codes without spaces (e.g. SE19QZ) I use: (its not failed me yet ;-) )
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
if spaces (e.g. SE1 9QZ) , then:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$
You can match most post codes with this regex:
/[A-Z]{1,2}[0-9]{1,2}\s?[0-9]{1,2}[A-Z]{1,2}/i
Which means... A-Z one or two times ({1,2}) followed by 0-9 1 or two times, followed by a space \s optionally ? followed by 0-9 one or two times, followed by A-Z one or two times.
This will match some false positives, as I can make up post codes like ZZ00 00ZZ, but to accurately match all post codes, the only way is to buy post code data from the post office - which is quite expensive. You could also download free post code databases, but they do not have 100% coverage.
Hope this helps.
Wikipedia has some regexes for UK Postcodes: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation

Count the number of words in NSString

I'm trying to implement a word count function for my app that uses UITextView.
There's a space between two words in English, so it's really easy to count the number of words in an English sentence.
The problem occurs with Chinese and Japanese word counting because usually, there's no any space in the entire sentence.
I checked with three different text editors in iPad that have a word count feature and compare them with MS Words.
For example, here's a series of Japanese characters meaning the world's idea: 世界(the world)の('s)アイデア(idea)
世界のアイデア
1) Pages for iPad and MS Words count each character as one word, so it contains 7 words.
2) iPad text editor P*** counts the entire as one word --> They just used space to separate words.
3) iPad text editor i*** counts them as three words --> I believe they used CFStringTokenizer with kCFStringTokenizerUnitWord because I could get the same result)
I've researched on the Internet, and Pages and MS Words' word counting seems to be correct because each Chinese character has a meaning.
I couldn't find any class that counts the words like Pages or MS Words, and it would be very hard to implement it from scratch because besides Japanese and Chinese, iPad supports a lot of different foreign languages.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
Is there a way to count words in NSString like Pages and MSWords?
Thank you
I recommend keep using CFStringTokenizer. Because it's platform feature, so will be upgraded by platform upgrade. And many people in Apple are working hardly to reflect real cultural difference. Which are hard to know for regular developers.
This is hard because this is not a programming problem essentially. This is a human cultural linguistic problem. You need a human language specialist for each culture. For Japanese, you need Japanese culture specialist. However, I don't think Japanese people needs word count feature seriously, because as I heard, the concept of word itself is not so important in the Japanese culture. You should define concept of word first.
And I can't understand why you want to force concept of word count into the character count. The Kanji word that you instanced. This is equal with counting universe as 2 words by splitting into uni + verse by meaning. Not even a logic. Splitting word by it's meaning is sometimes completely wrong and useless by the definition of word. Because definition of word itself are different by the cultures. In my language Korean, word is just a formal unit, not a meaning unit. The idea that each word is matching to each meaning is right only in roman character cultures.
Just give another feature like character counting for the users in east-asia if you think need it. And counting character in unicode string is so easy with -[NSString length] method.
I'm a Korean speaker, (so maybe out of your case :) and in many cases we count characters instead of words. In fact, I never saw people counting words in my whole life. I laughed at word counting feature on MS word because I guessed nobody would use it. (However now I know it's important in roman character cultures.) I have used word counting feature only once to know it works really :) I believe this is similar in Chinese or Japanese. Maybe Japanese users use the word counting because their basic alphabet is similar with roman characters which have no concept of composition. However they're using Kanji heavily which are completely compositing, character-centric system.
If you make word counting feature works greatly on those languages (which are using by people even does not feel any needs to split sentences into smaller formal units!), it's hard to imagine someone who using it. And without linguistic specialist, the feature should not correct.
This is a really hard problem if your string doesn't contain tokens identifying word breaks (like spaces). One way I know derived from attempting to solve anagrams is this:
At the start of the string you start with one character. Is it a word? It could be a word like "A" but it could also be a part of a word like "AN" or "ANALOG". So the decision about what is a word has to be made considering all of the string. You would consider the next characters to see if you can make another word starting with the first character following the first word you think you might have found. If you decide the word is "A" and you are left with "NALOG" then you will soon find that there are no more words to be found. When you start finding words in the dictionary (see below) then you know you are making the right choices about where to break the words. When you stop finding words you know you have made a wrong choice and you need to backtrack.
A big part of this is having dictionaries sufficient to contain any word you might encounter. The English resource would be TWL06 or SOWPODS or other scrabble dictionaries, containing many obscure words. You need a lot of memory to do this because if you check the words against a simple array containing all of the possible words your program will run incredibly slow. If you parse your dictionary, persist it as a plist and recreate the dictionary your checking will be quick enough but it will require a lot more space on disk and more space in memory. One of these big scrabble dictionaries can expand to about 10MB with the actual words as keys and a simple NSNumber as a placeholder for value - you don't care what the value is, just that the key exists in the dictionary, which tells you that the word is recognised as valid.
If you maintain an array as you count you get to do [array count] in a triumphal manner as you add the last word containing the last characters to it, but you also have an easy way of backtracking. If at some point you stop finding valid words you can pop the lastObject off the array and replace it at the start of the string, then start looking for alternative words. If that fails to get you back on the right track pop another word.
I would proceed by experimentation, looking for a potential three words ahead as you parse the string - when you have identified three potential words, take the first away, store it in the array and look for another word. If you find it is too slow to do it this way and you are getting OK results considering only two words ahead, drop it to two. If you find you are running up too many dead ends with your word division strategy then increase the number of words ahead you consider.
Another way would be to employ natural language rules - for example "A" and "NALOG" might look OK because a consonant follows "A", but "A" and "ARDVARK" would be ruled out because it would be correct for a word beginning in a vowel to follow "AN", not "A". This can get as complicated as you like to make it - I don't know if this gets simpler in Japanese or not but there are certainly common verb endings like "ma su".
(edit: started a bounty, I'd like to know the very best way to do this if my way isn't it.)
If you are using iOS 4, you can do something like
__block int count = 0;
[string enumerateSubstringsInRange:range
options:NSStringEnumerationByWords
usingBlock:^(NSString *word,
NSRange wordRange,
NSRange enclosingRange,
BOOL *stop)
{
count++;
}
];
More information in the NSString class reference.
There is also WWDC 2010 session, number 110, about advanced text handling, that explains this, around minute 10 or so.
I think CFStringTokenizer with kCFStringTokenizerUnitWord is the best option though.
That's right, you have to iterate through text and simply count number of word tokens encontered on the way.
Not a native chinese/japanese speaker, but here's my 2cents.
Each chinese character does have a meaning, but concept of a word is combination of letters/characters to represent an idea, isn't it?
In that sense, there's probably 3 words in "sekai no aidia" (or 2 if you don't count particles like NO/GA/DE/WA, etc). Same as english - "world's idea" is two words, while "idea of world" is 3, and let's forget about the required 'the' hehe.
That given, counting word is not as useful in non-roman language in my opinion, similar to what Eonil mentioned. It's probably better to count number of characters for those languages.. Check around with Chinese/Japanese native speakers and see what they think.
If I were to do it, I would tokenize the string with spaces and particles (at least for japanese, korean) and count tokens. Not sure about chinese..
With Japanese you can create a grammar parser and I think it is the same with Chinese. However, that is easier said than done because natural language tends to have many exceptions, but it is not impossible.
Please note it won't really be efficient since you have to parse each sentence before being able to count the words.
I would recommend the use of a parser compiler rather than building one yourself as well to start at least you can concentrate on doing the grammar than creating the parser yourself. It's not efficient, but it should get the job done.
Also have a fallback algorithm in case your grammar didn't parse the input correctly (perhaps the input really didn't make sense to begin with) you can use the length of the string to make it easier on you.
If you build it, there could be a market opportunity for you to use it as a natural language Domain Specific Language for Japanese/Chinese business rules as well.
Just use the length method:
[#"世界のアイデア" length]; // is 7
That being said, as a Japanese speaker, I think 3 is the right answer.