Do stopwords interfere with queries that use SENTENCE in Sphinx? - sphinx

I do not have Sphinx installed yet I am still reviewing if it makes sense so pardon a question I could probably answer on my own otherwise.
If I had the period as a stopword (so I can match e.g. U.S. to US) will a query that uses SENTENCE? In other words will the query ignore the period because of stopword and thus fail to recognize the end of sentence boundary?

Related

postgressql - performance when searching a phone number column using LIKE operator

I have a phone number column in my database that could potentially have somewhere close to 50 million records.
As I have the phone numbers stored with the country code, I am a bit confused on how to implement the search functionality.
Options I have in mind
When the user puts in a phone number to search - use the LIKE operator to find the right phone number [When using LIKE operator does it slow down the search?]
Split the phone number column into two one with just the area code and the other with the phone number. [Why I am looking into this implmentation is I dont have to use LIKE operator here]
Please suggest any other ideas! People here who has really good experience with postgres please chime in with the best practises.
Since they are stored with a country code, you can just include the country code when you search for them. That should be by far the most performant. If you know what country each person is in, or if your user base is dominantly from one country, you could just add the code to "short" numbers in order to complete it.
If LIKE is too slow (at 50 million rows it probably would be) you can put a pg_trgm index on it. You will probably need to remove, or at least standardize, the punctuation in both data and in the query, or it could cause problems with the LIKE (as well as every other method).
The problem I see with making two columns, country code (plus area code? I would expect that to go in the other column) and one column for the main body of the number, is that it probably wouldn't do what people want. I would think people are going to either expect partial matching at any number of digits they feel like typing meaning you would still need to use LIKE, or people who type in the full number (minus country code) are going to expect it to find only numbers in "their" country. On the other hand splitting off the country code from the main body of the number might avoid having an extremely common country code pollute any pg_trgm indexes you do build with low selectivity trigrams.

Pattern matching performance issue Postgres

I found the query like below taking longer time as this pattern matching causes the performance in my batch job,
Query:
select a.id, b.code
from table a
left join table b
on a.desc_01 like '%'||b.desc_02||'%';
I have tried with LEFT, STRPOS functions to improve the performance. But at the end am losing few data if i apply these functions.
Any other suggestion please.
It's not that clear what your data (or structure) really looks like, but your search is performing a contains comparison. That's not the simplest thing to optimize because a standard index, and many matching algorithms, are biased towards the start of the string. When you lead with %, then then a B-tree can't be used efficiently as it splits/branches based on the front of the string.
Depending on how you really want to search, have you considered trigram indexes? they're pretty great. Your string gets split into three letter chunks, which overcomes a lot of the problems with left-anchored text comparison. The reason why is simple: now every character is the start of a short, left-anchored chunk. There are traditionally two methods of generating trigrams (n-grams), one with leading padding, one without. Postgres uses padding, which is the better default. I got help with a related question recently that may be relevant to you:
Searching on expression indexes
If you want something more like a keyword match, then full text search might be of help. I had not been using them much because I've got a data set where converting words to "lexemes" doesn't make sense. It turns out that you can tell the parser to use the "simple" dictionary instead, and that gets you a unique word list without any stemming transformations. Here's a recent question on that:
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes/251185#251185
If that sounds more like what you need, you might also want to get rid of stop/skip/noise words. Here's a thread that I think is a bit clearer on the docs regarding how to set this up (it's not hard):
https://dba.stackexchange.com/questions/145016/finding-the-most-commonly-used-non-stop-words-in-a-column/186754#186754
The long term answer is to clean up and re-organize your data so you don't need to do this.
Using a pg_trgm index might be the short term answer.
create extension pg_trgm;
create index on a using gin (desc_01 gin_trgm_ops);
How fast this will be is going to depend on what is in b.desc_02.

Fuzzy search on large table

I have a very large PostgreSQL table with 12M names and I would like to show an autocomplete. Previously I used a ILIKE "someth%" clause but I'm not really satisfied with it. For example it doesn't sort by similarity and any spelling error would cause wrong or no results. The field is a string, usually one or two words (in any language). I need a fast response because suggestions are shown live to the user while he is typing (i.e. autocomplete). I cannot restrict the fuzzy match to a subset because all names are equally important. I can also say that most names are different.
I have tried pg_trgm but even with a gin index is very slow. The search of a name similar to 'html' takes a few milliseconds, but - don't ask me why - other searches like 'htm' take a lot of seconds - e.g. 25 seconds. Also other people has reported performance issues with pg_trgm on large tables.
Is there anything I can do to efficiently show an autocomplete on that field?
Would a full text search engine (e.g. Lucene, Solr) be an appropriate solution? Or I would encounter the same inefficiency?

How do I prove a DFA has no synchronizing word?

To find a synchronizing word I have always just used trial and error, which for small DFAs is fine but not so useful on larger DFAs. What I want to know, however, is if there exists an algorithm for determining a synchronizing word or if there is a way of being able to tell that one does not exist. (Rather than just saying "I can't find one, therefore one can not exist" which is by no means a proof).
I have had a look around on google and so far just came across methods for determining what the upper and lower bounds for a length of a synchronizing word would be based on the number of states, however this is not helpful to me.
The existence of upper bounds on the length of a synchronizing word immediately implies the existence of a (very slow) algorithm for finding one: just list all strings of length less than the upper bound and test whether each is a synchronizing word. If any of them are, then the synchronizing word exists, and if none of them are, there's no synchronizing word. This is exponentially slow, though, so it's not advisable on large DFAs.
David Eppstein designed a polynomial-time algorithm for finding synchronizing words in DFAs, though I'm not very familiar with this algorithm.
Hope this helps!

Words Prediction - Get most frequent predecessor and successor

Given a word I want to get the list of most frequent predecessors and successors of the word in English language.
I have developed a code that does bigram analysis on any corpus ( I have used Enron email corpus) and can predict the most frequent next possible word but I want some other solution because
a) I want to check the working / accuracy of my prediction
b) Corpus or dataset based solutions fail for an unseen word
For example, given the word "excellent" I want to get the words that are most likely to come before excellent and after excellent
My question is whether any particular service or api exists for the purpose?
Any solution to this problem is bound to be a corpus-based method; you just need a bigger corpus. I'm not aware of any web service or library that is does this for you, but there are ways to obtain bigger corpora:
Google has published a huge corpus of n-grams collected from the English part of the web. It's available via the Linguistic Data Consortium (LDC), but I believe you must be an LDC member to obtain it. (Many universities are.)
If you're not an LDC member, try downloading a Wikipedia database dump (get enwiki) and training your predictor on that.
If you happen to be using Python, check out the nice set of corpora (and tools) delivered with NLTK.
As for the unseen words problem, there are ways to tackle it, e.g. by replacing all words that occur less often than some threshold by a special token like <unseen> prior to training. That will make your evaluation a bit harder.
You have got to give some more instances or context of "unseen" word so that the algorithm can make some inference.
One indirect way can be reading rest of the words in the sentences.. and looking into a dictionary for the words where those words are encountered.
In general, you cant expect the algorithm to learn and understand the inference in the first time. Think about yourself.. If you were given a new word.. how well can you make out its meaning (probably by looking into how it has been used in the sentence and how well your understanding is) but then you make an educated guess and over the period of time you understand the meaning.
I just re-read the original question and I realize the answers, mine included got off base. I think the original person just wanted to solve a simple programming problem, not look for datasets.
If you list all distinct word-pairs and count them, then you can answer your question with simple math on that list.
Of course you have to do a lot of processing to generate the list. While it's true that if the total number of distinct words is as much a 30,000 then there are a billion possible pairs, I doubt that in practice there are that many. So you can probably make a program with a huge hash table in memory (or on disk) and just count them all. If you don't need the insignificant pairs you could write a program that flushes out the less important ones periodically while scanning. Also you can segment the word list and generate pairs of a hundred words verses the rest, then the next hundred and so on, and calculate in passes.
My original answer is here I'm leaving it because it's my own related question:
I'm interested in something similar (I'm writing a entry system that suggest word completions and punctuation and I would like it to be multilingual).
I found a download page for google's ngram files, but they're not that good, they're full of scanning errors. 'i's become '1's, words run together etc. Hopefully Google has improved their scanning technology since then.
The just-download-wikipedia-unpack=it-and-strip-the-xml idea is a bust for me, I don't have a fast computer (heh, I have a choice between an atom netbook here and an android device). Imagine how long it would take me to unpack a 3 gigabytes of bz2 file becoming what? 100 of xml, then process it with beautiful soup and filters that he admits crash part way through each file and need to be restarted.
For your purpose (previous and following words) you could create a dictionary of real words and filter the ngram lists to exclude the mis-scanned words. One might hope that the scanning was good enough that you could exclude misscans by only taking the most popular words... But I saw some signs of constant mistakes.
The ngram datasets are here by the way http://books.google.com/ngrams/datasets
This site may have what you want http://www.wordfrequency.info/