Suppose I have a text Foo Bar Baz-Qux. How can I configure Sphinx's indexer so Sphinx be able to find match for any of given strings?
Foo Bar Baz-Qux
Foo BazQux Bar
Baz Qux Foo Bar
Currently I've a dash symbol as value of ignore_chars setting, and Sphinx gives me result for first two queries but not for third.
Please note that solution must be general and not rely on particular words from example or on their relevant order.
Thanks!
I have found a solution (or a workaround): use of regexp_filter.
So Sphinx index config looks now like this:
...
ignore_chars = -
regexp_filter = \b([\w\d]+)-([\w\d]+)\b => \1\2 \1 \2
...
So right before Sphinx will put text into its index it will split all dash-containing words into two forms: first one where dash is simply removed and second where dash replaced with a space. At the moment of index creation three words of text "Foo-Bar" will be indexed: "FooBar", "Foo" and "Bar". This lets me to search with any of the following queries: "Foo-Bar" (dash will be removed since it is in ignore_chars list), "FooBar" (this words is in the index) and "Foo Bar" (both words are in the index).
The main problem here is that you cannot use exact phase match for both types of the queries at same time. I. e. if you search for "Bar BazQux" or "Bar Baz-Qux" you'll get a result. But for "Bar Baz Qux" you will get nothing. In my specific case it is not a issue, but for any who want to use this approach - I've warned you.
If you know better way to do this thing, or this workaround has some disadvantages that I have missed, please let me know.
Another possible solution is using of trigrams as shown here. This way also helps with possible user's mistakes but more difficult to implement.
Related
I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.
I have a text search problem where I need to search systematically-generated text, i.e. not human-written natural language text.
The typical ts_tovector('english', 'foo bar baz') is not particularly helpful. In some cases it generates tokens which I know will be lead to false-positive search results.
Instead I'd really just like to either provide the tokens in a string where each token is separated by whitespace, or provide an array of ordered tokens.
For example, something along the lines of to_tsvector(array["foo", "bar", "baz"]) should produce three tokens: foo, bar, and baz. This seems like a pretty basic thing, but so far I haven't found any explicit documentation of this functionality.
This is indeed a basic thing, and all you have to do is use the simple text search configuration:
to_tsvector('simple', 'foo bar baz')
I'm searching for the function named "init"; along with init, I also get things like:
"function g = compute_gradient()"
where the substring characters aren't consecutive. Most of the time, it makes the whole search useless (faster to use a simple string search).
How do I fix that?
By the way, is this a bug? If not, what's the idea of such a search? I could understand looking for separate (by space) words; I don't get a search by separate letters.
I used pg_trgrm to check string matches and I am pretty happy with the results. But it is not pefrectly the way I want it. I want that searches like "poduto" finds "produtos" (the r was missing). And Also that "sofáa" finds "sofa". I am using posgresql 9.6.
It does find "vermelho" when I type "vermelo" (h is missing). And it does find "sofa" when I type "sof". It seems that only some letters in middle can be left out and I always can miss a final letter. I want to be able to miss any letter in the middle of the word. And also be able to commit "two mistakes" in the case of sofáa and sofá (I used an accent and used one additional "a").
The solution is to lower pg_trgm.similarity_threshold (or pg_trgm.word_similarity_threshold if you are using <% or %>).
Then words with lower similarity will also be found.
I have an app that utilizes hashtags to help tag posts. I am trying to have a more detailed search.
Lets say one of the records I'm searching is:
The #bird flew very far.
When I search for "flew", "fle", or "#bird", it should return the record.
However, when I search "#bir", it should NOT return the sentence because the whole the tag being searched for doesn't match.
I'm also not sure if "bird" should even return the sentence. I'd be interested how to do that though as well.
Right now, I have a very basic search:
SELECT "posts".* FROM "posts" WHERE (body LIKE '%search%')
Any ideas?
You could do this with LIKE but it would be rather hideous, regexes will serve you better here. If you want to ignore the hashes then a simple search like this will do the trick:
WHERE body ~ E'\\mbird\M''
That would find 'The bird flew very far.' and 'The #bird flew very far.'. You'd want to strip off any #s before search though as this:
WHERE body ~ E'\\m#bird\M''
wouldn't find either of those results due to the nature of \m and \M.
If you don't want to ignore #s in body then you'd have to expand and modify the \m and \M shortcuts yourself with something like this:
WHERE body ~ E'(^|[^\\w#])#bird($|[^\\w#])'
-- search term goes here^^^^^
Using E'(^|[^\\w#])#bird($|[^\\w#])' would find 'The #bird flew very far.' but not 'The bird flew very far.' whereas E'(^|[^\\w#])bird($|[^\\w#])' would find 'The bird flew very far.' but not 'The #bird flew very far.'. You might also want to look at \A instead of ^ and \Z instead of $ as there are subtle differences but I think $ and ^ would be what you want.
You should keep in mind that none of these regex searches (or your LIKE search for that matter) will uses indexes so you're setting yourself up for lots of table scans and performance problems unless you can restrict the searches using something that will use an index. You might want to look at a full-text search solution instead.
It might help to parse the hash tags out of the text and store them in an array in a separate column called say hashtags when the articles are inserted/updated. Remove them from the article body before feeding it into to_tsvector and store the tsvector in a column of the table. Then use:
WHERE body_tsvector ## to_tsquery('search') OR 'search' IN hashtags
You could use a trigger on the table to maintain the hashtags column and the body_tsvector stripped of hash tags, so that the application doesn't have to do the work. Parse them out of the text when entries are INSERTed or UPDATEd.