ensure if hashtag matches in search, that it matches whole hashtag - postgresql

I have an app that utilizes hashtags to help tag posts. I am trying to have a more detailed search.
Lets say one of the records I'm searching is:
The #bird flew very far.
When I search for "flew", "fle", or "#bird", it should return the record.
However, when I search "#bir", it should NOT return the sentence because the whole the tag being searched for doesn't match.
I'm also not sure if "bird" should even return the sentence. I'd be interested how to do that though as well.
Right now, I have a very basic search:
SELECT "posts".* FROM "posts" WHERE (body LIKE '%search%')
Any ideas?

You could do this with LIKE but it would be rather hideous, regexes will serve you better here. If you want to ignore the hashes then a simple search like this will do the trick:
WHERE body ~ E'\\mbird\M''
That would find 'The bird flew very far.' and 'The #bird flew very far.'. You'd want to strip off any #s before search though as this:
WHERE body ~ E'\\m#bird\M''
wouldn't find either of those results due to the nature of \m and \M.
If you don't want to ignore #s in body then you'd have to expand and modify the \m and \M shortcuts yourself with something like this:
WHERE body ~ E'(^|[^\\w#])#bird($|[^\\w#])'
-- search term goes here^^^^^
Using E'(^|[^\\w#])#bird($|[^\\w#])' would find 'The #bird flew very far.' but not 'The bird flew very far.' whereas E'(^|[^\\w#])bird($|[^\\w#])' would find 'The bird flew very far.' but not 'The #bird flew very far.'. You might also want to look at \A instead of ^ and \Z instead of $ as there are subtle differences but I think $ and ^ would be what you want.
You should keep in mind that none of these regex searches (or your LIKE search for that matter) will uses indexes so you're setting yourself up for lots of table scans and performance problems unless you can restrict the searches using something that will use an index. You might want to look at a full-text search solution instead.

It might help to parse the hash tags out of the text and store them in an array in a separate column called say hashtags when the articles are inserted/updated. Remove them from the article body before feeding it into to_tsvector and store the tsvector in a column of the table. Then use:
WHERE body_tsvector ## to_tsquery('search') OR 'search' IN hashtags
You could use a trigger on the table to maintain the hashtags column and the body_tsvector stripped of hash tags, so that the application doesn't have to do the work. Parse them out of the text when entries are INSERTed or UPDATEd.

Related

How to find common patterns in thousands of strings?

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.

Postgres - difference between to_tsquery, to_tsvector and plainto_tsquery

The Full text search of postgres includes some of these functions to search: plainto_tsquery, to_tsquery and to_tsvector .
I don't get the difference between it, the results contain the same words always, but in tsvector it is detached with the number of position of that word.
SELECT plainto_tsquery('simple', 'The & Fat & Rats');
result will be like this:
plainto_tsquery: 'fat' & 'rat'
to_tsquery: 'fat' & 'rat'
to_tsvector: 'fat':2 'rat':3
I have tried longer queries, but i haven't found a bigger difference than that.
I already read the documentation, but I didnt get the difference there either.
I am happy for any help.
"plainto_tsquery" takes a phrase in plain English (or in this case plain "simple"--although your question is not consistent. "simple" does not strip out the word 'the', the way you show, unless you made nonstandard modifications to it) and converts it to a tsquery. Since "&" is punctuation, it gets ignored. But then it adds '&' in between the words, because that is what "plainto_tsquery" does. So those changes are not visible, because you chose a poor example to feed to plainto_tsquery.
"to_tsquery" compiles the query you gave it into the structure used for searching. But then, because you are selecting it rather than using it with a ts query operator, it converts it back to text again so it can display it. It requires that what you feed it already looks mostly like a tsquery (for example, has boolean operators between each word), otherwise it throws an error. Surely you noticed that when you tried longer queries?
"to_tsvector" creates a tsvector. This is not a tsquery, rather it is what the tsquery gets applied to.

Odd to_tsquery results for s:* and t:*

I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.
The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

Cannot use t-sql contains with short words

I call my statement with CONTAINS function, but sometimes it does not return correct records, e.g. I want to return row which contain in one field word 'Your':
SELECT [Email]
,[Comment]
FROM [USERS]
WHERE CONTAINS(Comment, 'Your')
It gives mi 0 result despite that this field contains this word (the same with 'as', 'to', 'was', 'me'). When I use 'given' instead of 'Your' then I receive a result. Is there maybe a list of words which cannot be used with CONTAINS? Or maybe this words are to short (when i use 'name' then i receive the results)? The work 'Your' is at the beginning in field Comment.
The field is of type 'text' and has enabled full-text index.
Words such as those you mention are "stop words"; they are expressly excluded from being indexed and searched in Full Text Search due to how common (and thereby meaningless for searches) they are. You'll notice the same thing when searching Google, for instance.
It is possible to edit the list, but I would avoid doing so except perhaps to add words to it; the words in the list are chosen very well, IMHO, for their lack of utility in searches.