Similarity in tsv column - postgresql

I'm needing some help getting the SQL to work here in PostgreSQL 9.5.1 using pgAdminIII. What I have is a column status (datatype, text) of Facebook statuses in the format they were typed and another column status_tsv which stores a tsvector of the status column with stop words removed and the words stemmed.
I'd like to find similar statuses by comparing the similarity of the tsvector column in a self-join.
Thus far I have tried using a regexp_replace function combined with the pg_trgm similarity search to keep only the a-zA-Z character set in the tsvector column but this didn't worked as regexp_replace says it can't do tsvector columns so I've changed datatype of tsv column to text.
The problem now is that it only compares the similarity of the first word in each row and ignores the rest, obviously this is no use and I need it to compare the whole row.
My SQL just now looks like
`SELECT * FROM status_table AS x
JOIN status_table AS y
ON ST_Dwithin (x.geom54032, y.geom54032,5000)
WHERE status_similarity (x.tsvector_status, y.tsvector_status) > 0.7
AND x.status_id != y.status_id;`
The status_similarity does this `(regexp_replace(x.tsvector_status, '[^a-zA-Z]', '', 'g'), regexp_replace(y.tsvector_status, '[^a-zA-Z]', '', 'g')) which I'm sure keeps only the a-zA-Z from the tsvector_status column.
What must I changed to get this returning similar status'?

Related

Counting the Number of Occurrences of a Multi-Word Phrase in Text with PostgreSQL

I have a problem, I need to count the frequency of a word phrase appearing within a text field in a PostgreSQL database.
I'm aware of functions such as to_tsquery() and I'm using it to check if a phrase exists within the text using to_tsquery('simple', 'sample text'), however, I'm unsure of how to count these occurrences accurately.
If the words are contained just once in the string (I am supposing here that your table contains two columns, one with an id and another with a text column called my_text):
SELECT
count(id)
FROM
my_table
WHERE
my_text ~* 'the_words_i_am_looking_for'
If the occurrences are more than one per field, this nested query can be used:
SELECT
id,
count(matches) as matches
FROM (
SELECT
id,
regexp_matches(my_text, 'the_words_i_am_looking_for', 'g') as matches
FROM
my_table
) t
GROUP BY 1
The syntax of this function and much more about string pattern matching can be found here.

How to add a leading zero when the length of the column is unknown?

How can I add a leading zero to a varchar column in the table and I don't know the length of the column. If the column is not null, then I should add a leading zero.
Examples:
345 - output should be 0345
4567 - output should be 04567
I tried:
SELECT lpad(column1,WHAT TO SPECIFY HERE?, '0')
from table_name;
I will run an update query after I get this.
You may be overthinking this. Use plain concatenation:
SELECT '0' || column1 AS padded_col1 FROM table_name;
If the column is NULL, nothing happens: concatenating anything to NULL returns NULL.
In particular, don't use concat(). You would get '0' for NULL columns, which you do not want.
If you also have empty strings (''), you may need to do more, depending on what you want.
And since you mentioned your plan to updated the table: Consider not doing this, you are adding noise, that could be added for display with the simple expression. A VIEW might come in handy for this.
If all your varchar values are in fact valid numbers, use an appropriate numeric data type instead and format for display with the same expression as above. The concatenation automatically produces a text result.
If circumstances should force your hand and you need to update anyway, consider this:
UPDATE table_name
SET column1 = '0' || column1
WHERE column1 IS DISTINCT FROM '0' || column1;
The added WHERE clause to avoid empty updates. Compare:
How do I (or can I) SELECT DISTINCT on multiple columns?
try concat instead?..
SELECT concat(0::text,column1) from table_name;

Only getting 1 result from postgres tsvector

I am using PostgreSQL 9.3. I have built a dataset with a tsvector field called vector.
Then I execute a query against it
SELECT id, vector, relative_path, title
FROM site_server.indexed_url, plainto_tsquery('english','booking') query
WHERE vector ## query;
Only 1 row is returned. When I look at the data there are at least 6 rows that would match. How do I get it to retrieve all matching records?
Data file
Values in vector column in your data sample are not normalized. Which is ignored on COPY, as per docs:
It is important to understand that the tsvector type itself does not
perform any word normalization; it assumes the words it is given are
normalized appropriately for the application
If you run:
SELECT id, vector, relative_path, title
FROM site_server.indexed_url
WHERE to_tsvector(vector) ## plainto_tsquery('english','booking') query;
It will produce expected result I think.

How do I find a word that certain rows contain in SQLite?

I am having a hard time figuring out how to do the following in SQLite:
I have a table with let's say the following:
table name: terms
golden
waterfall
inception
castaway
I would like to now do a lookup on all of the terms in the table that is contained in a specific string. So a string like "abc_golden#hotmail.com" should return a match. Or "life_waterfall_5" should return a match.
I understand how to do this with the LIKE statement if it was the other way around (if I was looking for matches in the table that contains a specific word. But how do I do it in my case where I have to match all entries that is contained WITHIN my search term?
To find rows that contain a string:
SELECT * FROM tbl WHERE col LIKE '%word%';
To find rows that a string contains, just turn it backwards:
SELECT * FROM tbl WHERE 'some string' LIKE '%' || col || '%';

ltrim(rtrim(x)) leave blanks on rtl content - anyone knows on a work around?

i have a table [Company] with a column [Address3] defined as varchar(50)
i can not control the values entered into that table - but i need to extract the values without leading and trailing spaces. i perform the following query:
SELECT DISTINCT RTRIM(LTRIM([Address3])) Address3 FROM [Company] ORDER BY Address3
the column contain both rtl and ltr values
most of the data retrieved is retrieved correctly - but SOME (not all) RTL values are returned with leading and or trailing spaces
i attempted to perform the following query:
SELECT DISTINCT ltrim(rTRIM(ltrim(rTRIM([Address3])))) c, ltrim(rTRIM([Address3])) b, [Address3] a, rtrim(LTRIM([Address3])) Address3 FROM [Company] ORDER BY Address3
but it returned the same problem on all columns - anyone has any idea what could cause it?
The rows that return with extraneous spaces might have a kind of space or invisible character the trim functions don't know about. The documentation doesn't even mention what is considered "a blank" (pretty damn sloppy if you ask me). Try taking one of those rows and looking at the characters one by one to see what character they are.
since you are using varchar, just do this to get the ascii code of all the bad characters
--identify the bad character
SELECT
COUNT(*) AS CountOf
,'>'+RIGHT(LTRIM(RTRIM(Address3)),1)+'<' AS LastChar_Display
,ASCII(RIGHT(LTRIM(RTRIM(Address3)),1)) AS LastChar_ASCII
FROM Company
GROUP BY RIGHT(LTRIM(RTRIM(Address3)),1)
ORDER BY 3 ASC
do a one time fix to data to remove the bogus character, where xxxx is the ASCII value identified in the previous select:
--only one bad character found in previous query
UPDATE Company
SET Address3=REPLACE(Address3,CHAR(xxxx),'')
--multiple different bad characters found by previous query
UPDATE Company
SET Address3=REPLACE(REPLACE(Address3,CHAR(xxxx1),''),char(xxxx2),'')
if you have bogus chars in your data remove them from the data and not each time you select the data. you WILL have to add this REPLACE logic to all INSERTS and UPDATES on this column, to keep any new data from having the bogus characters.
If you can't alter the data, you can just select it this way:
SELECT
LTRIM(RTRIM(REPLACE(Address3,CHAR(xxxx),'')))
,LTRIM(RTRIM(REPLACE(REPLACE(Address3,CHAR(xxxx1),''),char(xxxx2),'')))
...