How to query only the data that has emojis (postgresql) - postgresql

I have data that contains emojis within a database column, i.e.
message_text
-------
🙂
😀
Hi 😀
I want to query only the rows that have data containing emojis. Is there a simple way to do this in postgres?
select data from table where data_col contains emoji
Currently I use a simple query
select message_text from cb_messages_v1 cmv
where message_text IN ('👍🏻','😀','😐','🙂', '😧')
but I want it to be more dynamic, where if future emotions are added it will capture the data.

From your example it seems like you are not only interested in emoticons (U+1F601 - U+1F64F), but also in Miscellaneous Symbols And Pictographs (U+1F300 - U+1F5FF) and Transport And Map Symbols (U+1F680 - U+1F6C5).
You can find values that contain one of these with
WHERE textcol ~ '[\U0001F300-\U0001F6FF]'
~ is the regular expression matching operator, and the pattern is a range of Unicode characters.

Related

Combining columns in Qlik

I have an Excel sheet that has two seperate columns of data that I need combined for my table in Qlik. I know it would be easy to combine the two in the Excel document, but because this is a data feed I would prefer not to do any extra work. One column has the first name and the other the last. Thank you.
I tried to concat but that did not work.
It sounds like what you're trying to achieve is string concatenation, where you'd combine two or more strings together. It'd be the same concept for fields, as long as their values can be coerced to a string type. In Qlik, this is done very simply by using the ampersand & operator. You can use this in three places:
Data Manager
If done in the Data Manager, you are creating a new calculated field. This can be done by editing the table that you're loading in from Excel, selecting Add field >> Calculated field, and then using an expression like this:
first_name & ' ' & last_name
What that expression is doing is taking the field first_name and concatenating it's values with a space ' ' and then concatenating the values of the last_name field.
So your new field, which I'm naming full_name here, would look like this:
first_name
last_name
full_name
Chris
Schaeuble
Chris Schaeuble
Stefan
Stoichev
Stefan Stoichev
Austin
Spivey
Austin Spivey
Here's what the data manager would look like:
Then after you load the data, you will have a new field with the combined names:
Data Load Editor
Doing this in the Data Load Editor will also result in a new field and is the exact same expression (see line 6):
Chart expression
The other option you have is to use this expression "on-the-fly" in a chart without creating a new column in the app data model like the first two options. Just use that same expression from above in a chart field expression and you'll get the same result:

AWS Redshift - ILIKE doesn't work with accented words

We've been working with AWS Redshift for some time now and recently we faced a quite interesting situation.
Suppose we have the following table.
CREATE TEMPORARY TABLE cities (city VARCHAR(256), state VARCHAR(256));
And the following sample data
INSERT INTO cities (city, state) VALUES
('Campos do Jordão', 'São Paulo'),
('CAMPOS DO JORDÃO', 'SÃO Paulo'),
('campos do jordão', 'são paulo'),
('Balneário Camburiú', 'Santa Catarina'),
('balneÁrio camburiú', 'Santa Catarina'),
('BALNEÁRIO camburiÚ', 'Santa Catarina'),
('Açailândia', 'Maranhão'),
('AÇailândia', 'Maranhão'),
('AÇAILÂNDIA', 'Maranhão'),
('Salvador', 'Bahia'),
('SALvADOR', 'BAHIA'),
('salVAdor', 'BAHiA')
;
We want to filter all rows corresponding to a specific city. Consider that the data hasn't passed through a validating process, so the same city name is written in multiple different ways.
We tried using ILIKE, such as SELECT * FROM cities WHERE city ILIKE 'Campos do Jordão', but the result was the following
city
state
Campos do Jordão
São Paulo
campos do jordão
são paulo
We got two records instead of three. After some testing we discovered that the problem occurs due to the accented characters (such as ã, ç, á). For example the query
SELECT * FROM cities WHERE city ILIKE '%Ú%';
returns only the record ('BALNEÁRIO camburiÚ', 'Santa Catarina'), while the same query returns the other two records if we replace %Ú% by %ú%.
I thought this happened because Redshift treated these accented characters as special characters, but it worked as expected initially when we used UPPER. For example, the query below returned all three records of Balneário Camburiú.
SELECT * FROM cities WHERE UPPER(city) ILIKE '%Ú%';
I'm posting this example here to ask if I'm missing something on the ILIKE command or if this is actually some kind of bug.
https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html
LIKE performs a case-sensitive pattern match. ILIKE performs a case-insensitive pattern match for single-byte UTF-8 (ASCII) characters. To perform a case-insensitive pattern match for multibyte characters, use the LOWER function on expression and pattern with a LIKE condition.
I think the problem is coming from that Redshift does not actually understand Unicode.
Redshift stores into a varchar whatever bit-patterns you put in there, valid Unicode or not.
When it performs comparisons, it's performing a byte-by-byte comparison, not a character-by-character comparison.
I think there are some functions which understand Unicode, such as upper() and lower() - they're written separately to the main code base. You have to understand Unicode to change the case of a multi-byte UTF-8 character; but LIKE and ILIKE do not, they're operators, not functions, so they are from the core database code base, which is not Unicode aware. You have to do some work for them, using the Unicode-aware functions, to allow them to function correctly.
BTW, it was a fascenating question and answer also. Thankyou for asking.

Where are the encoding_types stored in the system tables?

When I query pg_catalog.pg_attribute I see a column of encoding_types. These are IDs, not strings.
When I query pg_catalog.pg_table_def I see a column of encodings which are strings.
So far, I've reversed-engineered the following subset of the table I'm looking for:
id description
128 none
131 lzo
Where are these encodings stored so I can complete my table?
This issue on AWS Labs Github mentions a (possibly undocumented?) function called format_encoding which converts the integer representation called attencodingtype into a string:
For example:
SELECT DISTINCT
attencodingtype, format_encoding(attencodingtype) attencoding
FROM
pg_catalog.pg_attribute;
This will show you all the currently used encodings in your database, their attencodingtype and the associated string representation.

why is this postgresql full text search query returning ts_rank of 0?

Before I invest in using solr or lucene or sphinx, I wanted to try to implement a search capability on my system using postgresql full text search.
I have a national list of businesses in a table that I want to search. I created a ts vector that combines the business name and city so that I can do a search like "outback atlanta".
I am also implementing an auto-completion function by using the wildcard capability of the search by appending ":" to the search pattern and inserting " & " between keywords, so the search pattern "outback atl" turns into the "outback & atl:" before getting converted into a query using to_tsquery().
Here's the problem that I am running into currently.
if the search pattern is entered as "ou", many "Outback Steakhouse" records are returned.
if the search pattern is entered as "out", no results are returned.
if the search pattern is entered as "outb", many "Outback Steakhouse" records are returned.
doing a little debugging, I came up with this:
select ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('ou:*')) as "ou",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('out:*')) as "out",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('outb:*')) as "outb"
which results this:
ou out outb
0.0607927 0 0.0607927
What am I doing wrong?
Is this a limitation of pg full text search?
Is there something that I can do with my dictionary or configuration to get around this anomaly?
UPDATE:
I think that "out" may be a stop word.
when I run this debug query, I don't get any lexemes for "out"
SELECT * FROM ts_debug('english','out back outback');
alias description token dictionaries dictionary lexemes
asciiword Word all ASCII out {english_stem} english_stem {}
blank Space symbols {}
asciiword Word all ASCII back {english_stem} english_stem {back}
blank Space symbols {}
asciiword Word all ASCII outback {english_stem} english_stem {outback}
So now I ask how do I modify the stop word list to remove a word?
UPDATE:
here is the query that I currently using:
select id,name,address,city,state,likes
from view_business_favorite_count
where textsearchable_index_col ## to_tsquery('simple',$1)
ORDER BY ts_rank(textsearchable_index_col, to_tsquery('simple',$1)) DESC
When I execute the query (I'm using Strongloop Loopback + Express + Node), I pass the pattern in to replace $1 param. The pattern (as stated above) will look something like "keyword:" or "keyword1 & keyword2 & ... & keywordN:"
thanks
The problem here is that you are searching against business names and as #Daniel correctly pointed out - 'english' dictionary will not help you to find "fuzzy" match for NON-dictionary words like "Outback Steakhouse" etc;
'simple' dictionary
'simple' dictionary on its own will not help you neither, in your case business names will work only for exact match as all words are unstemmed.
'simple' dictionary + pg_trgm
But if you use 'simple' dictionary together with pg_trgm module - it will be exactly what you need, in particular:
for to_tsvector('simple','<business name>') you don't need to worry about stop words "hack", you will get all the lexemes unstemmed;
using similarity() from pg_trgm you will get the the highest "rank"
for the best match,
look at this:
WITH pg_trgm_test(business_name,search_pattern) AS ( VALUES
('Outback Steakhouse','ou'),
('Outback Steakhouse','out'),
('Outback Steakhouse','outb')
)
SELECT business_name,search_pattern,similarity(business_name,search_pattern)
FROM pg_trgm_test;
result:
business_name | search_pattern | similarity
--------------------+----------------+------------
Outback Steakhouse | ou | 0.1
Outback Steakhouse | out | 0.15
Outback Steakhouse | outb | 0.2
(3 rows)
Ordering by similarity DESC you will be able to get what you need.
UPDATE
For you situation there are 2 possible options.
Option #1.
Just create trgm index for name column in view_business_favorite_count table; index definition may be the following:
CREATE INDEX name_trgm_idx ON view_business_favorite_count USING gin (name gin_trgm_ops);
Query will look something like that:
SELECT
id,
name,
address,
city,
state,
likes,
similarity(name,$1) AS trgm_rank -- similarity score
FROM
view_business_favorite_count
WHERE
name % $1 -- trgm search
ORDER BY trgm_rank DESC;
Option #2.
With full text search, you need to :
create a separate table, for example unnested_business_names, where you will store 2 columns: 1st column will keep all lexemes from to_tsvector('simple',name) function, 2nd column will have vbfc_id(FK for id from view_business_favorite_count table);
add trgm index for the column, which contains lexemes;
add trigger for unnested_business_names, which will update OR insert OR delete new values from view_business_favorite_count to keep all words up to date

When will Postgres's full text search supports phrase match and proximity match?

As of Postgres 8.4 the database fts does not support exact phrase match, nor does it support proximity match if given 2 terms. For example, there is no way to tell Postgres to match on content that have word #1 which is in a specified proximity of word #2. Any one know the plan of Postgres and possibly which version will phrase and proximity match be supported?
PostgreSQL 9.6 text search supports phrases now
select
*
from (values
('i heart new york'),
('i hate york new')
) docs(body)
where
to_tsvector(body) ## phraseto_tsquery('new york')
(1 row retrieved)
or by distance between words:
-- a distance of exactly 2 "hops" between "quick" and "fox"
select
*
from (values
('the quick brown fox'),
('quick brown cute fox')
) docs(body)
where
to_tsvector(body) ## to_tsquery('quick <2> fox')
(1 row retrieved)
http://linuxgazette.net/164/sephton.html
<snip>
Search Vectors
How does one turn document content into an array of lexemes using the parser and dictionaries? How does one match a search criterion ti body text? PostgreSQL provides a number of functions to do this. The first one we will look at is to_tsvector().
A tsvector is an internal data type containing an array of lexemes with position information. The lexeme positions are used when searching, to rank the search result based on proximity and other information. One may control the ranking by labelling the different portions which make up the search document content, for example the title, body and abstract may be weighted differently during search by labelling these sections differently. The section labels, quite simply A,B,C & D, are associated with the tsvector at the time it is created, but the weight modifiers associated with those labels may be controlled after the fact.
</snip>
For full-phrase searching, see here.
The Postgresql website does not have a roadmap. Instead, you are referred to the Open Issues page. At the moment, this page makes no mention of full-phrase searching.