Question says it all really... I've been able to use both of these but can't tell what the difference is. Using pg_trgm module...
SELECT * from tbl WHERE postcode % 'w4%' LIMIT 10;
SELECT * from tbl WHERE postcode ILIKE 'w4%' LIMIT 10;
ILIKE and the % operator are quite different.
% is the similarity operator used by pg_trgm. Its outputs depends on the set similarity threshold.
set pg_trgm.similarity_threshold = .6;
select 'abcdef' % 'abZZef';
--> false
set pg_trgm.similarity_threshold = .1;
select 'abcdef' % 'abZZef';
--> true
On the other hand Ilike looks for (partial) string equality
select 'abcdef' ilike 'abc%';
--> true
select 'abcdef' ilike 'abZ%';
--> false
Related
I have two tables Q and T, both containing a column of float numbers.
What I want to do is, for each number in Q, I want to find a number in T that has the smallest distance to it.
For example, for T={1,7,9} and Q={2,6,10}, I want to return Q,T pairs as {(2,1),(6,7),(10,9)}.
How should I express this query with SQL?
In addition, is that possible to accelerate this join by index, e.g. add an operator class which bind "FOR ORDER BY <->" with fabs calculation?
create table t (val_t integer);
create table q (val_q integer);
insert into t values (1),(7),(9);
insert into q values (2),(6),(10);
Start with a query that cross joins the two tables and adds a rank based on the difference:
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true ;
Use this query in a cte or subquery and filter by rank:
WITH src AS(
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true )
SELECT val_q, val_t FROM src
WHERE rank = 1;
val_q | val_t
-------+-------
2 | 1
6 | 7
10 | 9
See https://www.postgresql.org/docs/12/tutorial-window.html
Given this schema:
create table t (tn float);
insert into t values (1), (7), (9);
create table q (qn float);
insert into q values (2), (6), (10);
DISTINCT ON is the most straightforward way:
select distinct on (qn) qn, tn
from q
cross join t
order by qn, abs(qn - tn);
Exploiting a numeric range may perform better depending on your data sizes. If performance is an issue, then you can create an actual temp table for the range_tn CTE and put a gist index on it:
with all_tn as (
select tn
from t
union select null
), range_tn as (
select numrange(tn::numeric, (lead(tn) over w)::numeric, '[]') as tr
from all_tn
window w as (order by tn nulls first)
)
select qn,
case
when lower_inf(tr) then upper(tr)
when upper_inf(tr) then lower(tr)
when 2 * qn - lower(tr) - upper(tr) > 0 then upper(tr)
else lower(tr)
end as tn
from q
join range_tn
on qn::numeric <# tr;
Fiddle here
This is how I do fuzzy string search in postgresql:
select * from table where levenshtein(name, 'value') < 2;
But what can I do if the 'name' colum contains array?
P.S.: It is necessary to use index. And this is the difference.
You can use unnest() over the array:
select * from
(
select unnest(name) as name_in_array, id from
(
select 1 as id, ARRAY['value1','valu','lav'] as name
union all
select 2 as id, ARRAY['value2','orange','yellow'] as name
)t1
) t2
where levenshtein(name_in_array, 'value') < 2;
I have a table with a field containing an array of strings (type is character varying(255)[]).
I'd like to compare a given string with a wildcard, say 'query%', to any of the elements of this field.
This request works and gets back the expected results:
SELECT * FROM my_table WHERE 'query' ILIKE ANY(my_field)
But with the wildcard, I got no results:
SELECT * FROM my_table WHERE 'query%' ILIKE ANY(my_field)
I think the reason is that the wildcard is supported only at the right side of the ILIKE operator, but ANY(my_field) also has to be after the operator.
Is there a way to achieve what I want?
Using PostgreSQL 9.5.
You have to unnest the array field:
with my_table(my_field) as (
values
(array['query medium', 'large query']),
(array['large query', 'small query'])
)
select t.*
from my_table t,
lateral unnest(my_field) elem
where elem ilike 'query%';
my_field
--------------------------------
{"query medium","large query"}
(1 row)
Convert the array into a set with unnest() and use an EXIST clause
SELECT * FROM my_table t WHERE EXISTS (SELECT unnest(t.my_field) AS f WHERE f ILIKE ‘query%’)
I can't find a straightforward answer. My query is spitting out the wrong result, and I think it's because it's not seeing the "AND" as an actual join.
Can you do something like this and if not, what is the correct approach:
SELECT * from X
LEFT JOIN Y
ON
y.date = x.date AND y.code = x.code
?
This is possible:
The ON clause is the most general kind of join condition: it takes a Boolean value expression of the same kind as is used in a WHERE clause. A pair of rows from T1 and T2 match if the ON expression evaluates to true for them.
http://www.postgresql.org/docs/9.1/static/queries-table-expressions.html#QUERIES-FROM
Your SQL looks OK.
It's fine. In fact, you can put any condition in the ON clause, even one
not related to the key columns or even the the tables at all, eg:
SELECT * from X
LEFT JOIN Y
ON y.date = x.date
AND y.code = x.code
AND EXTRACT (dow from current_date) = 1
Another, arguably more readable way of writing the join is to use tuples of columns:
SELECT * from X
LEFT JOIN Y
ON
(y.date, y.code) = (x.date, x.code)
;
, which clearly indicates that the join is based on the equality on several columns.
This solution has good performance:
select * from(
select md5(concat(date, code)) md5_x from x ) as x1
left join (select md5(concat(date, code)) md5_y from y) as y1
on x1.md5_x = y1.md5_y
I have a table with users. Each user has a country. What I want is to get the list of all countries with the numbers of users and the percent/total. What I have so far is:
SELECT
country_id,
COUNT(*) AS total,
((COUNT(*) * 100) / (SELECT COUNT(*) FROM users WHERE cond1 = true AND cond2 = true AND cond3 = true)::decimal) AS percent
FROM users
WHERE cond1 = true AND cond2 = true AND cond3 = true
GROUP BY contry_id
Conditions in both of queries are the same. I tried to do this without a subquery but then I can't get the total number of users but total per country. Is there a way to do this without a subquery? I'm using PostgreSQL. Any help is highly appreciated.
Thanks in advance
I guess the reason you want to eliminate the subquery is to avoid scanning the users table twice. Remember the total is the sum of the counts for each country.
WITH c AS (
SELECT
country_id,
count(*) AS cnt
FROM users
WHERE cond1=...
GROUP BY country_id
)
SELECT
*,
100.0 * cnt / (SELECT sum(cnt) FROM c) AS percent
FROM c;
This query builds a small CTE with the per-country statistics. It will only scan the users table once, and generate a small result set (only one row per country).
The total (SELECT sum(cnt) FROM c) is calculated only once on this small result set, so it uses negligible time.
You could also use a window function :
SELECT
country_id,
cnt,
100.0 * cnt / (sum(cnt) OVER ()) AS percent
FROM (
SELECT country_id, count(*) as cnt from users group by country_id
) foo;
(which is the same as nightwolf's query with the errors removed lol )
Both queries take about the same time.
This is really old, but both of the select examples above either don't work, or are overly complex.
SELECT
country_id,
COUNT(*),
(COUNT(*) / (SUM(COUNT(*)) OVER() )) * 100
FROM
users
WHERE
cond1 = true AND cond2 = true AND cond3 = true
GROUP BY
country_id
The second count is not necessary, it's just for debugging to ensure you're getting the right results. The trick is the SUM on top of the COUNT over the recordset.
Hope this helps someone.
Also, if anyone wants to do this in Django, just hack up an aggregate:
class PercentageOverRecordCount(Aggregate):
function = 'OVER'
template = '(COUNT(*) / (SUM(COUNT(*)) OVER() )) * 100'
def __init__(self, expression, **extra):
super().__init__(
expression,
output_field=DecimalField(),
**extra
)
Now it can be used in annotate.
I am not a PostgreSQL user but, the general solution would be to use window functions.
Read up on how to use this at http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
Best explanation i could use to describe it is: basically it allows you to do a group by on one field without the group by clause.
I believe this might do the trick:
SELECT
country_id,
COUNT(*) OVER (country_id)
((((COUNT(*) OVER (country_id)) * 100) / COUNT(*) OVER () )::decimal) as percent
FROM
users
WHERE
cond1 = true AND cond2 = true AND cond3 = true
Using last PostgreSQL version the query can be next:
CREATE TABLE users (
id serial,
country_id int
);
INSERT INTO users (country_id) VALUES (1),(1),(1),(2),(2),(3);
select distinct
country_id,
round(
((COUNT(*) OVER (partition by country_id )) * 100)::numeric
/ COUNT(*) OVER ()
, 2) as percent
from users
order by country_id
;
Result on SQLize.online
+============+=========+
| country_id | percent |
+============+=========+
| 1 | 50.00 |
+------------+---------+
| 2 | 33.33 |
+------------+---------+
| 3 | 16.67 |
+------------+---------+