If I have a table with the values name and surname, how do I search for those two values?
Example
I set a threshold to filter out unwanted values, and then I can find similarity with table name and searched text; however, if there is only name or surname, it's OK; but if both are combined, the similarity is different.
But in function similarity I am able to put only one value name or surname
SET pg_trgm.similarity_threshold = 0.8;
SELECT name, surname, similarity(name, 'seaching text') AS sml
FROM goods_goods
WHERE name % 'seaching text'
ORDER BY sml DESC, name
How can I put more values in a similar function? Like this similarity((name,surname and so on if necessary), 'seaching text')
EDITED
One More Example To Clarify
Table name is table
There is three columns sentence1, sentence2, sentence3
I want to find the most similar row in which the searching string is most similar to sentence1, sentence2, sentence3 All in one, not separated sentences.
Here an example with some
table
| sentence1 | sentence2 | sentence3 |
|-------------------|---------------|--------------------|
| I have an apple | Samsung Tv | Dji mavic 2 Zoom |
|-------------------|---------------|--------------------|
| Tiger is red!!! | postgresql | Dji mavic 2 Zoom |
|-------------------|---------------|--------------------|
| Basketball ABCD | battery AC | Dji mavic 3 Zoom |
|-------------------|---------------|--------------------|
| Tiger is red!!! | postgresql | Dji mavic 3 Zoom |
and now my seraching text is something like Tiger postgres dji mavic 2
As we can see most similar is row number 2 --> 61% but row 4 is 57% it's correct because there is ...mavic 3 but we want ...mavic 2
select strict_word_similarity('I have an apple Samsung Tv Dji mavic 2 Zoom','Tiger postgres dji mavic 2' ) --> 0.27906978
select strict_word_similarity('Tiger is red!!! postgresql Dji mavic 2 Zoom','Tiger postgres dji mavic 2' ) --> 0.61904764
select strict_word_similarity('Basketball ABCD battery AC Dji mavic 3 Zoom','Tiger postgres dji mavic 2' ) --> 0.24390244
select strict_word_similarity('Tiger is red!!! postgresql Dji mavic 2 Zoom','Tiger postgres dji mavic 2' ) --> 0.5714286
But if we compare each column one by one, there are columns which is similar, but all sentences together is completely wrong.
Like row one and two in sentence3 have same text, but first row is not what I want.
So How can i do that with pg_trgm?
It sounds to me like you want this:
SELECT name, surname, greatest(similarity(name, 'seaching text'),similarity(surname, 'seaching text')) AS sml
FROM goods_goods
WHERE name % 'seaching text' or surname % 'seaching text'
ORDER BY sml DESC, name
Related
id | acct_num | name | orderdt
1 1006A Joe Doe 1/1/2021
2 1006A Joe Doe 1/5/2021
EXPECTED OUTPUT
id | acct_num | name | orderdt | id1 | acct_num1 | NAME1 | orderdt1
1 1006A Joe Doe 1/1/2021 2 1006A Joe Doe 1/5/2021
My query is the following:
Select id,
acct_num,
name,
orderdt
from order_tbl
where acct_num = '1006A'
and orderdt >= '1/1/2021'
If you always have one or two rows you could do it like this (I'm assuming the latest version of SQL Server because you said TSQL):
NOTE: If you have a known max (eg 4) this solution can be converted to support any number by changing the modulus and adding more columns and another join.
WITH order_table_numbered as
(
SELECT ID, ACCT_NUM, NAME, ORDERDT,
ROW_NUMBER() AS (PARTITION BY ACCT_NUM ORDER BY ORDERDT) as RN
)
SELECT first.id as id, first.acct_num as acct_num, first.num as num, first.order_dt as orderdt,
second.id as id1, second.acct_num as acct_num1, second.num as num1, second.order_dt as orderdt1
FROM order_table_numbered first
LEFT JOIN order_table_numbered second ON first.ACCT_NUM = second.ACCT_NUM and (second.RN % 2 = 0)
WHERE first.RN % 2 = 1
If you have an unknown number of rows I think you should solve this on the client OR convert the groups to XML -- the XML support in SQL Server is not bad.
I am very new to postgreSQl and SQL and databases, I hope you guys can help me with this, i want to know which posts have the most amount of comments and which have the least amount of comments and the users need to be specified too.
CREATE SCHEMA perf_demo;
SET search_path TO perf_demo;
-- Tables
CREATE TABLE users(
id SERIAL -- PRIMARY KEY
, email VARCHAR(40) NOT NULL UNIQUE
);
CREATE TABLE posts(
id SERIAL -- PRIMARY KEY
, user_id INTEGER NOT NULL -- REFERENCES users(id)
, title VARCHAR(100) NOT NULL UNIQUE
);
CREATE TABLE comments(
id SERIAL -- PRIMARY KEY
, user_id INTEGER NOT NULL -- REFERENCES users(id)
, post_id INTEGER NOT NULL -- REFERENCES posts(id)
, body VARCHAR(500) NOT NULL
);
-- Generate approx. N users
-- Note: NULL values might lead to lesser rows than N value.
INSERT INTO users(email)
WITH query AS (
SELECT 'user_' || seq || '#'
|| ( CASE (random() * 5)::INT
WHEN 0 THEN 'my'
WHEN 1 THEN 'your'
WHEN 2 THEN 'his'
WHEN 3 THEN 'her'
WHEN 4 THEN 'our'
END )
|| '.mail' AS email
FROM generate_series(1, 5) seq -- Important: Replace N with a useful value
)
SELECT email
FROM query
WHERE email IS NOT NULL;
-- Generate N posts
INSERT INTO posts(user_id, title)
WITH expanded AS (
SELECT random(), seq, u.id AS user_id
FROM generate_series(1, 8) seq, users u -- Important: Replace N with a useful value
),
shuffled AS (
SELECT e.*
FROM expanded e
INNER JOIN (
SELECT ei.seq, min(ei.random) FROM expanded ei GROUP BY ei.seq
) em ON (e.seq = em.seq AND e.random = em.min)
ORDER BY e.seq
)
-- Top 20 programming languages: https://www.tiobe.com/tiobe-index/
SELECT s.user_id,
'Let''s talk about (' || s.seq || ') '
|| ( CASE (random() * 19 + 1)::INT
WHEN 1 THEN 'C'
WHEN 2 THEN 'Python'
WHEN 3 THEN 'Java'
WHEN 4 THEN 'C++'
WHEN 5 THEN 'C#'
WHEN 6 THEN 'Visual Basic'
WHEN 7 THEN 'JavaScript'
WHEN 8 THEN 'Assembly language'
WHEN 9 THEN 'PHP'
WHEN 10 THEN 'SQL'
WHEN 11 THEN 'Ruby'
WHEN 12 THEN 'Classic Visual Basic'
WHEN 13 THEN 'R'
WHEN 14 THEN 'Groovy'
WHEN 15 THEN 'MATLAB'
WHEN 16 THEN 'Go'
WHEN 17 THEN 'Delphi/Object Pascal'
WHEN 18 THEN 'Swift'
WHEN 19 THEN 'Perl'
WHEN 20 THEN 'Fortran'
END ) AS title
FROM shuffled s;
-- Generate N comments
-- Note: The cross-join is a performance killer.
-- Try the SELECT without INSERT with small N values to get an estimation of the execution time.
-- With these values you can extrapolate the execution time for a bigger N value.
INSERT INTO comments(user_id, post_id, body)
WITH expanded AS (
SELECT random(), seq, u.id AS user_id, p.id AS post_id
FROM generate_series(1, 10) seq, users u, posts p -- Important: Replace N with a useful value
),
shuffled AS (
SELECT e.*
FROM expanded e
INNER JOIN ( SELECT ei.seq, min(ei.random) FROM expanded ei GROUP BY ei.seq ) em ON (e.seq = em.seq AND e.random = em.min)
ORDER BY e.seq
)
SELECT s.user_id, s.post_id, 'Here some comment: ' || md5(random()::text) AS body
FROM shuffled s;
Could someone show me how this could be done please, I am new to SQL/postgres any help would be much appreciated. an Example would be very helpful too.
Good effort in pasting the whole dataset creation procedure, is what it needs to be included in order to make the example reproducible.
Let's start first with, how to join several tables: you have your posts table which contains the user_id and we can use it to join with users with the following.
SELECT email,
users.id user_id,
posts.id post_id,
title
from posts join users
on posts.user_id=users.id;
This will list the posts together with the authors. Check the joining condition (after the ON) stating the fields we're using. The result should be similar to the below
email | user_id | post_id | title
------------------+---------+---------+----------------------------------------
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language
user_3#her.mail | 3 | 8 | Let's talk about (8) R
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic
user_5#your.mail | 5 | 4 | Let's talk about (4) R
user_5#your.mail | 5 | 3 | Let's talk about (3) C
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby
(8 rows)
Now it's time to join this result, with the comments table. Since a post can have comments or not and you want to show all posts even if you don't have any comments you should use the LEFT OUTER JOIN (more info about join types here
So let's rewrite the above to include comments
SELECT email,
users.id user_id,
posts.id post_id,
title,
comments.body
from posts
join users
on posts.user_id=users.id
left outer join comments
on posts.id = comments.post_id
;
Check out the join between posts and comments based on post_id.
The result of the query is the list of posts, related author and comments, similar to the below
email | user_id | post_id | title | body
------------------+---------+---------+----------------------------------------+-----------------------------------------------------
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic |
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language |
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 200bb07acfbac893aed60e018b47b92b
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 66159adaed11404b1c88ca23b6a689ef
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: e5cc1f7c10bb6103053bf281d3cadb60
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 5ae8674c2ef819af0b1a93398efd9418
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | Here some comment: 5b818da691c1570dcf732ed8f6b718b3
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | Here some comment: 88a990e9495841f8ed628cdce576a766
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic |
user_5#your.mail | 5 | 4 | Let's talk about (4) R | Here some comment: ed19bb476eb220d6618e224a0ac2910d
user_5#your.mail | 5 | 3 | Let's talk about (3) C | Here some comment: 23cd43836a44aeba47ad212985f210a7
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | Here some comment: b83999120bd2bb09d71aa0c6c83a05dd
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | Here some comment: b4895f4e0aa0e0106b5d3834af80275e
(13 rows)
Now you can start aggregating and counting comments for a certain post. You can use PG's aggregation functions, we'll use the COUNT here.
SELECT email,
users.id user_id,
posts.id post_id,
title,
count(comments.id) nr_comments
from posts
join users
on posts.user_id=users.id
left outer join comments
on posts.id = comments.post_id
group by email,
users.id,
posts.id,
title
;
Check out that we're counting the comments.id field but we could also perform a count(*) which just counts the rows. Also check that we are grouping our results by email, users.id, post.id and title, the columns we are showing alongside the count.
The result should be similar to
email | user_id | post_id | title | nr_comments
------------------+---------+---------+----------------------------------------+-------------
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | 2
user_5#your.mail | 5 | 3 | Let's talk about (3) C | 1
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | 2
user_3#her.mail | 3 | 8 | Let's talk about (8) R | 4
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic | 0
user_5#your.mail | 5 | 4 | Let's talk about (4) R | 1
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic | 0
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language | 0
(8 rows)
This should be the result you're looking for. Just bear in mind, that you're showing the user from users who wrote the post, not the one who commented. To view who commented you'll need to change the joining conditions.
I need to store a series of GPS points with timestamps in the database (traces of various vehicles).
Initially I wanted to write something on my own but it would involve a bit more computational power, then just having it come as a result of a single query.
I explored a bit and came across PostGIS, but I'm not sure if it's suitable or possible to solve this problem.
The idea is to check if a two vehicles passed each other at the same time.
I have a table with the coordinates, each coordinate is in a separate row, and each row has a timestamp associated with it.
The table has following schema (vehicle_id, latitude, longitude, timestamp).
So given multiple coordinates of a vehicles I need to check if it has crossed with other vehicles at the same time. I found that I could use ST_MakeLine to create a line string from a sequence of GPS points, and saw that there are different intersection functions, but that requires coordinates to match perfectly and here the offset may be let's say 30 meters and timestamp has to be taken in account.
Any answer would help.
Thanks
If I understood your use case correctly, I believe you don't need to create LineStrings to check if your trajectory intersects or gets close in a certain point in time.
Data Sample:
CREATE TABLE t (vehicle_id INT, longitude NUMERIC, latitude NUMERIC, ts TIMESTAMP);
INSERT INTO t VALUES (1,1,1.1111,'2019-05-01 15:30:00'),
(1,1,2.1112,'2019-05-01 15:40:00'),
(1,1,3.1111,'2019-05-01 15:50:00'),
(2,2,2.1111,'2019-05-01 15:30:00'),
(2,1,2.1111,'2019-05-01 15:40:00'),
(2,1,4.1111,'2019-05-01 15:05:00');
As you can see in the sample data above, vehicle_id 1 and 2 are close (less than 30m) to each other at 2019-05-01 15:40:00, which can be found using a query like this:
SELECT
t1.vehicle_id,t2.vehicle_id,t1.ts,
ST_AsText(ST_MakePoint(t1.longitude,t1.latitude)::GEOGRAPHY) AS p1,
ST_AsText(ST_MakePoint(t2.longitude,t2.latitude)::GEOGRAPHY) AS p2,
ST_Distance(
ST_MakePoint(t1.longitude,t1.latitude)::GEOGRAPHY,
ST_MakePoint(t2.longitude,t2.latitude)::GEOGRAPHY) AS distance
FROM t t1, t t2
WHERE
t1.vehicle_id <> t2.vehicle_id AND
t1.ts = t2.ts AND
ST_Distance(
ST_MakePoint(t1.longitude,t1.latitude)::GEOGRAPHY,
ST_MakePoint(t2.longitude,t2.latitude)::GEOGRAPHY) <= 30
vehicle_id | vehicle_id | ts | p1 | p2 | distance
------------+------------+---------------------+-----------------+-----------------+-------------
1 | 2 | 2019-05-01 15:40:00 | POINT(1 2.1112) | POINT(1 2.1111) | 11.05757826
2 | 1 | 2019-05-01 15:40:00 | POINT(1 2.1111) | POINT(1 2.1112) | 11.05757826
(2 rows)
As you can see, the result is sort of duplicated since 1 is close to 2 and 2 is close to 1 at the same time. You can correct this using DISTINCT ON(), but since I'm not familiar with your data I guess you better adjust this yourself.
Note that the data type is GEOGRAPHY and not GEOMETRY. It's because distances with ST_Distance over geometries are calculated in degrees, and using geography it 's in meters.
EDIT: To address a question mentioned in comments.
To avoid the overhead of having to create geography records in execution time, you might want to already store the coordinates as geography. In that case the table would look like this ..
CREATE TABLE t (vehicle_id INT, geom GEOGRAPHY, ts TIMESTAMP);
And you could populate it like this.
INSERT INTO t (vehicle_id, geom, ts)
VALUES (1,ST_MakePoint(1,1.1111),'2019-05-01 15:30:00');
In case you want to avoid having to populate the table again, you might want to just move the data to another column and get rid (if you wish) of latitude and longitude:
ALTER TABLE t ADD COLUMN geom GEOGRAPHY;
UPDATE t SET geom = ST_MakePoint(longitude,latitude);
ALTER TABLE t DROP COLUMN longitude, DROP COLUMN latitude;
CREATE INDEX idx_point ON t USING GIST(geom);
SELECT vehicle_id,ts,ST_AsText(geom) FROM t;
vehicle_id | ts | st_astext
------------+---------------------+-----------------
1 | 2019-05-01 15:30:00 | POINT(1 1.1111)
1 | 2019-05-01 15:40:00 | POINT(1 2.1112)
1 | 2019-05-01 15:50:00 | POINT(1 3.1111)
2 | 2019-05-01 15:30:00 | POINT(2 2.1111)
2 | 2019-05-01 15:40:00 | POINT(1 2.1111)
2 | 2019-05-01 15:05:00 | POINT(1 4.1111)
(6 rows)
Let's say I get the following table when I do
select name, alternative_name from persons;
name | alternative_name
--------------------------+----------------------------------
Johnny A | John the first
Johnny B | The second John
Now with this query
select name from persons where to_tsvector(name || alternative_name) ## to_tsquery('John');:
name | alternative_name
--------------------------+----------------------------------
Johnny A | John the first
Shouldn't I get both? How can I do a full text search on both the name and columns where I get all rows that match the search query?
Edit: Yes, there is indeed a typo here. It is to_tsquery
you concat without space:
t=# with c(n,a) as (values('Johnny A','John the first'),('Johny B','The second John'))
select * from c
where to_tsvector(n || a) ## to_tsquery('John')
;
n | a
---------+-----------------
Johny B | The second John
(1 row)
so first haystack becomes Johnny AJohn the first, thus lexeme do not match, try:
t=# with c(n,a) as (values('Johnny A','John the first'),('Johny B','The second John'))
select * from c
where to_tsvector(n ||' '|| a) ## to_tsquery('John')
;
n | a
----------+-----------------
Johnny A | John the first
Johny B | The second John
(2 rows)
I have a range of data on search queries across diffrent merchants.
I have a python script that 1st creates the head, torso & tail query sets from the main table in qsql, based on count(query) instances as 1000, 100 etc.
Since the number of merchants I my script runs of could have/not have queries that meet that threshold, the script does not log the "head.csv" "torso.csv" .. tail.csv always being produced.
How can I break the queries into head, torso & tail groups by respecting the logic above.
I also tried ntile to break the groups by percentiles(33, 33, 33), but that skews both the head & torso, if a merchant has a very long tail.
Current :
# head
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) >=1000
#torso
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <1000 and count(*) >=100
#tail
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <100
# using ntile - but note that I have percentiles of "3" , 33.#% each, which introduces the skew
select trim(query), count(*) as query_count,
ntile(3) over(order by query_count desc) AS group_ntile
from my_merchant_table
group by trim(query)
order by query_count desc limit 100;
Ideally the solution can build on top of this -:
select trim(query), count(*) as query_count,
ntile(100) over(order by query_count desc) AS group_ntile
from my_merchant_table
-- other conditions & date range
group by trim(query)
order by query_count desc
This gives,
btrim query_count group_ntile
q0 1277 1
q1 495 1
q2 357 1
q3 246 1
# so on till group_ntile =100 , while the query_count reduces.
Question :
What is the best way for the logic, to make the overall logic merchant agnostic/no hard-coding the configs ?
Note : I am fetching the data in Redshift, the solution should be compatible to postgres 8.0 & redshift in particular.
I imagine that you from some programming language invokes its queries to process information. My recommendation in this regard is get all the records and apply a filter over they. Consider that if you queries the database where there are several operations over the data this would result that the response time of the application is affected.
Assuming that the main challenge is to create the 'tiles' from a list of values, here is some sample code. It takes the 13 provinces of Canada and breaks it into a requested number of groups. It uses the province names, but numbers would work just as well.
SELECT * FROM Provinces ORDER BY province; -- To see what we are working with
+---------------------------+
| province |
+---------------------------+
| Alberta |
| British Columbia |
| Manitoba |
| New Brunswick |
| Newfoundland and Labrador |
| Northwest Territories |
| Nova Scotia |
| Nunavut |
| Ontario |
| Prince Edward Island |
| Quebec |
| Saskatchewan |
| Yukon |
+---------------------------+
13 rows in set (0.00 sec)
Now for the code:
SELECT #n := COUNT(*), -- Find total count (13)
#j := 0.5, -- 'trust me'
#tiles := 3 -- The number of groupings
FROM Provinces;
SELECT group_start
FROM (
SELECT
IF((#j * #tiles) % #n < #tiles, province, NULL) AS group_start,
#j := #j + 1
FROM Provinces
ORDER BY province
) x
WHERE group_start IS NOT NULL;
+---------------------------+
| group_start |
+---------------------------+
| Alberta |
| Newfoundland and Labrador |
| Prince Edward Island |
+---------------------------+
3 rows in set (0.00 sec)
With #tiles set to 4:
+---------------+
| group_start |
+---------------+
| Alberta |
| New Brunswick |
| Nova Scotia |
| Quebec |
+---------------+
4 rows in set (0.00 sec)
It is reasonably efficient: 1 pass to count the number of rows, 1 pass to do the computation, 1 pass to filter out the non-break values.