how improve order by in search query? PostgreSQL - postgresql

i need some help.
site has 2 sort types: by relevancy and date.
Sometimes happends that issues with most high score are to old, and the newest has small score.
So needed some common query based on 2 marks.
Relecancy query looks like 'ORDER BY' ts_rank(EXTENDED_INDEX, custom_tsquery('english', 'test', 0))'
and second one just 'ORDER BY table.date'
Are any ideas how improve search? Maybe some second ts_rank by date?

Based on the question it's unclear what dataset you are using as an example, but you can basically use ORDER BY rank DESC,date DESC in your query, so you will have most "recent" and highly "ranked" at the top of your result set.
WITH t(id,t,d) AS ( VALUES
(1,to_tsvector('english','one'),'2016-03-18'::DATE),
(2,to_tsvector('english','two words'),'2016-03-17'::DATE),
(3,to_tsvector('english','three words we are looking for'),'2016-03-16'::DATE),
(4,to_tsvector('english','four words goes here'),'2016-03-15'::DATE)
)
SELECT
id,
ts_rank(t,q) AS rank,
d
FROM t,to_tsquery('english','three | words') AS q
ORDER BY rank DESC NULLS LAST,d DESC;
Result :
id | rank | d
----+-----------+------------
3 | 0.0607927 | 2016-03-16
2 | 0.0303964 | 2016-03-17
4 | 0.0303964 | 2016-03-15
1 | 0 | 2016-03-18
(4 rows)

Related

PostgreSQL - query for record that is either side of the result set

Lets say I have this table (balances) schema and data:
+----+---------+------------+
| id | balance | createdAt |
+----+---------+------------+
| 1 | 10 | 2021-11-18 |
| 2 | 12 | 2021-11-16 |
| 3 | 6 | 2021-11-04 |
+----+---------+------------+
To retrieve the last 7 days of balances, I would do something like this:
SELECT * FROM "balances" WHERE "createdAt" BETWEEN '2021-11-19T09:04:17.488Z' AND '2021-11-12T10:04:17.488Z' ORDER BY "createdAt" ASC
This will give me 2 records (IDs: 1 & 2), which is fine. However, what I'm looking at doing, probably with a second query, is to grab the record that is previous to that result set, by createdAt date, as my query is ordered by createdAt. Is there a way to do this with PG?
So whatever the time-range I use, I would also retrieve the record that is n-1 to the result set
To obtain the record you want, you may use a LIMIT query:
SELECT *
FROM balances
WHERE createdAt < '2021-11-19T09:04:17.488Z'
ORDER BY createdAt DESC
LIMIT 1;
This answer makes an assumption that there is only one record which is logically earlier than 2021-11-19T09:04:17.488Z, and there is no edge case of ties. If there are ties, we can break them by adding more levels to the ORDER BY clause.

How to calculate mean date difference with Postgres

I have a table on my postgres database, which has two main fields: agent_id and quoted_at.
I need to group my data by agent_id, and calculate the mean difference among all quoted_at.
So, for example, if I have the following rows:
agent_id | quoted_at
---------+-----------
1 | 2020-04-02
1 | 2020-04-04
1 | 2020-04-05
The mean difference would be calculated as:
( (2020-04-05 - 2020-04-04) + (2020-04-04 - 2020-04-02) ) / 2 = 1.5 days
What I want to see after grouping the info is:
agent_id | mean
---------+---------
1 | 1.5 days
I know, by the end, I just need to calculate (last - first) / (#_occurrences - 1)
It is just not really clear how (and if) it is possible to do that using a single query on Postgres.
Use the lag() window function to calculate your differences. Once you have those differences, use the avg() aggregation function.
with diffs as (
select agent_id, quoted_at,
quoted_at - lag(quoted_at) over (partition by agent_id
order by quoted_at) as diff_days
from your_table
)
select agent_id, avg(diff_days) as mean
from diffs
where diff_days is not null;
The check for null diff_days is necessary since the diff_days for the first record for an agent is null, and you do not want that in the avg() aggregation.

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

I have a Database that has product names in column 1 and product release dates in column 2. I want to find 'old' products by their release date. However, I'm only interested in finding 'old' products that released a minimum of 1 year ago. I cannot make any edits to the original database infrastructure.
The table looks like this:
Product| Release_Day
A | 2018-08-23
A | 2017-08-23
A | 2019-08-21
B | 2018-08-22
B | 2016-08-22
B | 2017-08-22
C | 2018-10-25
C | 2016-10-25
C | 2019-08-19
I have already tried multiple versions of DISTINCT, MAX, BETWEEN, >, <, etc.
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2015-08-22' and '2018-08-22'
and release_day not between '2018-08-23' and '2019-08-22'
GROUP BY 1
ORDER BY MAX(release_day) DESC
The expected results should not contain any products found by this query:
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2018-08-23' and '2019-08-22'
AND product = A
GROUP BY 1
However, every check I complete returns a product from this date range.
This is the output of the initial query:
Product|Most_Recent_Release
A | 2018-08-23
B | 2018-08-22
C | 2015-10-25
And, for example, if I run the check query on Product A, I get this:
Product|Most_Recent_Release
A | 2019-08-21
Use HAVING to filter on most_recent_release
SELECT product, MAX(release_day) as most_recent_release
FROM Product_Release
GROUP BY product
HAVING most_recent_release < '2018-08-23'
ORDER BY most_recent_release DESC
There's no need to use DISTINCT when you use GROUP BY -- you can't get duplicates if there's only one row per product.

Recursive CTE PostgreSQL Connecting Multiple IDs with Additional Logic for Other Fields

Within my PostgreSQL database, I have an id column that shows each unique lead that comes in. I also have a connected_lead_id column which shows whether accounts are related to each other (ie husband and wife, parents and children, group of friends, group of investors, etc).
When we count the number of ids created during a time period, we want to see the number of unique "groups" of connected_ids during a period. In other words, we wouldn't want to count both the husband and wife pair, we would only want to count one since they are truly one lead.
We want to be able to create a view that only has the "first" id based on the "created_at" date and then contains additional columns at the end for "connected_lead_id_1", "connected_lead_id_2", "connected_lead_id_3", etc.
We want to add in additional logic so that we take the "first" id's source, unless that is null, then take the "second" connected_lead_id's source unless that is null and so on. Finally, we want to take the earliest on_boarded_date from the connected_lead_id group.
id | created_at | connected_lead_id | on_boarded_date | source |
2 | 9/24/15 23:00 | 8 | |
4 | 9/25/15 23:00 | 7 | |event
7 | 9/26/15 23:00 | 4 | |
8 | 9/26/15 23:00 | 2 | |referral
11 | 9/26/15 23:00 | 336 | 7/1/17 |online
142 | 4/27/16 23:00 | 336 | |
336 | 7/4/16 23:00 | 11 | 9/20/18 |referral
End Goal:
id | created_at | on_boarded_date | source |
2 | 9/24/15 23:00 | | referral |
4 | 9/25/15 23:00 | | event |
11 | 9/26/15 23:00 | 7/1/17 | online |
Ideally, we would also have i number of extra columns at the end to show each connected_lead_id that is attached to the base id.
Thanks for the help!
Ok the best I can come up with at the moment is to first build maximal groups of related IDs, and then join back to your table of leads to get the rest of the data (See this SQL Fiddle for the setup, full queries and results).
To get the maximal groups you can use a recursive common table expression to first grow the groups, followed by a query to filter the CTE results down to just the maximal groups:
with recursive cte(grp) as (
select case when l.connected_lead_id is null then array[l.id]
else array[l.id, l.connected_lead_id]
end from leads l
union all
select grp || l.id
from leads l
join cte
on l.connected_lead_id = any(grp)
and not l.id = any(grp)
)
select * from cte c1
The CTE above outputs several similar groups as well as intermediary groups. The query predicate below prunes out the non maximal groups, and limits results to just one permutation of each possible group:
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| grp |
|------------|
| 2,8 |
| 4,7 |
| 14 |
| 11,336,142 |
| 12,13 |
Next join the final query above back to your leads table and use window functions to get the remaining column values, along with the distinct operator to prune it down to the final result set:
with recursive cte(grp) as (
...
)
select distinct
first_value(l.id) over (partition by grp order by l.created_at) id
, first_value(l.created_at) over (partition by grp order by l.created_at) create_at
, first_value(l.on_boarded_date) over (partition by grp order by l.created_at) on_boarded_date
, first_value(l.source) over (partition by grp
order by case when l.source is null then 2 else 1 end
, l.created_at) source
, grp CONNECTED_IDS
from cte c1
join leads l
on l.id = any(grp)
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| id | create_at | on_boarded_date | source | connected_ids |
|----|----------------------|-----------------|----------|---------------|
| 2 | 2015-09-24T23:00:00Z | (null) | referral | 2,8 |
| 4 | 2015-09-25T23:00:00Z | (null) | event | 4,7 |
| 11 | 2015-09-26T23:00:00Z | 2017-07-01 | online | 11,336,142 |
| 12 | 2015-09-26T23:00:00Z | 2017-07-01 | event | 12,13 |
| 14 | 2015-09-26T23:00:00Z | (null) | (null) | 14 |
demo:db<>fiddle
Main idea - sketch:
Looping through the ordered set. Get all ids, that haven't been seen before in any connected_lead_id (cli). These are your starting points for recursion.
The problem is your number 142 which hasn't been seen before but is in same group as 11 because of its cli. So it is would be better to get the clis of the unseen ids. With these values it's much simpler to calculate the ids of the groups later in the recursion part. Because of the loop a function/stored procedure is necessary.
The recursion part: First step is to get the ids of the starting clis. Calculating the first referring id by using the created_at timestamp. After that a simple tree recursion over the clis can be done.
1. The function:
CREATE OR REPLACE FUNCTION filter_groups() RETURNS int[] AS $$
DECLARE
_seen_values int[];
_new_values int[];
_temprow record;
BEGIN
FOR _temprow IN
-- 1:
SELECT array_agg(id ORDER BY created_at) as ids, connected_lead_id FROM groups GROUP BY connected_lead_id ORDER BY MIN(created_at)
LOOP
-- 2:
IF array_length(_seen_values, 1) IS NULL
OR (_temprow.ids || _temprow.connected_lead_id) && _seen_values = FALSE THEN
_new_values := _new_values || _temprow.connected_lead_id;
END IF;
_seen_values := _seen_values || _temprow.ids;
_seen_values := _seen_values || _temprow.connected_lead_id;
END LOOP;
RETURN _new_values;
END;
$$ LANGUAGE plpgsql;
Grouping all ids that refer to the same cli
Loop through the id arrays. If no element of the array was seen before, add the referred cli the output variable (_new_values). In both cases add the ids and the cli to the variable which stores all yet seen ids (_seen_values)
Give out the clis.
The result so far is {8, 7, 336} (which is equivalent to the ids {2,4,11,142}!)
2. The recursion:
-- 1:
WITH RECURSIVE start_points AS (
SELECT unnest(filter_groups()) as ids
),
filtered_groups AS (
-- 3:
SELECT DISTINCT
1 as depth, -- 3
first_value(id) OVER w as id, -- 4
ARRAY[(MIN(id) OVER w)] as visited, -- 5
MIN(created_at) OVER w as created_at,
connected_lead_id,
MIN(on_boarded_date) OVER w as on_boarded_date -- 6,
first_value(source) OVER w as source
FROM groups
WHERE connected_lead_id IN (SELECT ids FROM start_points)
-- 2:
WINDOW w AS (PARTITION BY connected_lead_id ORDER BY created_at)
UNION
SELECT
fg.depth + 1,
fg.id,
array_append(fg.visited, g.id), -- 8
LEAST(fg.created_at, g.created_at),
g.connected_lead_id,
LEAST(fg.on_boarded_date, g.on_boarded_date), -- 9
COALESCE(fg.source, g.source) -- 10
FROM groups g
JOIN filtered_groups fg
-- 7
ON fg.connected_lead_id = g.id AND NOT (g.id = ANY(visited))
)
SELECT DISTINCT ON (id) -- 11
id, created_at,on_boarded_date, source
FROM filtered_groups
ORDER BY id, depth DESC;
The WITH part gives out the results from the function. unnest() expands the id array into each row for each id.
Creating a window: The window function groups all values by their clis and orders the window by the created_at timestamp. In your example all values are in their own window excepting 11 and 142 which are grouped.
This is a help variable to get the latest rows later on.
first_value() gives the first value of the ordered window frame. Assuming 142 had a smaller created_at timestamp the result would have been 142. But it's 11 nevertheless.
A variable is needed to save which id has been visited yet. Without this information an infinite loop would be created: 2-8-2-8-2-8-2-8-...
The minimum date of the window is taken (same thing here: if 142 would have a smaller date than 11 this would be the result).
Now the starting query of the recursion is calculated. Following describes the recursion part:
Joining the table (the original function results) against the previous recursion result. The second condition is the stop of the infinite loop I mentioned above.
Appending the currently visited id into the visited variable.
If the current on_boarded_date is earlier it is taken.
COALESCE gives the first NOT NULL value. So the first NOT NULL source is safed throughout the whole recursion
After the recursion which gives a result of all recursion steps we want to filter out only the deepest visits of every starting id.
DISTINCT ON (id) gives out the row with the first occurence of an id. To get the last one, the whole set is descendingly ordered by the depth variable.

fastest query to select specific row when using distinct + filter

Table answers:
Answer ID | User ID | Question ID | deleted
1 | 1 | 1 | f
2 | 1 | 2 | f
3 | 1 | 2 | f
4 | 1 | 1 | t
5 | 2 | 1 | f
6 | 2 | 2 | f
7 | 2 | 2 | f
I want to select all answers distinct on (userID, questionID) using the latest answer (based on the highest id) and from this result set I want to remove all entries having deleted = t.
So my result should be
Answer ID | User ID | Question ID | deleted
3 | 1 | 2 | f
5 | 2 | 1 | f
7 | 2 | 2 | f
I guess we are not able to do it with generated query methods from the interface? I am using a #Query annotation instead:
#Query("SELECT a1 FROM answer a1 WHERE ... ")
findLatestAnswers();
I came up with this (sql fiddle: http://sqlfiddle.com/#!15/02339/8/0 ) and am not even using distinct or group by nor order by. It is doing the job but seems to be very inefficient for larger data sets? What would be a faster statement?
SELECT * FROM answer a1
WHERE NOT EXISTS ( -- where no newer answer exists
SELECT * FROM answer a2
WHERE a1.user_id = a2.user_id
AND a1.question_id = a2.question_id
AND a1.id < a2.id
)
AND a1.deleted = FALSE;
There is no problem using distinct or group by or aggregation functions. These are essential in datawarehouse or analytics software where millions or records are processed in every request (billions and trillions in bigdata).
The only adjustments are indexes generation based on your data and query.
The function you need for your scanrio is max. You have to select the max of anser_id for grouped user_id, question_id as following:
SOLUTION 1
#Query("select max(answer) from Answer answer where answer.deleted = false group by answer.userId, answer.questionId")
List<Answer> findLatestAnswersByUserQuestionNotDeleted();
This statement returns 4 record because, rightly, if you are not considering deleted answer, the latest answer of the user 1 in question 1 becomes 1.
I don't know why you didn't consider this but i will follow your question as it is.
Because of this you have to post filter programmatically deleted as described by you so the #Query becomes:
#Query("select max(answer) from Answer answer group by answer.userId, answer.questionId")
List<Answer> findLatestAnswersByUserQuestion();
Again you have, rightly, 4 records because also deleted is present and must be filtered programmatically
SOLUTION 2 (two queries, because of your requirement to ignore deleted and not consider the old one)
step 1 - findId of answers including deleted (just id):
#Query("select max(answer.id) from Answer answer group by answer.userId, answer.questionId")
List<Long> findLatestAnswersId();
step 2 - load answers by id excluding deleted
List<Answer> findAllByDeletedIsFalseAndIdIn(List<Long> ids);
SOLUTION 3 (one query)
#Query("select answer from Answer answer where answer.deleted = false and answer.id in (select max(inAnswer.id) from Answer inAnswer)")
List<Answer> findLastestNotDeleted()